Handling clip as part of the surface state, as opposed to being part of
the operation state, is cumbersome and a hindrance to providing true proxy
surface support. For example, the clip must be copied from the surface
onto the fallback image, but this was forgotten causing undue hassle in
each backend. Another example is the contortion the meta surface
endures to ensure the clip is correctly recorded. By contrast passing the
clip along with the operation is quite simple and enables us to write
generic handlers for providing surface wrappers. (And in the future, we
should be able to write more esoteric wrappers, e.g. automatic 2x FSAA,
trivially.)
In brief, instead of the surface automatically applying the clip before
calling the backend, the backend can call into a generic helper to apply
clipping. For raster surfaces, clip regions are handled automatically as
part of the composite interface. For vector surfaces, a clip helper is
introduced to replay and callback into an intersect_clip_path() function
as necessary.
Whilst this is not primarily a performance related change (the change
should just move the computation of the clip from the moment it is applied
by the user to the moment it is required by the backend), it is important
to track any potential regression:
ppc:
Speedups
========
image-rgba evolution-20090607-0 1026085.22 0.18% -> 672972.07 0.77%: 1.52x speedup
▌
image-rgba evolution-20090618-0 680579.98 0.12% -> 573237.66 0.16%: 1.19x speedup
▎
image-rgba swfdec-fill-rate-4xaa-0 460296.92 0.36% -> 407464.63 0.42%: 1.13x speedup
▏
image-rgba swfdec-fill-rate-2xaa-0 128431.95 0.47% -> 115051.86 0.42%: 1.12x speedup
▏
Slowdowns
=========
image-rgba firefox-periodic-table-0 56837.61 0.78% -> 66055.17 3.20%: 1.09x slowdown
▏
Based on the work by Øyvind Kolås and Pierre Tardy -- many thanks to
Pierre for pushing this backend for inclusion as well as testing and
reviewing my initial patch. And many more thanks to pippin for writing the
backend in the first place!
Hacked and chopped by myself into a suitable basis for a backend. Quite a
few issues remain open, but would seem to be ready for testing on suitable
hardware.
Hook into the scanner to write out binary version of the tokenized
objects -- note we bind executable names (i.e. check to see if is an
operator and substitute the name with an operator -- this breaks
overloading of operators by scripts).
By converting scripts to a binary form, they are more compact and
execute faster:
firefox-world-map.trace 526850146 bytes
bound.trace 275187755 bytes
[ # ] backend test min(s) median(s) stddev. count
[ 0] null bound 34.481 34.741 0.68% 3/3
[ 1] null firefox-world-map 89.635 89.716 0.19% 3/3
[ 0] drm bound 79.304 79.350 0.61% 3/3
[ 1] drm firefox-world-map 135.380 135.475 0.58% 3/3
[ 0] image bound 95.819 96.258 2.85% 3/3
[ 1] image firefox-world-map 156.889 156.935 1.36% 3/3
[ 0] xlib bound 539.130 550.220 1.40% 3/3
[ 1] xlib firefox-world-map 596.244 613.487 1.74% 3/3
This trace has a lot of complex paths and the use of binary floating point
reduces the file size by about 50%, with a commensurate reduction in scan
time and significant reduction in operator lookup overhead. Note that this
test is still IO/CPU bound on my i915 with its pitifully slow flash...
It is easier on the eye to use
'1 index set-source exch pop'
rather than
'dup /p0 exch def p0 set-source /p0 undef'
(as patterns are expected to be temporary so we strive to avoid naming
them).
After opening a specific file or fd for ourselves, reset the
CAIRO_TRACE_FD to point to an invalid fd in order to prevent any child
processes (who inherit our environment) from attempting to trace cairo
calls. If we allow them to continue, then the two traces will intermix
and be unreplayable.
Embed the pixels for images less than 32*32 as this catches most icons
which are frequently uploaded, but is still an unlikely size for a
destination image surface.
Using a null surface is a convenient method to measure the overhead of the
performance testing framework, so export it although as a test-surface so
that it will only be available in development builds and not pollute
distributed libraries.
Need to allow user programs to dump their traces into the common output
directory, when using /etc/ld.so.preload to capture traces for the entire
desktop.
Carl spotted this last night, but I misinterpreted it as an old problem
caused by the application changing its working directory before its first
cairo call - thus causing cairo-trace to attempt to open a file in the new
directory. Instead the problem was attempting to trace an executable with
an absolute path, where we just tagged it with a .lzma extentsion and
attempted to pipe the output there. Obviously this fails for the user
profiling system binaries. So use basename to strip the leading path.
python lazily loads libcairo.so and so it is not available via RTLD_NEXT,
and we need to dlopen cairo ourselves. Similarly the linker is not able to
resolve any naked function references and so we need to ensure that all of
our own calls into the library are wrapped with DLCALL.
These can be reasonably large and persist for long times due to the
font holdover caches, so manually swap them out to save space on tiny
swapless machines.
As an aide to tiny swapless systems write the rarely used bytes that
define type42 fonts into a deleted file and mmap them back into our
address space.
csi_string_new() duplicated the bytes which was not what was desired, so
implement a csi_string_new_for_bytes() to take ownership and prevent the
leak that was occuring, for example, every time we create a new font face.
Seeing unexpected time inside pixman composite is quite disturbing when
trying to track down the apparent slowness in some benchmarks. Remove one
source of this artefact by simply memcpy'ing pixel data when trivial.
Objects like cairo_scaled_font_t may return a reference to a previously
defined scaled-font instead of creating a new token each time. This caused
cairo-trace to overflow its operand stack by pushing a new instance of the
old token every time. Modify the tracer such that a font token can only
appear once on the stack -- for font-faces we remove the old operand and
for scaled-fonts we simply pop, chosen to reflect expected usage.
Applications such as swfdec have a strictly correct use of mark-dirty and
so we need an option to re-enable mark-dirty tracing in conjunction with
--profile.
When using fonts circular references are established between the holdover
font caches and the interpreter which need manual intervention via
cairo_script_interpreter_finish() to break.
To save typing when creating macro-benchmarks --profile disables
mark-dirty and caller-info and compresses the trace using LZMA. Not for
computers short on memory!
The glyph advance cache was only enabled for glyph indices < 256,
causing a large number of misses for non-ASCII text. Improve this by
simply applying the modulus of the index to select the cache slot - which
may cause some glyph advances to be overwritten and re-queried, but
improves the hit rate.
Felt like pulling the latest stuff, since I branched back in February.
Conflicts:
build/configure.ac.features
src/cairo.h
util/cairo-script/csi-replay.c
Record the current working directory and pass that along to cairo-trace so
that the trace output is local to the user and not the application. This
is vital if the application is called via a script that changes directory.