On slow machines the call to pixman_fill_sse2() on similar surfaces that
we know are already zeroed takes a significant amount of time [12.77% of
the profile for a firefox trace, cf to just 3% of the profile is spent
inside memset].
Rather than solve why the pixman_fill_sse2() is so slow, simply skip the
redundant clears.
Use the DRM interface to h/w accelerate composition on image surfaces.
The purpose of the backend is simply to explore what such a hardware
interface might look like and what benefits we might expect. The
use case that might justify writing such custom backends are embedded
devices running a drm compositor like wayland - which would, for example,
allow one to write applications that seamlessly integrated accelerated,
dynamic, high quality 2D graphics using Cairo with advanced interaction
(e.g. smooth animations in the UI) driven by a clutter framework...
In this first step we introduce the fundamental wrapping of GEM for intel
and radeon chipsets, and, for comparison, gallium. No acceleration, all
we do is use buffer objects (that is use the kernel memory manager) to
allocate images and simply use the fallback mechanism. This provides a
suitable base to start writing chip specific drivers.
Handling clip as part of the surface state, as opposed to being part of
the operation state, is cumbersome and a hindrance to providing true proxy
surface support. For example, the clip must be copied from the surface
onto the fallback image, but this was forgotten causing undue hassle in
each backend. Another example is the contortion the meta surface
endures to ensure the clip is correctly recorded. By contrast passing the
clip along with the operation is quite simple and enables us to write
generic handlers for providing surface wrappers. (And in the future, we
should be able to write more esoteric wrappers, e.g. automatic 2x FSAA,
trivially.)
In brief, instead of the surface automatically applying the clip before
calling the backend, the backend can call into a generic helper to apply
clipping. For raster surfaces, clip regions are handled automatically as
part of the composite interface. For vector surfaces, a clip helper is
introduced to replay and callback into an intersect_clip_path() function
as necessary.
Whilst this is not primarily a performance related change (the change
should just move the computation of the clip from the moment it is applied
by the user to the moment it is required by the backend), it is important
to track any potential regression:
ppc:
Speedups
========
image-rgba evolution-20090607-0 1026085.22 0.18% -> 672972.07 0.77%: 1.52x speedup
▌
image-rgba evolution-20090618-0 680579.98 0.12% -> 573237.66 0.16%: 1.19x speedup
▎
image-rgba swfdec-fill-rate-4xaa-0 460296.92 0.36% -> 407464.63 0.42%: 1.13x speedup
▏
image-rgba swfdec-fill-rate-2xaa-0 128431.95 0.47% -> 115051.86 0.42%: 1.12x speedup
▏
Slowdowns
=========
image-rgba firefox-periodic-table-0 56837.61 0.78% -> 66055.17 3.20%: 1.09x slowdown
▏
Whilst constructing the path, if the operations continue to be
axis-aligned lines, allow the is_box and is_region flags to persist. These
are set to false as soon as a curve-to is added, a diagonal or in the case
of is_region a non-integer point.
Use the cow-snapshotting mechanism to store the meta surface replay (either
to an image inside acquire_source_image() or to a similar surface during
clone_similar()).
Fixes Bug 17971 -- Extreme slowdown for manual convolutions in most
vector backends.
https://bugs.freedesktop.org/show_bug.cgi?id=17971
Observe that patterns are not altered during an operation and so we are
safe to use the data from the original pattern without copying. (This is
enforced through the declaration that the backends operate on constant
patterns which are not allowed to be referenced or destroyed.)
Fixes bug 22356 -- Spurious "out of memory" error on system without fonts
https://bugs.freedesktop.org/show_bug.cgi?id=22356
If FcFontMatch() fails, then it means that there are no fonts available on
the system (or it may have been a malloc error, we have no way of telling).
Instead of report NO_MEMORY and disabling all drawing, one of the
rationales for including a builtin font was so that we could continue even
in the face of this error and show *something* to the user. (This being a
last resort (and especially important for demos!) and hopefully easier to
diagnose than no output at all.)
cairo_region_union_rectangle() is linear in the number of rectangles
in the region. There is no way to make it significantly faster without
losing the ability to return errors synchronously, so a
cairo_region_create_rectangles() is needed to avoid a large
performance regression.
The lazy resolution of patterns was defeating the scaled_font cache as
ft-fonts that resolved to the same unscaled font were being given different
font-faces upon creation. We can keep the lazy resolution by simply asking
the ft backend to create a fully resolved ft-font-face when we need to
create a scaled-font. This font is then keyed by the resolved font-face
and so will match all future lazily resolved identical patterns.
Provide a mechanism for backends to attach and remove snapshots. This can
be used by backends to provide a cache for _cairo_surface_clone_similar(),
or by the meta-surfaces to only emit a single pattern for each unique
snapshot.
In order to prevent stale data being returned upon a snapshot operation,
if the surface is modified (via the 5 high level operations, and on
notification of external modification) we break the association with any
current snapshot of the surface and thus preserve the current data for
their use.
Allow the caller to choose whether or not various conversions take place.
The first flag is used to disable the expansion of reflected patterns into a
repeating surface.
The number of mime-types attached to a surface is usually small,
typically zero. Therefore it is quicker to do a strcmp() against
each key in the private mime-data array than it is to intern the
string (i.e. compute a hash, search the hash table, and do a final
strcmp).
As an aide to tracking down the source of uninitialised reads, run
VALGRIND_CHECK_MEM_IS_DEFINED() over the contents of image surfaces at the
boundary between backends, e.g. upon setting a glyph image or acquiring
a source image.
Damian Frank noted
[http://lists.cairographics.org/archives/cairo/2009-May/017095.html]
a performance problem with an older XServer with an
unaccelerated composite - similar problems will be seen with non-XRender
servers which will trigger extraneous fallbacks. The problem he found was
that painting an ARGB32 image onto an RGB24 destination window (using
SOURCE) was going via the RENDER protocol and not core. He was able to
demonstrate that this could be worked around by declaring the pixel data as
an RGB24 image. The issue is that the image is uploaded into a temporary
pixmap of matching depth (i.e. 32 bit for ARGB32 and 24 bit for RGB23
data), however the core protocol can only blit between Drawables of
matching depth - so without the work-around the Drawables are mismatched
and we either need to use RENDER or fallback.
This patch adds a content mask to _cairo_surface_clone_similar() to
provide the extra bit of information to the backends for when it is
possible for them to drop channels from the clone. This is used by the
xlib backend to only create a 24 bit source when blitting to a Window.
Eliminate the extremely short-lived and oft unnecessary heap allocation
of the region by first checking to see whether the clip exceeds the
surface bounds and only then intersect the clip with a local
stack-allocated region.
Specifically,
cairo_region_union_rect -> cairo_region_union_rectangle
cairo_region_create_rect -> cairo_region_create_rectangle
Also delete cairo_region_clear() which is not that useful.
Jeff Muizelaar pointed out that the severe overallocation implicit in the
current version of the glyph cache is obnoxious and prevents him from
accepting the trunk into Mozilla. Jeff captured a trace of scaled font
and glyph usage during a tp run and presented his analysis in
http://lists.cairographics.org/archives/cairo/2009-March/016706.html
Using that data, the design was changed to allocate pages of glyphs from a
capped global pool but with per-font hash tables. This should allow the
glyph cache to have tight memory bounds with fair allocation according to
usage. Note that both the old design and the 1.8 glyph cache had
essentially unbounded memory constraints, since each scaled font could
cache up to 256 glyphs (1.8) or had a reserved page (old), with no limit
on the number of active fonts. Currently the eviction policy is a simple
random strategy, this gives a 'fair' allotment of the cache, but a LRU
variant might perform better.
On a sample run of firefox-3.0.7 perusing BBC news in 32 languages:
1.8: cache allocation 8190x, ~1.2 MiB; elapsed 88.2s
old: cache allocation 771x, ~13.8 MiB; elapsed 81.7s
lean: cache allocation 433x, ~1.8 MiB; elapsed 82.4s
_cairo_win32_tmpfile() uses _open_osfhandle() which is not available
on Windows CE. However, Windows CE doesn't have the permisions problems
that necessitated _cairo_win32_tmpfile() in the first place so we can just
use tmpfile() on Windows CE.
Move the mime-data into its own array so that it cannot be confused with
user-data and we do not need to hard-code the copy list during
snapshotting. The copy-on-snapshotting code becomes far simpler and will
accommodate all future mime-types.
Keeping mime-data separate from user-data is important due to the
principle of least surprise - the API is different and so it would be
surprising if you queried for user-data and were returned an opaque
mime-data pointer, and vice versa. (Note this should have been prevented
by using interned strings, but conceptually it is cleaner to make the
separation.) Also it aides in trimming the user data arrays which are
linearly searched.
Based on the original patch by Adrian Johnson:
http://cgit.freedesktop.org/~ajohnson/cairo/commit/?h=metadata&id=37e607cc777523ad12a2d214708d79ecbca5b380
Adds an error code replacing CAIRO_STATUS_NO_MEMORY in one case where it
is not really appropriate. CAIRO_STATUS_INVALID_SIZE is used by several
backends that do not support image sizes beyond 2^15 pixels on each side.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Currently glyphs are cached independently in each font i.e. each font
maintains a cache of up to 256 glyphs, and there can be as many scaled fonts
in use as the application needs and references (we maintain a holdover
cache of 512 scaled fonts as well).
Alternatively, as in this patch, we can maintain a global pool of glyphs
split between all open fonts. This allows a heavily used individual font
to cache more glyphs than we could allow if we used per-font glyph caches,
but at the same time maintains fairness across all fonts (by using random
replacement) and provides a cap on the maximum number of global glyphs.
The glyphs are allocated in pages, which are cached in the global pool.
Using pages means we can exploit spatial locality within the font
(nearby indices are typically used in clusters) to reduce frequency of small
allocations and allow the scaled font to reserve a single MRU page of
glyphs. This caching dramatically reduces the cairo overhead during the
cairo-perf benchmarks, and drastically reduces the number of allocations
made by the application (for example browsing multi-lingual site with
firefox).
Rename approximate_extents() to approximate_clip_extents() so that it is
consistent with the fill and stroke variants and clearer under what
circumstances you may wish to use it.
With Behdad's analytical analysis of the spline bbox, tolerance is now
redundant for the path extents and the approximate bounds, so remove it
from the functions parameters.
The way this works is very simple: Once for X, and once for Y, it
takes the derivative of the bezier equation, equals it to zero and
solves to find the extreme points, and if the extreme points are
interesting, adds them to the bounder.
Not the fastest algorithm out there, but my estimate is that if
_de_casteljau() ends up breaking a stroke in at least 10 pieces,
then the new bounder is faster. Would be good to see some real
perf data.