Friday, May 10, 2013

Moving the Needle - a Quick Audit of the 7970

I dropped X-Plane into ATIAMD's GPU PerfStudio 2 to see what might be going on on the GPU side of things.  This is on a Sandy Bridge i5 with a 7970 on a 1920 x 1200 monitor.  I cranked the sim up to 4x SSAA (stupidly simplistic anti-aliasing) to really push fill rate.

And...

(Sigh...)

We're mostly CPU bound.  Still.

You know those GDC talks where the first thing the IHVs say is "most games are CPU bound?"  I can feel them glaring in my direction.

There was one case where I was able to absolutely hose the GPU: a full screen of thick particle-type clouds at 1080p + with 4x SSAA.  You can tell you're GPU bound just by the amount of heat the card blows out.

What was cool though, was running a batch profile in GPU PerfStudio and sorting by GPU time.  At a whopping, facemelting, eye bleeding 32 ms was a single batch of cloud puffs (about 10k vertices) that covered most of the screen.  The histogram falls off from there, with the next few most expensive cloud batches taking a few ms and everything else being noise.

This isn't hugely surprising...we know our cloud system is a fill-rate pig and thus responds badly to very large render surfaces...the real surprise is how the GTX 680 copes with them so well.  (What is less obvious is what to do about it; I fear the halo-type artifacts that a lot of half-res solutions provide may be a deal-breaker; re-rendering the edges with stencil will increase our geometry count and we do have a lot of puffs.  Probably the right thing is to start using hardware MSAA surfaces for the G-Buffer to leverage hardware compression.)

I went looking for anything else that might be slow and finally got an answer about pre-stenciling lighting volumes.  Our first deferred rendering implementation pre-stenciled lighting volumes to cut fill rate; we dropped the stencil pass when we found that we were CPU and AGP bandwidth bound; the only time losing stencil was a problem was in high-zoom lighting scenarios.

With the GPU profiler, I could see that a moderate sized batch of light volumes around the Apron at our Seattle demo airport takes about 1.5 ms to render at the aforementioned rather large resolution.  The scene has maybe 3 or 4 volumes of that magnitude, and the rest are small enough in screen space that we don't need to care.

That only adds up to 6-10 ms of GPU time though - and given that the sun pass is fast enough to not show up in the top 10 list, FXAA is fast and even scenery fill isn't so bad, it's clear why light fill isn't the long pole, particularly when the CPU is struggling to get around the frame in 30 ms.  Cutting CPU work does appear to be the right thing here.

The real right thing, some day, in the future, when I have, like, spare time, would be to do two scene graph passes: the first one would draw lights except in the near N meters, with no stencil; the second pass would only grab the near lights and would do two passes, stenciling out the lights first.  This would give us the fill rate fix in the one case where it matters: when the light is close enough to be huge in screen space.  (Our non-sun lights tend to be reasonably small in world space - they are things like street lights, apron lights, and airplane landing lights.)

There is one limitation to GPU PerfStudio2 that frustrated me: because it doesn't sniff the call-stack during draw calls, it can't easily do data mining for the source of poor performance.  That is, if you have a big app that generates a huge frame using a number of common subsystems, if one subsystem sucks, it doesn't data mine that for you.  (Note: I did not experiment with trying to inject frame markers into the draw call stream...I don't even know if they support that using the KHR debug extension.)

My next step will be to integrate the NV and ATI performance counter APIs directly into the sim.  We have, at various times, had various timing utilities to allow us to rapidly instrument a code section in a way that follows the logical design, rather than just the call stack.  (Shark was so good that we didn't use any utilities for a while.)  With the GPU performance counters, we could potentially build HUD-style GPU metering directly into the app.

Saturday, April 20, 2013

There Must Be 50 Ways to Draw Your Streaming Quads

So you want to draw a series of streaming 2-d quads.  (By streaming I mean: their position or orientation is changing per frame.)  Maybe it's your UI, maybe it is text, maybe it is some kind of 2-d game or particle system.  What is the fastest way to draw them?

For a desktop, one legitimate response might be "who cares"?  Desktops are so fast now that it would be worth profiling naive code first to make sure optimization is warranted.  For the iPhone, you probably still care.

Batch It Up

The first thing you'll need to do is to put all of your geometry in a single VBO, and reference only a single texture (e.g. a texture atlas). This is mandatory for any kind of good performance.  The cost of changing the source buffer or texture (in terms of CPU time spent in the driver) is quite a bit larger than the time it takes to actually draw a single quad.  If you can't draw in bulk, stop now - there's no point in optimizing anything else.

(As a side note: you'll eat the cost of changing VBOs even if you simply change the base address pointer within the same VBO.  Every time you muck around with the vertex pointers or vertex formats, the driver has to re-validate that what you are doing isn't going to blow up the GPU.  It's incredible how many cases the driver has to check to make sure that a subsequent draw call is reasonable.  glVertexPointer is way more expensive than it looks!)

The Naive Approach

The naive approach is to use the OpenGL transform stack (e.g. glRotate, glTranslate, etc.) to position each quad, then draw it.  Under the hood, the OpenGL transform stack translates into "uniform" state change - that is, changing state that is considered invariant during a draw call but changes between draw calls.  (If you are using the GL 2.0 core profile, you would code glRotate/glTranslate yourself by maintaining a current matrix and changing a uniform.)

When drawing a lot of stuff, uniforms are your friend; because they are known to be uniform for a single draw call, the driver can put them in memory on the GPU where they are quick to access over a huge number of triangles or quads.  But when drawing a very small amount of geometry, the cost of changing the uniforms (more driver calls, more CPU time) begins to outweigh the benefit of having the GPU "do the math".

In particular, if each quad has its own matrix stack in 2-d, you are saving 24 MADs per quad by requiring the driver to rebuild the current uniform state.  (How much does that cost?  A lot more than 24 MADs.)  Even ignoring the uniforms, the fact that a uniform changed means each draw call can only draw 1 quad.  Not fast.

Stream the Geomery

One simple option is to throw out hardware transform on the GPU and simply transform the vertices on the CPU before "pushing" them to the GPU.  Since the geometry of the quads are changing per frame, you were going to have to send them to the GPU anyway.  This technique has a few advantages and disadvantages.
  • Win: You get all of your drawing in a single OpenGL draw call with a single VBO.  So your driver time is going to be low and you're talking to the hardware efficiency.
  • Win: This requires no newer GL 3.x/4.x kung fu.  That's good if you're using OpenGL 2.0 ES on an iPhone, for example.
  • Fail: You have to push every vertex every frame. That costs CPU and (on desktops) bus bandwidth.
  • Not-total-fail: Once you commit to pushing everything every frame, the cost of varying UV maps in real-time has no penalty; and there isn't a bus to jam up on a mobile device.
Note that if we were using naive transforms, we'd still have to "push" a 16-float uniform matrix to the card (plus a ton of overhead that goes with it), so 16 floats of 2-d coordinates plus texture is a wash.  As a general rule I would say that if you are using uniforms to transform single primitives, try using the CPU instead.

Stupid OpenGL Tricks


If you are on a desktop with a modern driver, you can in theory leverage the compute power of the GPU, cut down your bandwidth, and still avoid uniform-CPU-misery.

Disclaimer: while we use instancing heavily in X-Plane, I have not tried this technique for 2-d quads.  Per the first section, in X-Plane desktop we don't have any cases where we care enough.  The streaming case was important for iPhone.

To cut down the amount of streamed data:
  • Set the GPU up for vertex-array-divisor-style instancing.
  • In your instance array, push the transform data.  You might have an opportunity for compression here; for example, if all of your transforms are translate+2-d rotate (no scaling ever), you can pass a pair of 2-d offsets and the sin/cos of the rotation and let the shader apply the math ad-hoc, rather than using a full 4x4 matrix.  If your UV coordinates change per quad, you'll need to pass some mix of UV translations/scales.  (Again, if there is a regularity to your data you can save instance space.)
  • The mesh itself is a simple 4-vertex quad in a static VBO.
You issue a single instanced draw call and the card "muxes" together the instancing transforms with the vertex data.  You get a single batch, a small amount of data transferred over the bus, and low CPU use.

There are a few other ways to cut this up that might not be as good as hw instancing:
  • There are other ways to instance using the primitive ID and UBOs or TBOs - YMMV.
  • If you have no instancing, you can use immediate mode to push the transforms and make individual draw calls.  This case will probably outperform uniforms, but probably not outperform streaming and CPU transform.
  • You could use geometry shaders, but um, don't.

Wednesday, March 20, 2013

Shiny Parts

Just a science experiment from the weekend.  LDraw parts have been scaled to get them to all fit on screen together.


Saturday, March 16, 2013

Lego Lighting Effects

I was flipping through MOC-train pictures and was struck by this image.  What got my attention is not only that it is a very well done model, but also that it is one of the few images in the photo stream that really pops despite being a computer-graphics render (as opposed to a photo of a real lego set); most of the renderings don't have the same "ooh" factor as real models.

(Compare the first image to this image from the same set, which appears to be more like a screen capture from a lego editing program - a very simple forward-shaded lighting environment. The first image works because the lighting environment does enough interesting things to make the model start to look like it exists in a real 3-d space.)

I was able to get smooth shading working (more or less) in BrickSmith, at least as a prototype, and that got me thinking: what are we going to do with the new rendering engine?  Now that we have shaders and smooth normals, what lighting would actually look good?  The existing lighting model makes models look like a bit like instructions; it's great for clarity and editing without eye strain, but no one is going to think you're looking at a photo.

So I took the only lego set I actually own now (the Maersk train), and held it up to the window while turning it. I don't actually play with it - it just sits on my shelf, so the parts are still clean and relatively finger-print free.  It looks to me like there are a few critical lighting effects that we'll need to capture to get a high quality render.  The good news is that they are probably all doable in real time.  Here is a brain dump:
  • Lego bricks have some kind of BDRF - they are highly reflective at some angles, and the reflection strength dies off with angle; the BDRF may be more complicated than a standard exponential specular hilite.  Given the small number of part surfaces, it would not be insane to model each specific BDRF with a lookup table texture.
  • Normal mapping: it turns out that a square brick doesn't actually have a flat side.  There is a subtle bit of 'indent' in the center of the side relative to the corners.  I don't know if this is intentional or a limit of the manufacturing process (I'll go with "intentional" since TLC is known for their insane levels of quality control) but there is no question that a flat surface is not actually flat. The amount of curvature depends on the part, and the shape of the curvature appears to have a pattern - the brick wall goes 'out' at the corner, creating tell-tale reflections just inside the bounds of the brick.  This effect could potentially be created with texture-based normal mapping.
  • The slope bricks have a 'grit' texture etched into the sloped sides; this effectively changes the BDRF.  The question then is whether this should be done with a normal map or BDRF tweak.  The answer might be to use something like LEAN mapping, e.g. a normal map that produces a correct specularity change when mipmap-filtered.  (Again, we could get away with a technique that is considered "expensive" for game content because legos have very few distinct materials and LDraw makes almost no use of textures; that texturing hardware is just sitting waiting for us.)
  • The brick edges present a difficult problem; they are represented as line segments in LDraw (to make it easy to provide a wire-frame around bricks for instruction-style drawing).  In a real lego model, the edges of the bricks appear to be slightly faceted, which makes them feel less sharp.  This leads to two effects: specular hilites off the edge of the brick, and 'dark cracks' between bricks, which I would say is essentially self-shadowing or ambient occlusion.  My thought is to set up the lines with the average normal of the 'crease' it represents and then use them to overpaint some specularity, but I haven't tried this yet.
  • There is slight variation in the direction of the bricks - a modeler can assemble bricks with varying degrees of tightness, and if desired, can leave the bricks a little bit loose to get some variation in their exact orientation.  This leads to variations in normals (and thus lighting) as well as self shadowing and more/less visible cracks at their junctions.  My thinking is that this could be simulated by applying some tiny offset to the transform of individual bricks.
  • While some POV-Ray style renders use cast shadows (including the one I linked to) I think that ambient occlusion might provide better lighting cues.  People usually play with and observe legos indoors, and the indoor environment often has heavily diffused lighting.
Putting this wish list together, I can imagine:
  • Lighting via an environment map (to capture variable changes in diffuse lighting levels with multiple lighting reflection sources) and
  • Rendering to a deferred surface, with lines blending changes into the normal vector plane.  (Some normal-mapping schemes are reasonably amenable to hardware blending.)
  • Lighting with screen space reflectance/ambient occlusion - that is, we walk the neighborhood around our pixel in screen space, capturing shadowing and local color bounce, and lookup the ray in the environment map for rays that escape.
I will be the first to admit that I have no idea how a material BDRF, local screen space GI, and environment maps play together. 

Those questions may also be slightly moot; the LDraw data for parts does not contain normal maps or even surface roughness descriptions, so good input data on the lighting properties of the bricks might not even be available.

But this is all walking before we crawl; smooth normals are not fully coded or debugged, the new renderer hasn't shipped yet, and I still don't have an LOD scheme to cut vertex count.

Thursday, March 14, 2013

How to Jam an Arrangement_2 into a General_polygon_set_2

I spent about three hours yesterday tracking down a weird bug in CGAL - I have code that builds a general polygon set out of an arrangement, exports the polygons, and weirdly the polygons had duplicate points.  This is an impossibility for a valid arrangement.

To my annoyance, I discovered today as I went to write the bug up that I knew about this bug...over three years ago. :-(  I get annoyed when I search for the answer to an obscure OpenGL problem and find my own post (e.g. I'm not going to find anything I didn't already know), but it's even more annoying to waste hours on the bug and then have that happen.

Basically if you are going to build a general polygon set by providing a pre-built arrangement, there are two things you must do:
  • Remove redundant edges - the GPS code assumes that the arrangement doesn't have needless edges (which will screw up traversal).  Fortunately, the GPS code has a utility to do this, which I just call.
  • Then you have to ensure that the direction of the underlying curves along the various edges are consistent - that is, for a given counter-clockwise boundary, every underlying curve goes either with or against the edge.
(After redundant edge removal, the arrangement will contain no antennas, so it will always be possible to get consistency on both sides of a CCB.)

I wrote code to enforce this second condition by flipping the curve of any halfedge where (1) the curve goes against the halfedge and (2) the halfedge is adjacent to the "contained" side of the map.

With this, polygon set operations work on arbitrary map input.

Why Did You Try This?

Forcing pre-made arrangements into polygon sets requires sub-classing the general polygon set template instantiation to get direect access to things like the arrangement, and it's not particularly safe.  It also requires your arrangement to have the containment flag on the face data mixed in.  Why go to the trouble?  I did this for two reasons:
  • Sometimes the polygonal set data I want to process came from an arrangement, and that arrangement is fairly huge.  Having to construct the arrangement out of polygons the normal way requires geometry tests - topology data would be lost and rediscovered.  For big maps this is really performance-painful.
  • I have some operations that work on arrangements that are precursors to boolean sets. For example, the airport surface area data are fundamentally polygon sets (e.g. in the set is the airport surface area) but some of the constructive processing (e.g. simplifying the contour) run on arrangements.
When an arrangement is turned into a polygon set, one of the results is a rather drastic map cleaning.  Since the polygon set cares only about containment (what points are in, what are out), random features like roads tend to get blown away.

Wednesday, January 30, 2013

Instancing for BrickSmith

BrickSmith is a FOSS 3-d LDraw-compatible editor for Mac; basically it lets you model legos on your computer. BrickSmith is a wonderful program - really a joy to use.  I have submitted a few patches, mostly little features that I want for my own modeling.  Recently I rewrote the low level OpenGL drawing code to improve performance and quality; hopefully we'll ship this in the next major patch.

This post describes the OpenGL techniques I picked.  Since OpenGL is a cross-platform standard, it's possible that the design might be of interest (at least for reference) to other LDraw program developers.  In some cases I will reference X-Plane performance numbers because I have access to that data.

LDraw Basics

If you aren't familiar with LDraw's simulation of lego bricks, here are the operative points to a graphics programmer:
  • The file format effectively turns into something like a push-down state stack and individual line, tri, and quad primitives.
  • The vast majority of drawing tends to be colored lines and polygons; while texture support was added to the format, it's not yet in wide-spread production.
  • LDraw models tend to be vertex bound; the format contains no LOD for reducing vertex count, and the lego parts are modeled in full geometric detail.  (Consider: even though the lego 'studs' are only broken into octagon-prisms, you still have 1024 of them on a single baseplate.)

Basic OpenGL Decisions

The new renderer uses shaders, not the fixed function pipeline.  Because of this, I was able to program one or two LDraw-specific tricks into the shaders to avoid CPU work.

The shaders understand the LDraw concept of a "current" color (which is the top of a stack of color changes induced by the inclusion of sub-parts/sub-models) vs static hard-coded colors; a given part might be a mix of "this is red" and "fill in the blank".  I represent these meta-colors that come off the stack as special RGBA quadruples with A=0 and RGB having a special value; the shader can then pull off the current stack state and substitute it in.  This is important because it means that I can use a single mesh (with color) for any given part regardless of color changes (the mesh doesn't have to be edited) and I don't have to draw the mesh in two batches (which would cost CPU time).

BrickSmith "flattens" parts in the standard LDraw library into a single simple representation - in other words, the 'stack' is simulated and the final output is consolidated.  Thus a part is typically a single set of tris, lines, and draws, all stored in a single VBO, with no state change.  Thus the library parts are "atomic".  The VBO is loaded as STATIC_DRAW (because it is virtually never changed) for maximum speed.

Because LDraw models are currently flat shaded, BrickSmith does not attempt to index and share vertices; all VBOs are non-indexed, and a distinct set of GL_QUADS is maintained to avoid vertex duplication.

(I believe we would get a performance boost by indexing vertices, but only if smooth shading could be employed; with flat shading virtually all vertices have different normals and cannot be merged.)

Attribute Instancing

A naive way to draw part meshes would be to use glPushMatrix/glRotate/glTranslate sequences to push transforms to the GPU.  The problem with this technique is that it is CPU expensive; the built-in matrix transforms are uniform state, and on modern cards this uniform state has to live in a buffer where the GPU can read it

Thus each time you 'touch' the matrix state, the entire set of built-in uniforms including the ones you haven't messed with (your projection matrix, lighting values, etc.) get copied into a new chunk of buffer that must be sent to the card.  The driver doesn't know that you're only going to touch transform, so it can't be efficient.

That uniform buffer will then either be in AGP memory (requiring the card to read over the PCIe bus at least at first to draw) or it will have to be DMAed into VRAM (requiring the card to set up, schedule, and wait for a DMA transfer).  Either way, that's a lot of work to do per draw call, and it's going to limit the total number of draw calls we can have.

Remember, 5000 draw calls is "a lot" of draw calls for real-time framerates.  But 5000 bricks is only one or two big lego models.  If you model all of the modular houses and you just want to see them in realtime, that's 17,090 parts -- that's a lot of draw calls!

One trick we can do to lower the cost of matrix transforms is to store our model view matrix in vertex attributes rather than in a uniform.  Vertex attributes are very cheap to change (via glVertexAttrib4f) and it allows us to draw one brick many times with no uniform change (and thus all of that uniform work by the card gets skipped).  If we draw one brick many times, we can avoid a VBO bind, avoid uniform change, and just alternate glVertexAttribute4fv, glDrawArrays repeat.

This technique is sometimes called "immediate mode" instancing because we're using immediate mode to jam our instancing data down the pipe quickly.  For X-Plane on OS X, immediate mode instancing is about 2x faster than the built-in uniform/matrix transforms.

BrickSmith's new renderer code is built around a 24-float instance: a 4x4 matrix stored in 4 attributes, an RGBA current color, and an RGBA compliment color.

Hardware Instancing

One nice thing about using attributes to instance is that it makes using hardware instancing simple.  With hardware instancing, we give the hardware an array of 24-float "instances" (e.g. the position/colors of a list of the same brick in many places) and the brick itself, and issue a single draw call; the hardware draws the brick mesh for each instance location.

Hardware instancing is much faster than immediate mode instancing - YMMV but in X-Plane we see instancing running about 10x faster than immediate mode; X-Plane can draw over 100,000 individual "objects" when instancing is used - more than enough for our modular houses.

To use instancing we put the instance data into a VBO and use glVertexAttribDivisor to tell OpenGL that we want some attributes (the instance data) to be used once per model, while the other data is once per vertex.

For BrickSmith, the instance locations are GL_STREAM_DRAW - BrickSmith generates the list of bricks per frame as the renderer traverses the model.  So the bricks themselves are static but their locations are on the fly.  I chose this design because it was the simplest; BrickSmith has no pre-existing way to cache sets of brick layouts.  At 24 floats, even a 100,000 brick model is only about 10 MB of data per frame - well within the range of what we can push to the card.

(By comparison, X-Plane precomputes and saves sets of objects, so the instance location data is GL_STATIC_DRAW.)

Drawing Dispatch

The actual drawing code uses a mix of immediate mode instancing, hardware instancing, and straight drawing.  The logic goes something like this:
  • As draw calls come in, we simply accumulate information about what the app wants to draw.  Parts that are off screen are culled.  (Since we are bound on the GPU, we can afford the CPU time to reduce vertex count.)
  • If a part contains translucency, it goes in a special "sorted" bucket, which is drawn last, from back to front.  The sorting costs CPU time, so we only do this when we identify a part with translucency.
  • Parts with "stack complexity" (e.g. texturing) that need to be drawn in multiple draw calls go in another bucket, the "just draw it" bucket - they are sorted by part so that we can avoid changing VBOs a lot - changing VBOs takes driver time!
  • Parts that are simple go in the "instancing" bucket, and we accumulate a list of locations and parts (again, organized by part.)
When it comes time to draw the instancing bucket, we choose immediate mode or hardware instancing based on (1) whether the GPU has hardware instancing and (2) the number of instances; for very small number of instances, it's cheaper to push the attributes than to change the buffer bindings (which is necessary for hardware instancing). The exact cutoff will vary with app, but typically hardware pays for more than 3 instances.

Note that all instanced parts are written into a single giant stream buffer.  This lets us avoid mapping and unmapping the buffer over and over, and it lets us avoid having a huge number of small buffers.  
Generally fewer, larger VBOs are better - they're relatively expensive objects for the driver to manage; if your VBOs are less than one VM page, find a way to merge them.

Performance Results

The new renderer often runs about 2x faster than the existing code, while providing sorted transparency, and it typically runs at significantly lower CPU.

One case where it did not run faster was with Datsville - the Datsville model I have for testing is about 39,000 bricks, resulting in 125 million vertices.  It runs on my 4870 at about 5 fps.

In the old renderer, I would see 100% CPU and about 5 fps; with the new one, maybe 30-35% CPU and 5.1 fps.  Why no more speed?  It turns out that the total vertex capacity of the card is only about 500 million vertices/second, so the card is vertex bound. (This is quite rare for games.)  When the model is partly off-screen, framerate increases significantly.

Thursday, December 13, 2012

Static Libraries and Plugins: Global Pain

Years after the X-Plane plugin SDK was ported to operating systems that support Unix .a (archive) static libraries, I have finally come to understand what a mess global symbols can make with incorrect linker settings. The problem is that the incorrect linker settings are almost always the defaults. This blog post will explain what goes wrong with this kind of linker setup and how to fix it.  While this stuff might be obvious to those intimately familiar with Linux and Unix-style linking, it's a bit astonishing to anyone coming from the Windows and pre-OS X Mac world, where the assumptions about linkage are very different.

This post may also thoroughly slander Linux, and if I learn why I'm an idiot and the whole problem can be solved in a much better way, hey, that's great.  I'd much rather find out that there's a better way and I'm wrong than find out that things really are as broken as they seem.

Globally Symbols and Shared Libraries

Unix-style linkers (e.g. ld on both OS X and Linux) support a shared global namespace for symbols exported from shared libraries. Simply put, for any given symbol name, there can be only one 'real' implementation of that symbol, and the first dynamic library (or host app with dynamic linkage, which is basically all host apps these days) to introduce that symbol defines it for every dynamic library.

In other words, if you have five implementations of "void a()" in your dynamic libraries, the first one loaded is used by everyone.  It's a global namespace.

Note that if your symbol is not global, it will not be replaced by an earlier variant.  So if your symbol isn't global, other people having global symbols can't hose you.

The implications of this are clear: you should be very very careful and very very minimal about what gets exported into the global namespace, because of the risk of symbol collision.  I found a bug in an X-Plane plugin because the internal routine sasl_done (in a plugin called sasl) was global and the second instance loaded - sasl_done from libsasl2.dylib had already been loaded by the OS.  The results: a random call into a DLL when the plugin thought it was calling itself!

Unfortunately, the default for GCC is to put everything into the global namespace.  As gcc 3.x fades into history, more code is using -fvisibility=hidden and attributes more aggressively, but the defaults make it really easy to do the wrong thing and dump a whole lot of symbols into the flat namespace.

There is one exception to this global calling: if you use dlsym to resolve a symbol from a specific dynamic library (as returned by dlopen) finds it in that dynamic library, like you would expect.  Therefore if you have a plugin with an "official" entry point (like "PluginStart") you can load multiple plugins into the global namespace and find the "right" start function via dlsym.  (If a plugin called its own start routine, it might jump into the wrong plugin due to globla namespace issues.

What Am I Exporting?

On both OS X and Linux you can use "nm" to view your globally exported symbols:
nm my_plugin.dylib | grep "T "
The stuff with a capital T from nm are code symbols in the global namespace.  If you make a plugin DLL that has a lot of those, your code may not operate if there are other plugins already loaded.

Static Libraries: Not So Static

In the Unix world, the .a (static archive) format is basically a collection of .o files with some header info to optimize when the code is linked.  .o files retain the hidden/visible attribute information that is used by the linker to export symbols out of a dynamic library.

What this means is: under normal operation, the linker may export dynamic library symbols out of a static library you link against. In other words, if you link against libpng.a, you may end up having your DLL export all of the symbols of libpng!  If you aren't the first dynamic library to load, the version of libpng you get may not be the static one you asked for.

This behavior is astonishing at best, but unfortunately it is, again, the default: if the static library didn't specifically set its symbols to hidden, you get "leakage" of static library symbols out of the client shared library.  Unfortunately, from my experience this kind of leakage happens all of the time.  With X-Plane we statically link libcurl, libfreetype and libpng, and all three have their symbols marked globally by default.  These are ./configure based libraries and we don't want to start second-guessing their build decisions.  Unfortunately the code tends to be marked up to build the right API in "shared library" mode but not static mode.

You can see this behavior using nm -m on OS X or objdump -t on Linux.

Working Around Library Leak

Someday we may reach a point where all Unix static libraries keep their symbols "hidden" for dynamic library purposes, but until then there is something a DLL can do to work around this problem: use an explicit list of symbol exports.

Using an explicit list of symbol exports is often considered annoying when an API has a large set of public entry points; usually attributes marking specific functions are preferred.  The advantage of an "official list" at link time is that the linker hides everything except that list, and if any static libraries have globally visible symbols, their absence from the master fixes the problem.

(As an example of how to set this up for gcc on Linux and OS X, see here.)

Addendum: What About Namespacing?

Both Linux and OS X have over time developed ways to cope with the flat namespace problem.

On OS X, a dynamic library can be linked with a two-level namespace.  The symbol is resolved against both the name of the providing dylib and the symbol itself. The result is that symbols come only from the dylibs where you thought they would come from.  If at link time symbol A comes from library X, library X is the only place where it will be provided in the future.  (This is the semantics Windows developers are used to.)

On Linux, library APIs can contain version information; as far as I can tell this works by "decorating" symbols with a named library version (e.g. @@GLIBC_2.0).  When the ABI is changed, symbols cannot conflict between versions, and in theory this may also protect against cross-talk between libraries since the version symbol has some kind of short universal library identifier.  I have found almost no documentation on library versioning; if anyone has a good Linux link I'll add it to this post.

X-Plane's plugin system does not use either of these mechanisms because the plugin system is older than both of them. (Technically on OS X two-level namespaces are older than the plugin system, but the plugin system is older than @loader_path, which is a requirement for strict linking of a dylib in the SDK.)  Thus we are stuck with the global namespace and find ourselves trying to force people to keep their symbols to themselves.