Comments on The Hacks of Life: There Must Be 50 Ways to Draw Your Streaming Quads

From my experience developing a voxel engine: for ...

2016-05-26T20:22:32.707-04:00

From my experience developing a voxel engine: for small vertex strides and small "instance" sizes, using a geometry shader to blow up each instance is MUCH faster than instancing. As the vertex stride increases, however, the geometry shader becomes inferior to simply filling a VBO with a pair of triangles for each quad. As for instancing in that case, common wisdom amongst the voxel gurus is that instancing is intended for *large* vertex-per-instance counts, and small ones won't perform well at all. At the end of the day, though, this only matters if you're not already fill-rate bound... a likely scenario if your voxels are better looking than Minecraft's :)

95% of the time you want to go pos0, color0, norma...

2013-11-06T12:45:36.850-05:00

95% of the time you want to go pos0, color0, normal0, pos1, color1, normal1. The GPU is going to fetch data from memory in relatively big chunks (it has a wide memory system) - if all of the data in a single vertex is nearby, then all of the data it fetches is useful immediately, and cache utilization is good.

The only exception is if you need some data to change per frame and some is static - in that case, keep the data separate AND use separate VBOs. That way you can set the unchanged data to STATIC_DRAW and the changing data to STREAM_DRAW.

I'm curious, I've only this week started p...

2013-11-06T11:28:51.609-05:00

I'm curious, I've only this week started playing around properly with OpenGL so please excuse my ignorance here..

When you say your vertex attributes for all mesh data is interleaved in a single VBO...

Is that to say that the vertices and the attributes themselves are interleaved? e.g.:
{pos_0,colour_0,pos_1,colour_1,...,pos_n,colour_n}

Or is each individual object encoded such that the vertices are all sequential, followed by a sequential block of attributes?
{pos_0,pos_1,...,pos_n,colour_0,colour_1,...,colour_n}

I suspect the latter, such the vertices for an object are sequential, followed by a sequential block of attributes for that object, and then repeat for each object...

Is there any measurable benefit of either approach?

Hi Vladimir, We always keep all vertex attributes...

2013-04-21T11:43:20.979-04:00

Hi Vladimir,

We always keep all vertex attributes interleaved in a single VBO for both IOS and desktop - only the indices are in a separate VBO.

We do this to maintain locality - whatever hw fetches a vertex, we want the vertex to sit in a cache line, etc.

Cheers
Ben

You are right, "Test, compare and analyze usi...

2013-04-20T21:46:59.811-04:00

You are right, "Test, compare and analyze using Instruments" is what I recommend to everyone too :D

Btw do you have all vertex data in one VBO and stream everything or do you split them into multiple so you can update (or not) them separately? I mean in the iOS version. I prefer the interleaved method but in our case it doesn't really matter... I think :)

Totally true - on IOS with GLES 2.0 you do have th...

2013-04-20T21:26:56.536-04:00

Totally true - on IOS with GLES 2.0 you do have the option to write a shader to 'decompress' some kind of packed up transformation. But you also bring up the other issue: the GPUs aren't super-beefy. I think it's a question of what the game is bound up on...if the game is dying for CPU, an offload is a win. If the CPU is idle and you've maxed out the poor GPU, then trying to use GPU-based transform is not a win.

I think in our latest code we use indexed triangles and orphaning and it works pretty well. You can tell pretty easily when you've won with instruments - when you're doing something the driver doesn't like a ton of CPU stuff with ominous names will show up in the stack trace under the glDrawXXX call. :-)

Wow, what a nice article! You are right, the str...

2013-04-20T21:19:45.145-04:00

Wow, what a nice article!

You are right, the streaming method is the best for rendering dynamic 2D geometry under iOS (OpenGL ES 2.0). The thing is how to do it the most efficiently - I mean how to send data to GPU… and also what data.

You touched the latter part by mentioning "instancing" and "compressing". When I say "instancing under OpenGL ES 2.0" I mean that you send object's transformation matrix (compressed if possible) as a part of the vertex data, extract/recreate the matrix in vertex shader and then do to transform on GPU. This is the method which I use in our game for rendering many dynamic 2D objects (which have relatively few vertices on their own) in one draw call. Btw I use "linked" triangle strips which seemed as a better solution than options like indexed triangle list. I didn't perform any relevant comparison test to prove that, though.

But I wonder if doing the transform on GPU is faster that doing it on CPU (using NEON instructions if possible). PowerVR GPUs' SIMD architecture should be quite a good match for that task but considering differences in frequencies (1Ghz CPU vs 200Mhz GPU) I am not sure if GPU is really that faster doing 2D transform. Also there are many different SoCs in iOS devices - starting with A4 in iPhone4 (Cortex-A8/SGX535) and ending with the newest A6x in iPad4 (?Cortex-A9 Custom?/SGX554) so GPU/CPU performance may vary from device to device. There is no recommended method. We are not vertex bound so I don't plan to investigate further… for now :)

Now about the problem how to send data to GPU (update your VBO) This always translates to questions like: "Single-buffered, double-buffered or even triple-buffered? Use a ring buffer? Discard an old buffer? Use glMapBuffer(Range), glBufferData or glBufferSubdata?", and more similar ones to which only the GPU driver engineers know the answer :) There are just too many ways how to do it, which means you can easily do it wrong. But do I actually care? Nope, because in most iOS games the differences are really not so huge and again vary from one iDevice to another… I tested some variants long time ago on my old iPhone4 and iPad2 and decided to stick with the best one after that. So right now I use tripe-buffered approach with glMapBuffer while discarding an old buffer. I haven't done any comparison tests since then so maybe another method is better on newer devices but I think I don't care enough becuase the game still runs 60 frames per seconds and it has other problems :)

So in the end I am streaming my dynamic geometry using my preferred method and all seems pretty fast in that part of the pipeline because I am fill-rate bound anyway :)

#end_of_rant

PS.: Btw I really like your older articles about double-buffered VBOs :)