Skip to content

172. Optimising shaders

October 9, 2014

As the title suggests, this is about making shaders run faster. All the hints that follow are taken from Apple documentation for iOS.

I could summarise many of the hints into just this –

treat your fragment shader as the contents of a really massive “for” loop.

It is extremely important to optimise everything you can in there


  1. there are several refences below to OpenGL ES 3.0 capability. This only applies to iPhone 5S and 6, and iPad Air (or better). Even then, I am not certain that Codea implements the latest OpenGL versions.
  2. Direct quotes are shown in this colour.
  3. All the hints come from here

Don’t branch if you can avoid it

Apple says multiple times that you should avoid using “if” statements that allow the shader to do different things. For example, if you have a shader that handles a variety of lighting based on uniform inputs, and your app isn’t going to use some of them, it’s better to remove them altogether from the shader. The reason is partly that “they can reduce the ability to execute operations in parallel on 3D graphics processors (although this performance cost is reduced on OpenGL ES 3.0–capable devices)”.

Specific advice is as follows:

Your app may perform best if you avoid branching entirely. For example, instead of creating a large shader with many conditional options, create smaller shaders specialized for specific rendering tasks. There is a tradeoff between reducing the number of branches in your shaders and increasing the number of shaders you create. Test different options and choose the fastest solution.

If your shaders must use branches, follow these recommendations:

  • Best performance: Branch on a constant known when the shader is compiled.
  • Acceptable: Branch on a uniform variable.
  • Potentially slow: Branch on a value computed inside the shader.


  • When in doubt, default to high precision.
  • Colors in the 0.0 to 1.0 range can usually be represented using low precision variables.
  • Position data should usually be stored as high precision.
  • Normals and vectors used in lighting calculations can usually be stored as medium precision

Minimise multiplications

highp float f0, f1;
highp vec4 v0, v1;
v0 = (v1 * f0) * f1  //multiply vec by float, then result by float
...runs slower than this
highp float f0, f1;
highp vec4 v0, v1;
v0 = v1 * (f0 * f1); //multiply floats, then result by vec

..and if you are only changing (say) x,z components, then 
only calculate those components - it halves the work
v1.xz = v1.xz * f1

 Avoid messing with texture coordinates

Dynamic texture lookups, also known as dependent texture reads, occur when a fragment shader computes texture coordinates rather than using the unmodified texture coordinates passed into the shader. Dependent texture reads are supported at no performance cost on OpenGL ES 3.0–capable hardware; on other devices, dependent texture reads can delay loading of texel data, reducing performance. When a shader has no dependent texture reads, the graphics hardware may prefetch texel data before the shader executes, hiding some of the latency of accessing memory.

This says that OpenGL is very efficient at calculating the texture mappings for pixels. If you recalculate them in the fragment shader, it may hurt performance. (For example, my favourite tiling shader recalculates  texture positions, but I think the huge benefits outweigh performance issues).

Keep everything simple

This is pretty obvious. They are talking about things like lighting. For example, setting light intensity simply based on distance from the camera gives a realistic effect of a lamp, and is much faster than full Phong lighting.

Vertex culling

Apple’s advice is not to try to figure out which vertices not to draw, because that is done for you.

A TBDR graphics processor automatically uses the depth buffer to perform hidden surface removal for the entire scene, ensuring that only one fragment shader is run for each pixel. Traditional techniques for reducing fragment processing are not necessary. For example, sorting objects or primitives by depth from front to back effectively duplicates the work done by the GPU, wasting CPU time.


Because tile memory is part of the GPU hardware, parts of the rendering process such as depth testing and blending are much more efficient—in both time and energy usage—than on a traditional stream-based GPU architecture. Because this architecture processes all vertices for an entire scene at once, the GPU can perform hidden surface removal before fragments are processed. Pixels that are not visible are discarded without sampling textures or performing fragment processing, significantly reducing the calculations that the GPU must perform to render the tile.

In other words, running any extra polygon clipping to cull vertices is probably duplicating what is already happening.

This is not the same as having separate meshes and only drawing those which will show. That will be more efficient if you do it yourself. Apple is talking about figuring out which vertices in a single mesh will show. If you don’t have to draw it at all, that has to be faster. Of course, meshes have an overhead, so you don’t want too many of them. As usual, there is a balance somewhere.

As an example, I have a 3D dungeon 82 x 114 pixels across, with 5 million pixels in total, which, when empty, draws at almost 60 frames per second, showing the efficiency of Apple’s culling. As I add stuff, the speed drops. It may get to the point where it is worth breaking the huge mesh into individual rooms and corridors and only drawing those meshes which are immediately around the player. But perhaps there will be little or no difference. You can only really be sure by testing.



From → Programming, Shaders

One Comment

Trackbacks & Pingbacks

  1. Index of posts | coolcodea

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: