Skip to content

209. 3D Ray Tracing (revisited)

April 9, 2015

This post is about how you can use “ray tracing” to detect what is in front of you, which can be used for things like self driving cars. The example below is jerky and crude, but the car is steering on its own without a map of the track, using only what it can see.

This post explains three ways to do it, including a shader based approach I think works well, and – if you are interested in 3D programming – includes a couple of nice insights into how it works.


First, I apologise if you find some of this difficult to understand. This is one case where it would take me several pages to explain everything simply, so I have to assume some understanding.

We see what is in front of us by using a camera (our eyes) to record light bouncing off objects, and then our brain figures out where things are.

There are two steps –

  1. collect information about the world, and
  2. make sense of it.

The first part can be difficult, and the second part will depend largely on what you want your application to do. So I’m going to focus mainly on the first part – collecting the information about the world.

Using the camera

If we wanted our program to steer itself around the real world, we could use the iPad camera to take snapshots (ie it does the information collecting for us), and then analyse the pictures. It would probably be much too slow unless you moved very slowly – and too difficult because the iPad camera can’t take stereo pictures, so distance would be hard to estimate – but there are other applications for object detection. I’ve previously used this approach for an experiment in motion detection.

But usually, none of this matters, because we draw a scene inside the computer, and we want to steer around that. So we can’t use the camera. This makes life difficult, because now we don’t have anything bouncing light off the objects and coming into an image in memory. We will have to do that part ourselves.

Do it yourself

The three posts starting here show how you can draw an extra copy of the screen to an image in memory, and examine it to see what is in front of you. It will only work for 2D, because an image in memory has no depth.

You can think of the method as starting at the camera position, and “looking” along the screen at a series of angles (one at a time), moving forward along each angle, pixel by pixel, and stopping when we hit a filled pixel. This is ray tracing.

It is not a very efficient way to do it, not least because drawing an extra copy of the screen can slow things down, even before you start looking at individual pixels. Also, we waste a lot of time looking at empty pixels between us and the objects on the screen. You could make it faster by only checking every (say) third pixel, but then you run the risk that an object might be quite thin (eg a flat picture), and you might skip right over it without realising.

You can speed things up by keeping a map of your scene in a 2D table in memory, and checking your position against that. If there are moving objects in the scene, you know where they are, so you can cheat and not bother with ray tracing. (And for performance reasons, we tend to cheat as much as possible).

Physics and ray tracing

The physics engine in Codea includes a ray tracing function (check out physics in the online reference). It is similar to the manual approach above, in that you give it a start and end point (which is the same as looking along a line at an angle), and it tells you if any physics objects are on the line between the points. It is probably more efficient than the manual approach, but you still have to look along a series of angles around the screen, and physics objects come with overheads that can slow things down.

I wouldn’t use this approach unless you are already working with physics objects, in which case, it could be quite useful. It also only works in 2D, like the manual approach above, because the physics engine is 2D.

3D camera simulation using a shader

Imagine that you had a special camera that took a picture of your scene, and instead of colours, the pixels held information about distance from the camera. Then you could read the pixel values and “see” the outline of the world in front of you.

A shader can do this for you. If you don’t know shaders, but you have a fair understanding of Codea, you should have a look at them – they are amazing, and nowhere near as difficult as they seem at first. If you do know shaders, look at the code (below) and you’ll see this one is very simple.

Anything that you want to “see” needs to be in a mesh, because shaders can only attach to meshes. A sprite, ellipse or line will be ignored.

The shader

As well as drawing everything to the screen, you draw your meshes to an image in memory, using a special shader, which stores distance information in each pixel, instead of colours. This means you get a picture, where each pixel gives you the z value of the closest object, along the direction of that pixel. We need to think about how to do that, because although we have the red, green and blue values where we can store data, each has to be between 0 and 255 (and we can’t use the alpha value, because that can get modified by OpenGL to smooth colours).

What I have done is just use the red pixel colour, and assume z values are always negative (if they can be positive as well, you will need to rethink this). I provide the shader with the maximum depth, and it divides the (absolute) value of z by this depth, giving a fraction between 0 and 1. I did it this way deliberately, because in shaders, colour values don’t go from 0 to 255, but from 0 to 1. I note in passing that I could also easily pass in an object identifier and store that in (say) the blue pixel, allowing me to identify which object is at each pixel.

The result is that the shader stores the z value as a fraction of the maximum depth, which is probably good enough in most cases. I do all the work in the vertex shader, because the depth is unlikely to vary greatly between the points of any particular mesh triangle, and it reduces the amount of work the shader has to do. If this assumption is incorrect, you can move the calculation to the fragment shader.

So here is my vertex shader. It has two inputs – the modelview matrix, and the maximum distance.

uniform mat4 mModel; //modelMatrix()*viewMatrix()
uniform lowp float s; //maximum distance
void main()
{
    lowp vec4 p = mModel * position; //transform vertex position
    lowp float z = abs(p.z)/s; //divide by max distance
    vColor=vec4(z,0.0,0.0,1.0);
    gl_Position = modelViewProjection * position;
}

Using the shader results

The method above gives you a picture where each pixel has a coded depth value in the red pixel. To retrieve the z value for each pixel, you need to

  • read the red value from the pixel
  • divide by 255 and multiply by the maximum distance given to the shader
  • make the result negative (assuming all z values need to be negative)
  • subtract the camera z position

What you do with the z value after that, is up to you.

 Steering

The self driving car video at the top uses this approach, but the driving is jerky because it is quite difficult to program, even if I have the z values of all the screen pixels (in fact, for performance reasons, I only bother getting z values for one row of pixels across the middle of the screen, because that is enough to tell me where all the side walls are).

My inital approach was to aim the car at the pixel that is furthest away (ie simply read in all the pixels, and look for the one with the biggest z value), and this works pretty well, with the car always aiming itself at the next corner. The problem is that it tends to happily drive into the side wall as it gets near the corner, so I had to find some way of keeping it a certain distance from the sidewall. This is what produces the jerkiness, as the car is constantly correcting.

I won’t bother describing what I did in more detail, because I’m sure it could be done much better. But I will explain how I turned the location of the furthest pixel into a rotation angle.

Calculating the rotation angle

Suppose I look at all the distances stored in my row of pixels, and the furthest z (=-1200) is for x=233 (where x is between 1 and WIDTH). A picture may show this best.

The camera is at the bottom of the screen, looking at the scene (the big picture at the back). The little screen in front is what is actually drawn for us, and is a 2D window looking onto the 3D scene. So the pixel we are looking at in the scene is actually a long way away, at z=-1200, with some x and y value, but on the little screen it is drawn in 2D at x=233 (and y = HEIGHT/2, because we are only working with the middle row of pixels on the screen).

We need to know how much to rotate, to turn from going straight up the screen, to following the dotted red line going toward the left. If we can calculate this angle, we can add it to the existing rotation angle of the car, to steer it.

The problem is that this x value is 2D (“orthographic”), but we need a 3D angle, so we have to convert from 2D to 3D. I found this was an educational problem, because I learned something doing it. (Learning for me always involves hours of drawing diagrams on paper, trying all sorts of things, and eventually stumbling on the answer).

We can solve it if we understand how Codea draws the little screen that we see. We need to convert the x value into a number between -1 (left of screen) and +1 (right of screen), like so

X = (x – WIDTH/2) / (WIDTH/2), which, if the screen were 1024 pixels wide, would give

X = (233 – 512) / 512 = -0.545

We also have to allow for the width of the viewing window (normally 45 degrees). The projectionMatrix does the work for us, containing the adjustment factor we need to apply to the x value above.

So we calculate X = X / projectionMatrix()[1] = -0.545 / 1.81 = -0.3

This is the amount that x changes for each change of -1 in z. We can now use trigonometry, because the tan of the angle we want is the x value divided by the z value, ie -0.3 / -1 in our case. We can take the arctan to find the angle, which is 0.29 radians, or 16.7 degrees to the left.

So the final answer is that if we have a pixel with an x position of x, then the angle (in degrees) is

angle = math.deg(math.atan((x – WIDTH/2) / (WIDTH/2)/ projectionMatrix()[1]))

and we can add this to the existing rotation angle of the car.

Calculating the x value in the 3D scene

We know the x position on the screen (in the front, 2D window), and we know the z value in the 3D scene behind. Can we use these to calculate the x value in the 3D scene behind?

The answer is yes, although it took me quite a few hours of frustration to figure it out, not least because I have trouble visualising geometric figures. Note I am assuming the y (height) value is constant throughout, which will be the case in programs like desktop racing.

In the picture below, the camera is looking at an angle to the right, along the blue dotted line. The pixel we want is shown by the red dots on the near screen and far screen.

What I’m going to do is travel along the blue line (the camera direction) for a distance of z pixels, to the green point A. Then I’m going to turn 90 degrees to the left and travel X pixels along it to reach the point B that we want in 3D space, having calculated what X is.

So the steps are as follows, assuming the camera settings are camera(pos.x,pos.y,pos.z,look.x,look.y,look.z) where pos and look are both vec3

  1. camera direction lookDir = (look – cam):normalize()
  2. travel along the camera direction for z pixels A = cam + lookDir*-z
  3. calculate X by first converting the position x (on the small screen) to a range of -1 to +1 exactly as I did just above, ie new_x = (x – WIDTH/2) / (WIDTH/2)/ projectionMatrix()[1]
  4. this gives us the distance of x from the camera direction line. We can now pro rate this to calculate X, the distance of A from B on the main scene, since we know that if the x and z values for the little screen are new_x and 1, and the z value for the big screen is z, then X = new_x / 1  * z
  5. We now need the direction from A to B, and I know that for any 2D vector (x,y), the direction that is rotated at 90 degrees is (-y,x). So the direction is vec3(-lookDir.z,0,lookDir.x)
  6. and then B = A + vec3(-lookDir.z,0,lookDir.x)*x

and B is a vec3 with the x and z value of the 3D point in the main scene.

I’m sure this looks complicated – and it took me hours to figure out (and there me be a better way) – but it runs pretty fast because there are only a few calculations.

Performance (speed)

Even though shaders are fast and efficient, it is going to be slow making a copy of the whole scene, and then having to read all the pixels out of it.  So we need to look at ways to speed things up. Here are a few suggestions.

  • don’t do it every frame, but maybe only a few times per second
  • don’t draw every object, just the ones you need
  • don’t draw the whole screen
  • don’t read and check every pixel along the width of the image in memory. Unless objects are very small, we can find them by only testing every 5th or 10th pixel.

The racing example above creates a memory image 3 times per second, drawing all the track objects, but restricting drawing to a single row of pixels in the middle of the screen (which is enough to tell us where the sides of the track are), giving us an image with just one row of information. Note – you restrict drawing using the clip command, and this speeds things up enormously.

This is the code for restricting drawing to one row

--initialise a full size image if it's the first time
--use a global image as Codea crashes if you use temporary images
imgRay=imgRay or image(WIDTH,HEIGHT)
clip(1,HEIGHT/2-1,WIDTH,1)
setContext(imgRay)
  --add code to draw everything here
setContext()
--read all the pixels from left to right, in steps of Sample
--(eg if Sample is 3, we only read every 3rd pixel)
for i=1,WIDTH,Sample do
    r=imgRay:get(i,HEIGHT/2) --get red value
    z=r*RayRange/255-camPos.z --see note underneath
    --do something with z
end

We convert the red pixel value back to z value by dividing by 255 and multiplying by the maximum distance (RayRange) we gave the shader. Then we we subtract the z value of the camera position – we need to do this because the shader measures all distances using the camera as the centre.

It may not always be possible to choose just one row (or a few rows together) of pixels that give you all the information you need, if objects are scattered all over the screen. An alternative is to set the y scale so the objects are all drawn infinitely high, and then draw on a picture that is just one pixel high. This means all the objects will appear on that one row of pixels. It also seems to be faster than my approach above, using clip.

The code is only slightly different

--initialise a one pixel high image 
imgRay=imgRay or image(WIDTH,1)
scale(1,10000) --scale y axis hugely
setContext(imgRay)
--draw everything here
setContext()
scale() --reset scale
--read all the pixels from left to right, in steps of Sample
for i=1,WIDTH,Sample do
    r=imgRay:get(i,1) --get red value
    z=r*RayRange/255-camPos.z --see note underneath
    --do something with z
end
Advertisement

From → Programming

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: