From GPWiki
Jump to: navigation, search

Techniques For Dynamic Per-Pixel Lighting

In 1978 the computer graphics guru James Blinn published his paper “Simulation of wrinkled surfaces” [Blinn78] that proposed perturbing the normal vectors across the surface of a triangle – a technique more commonly known as “bump mapping”. The idea is that if you vary the inputs you can simulate a much more complex surface; this complexity can be added via well understood texture mapping principles and does not actually require the underlying geometry to be any more complex. This has been a particularly important technique as it generates much higher quality images whilst having a modest performance cost. This was particularly important in 1978 as computers were substantially less powerful and had much less available storage to handle increased geometric complexity.

This chapter of the book covers this concept of rendering geometrically complex surfaces without the need for complex geometry. In later chapters it becomes the foundations that can be used to set up inputs for the actual lighting equations.


Lighting resolution

Whilst this chapter focuses on the concept of lighting calculated for each individual pixel it is worth appreciating that this isn’t the only resolution at which we can compute lighting results. The following section briefly considers the quality trade-offs that occur when using different resolutions.

There are three levels worth considering, arguably there are more but for practical purposes they aren’t so interesting. A typical model rendered using Direct3D will consist of many triangles and each of these individual triangles could have independent lighting properties. Each of these triangles is defined by three vertices and we could compute lighting properties for each of these and then interpolate the results across the area of the triangle. The next step would be to compute lighting properties for each individual pixel – the highest possible resolution. The progression from per-triangle through per-vertex and finishing at per-pixel matches the progression of lowest to highest image quality.

The following three images show this scale – the geometry and inputs are identical with only the Vertex, Geometry or Pixel shader having been altered.

Image 3.1.png
Image 3.1

Per-triangle lighting, shown as image 3.1, highlights the geometric complexity of the model. In order to achieve a higher resolution of lighting it would be necessary to increase the polygon count, which in turn also increases the amount of data that must be stored and processed.

Image 3.2.png
Image 3.2

Whilst it is not strictly a requirement that adjacent triangles share common vertices it is a fairly standard method – one that is used to demonstrate per-vertex lighting in image 3.2. Each vertex stores the average face normal computed from all the triangles that share it. If we didn’t do this then the results tend to be much closer to the per-triangle resolution. The overall results are much smoother than per-triangle but it appears to accentuate the edges that make up each triangle. This is due to the linear interpolation of colour across the surface of the triangle when a sphere really should use a higher order interpolation.

Image 3.3.png
Image 3.3

Finally per-pixel lighting is shown in image 3.3; it is immediately obvious that the lighting suggests that the sphere is much smoother than the previous two resolutions. However notice that the silhouette of the sphere is still very low resolution and can spoil this illusion of a smooth sphere.

This scale corresponds with another aspect of computer graphics – shading. The difference between lighting and shading is often overlooked due to being tightly coupled, but effectively “flat shading” is per-triangle, “Gouraud shading” is per-vertex and “Phong shading” is per-pixel.

Choosing a resolution

The previous section demonstrated the difference that computing lighting at different resolutions can make. Ultimately it boils down to a common trade-off in computing (and one that is especially prevalent in graphics) – that of performance versus quality. A simple flat-shaded per-triangle approach will almost always be faster than per-pixel but conversely it will almost certainly be of a lower quality. The foundation and theory chapter discussed the finer points of computing results at different stages in the pipeline but it is also worth considering that you need not have a static distribution of work.

More specifically it is perfectly reasonable to consider an LOD (Level Of Detail) algorithm for lighting. Using such a system may qualify as a regular optimization that improves overall speed but it could also allow dynamic balancing of quality where it really matters. For example, distant background objects may have their resolution reduced to per-triangle whereas a foreground character may utilize a complex high resolution per-pixel technique. As the character moves away or as the background becomes more important the resolution of lighting calculations can change accordingly.

Creating The Source Data

Per-pixel lighting requires the model being rendered to have a texture map that provides the modified surface normal vectors. This input is often referred to as a ‘Normal Map’. Whilst this is the fundamental source of input it is possible that additional data can be required, as will be demonstrated later in this chapter.

Normal maps can be defined in a number of different coordinate systems – primarily object space or tangent space.

Object space, also referred to as model space, is the coordinate system that the source geometry is defined in. Typically this will be within a unit-volume centred at the origin and rely on the ‘world transform’ to scale, position and orient it with regards to other objects in the scene. Each pixel in the normal map stores a vector in object-space and the light’s properties will need to be transformed from world space to object space for the calculations to be performed. This makes computation very simple by comparison to tangent-space such that object-space normal mapping will typically be faster. Normal map and geometry are explicitly coupled making re-use very difficult and increasing storage requirements, but an advantage is that it allows for easy addition of decals to the surface. Applying object-space normal maps to geometry that is animated or deformable is difficult because as soon as the geometry upon which the texture is mapped is altered the data stored in the normal map becomes invalid. Updating or compensating for this at run-time can be very costly.

Tangent space is used for the remainder of this chapter and will be explained in more detail later. Fundamentally normal vectors stored in tangent space are tied to their texture (surface detail in any normal map often corresponds with other textures mapped to the same pixels) rather than the underlying geometry; consequently they will be much more re-usable. The disadvantage is that they require more computation as well as additional per-vertex attributes (or even more computation to runtime generate these attributes). Many multimedia applications have enormous requirements for art assets that anything to reduce the quantity to be authored or to be stored at runtime is a definite advantage. This perspective tends to favour tangent-space over object-space and consequently most implementations of per-pixel lighting will use tangent-space normal maps.

There is no required method for creating the source data and for the most part it entirely depends on what the eventual normal map will be used for as well as its relation to any other input data. Conceptually there are two options - artistically or dynamically.

For the former it relies on an artist using appropriate tools to author a normal-map in much the same way as they might generate other texture data. Several industry standard tools offer features to assist in this as well as there being a number of “tricks of the trade” for generating normal map data using cameras and carefully constructed lighting environments. This approach may well generate higher quality results but it can quite easily create a lot of work for artists and may not be an effective use of their time.

The alternative is to use one or more computer algorithms or techniques to generate the normal maps. Depending on the complexity of the algorithm this can be used to great effect and it also automates the process – which can be a substantial benefit.

A simple but effective approach is to start with height-map data, conventionally this is a grey-scale image where black is the lowest and white is the highest. Then, for each pixel, you can inspect neighbouring elements and determine the normal vector according to the gradients. The D3DX10 library provides a convenient function to do exactly this: D3DX10ComputeNormalMap().

A more complex approach is actually to take a high-resolution mesh and use normal maps to encode high-frequency surface variations in the texture and map this over low-resolution geometry. This has been used to great effect by several commercial tools (such as ZBrush and Mudbox) and graphics engines (such as Unreal Engine 3 and CryTek’s PolyBump technology). This process allows the artist to create content at a high resolution (possibly using hundreds or thousands of times more polygons) yet have it displayed at interactive frame rates with a minimal loss of quality. However, generating normal maps using this approach is beyond the scope of this text.

Storing The Source Data

A typical normal vector is stored in terms of X, Y and Z values – each of which are between -1 and +1 and giving the overall vector a length of 1. However, texture data is typically stored as Red, Green and Blue intensity information. More specifically it is unusual to have negative colour intensities – they will more typically be defined in the 0 to 1 range rather than the -1 to +1 range that a vector requires. This poses the challenge of mapping information between the two formats.

Direct3D 10 makes this job a little easier than with Direct3D 9 simply due to the larger number of guaranteed texture formats. In particular the guaranteed presence of several “High Dynamic Range” formats means you can avoid any sort of mapping and store the texture data in exactly the same way that your application might wish to deal with (e.g. D3DXVECTOR3 maps directly to DXGI_FORMAT_R32G32B32_FLOAT). However doing such a simple conversion has distinct storage-space considerations. A modest 96bit per pixel 512x512 normal map will require 3mb of VRAM which, on its own, isn’t so significant but if you want per-pixel lighting on a large number of objects then you can quite quickly see how you might exhaust available memory (don’t be fooled by the new WDDM virtualization – if you blow all the real VRAM it may still work but performance is likely to suffer due to the paging overhead).

A more conventional format might be DXGI_FORMAT_R8G8B8A8_SNORM – this requires only a third of the VRAM as the previous example. It is immediately noticeable that we are now only using 8bits per component instead of the full 32bit of the previous example; this might set alarm bells ringing but experience with this technology in previous versions of Direct3D has shown it to be relatively unimportant. You are more likely to generate quality errors in other places.

Another useful approach to storing normal-maps is to consider a two-component form. The DXGI format list offers a number of two-component pixel formats such as 8, 16 and 32 bit/channel Red/Green only textures (DXGI_FORMAT_R8G8_*, DXGI_FORMAT_R16G16_* and DXGI_FORMAT_R32G32_*). Use of these formats requires a bit more work on the part of the programmer and doesn’t necessarily work in every situation.

Equation 3.1.png

As shown in the above equation a unit vector can be expressed as a simple summation equal to one. If we rearrange the above equation for the Z component we get:

Equation 3.2.png

The immediate problem is that there are two, positive or negative, values for the Z component. Given that we are using this technique to perturb normal vectors across the surface we can, in most cases, assume that the vector points away from the surface – more specifically, we can assume it will be the positive value. For tangent-space normals the Z (blue channel) is pointing away from the surface. Using the following HLSL function we can sample a 2-component texture and then reconstruct the original 3-component one:

float2 n = tex2D( sampler, t ).rg;
float z = sqrt( 1(n.r * n.r)(n.g * n.g) );
float3 normal = normalize( float3( n.r, n.g, z ) );

This trick works well for tangent-space normal maps but tends to fall apart when dealing with object or world space normal vectors (which are much less common but do have valid uses). Given that it works for the majority of cases and gives a neat 33% storage improvement it is well worth considering.

To go even further you can consider using the block-compressed formats (analogous to the DXT1-5 formats from previous versions of Direct3D). In particular DXGI_FORMAT_BC5_SNORM works well with normal maps as it offers dual independent components (meaning we have to decode a 2-component normal to 3-component as previously discussed).

Block compressed formats work on 4x4 blocks of pixels thus it should come as no surprise that the overall texture dimensions must be multiples of four. If you stick with conventional 2n dimensions then this shouldn’t matter in the slightest. When using DXGI_FORMAT_BC5_SNORM we will require 16 bytes per 4x4 block – which is at least a 2:1 compression ratio (best case would be 12:1).

Diagram 3.4.png
Diagram 3.4

DXGI_FORMAT_BC5_SNORM stores two independent colours encoded as 8 bit Red/Green – effectively two DXGI_FORMAT_R8G8_SNORM pixels. From the two values for each colour it is possible to generate six other intermediary values via interpolation – for 8 shades of red and 8 shades of green. For each of the 16 pixels in the 4x4 grid two 3 bit values are stored (3 bits can represent 8 different values) which are used to look-up one of the 8 shades of red or green.

It is worth noting that this format is not a lossless compression system – you will get lower quality results when using compressed formats. Despite this it remains mostly trial-and-error as to whether this will be noticeable in the final image.

Apart from the aforementioned need to decode from 2-component to 3-component the pixel shader needs not have any further involvement – the GPU is responsible for decoding the compressed data (it also performs filtering and blending at 8 bit resolution) upon request.

As a brief summary, the following table shows the various formats and methods discussed thus far:

Pixel Format Storage Per-Pixel Storage for a 512x512 normal-map (with full mip-chain) Comments
R32G32B32_FLOAT 12 bytes 3.0mb (4mb) Easiest to work with, maps to data-types used by the application
R8G8B8A8_SNORM 4 bytes 1.0mb (1.33mb) Requires the application to re-map input formats. One unused channel (Alpha), could be used to store additional data.
R11G11B10_FLOAT 4 bytes 1.0mb (1.33mb) Requires the application to re-map input formats.
R32G32_FLOAT 8 bytes 2.0mb (2.66mb) Application and storage data-types match, but requires the pixel shader to decode to 3-component form
R16G16_FLOAT 4 bytes 1.0mb (1.33mb) Requires the application to re-map input formats and the pixel shader must decode to 3-component form
R8G8_SNORM 2 bytes 512kb (683kb) Same as above
BC5_SNORM 1 byte 256kb (341kb) Texture must have dimensions that are multiple of four, is a compressed format so quality may suffer and will require decoding in the pixel shader

Thinking about the storage format for your normal maps is without a doubt a good idea – as is seen in many parts of computer graphics, there is almost always a trade-off between quality, storage and performance. It is worth considering the above choices in a broad context where you might be dealing with tens, hundreds or even thousands of normal-maps in a given scene; by reducing the storage space required you can either increase the resolution (a 2048x2048 compressed normal map is only 1mb larger than an uncompressed 512x512 map) or simply increase the number of normal maps (For one 512x512 uncompressed normal maps you could store 12 compressed maps).

On a related, but slightly different note it is worth considering the use of mip-mapping with normal maps. In particular it can be a very easy way of introducing visual errors into your final image.

Without mip-maps it is possible to introduce aliasing that causes the final image to sparkle – very small movements of the geometry can result in substantially different parts of the normal-map being sampled if it is of a very high resolution. This can mean that the inputs into the lighting equation dramatically change and thus you get a noticeable visual artefact.

Using mip-maps helps to eliminate this problem but it can allow for different errors to creep in. Unless you create your normal maps with a tool that is specifically aware of what the content really is then you may find regular image-processing techniques actually break the stored data. In particular, D3DX can generate mip-chains for you when you load your image but when it down-samples and blends neighbouring pixels the result may no longer be decoded as a unit-length vector. If you then feed these un-normalised vectors into lighting equations you’ll often get dull and flat results. This may not always be noticeable (by definition, lower mip-maps are used when the contribution to the final image is relatively small) but it is worth bearing in mind. A simple remedy is to ensure that your normalize() any vectors sampled from a texture, but this will add to the arithmetic complexity of the shader.

There is one final storage-reduction trick that can beat all of the methods discussed previously in this section: Run-time normal generation. A normal map is often generated from a height-map but as will be shown later in this chapter the height map might also be an input into a per-pixel lighting algorithm.

This effectively duplicates the data – the height map is being uploaded to the GPU and additional data derived from this is also being uploaded. Therefore an obvious optimization is to upload only the original data and then derive the normal map as and when required.

The trade-off is the performance decrease due to the extra work done versus the reduction in storage. In some applications storage space can be much more valuable than performance therefore making this a compelling option.

// Determine the offsets
float2 o1 = float2( vPixelSize.x, 0.0f         );
float2 o2 = float2( 0.0f,         vPixelSize.y );
// Take three samples to determine two vectors that can be
// use to generate the normal at this pixel
float h0 = texHeightMap.Sample( DefaultSampler, tc ).r;
float h1 = texHeightMap.Sample( DefaultSampler, tc + o1 ).r;
float h2 = texHeightMap.Sample( DefaultSampler, tc + o2 ).r;
float3 v01 = float3( o1, h1 - h0 );
float3 v02 = float3( o2, h2 - h0 );
float3 n = cross( v01, v02 );
// Can be useful to scale the Z component to tweak the
// amount bumps show up, less than 1.0 will make them
// more apparent, greater than 1.0 will smooth them out
n.z *= 0.5f;
return normalize( n );

The previous code listing shows a simple way of generating a normal from a height-map. It uses the standard method for generating a triangle’s face normal. The quality, shown in image 3.5, proves to be much lower than that generated by an offline pre-processor. An alternative is to make use of Sobel filter kernels to determine the U and V gradients; this requires eight samples but has much lower arithmetic overheads and has generates better quality result.

The following two kernels are used to generate the U and V gradients:

Equation 3.3.png

Normal vectors have a length of 1.0 which combined with the U and V gradient’s allows the computation of the missing 3rd element of the normal:

Equation 3.4.png

This can all be put together in HLSL:

// Compute the necessary offsets:
float2 o00 = tc + float2( -vPixelSize.x, -vPixelSize.y );
float2 o10 = tc + float2(          0.0f, -vPixelSize.y );
float2 o20 = tc + float2(  vPixelSize.x, -vPixelSize.y );
float2 o01 = tc + float2( -vPixelSize.x, 0.0f          );
float2 o21 = tc + float2(  vPixelSize.x, 0.0f          );
float2 o02 = tc + float2( -vPixelSize.x,  vPixelSize.y );
float2 o12 = tc + float2(          0.0f,  vPixelSize.y );
float2 o22 = tc + float2(  vPixelSize.x,  vPixelSize.y );
// Use of the sobel filter requires the eight samples
// surrounding the current pixel:
float h00 = texHeightMap.Sample( DefaultSampler, o00 ).r;
float h10 = texHeightMap.Sample( DefaultSampler, o10 ).r;
float h20 = texHeightMap.Sample( DefaultSampler, o20 ).r;
float h01 = texHeightMap.Sample( DefaultSampler, o01 ).r;
float h21 = texHeightMap.Sample( DefaultSampler, o21 ).r;
float h02 = texHeightMap.Sample( DefaultSampler, o02 ).r;
float h12 = texHeightMap.Sample( DefaultSampler, o12 ).r;
float h22 = texHeightMap.Sample( DefaultSampler, o22 ).r;
// Evaluate the Sobel filters
float Gx = h00 - h20 + 2.0f * h01 - 2.0f * h21 + h02 - h22;
float Gy = h00 + 2.0f * h10 + h20 - h02 - 2.0f * h12 - h22;
// Generate the missing Z
float Gz = 0.5f * sqrt( 1.0f - Gx * Gx - Gy * Gy );
// Make sure the returned normal is of unit length
return normalize( float3( 2.0f * Gx, 2.0f * Gy, Gz ) );
Image 3.5.png
Image 3.5

Reading from a texture, top, computing using three samples, middle, and the Sobel filter, bottom, all generate different quality results. In general a pre-computed normal map will be of a higher quality but it can require a substantial amount of storage space. The Sobel filter is probably the best trade-off for performance versus storage.

This can all be put together to form the following HLSL function:

float3 FetchNormalVector
                float2 tc,
                uniform bool readFromTexture,
                uniform bool useSobelFilter
    if( readFromTexture )
        // Use the simple pre-computed look-up
        float3 n = texNormalMap.Sample( DefaultSampler, tc ).rgb;
        return normalize( n * 2.0f - 1.0f );
        if( useSobelFilter )
            // Previously discussed code for Sobel filter
            // goes here.
            // Previously discussed code for 3-sample
            // evaluation goes here.

This function is relied upon for the techniques demonstrated later in the chapter. It encapsulates the three main methods for fetching per-pixel normal vectors and allows for easy experimentation. There are two very important points to note with the above code. Both of which will result in an incorrect image and can be really quite difficult bugs to track down.

Firstly, the underlying normal-map texture is stored in DXGI_FORMAT_R8G8B8_UNORM – that is, each component is in the range 0.0 to 1.0 whereas the vector we want should be in the -1.0 to +1.0 range. A simple multiply by two and subtract one will perform the correct remapping, but it is all too easy to forget!

Storing normal maps in _UNORM form is fairly standard, but as previously discussed not compulsory. This initial remapping may not be required depending on the chosen implementation.

Secondly, the Sample() call is explicitly swizzled to ‘rgb’ before being manipulated by the previous remapping and most importantly before being an input into the normalize() intrinsic. The return value from Sample() is a float4 which would result in a 4D vector normalization that would then be truncated to the float3 declaration used to hold the results. Because the incoming texture contains only three components it will be padded to a 4D vector with a 1.0 value for the 4th component. This throws the normalize() result out and the 3D vector we pass into the lighting calculations is no longer unit-length. The net result is that lighting can appear to be dull due to the way that the Lambertian N•L scalar can never be greater than this shortened and corrupted input vector. By forcing the sample to be a 3D vector before normalization we can completely avoid this problem.

Moving From Per-Vertex To Per-Pixel

For this section of the book we will be using tangent-space normal maps (as opposed to the previously mentioned object-space normal maps). This gets slightly tricky in that the per-vertex attributes that our geometry is defined with are no longer in the same coordinate system as the vectors stored in the normal map texture. Thus it becomes necessary to transform all inputs into the same coordinate frame – failure to do this will generate noticeably incorrect results.

As with most transformations we can go in either direction – we could transform the normal-map into world space coordinates or we could transform per-vertex attributes to tangent-space coordinates. It is common to do the latter simply because it requires less computation (which equates to better performance). However there are still a number of situations where the choice isn’t so obvious or you may want to implement it differently. For example, environment mapping needs a world-space normal and using a vertex shader to transform light vectors to tangent space can chew up a large number of interpolators (the resource used to transport data between shader units) if many lights are being used.

The most important point is that we perform any calculations on inputs that are all in the same coordinate system. Failure to do this (e.g. performing the dot-product on a world-space and tangent-space vector) results in very noticeable errors which, from the source code, can be quite tricky to spot.

From a mathematical point of view a 3D coordinate system is defined through three linearly independent basis vectors. For tangent space we will be using tangent, normal and bitangent (this is often incorrectly referred to as “binormal”) vectors which are based on the texture coordinates that map the normal map over the surface of the triangle.

Diagram 3.6.png
Diagram 3.6

The preceding diagram shows how this coordinate system looks; it is worth noting that the tangent frame is defined for the individual triangle. Whilst Direct3D 10, via the SV_PrimitiveID system-generated value, would allow us to store per-primitive attributes it is easier (and more common) to attach the vectors as per-vertex attributes.

Diagram 3.7.png
Diagram 3.7

Providing the vectors as per-vertex attributes introduces one potential problem though – the case where two triangles share the same vertex yet have different coordinate frames. Diagram 3.7 shows an example of a situation known as a “bow tie”. The vertex linking the two triangles might have a common texture coordinate, but the other two vertices in each triangle result in a different per-triangle coordinate frame. Solving this problem isn’t difficult but it does require the splitting/duplication of vertex data which can lead to less efficient vertex processing (less utilization of the post-transform vertex cache for example) and increased storage requirements.

Generating the tangent frame is relatively simple and depending on the available art tools may not even be required in the main application. It is important to note that the tangent frame is actually a per-triangle attribute and it is quite common to store averages of neighbouring triangles for the per-vertex attributes.

  • Let V0, V1 and V2 be the vertex positions making up the triangle
  • Define A and B as:

Equation 3.5.png

  • Let TC0, TC1 and TC2 be the texture coordinates at each of the vertices
  • Define P and Q as:

Equation 3.6.png

  • We can now express the Tangent and Bitangent with the following equation:

Equation 3.7.png

  • The preceding equation gives us unnormalized vectors, so the next step is to normalize:

Equation 3.8.png

  • As well as the Tangent and Bitangent we also need the normal, N:

Equation 3.9.png

  • The three vectors might not be exactly orthogonal but we can use the Gram-Schmidt algorithm to orthogonalize:

Equation 3.10.png

  • These vectors can then be rearranged for the following, final, matrix:

Equation 3.11.png

For more in-depth information on the above method, refer to [Lengyel03].

Implementing the algorithm is not difficult, but D3DX9 had a function to do all this work for you. At the time of writing this function is not in D3DX10, but there are ways of using the more advanced D3DX9 mesh library as a pre-processor for D3DX10 meshes until the D3DX10 library matures. The sample code accompanying this section demonstrates a way of achieving this.

In much the same way that we could choose the storage format for the normal-map texture we also have a lot of choice about how the tangent-space coordinate frame is attached to each vertex. This can have notable performance (especially if using animated or deformable models) and storage trade-offs as will be demonstrated in the following sections:

Uncompressed storage'

This method is very simple – the application can compute the necessary basis vectors or get the D3DX library to do the work and then simply attach the raw data to each vertex. This is both easy to understand and straight-forward to write code for. The problem is, quite simply, that it is uncompressed and that it stands to inflate our storage requirements. More subtly it can also affect the performance of our rendering.

In particular the GPU will usually be required to stream vertex data from the main video memory into the local cache before the vertex shader can operate on the information. It is a simple case that the more data the chip has to stream the slower the overall performance is likely to be. Given that a simple vertex might be a position, normal and texture coordinate (32 bytes) adding an additional 36 bytes (112.5%) seems a little heavy.

For uncompressed mode it would be necessary to use the following declaration:

D3D10_INPUT_ELEMENT_DESC uncompressed[] = 
        "POSITION", 0, 
        DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, 
        "NORMAL", 0, 
        DXGI_FORMAT_R32G32B32_FLOAT, 0, 12, 
        "BITANGENT", 0, 
        DXGI_FORMAT_R32G32B32_FLOAT, 0, 24, 
        "TANGENT", 0, 
        DXGI_FORMAT_R32G32B32_FLOAT, 0, 36, 
        "TEXCOORD", 0, 
        DXGI_FORMAT_R32G32_FLOAT,    0, 48, 

This would correspond to the following input into a vertex shader:

struct vsUncompressed
    float3 position        : POSITION;
    float3 normal        : NORMAL;
    float3 bitangent        : BITANGENT;
    float3 tangent        : TANGENT;
    float2 texcoord        : TEXCOORD;

Compress to 2-axis

The next step on from uncompressed data is to trade the storage of one vector against additional computation costs. Using the cross-product equation we can reconstruct the bitangent vector provided we know both of the tangent and normal vectors. This is a relatively easy storage space win – the reduction of 12 bytes per vertex, down to adding only 24 bytes, is almost always going to be more beneficial than any overhead introduced by the cross product operation.

There is one complication in using this technique: tangent-space handedness. When trying to reconstruct the missing vector there are two possible, opposite, directions similar to the situation encountered when storing normal vectors in a 2-channel texture. The resolution to this issue is to multiply the cross product of the normal and tangent vectors by a ±1 bias. This value is simply the determinant of the matrix produced in step-9 of the preceding instructions on computing the tangent-space transformation matrix.

The application configuration would look like:

D3D10_INPUT_ELEMENT_DESC uncompressed2Axis[] = 
        "POSITION", 0, 
        DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, 
        "TANGENT", 0, 
        DXGI_FORMAT_R32G32B32_FLOAT, 0, 12, 
        "NORMAL", 0, 
        DXGI_FORMAT_R32G32B32_FLOAT, 0, 24, 
        "BIAS", 0, 
        DXGI_FORMAT_R32_FLOAT, 0, 36, 
        "TEXCOORD", 0, 
        DXGI_FORMAT_R32G32_FLOAT,    0, 40, 

This corresponds to the following input into the vertex shader:

struct vsUncompressed2Axis
    float3 position       : POSITION;
    float3 tangent        : TANGENT;
    float3 normal         : NORMAL;
    float  bias           : BIAS
    float2 texcoord       : TEXCOORD;

For clarity the bias is separated into a separate attribute but in practice it would be valid to widen either the tangent or normal attributes to a 4-component form and store the bias in the 4th component. If doing this be careful to use the appropriate ‘xyz’ swizzle when decoding the missing vector.

The vectors can then be decoded using this simple HLSL fragment:

float3 bitangent = cross( vertex.tangent, vertex.tangent );
bitangent *= vertex.bias;
bitangent = normalize( bitangent );

Compress to half-precision

The next step is to make use of half-precision vectors. Whilst the majority of C/C++ applications will use 32bit IEEE754 floating point numbers (or, if they don’t use 32bit they’ll use 64bit) the HLSL language allows us to pack in 16bit floating point numbers. The common shader core’s for D3D10 all operate at 32bit precision, but there isn’t anything to stop the application picking a lesser format for storage.

In particular the application can use DXGI_FORMAT_R16G16B16_FLOAT as shown in the following declaration:

D3D10_INPUT_ELEMENT_DESC uncompressed2Axis[] = 
        "POSITION", 0, 
        DXGI_FORMAT_R16G16B16_FLOAT, 0, 0, 
        "TANGENT", 0, 
        DXGI_FORMAT_R16G16B16_FLOAT, 0, 6, 
        "NORMAL", 0, 
        DXGI_FORMAT_R16G16B16_FLOAT, 0, 12, 
        "BIAS", 0, 
        DXGI_FORMAT_R16_FLOAT, 0, 18, 
        "TEXCOORD", 0, 
        DXGI_FORMAT_R32G32_FLOAT,    0, 20, 

It is worth noting that the common shader cores introduced in Direct3D 10 all run at 32 bit precision internally – however this does not apply to input data (be it per-vertex attributes or a half-precision texture). This upward casting from 16bit floating point to 32bit floating point is handled transparently as far as the application and shader is concerned. The above fragment can be used via the same HLSL as the previous method.

Compress to 8-bit integer

The next step would be to consider properly compressing the per-vertex attributes. Using half-precision vectors is a form of compression, but more of a convenient one as opposed to an explicit attempt at compression.

Using D3D10’s new integer instruction set could yield numerous packing opportunities and based on trial-and-error it is possible to greatly reduce the amount of per-vertex data. As with most compression schemes the result is a trade-off between storage requirements versus quality.

A key observation, based in part on the discussion of encoding normal vectors in textures, is that it is possible to store each component of a vector in an 8 bit integer quantity. Experience shows that this doesn’t usually result in a substantial loss of quality therefore making it a viable choice in this situation. The more adventurous programmer could look into using Direct3D 10’s new integer instruction set to implement fixed-point storage with the DXGI_FORMAT_R10G10B10A2 formats. This could be a useful middle-ground between using 8bit integers and half-precision floating point quantity. The 2-bit alpha channel may seem useless, but it is a good candidate for storing the bias component discussed in the previous section.

The new DXGI formats for D3D10 don’t actually have a 3-channel 8 bit format, only 1, 2 and 4 channel forms. Consequently the following structure is slightly unintuitive as it splits the individual vectors up so as not to waste an additional byte per attribute:

D3D10_INPUT_ELEMENT_DESC compressed8Bit2Axis[] = 
        "POSITION", 0, 
        DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, 
        "TANGENT_X", 0, 
        DXGI_FORMAT_R8_SNORM, 0, 12, 
        "TANGENT_Y", 0, 
        DXGI_FORMAT_R8_SNORM, 0, 13, 
        "TANGENT_Z", 0, 
        DXGI_FORMAT_R8_SNORM, 0, 14, 
        "NORMAL_X", 0, 
        DXGI_FORMAT_R8_SNORM, 0, 15, 
        "NORMAL_Y", 0, 
        DXGI_FORMAT_R8_SNORM, 0, 16, 
        "NORMAL_Z", 0, 
        DXGI_FORMAT_R8_SNORM, 0, 17, 
        "BIAS", 0, 
        DXGI_FORMAT_R8_SNORM, 0, 18, 
        "TEXCOORD", 0, 
        DXGI_FORMAT_R32G32_FLOAT,    0, 19, 

The above IA signature can be mapped to the following HLSL input structure, note that the input is declared as float3 – it relies on D3D10 upscaling the R8_SNORM to a R32_FLOAT:

struct vsCompressed
    float position      : POSITION;
    float tangent_x     : TANGENT_X;
    float tangent_y     : TANGENT_Y;
    float tangent_z     : TANGENT_Z;
    float normal_x      : NORMAL_X;
    float normal_y      : NORMAL_Y;
    float normal_z      : NORMAL_Z;
    float bias          : BIAS;
    float2 texcoord     : TEXCOORD;

Compute with the geometry shader

As previously discussed in this section, the tangent-frame information is really a per-triangle attribute rather than per-vertex. Attaching the information to the vertices is more of a legacy method – these techniques all work under Direct3D 9 where there was no scope for either generating or attaching per-primitive attributes.

Simply put we can use the new Geometry Shader stage to generate the tangent-space vectors dynamically on the GPU. This reduces the storage requirements (remember that for a given triangle we could have three sets of tangent-space vectors if implemented per-vertex) and associated memory bandwidth requirements. It also eliminates the case of “bow ties” where conflicting data requires a shared vertex to be duplicated and split.

The obvious downside is that it trades storage requirements against execution time requirements; the geometry shader will have to compute the vectors every time the triangle is rendered – regardless of whether it has actually changed. For static geometry it is most likely a better option to compute the results on the CPU and upload it; dynamic and procedural geometry might well see a substantial performance improvement by computing entirely on the GPU. Using previous techniques any change to the geometry would most likely require the CPU to intervene and both recalculate and re-upload the results – neither of which is good for performance.

The following HLSL function can be used in a Geometry Shader to generate the appropriate transform matrix:

float3x3 ComputeTangentSpaceMatrix
                in float3 pos[3], 
                in float2 tc[3] 
    float3 A = pos[1] - pos[0];
    float3 B = pos[2] - pos[0];
    float2 P = tc[1] - tc[0];
    float2 Q = tc[2] - tc[0];
    float fraction = 1.0f / ( P.x * Q.y - Q.x * P.y );
    float3 normal = normalize( cross( A, B ) );
    float3 tangent = float3
        (Q.y * A.x - P.y * B.x) * fraction,
        (Q.y * A.y - P.y * B.y) * fraction,
        (Q.y * A.z - P.y * B.z) * fraction
    float3 bitangent = float3
        (P.x * B.x - Q.x * A.x) * fraction,
        (P.x * B.y - Q.x * A.y) * fraction,
        (P.x * B.z - Q.x * A.z) * fraction
    // Some simple aliases
    float NdotT = dot( normal, tangent );
    float NdotB = dot( normal, bitangent );
    float TdotB = dot( tangent, bitangent );
    // Apply Gram-Schmidt orthogonalization
    tangent    = tangent - NdotT * normal;
    bitangent  = bitangent - NdotB * normal - TdotB * tangent;
    // Pack the vectors into the matrix output
    float3x3 tsMatrix;
    tsMatrix[0] = tangent;
    tsMatrix[1] = bitangent;
    tsMatrix[2] = normal;
    return tsMatrix;

CPU based computation will typically include adjacent triangles when computing per-vertex attributes – a simple case of averaging the vectors for each face that shares a particular vertex. This may be mathematically incorrect but it can allow for a more aesthetically pleasing result in much the same way that averaging vertex normal vectors can give the impression that the surface is smoother than it really is (refer back to image 3.2). By computing the tangent-space transform in the geometry shader this is still possible if adjacency information is used, but it would mean calling the ComputeTangentSpaceMatrix() function four times per-primitive.


Method Additional Storage Comments
Uncompressed 36 bytes/vertex Easiest to work with, but may well inflate storage and bandwidth requirements
2-axis and compute 3rd in the shader 28 bytes/vertex Relatively easy to work with, but offloads storage against a minimal run-time computation cost
Half-Precision vectors 14 bytes/vertex Slightly more complex to work with and trades storage space against computation cost
Compressed vectors 7 bytes/vertex Potential loss of quality due to low-resolution of the data
Geometry Shader None! computation cost in favour of not having any additional data storage

A Framework For Per-Pixel Lighting

Before covering the actual algorithms for per-pixel lighting it is useful to define a common framework. From a high-level the following three algorithms do only two things:

  1. Compute the texture coordinate
  2. Compute the illumination

The remainder of this chapter deals with the first part and the next six deal with the second part. This chapter uses a very simple Lambertian N•L computation for illumination as a form of place-holder for the models discussed in later chapters.

Each of the techniques can use the same vertex shader to perform the tangent-space transformations:

VS_TS_OUTPUT vsTangentSpace( in VS_INPUT v )
    // Transform the incoming model-space position to
    // projection space - a standard VS operation.
    o.position = mul( float4( v.position, 1.0f ), mWorldViewProj );
    // The world-space vertex position is useful for
    // later calculations.
    float3 pWorld = mul( float4( v.position, 1.0f ), mWorld ).xyz;
    // The per-vertex attributes are in model space, so we
    // need to transform them to world space before we can
    // generate a world->tangent transformation
    float3 vTangentWS   = mul( v.tangent,   (float3x3)mWorld );
    float3 vBitangentWS = mul( v.bitangent, (float3x3)mWorld );
    float3 vNormalWS    = mul( v.normal,    (float3x3)mWorld );
    // Pack the data into the matrix for transforming
    // the actual input data
    float3x3 mWorldToTangentSpace;
    mWorldToTangentSpace[0] = normalize( vTangentWS );
    mWorldToTangentSpace[1] = normalize( vBitangentWS );
    mWorldToTangentSpace[2] = normalize( vNormalWS );
    // Use the matrix to transform all necessary
    // inputs for the pixel shader.
    o.lightdir = mul( mWorldToTangentSpace, -vLightDir );
    o.viewdir = pCameraPosition - pWorld;
    o.viewdir = mul( mWorldToTangentSpace, o.viewdir );
    // Pass through any other necessary attributes
    o.texcoord = v.texcoord;
    return o;

The code included above only transforms two vectors into tangent space – the light and view directions. These are standard inputs into most BRDF lighting models and all that is required for the techniques discussed later in this section, but there more complex techniques may well need to transform additional inputs – simply add the additional code towards the end of the above vertex shader.

The vertex shader sets up the inputs for the pixel shader which is where the two main pieces of work are performed:

float4 psMain( in VS_TS_OUTPUT_V ) : SV_TARGET
    // Normalize the input vectors
    float3 view  = normalize( v.viewdir );
    float3 light = normalize( v.lightdir );
    // Determine the offset for this pixel
    float2 offset = ComputeOffset( v.texcoord, view );
    // Retrieve the normal
    float3 normal = FetchNormalVector( v.texcoord, true );
    // Evaluate the lighting model
    float4 colour = ComputeIllumination( normal, view, light );
    // Return the final colour
    return colour;

The ComputeOffset() and ComputeIllumination() calls are abstractions for the main pieces of work. The former is discussed for the remainder of this section and the latter represents the BRDF lighting model that is discussed over the next six chapters.

In developing more complex material systems (for which the lighting models are simply a foundation) the framework proposed here would need to be changed. As should be immediately obvious, the input vertex data is assumed to be using uncompressed tangent-space vectors; it is also worth noting that this framework is suitable only for a single light source. Real-world applications are more likely to use a loop (using a for() construct) to handle multiple lights. For simplicity these are not included here but they are all valid modifications that can be implemented on top of this simple example.

Simple Normal Mapping

This technique is the closest to James Blinn’s original 1978 paper and is also the simplest of the three being covered in this chapter. Despite its simplicity it is a very effective technique that can be used to great effect.

The code-listing presented as part of the previous section is in fact simple normal mapping. As described, these three techniques deal with perturbing the normal across the surface in order to simulate a more detailed surface. For simple normal mapping with just read a new normal directly from the normal map texture bound to this particular primitive.

The pixel shader does not need to do any additional work – the vertex shader has already transformed the input data into tangent space (which is the same coordinate system that the normal map is stored in) and the texture coordinate does not need to be modified. Therefore the complete pixel shader is as follows:

float4 psNormalMap( in VS_TS_OUTPUT v ) : SV_TARGET
	float3 normal = FetchNormalVector( v.texcoord, true );
	return ComputeIllumination( v.lightdir, normal, 0.0f );

The previous fragment relies on two functions one of which is included below for reference. To simplify the source code provided to accompany this chapter several common operations are refactored into their own functions.

float4 ComputeIllumination
        float3 LightDir, 
        float3 Normal, 
        float  Shadow 
    float NdotL = dot( LightDir, Normal );
    Shadow      = 1.0f - Shadow;
    float c = max( 0.25f, NdotL * Shadow );
    return float4( c, c, c, 1.0f );

Refer back to the earlier discussion on run-time generation of normal vectors from the height-map for a full explanation of the FetchNormalVector() function.

Image 3.8.png
Image 3.8

The preceding image shows the results of the normal mapping technique when applied to a completely flat surface. It is apparent that the surface contains additional detail, although this is most obvious when seen interactively – the lighting changes as expected, something that is difficult to show in a single static image.

Upon closer evaluation of image 3.8 it might start to seem incorrect – whilst there is additional detail to the surface it still looks very flat. For a given surface detail it is possible to see as much of the surface facing away from the view as it is of the surface facing towards the view. In the real-world you would expect the height of the surface detail to partially occlude the surface facing away. In much the same way that if you look at a 3D cube you only expect to see three faces – not four. The following technique aims to reduce this error and the final technique aims to eliminate it entirely.

Parallax Mapping With Offset Limiting

Parallax is the way that parts of the same surface appear to change position relative to each other due to movement of the observer. With simple normal mapping there is no parallax effect – the relative position of each part of surface remains the same regardless of where you observe the surface from. It is best observed interactively when you can change the viewing position – if the surface was actually bumpy then you’d expect to see more/less of opposing sides of the bumps depending on where you positioned the camera.

The next progression for per-pixel mapping requires the use of an additional piece of per-pixel information – elevation. This is usually easy enough to provide given that the normal map is typically generated from a height-map (see “Creating the source data” for more discussion on this). Simple normal mapping did not perturb the texture coordinate before sampling the normal map, but by modifying this such that it compensates for parallax it is possible to reduce one of the main graphical flaws in simple normal mapping. This approach was proposed in “Parallax Mapping with Offset Limiting: A Per-Pixel Approximation of Uneven Surfaces” [Welsh04]. As will be shown, a relatively simple modification to simple normal mapping can generate a substantial quality improvement.

Diagram 3.9.bmp
Diagram 3.9

Diagram 3.9 shows the actual surface as the bold, black line and adds the elevation information as the dashed line. In tracing a line from the viewer to the pixel there are cases where it will pick the normal vector incorrectly. Simple normal mapping provides no occlusion and makes it possible to see parts of the surface that are facing away from the camera. Parallax mapping with offset limiting compensates for this by inspecting the elevation information (stored in a height-map) and offsetting the texture coordinate so that a more accurate sample is used.

Diagram 3.10.bmp
Diagram 3.10

By sampling the height-map at the coordinate passed down from the vertex shader (Toriginal) we can compute an offset that we add in order to sample the correct location (Tcorrected) for any other resources (in this instance we only sample a normal map but sampling a diffuse texture is also fairly standard). As shown in diagram 3.10 the offset computed by using just the height is not actually correct and is thus only an approximation to the proper result (Tactual). Acquiring the correct offset is possible but is a substantially more expensive algorithm thus making this approximation worth using.

Parallax mapping with offset limiting first has to scale the height information which in most cases will be stored in the range 0.0 to 1.0. Given that texture coordinates are typically constrained to a 0.0 to 1.0 range thus offsetting these by a height in the same range would be far too much – providing wrapping/mirroring is enabled you’d almost certainly end up fetching data that is a long way from the desired location. As can be seen with a little experimentation, tweaking the offset calculations for this technique can destabilise the results – at a certain point the approximation is sufficiently wrong as to generate noticeable artefacts in the final image.

The first step in this algorithm is to perform a simple linear transformation on the height retrieved from the height-map:

Equation 3.12 web.png

Typical values for scale will be less than 0.05 with the bias typically being the scale multiplied by -0.5. The actual values are partly dependent on source data and the desired outcome. This transformed height value can then be used to scale the offset in the direction of the viewer (a vectors transformed to tangent space by the vertex shader). This direction will be in the (X, Y) part of the 3D vector with the Z component representing the steepness of the view angle at a given point.

Equation 3.13 web.png

The original parallax mapping approach ([Kaneko01]) divided the offset by the view vector’s Z component which is mathematically more accurate. However, in doing so the offset ends up becoming extremely large at grazing angles (where the Z component approaches zero) due to the division by a very small value. As was previously mentioned, having a large offset results in the sampling of incorrect source data and thus you get very noticeable graphical artefacts. The preceding equation does not divide by the Z component thus limiting the offset to a maximum of Heightsb; it isn’t the only way that the offset could be limited but it has the distinct advantage that it simplifies the pixel shader rather than adds potentially complicated methods for limiting the computed offset.

float4 psParallaxMapping
            in VS_TS_OUTPUT v, 
            uniform float HEIGHT_SCALE 
        : SV_TARGET
    // Compute the height at this location
    float height = texHeightMap.Sample( DefaultSamp, v.texcoord ).r;
    height = HEIGHT_SCALE * height - (HEIGHT_SCALE / 2.0f);
    // Compute the offset    
    float2 offsetDir = normalize( v.viewdir ).xy;
    float2 offsetSample = v.texcoord + offsetDir * height;
    // Take the samples with the shifted offset
    float3 normal = FetchNormalVector( offsetSample, true );
    // Compute the final colour
    return ComputeIllumination( v.lightdir, normal, 0.0f );

The above code listing contains the simple modifications to the standard framework introduced previously. Note that the only real change is modifying the texture coordinate passed down from the vertex shader – the remainder of the pixel shader is identical to that used in simple normal mapping.

Image 3.11.png
Image 3.11

The preceding image, 3.11, shows parallax mapping with a scale of 0.05 and bias of -0.025. It should be immediately obvious that the surface is more realistic than with simple normal mapping; note how the amount of each bump’s sloped sides varies across the surface of the geometry. This is much closer to what the eye and brain are expecting and is thus much more acceptable than simple normal mapping. The following image, 3.12 shows a debug output of the parallax effect – one where the value is just the length of the computed offset. Darker areas of the surface correspond with areas where little or no offset is required and lighter areas where the offset is greater. Notice that the areas subject to the largest offsets are at the glancing angles and where the surface height is greatest; the darker area on the surface corresponds to where the view direction is close to being perpendicular to the underlying geometry.

Image 3.12.png
Image 3.12


This is the 3rd and final approach to per-pixel lighting that will be covered. The previous section implemented a way to approximate the correct offset that generates very good results given the simplicity of the change. This section presents a hybrid technique of relief mapping ([Policarpo05] and ‘Parallax Occlusion Mapping’ – [Tartarchuk06]). Both use much the same theory of ray-tracing in the pixel shader to accurately compute the correct offset and thus attain the best image quality. This is interesting not only because of the high-quality results but also because of the way it blurs the line between ray-traced graphics and traditional polygonal raster graphics; two techniques that, for the most part, are considered to be mutually exclusive.

Up-front the mention of ray-tracing might worry the more experienced graphics programmer – arguably the primary motivation behind real-time graphics using polygonal methods is that their performance is an order of magnitude better than that of ray-traced equivalents. Thus ray-tracing is synonymous with non-real-time performance. Unsurprisingly this makes it the slowest of the three techniques demonstrated in this chapter but this doesn’t invalidate its viability as a real-time rendering effect. With regards to performance, previous implementations have shown best performance with the more advanced shading models – version 3.0 pixel shaders for example. The process of ray-tracing in a pixel shader benefits greatly from a shader model that can handle dynamic looping and branching – features that come as standard in shader model 4.0. Arguably the fact that the software shader model is good at expressing these programming constructs does not guarantee that the hardware is good at executing them – but such things are beyond the scope of this text.

The previous technique introduced the use of the view-direction and relief mapping takes this further by not just using it for a single-step offset but to define a line from which to take multiple samples from. It is the viewing vector that forms the ray in which we are to trace. Take the following diagram where the bold line is the surface and the dashed line is the one we wish to approximate (note that, unlike previous techniques, we are using a subtractive approach – more on this later). Conventional ray-tracing is heavy on the mathematics and fundamentally relies on equations that test the intersection between the current ray and basic primitives (such as planes or spheres), the ray-tracing that can be employed for this technique simply checks whether we are above or below the height of the surface being represented. It therefore becomes a simple test whereby the transition from above to below the height of the surface marks the point of intersection. Take the following diagram where the intersection exists between the 3rd and 4th sample:

Diagram 3.13.png
Diagram 3.13

Referring to any good computer graphics or even general computer science text-book would suggest that a binary search is a perfect candidate in this situation. Given that we would like to reduce the number of samples taken (each of which increases the required bandwidth) a smart algorithm such as binary search would be great. However, consider the following diagram:

Diagram 3.14.png
Diagram 3.14

Diagram 3.14 shows a case where the first mid-point is already beyond the actual point of intersection. The binary search continues scanning forward, looking for a point below the surface which it quickly finds and then assumes the true point of intersection lies between these two – even though the first point is incorrect.

The preceding diagram shows a relatively rare case that is usually reserved for surfaces with high-frequency height-map information however it hides a more subtle artefact; the relative viewer orientation need not move a huge amount for the above characteristic to either appear or disappear. This is one of the worst situations for computer graphics quality – the artefact appears to flicker between visible and invisible much like “Z-Fighting” artefacts. Luckily there is a convenient work-around to this initial setback – a hybrid of both binary and linear searching works quite well. The basic idea is to run a linear search to get an initial reference point and then to improve the accuracy of this by using a secondary binary search:

Diagram 3.15.png
Diagram 3.15

The first four samples are the linear search – performed until the first sample below the surface is found. A binary search, the 5th and 6th samples, can then operate between the last two linear search locations which should be above/before and below/after the true point of intersection. It is still possible to find test-cases that fool this system, but the in the vast majority of cases this hybrid approach is accurate.

This hybrid approach is used in the implementation of Relief Mapping and whilst it can yield very accurate results with minimal samples it has one problem when mapped to real hardware. The samples taken during the final binary searching phase are classed as “dependent reads” – the location being sampled is dependent on some arithmetic performed by the shader. This makes it much harder for the GPU and its driver to schedule instructions efficiently and ultimately leads to poor performance. The linear searching phase of the algorithm does not suffer from this problem – whilst the number of samples are dynamic the actual locations are at well defined intervals and not dependent on any arithmetic in the shader. The Parallax Occlusion Mapping technique utilizes only a linear search in an attempt to make the algorithm as optimal as possible for current GPU architectures. The final refinement stage can be implemented in arithmetic rather than as an additional search which yields equivalent results and is therefore a perfectly valid optimization.

This final refinement stage requires solving for the intersection of the surface and the ray corresponding to the view direction. Due to linear filtering we can assume that the pixels sampled for the height of the surface have a linear distribution between them which helps in defining the intersection.

Diagram 3.16.png
Diagram 3.16

The simple linear search can be stopped at the first iteration where the interpolated height along the ray is below the surface – Rafter. By implication the previous iteration’s interpolated height, Rbefore, will be above the surface. A sample of the underlying height-map will have been taken such that a pairing of interpolated height and actual height (Hbelow and Habove) is also known.

It can be assumed that the two actual heights will be from different pixels in the height-map, but it depends on the resolution of the linear search as to whether they are neighbours – but this minor detail can be skipped over in order to simplify things. The continuous surface that the discrete height-map represents can be determined through linear interpolation between the known discrete values. Solving the following equation should give us a distance between the known points where the intersection occurs – for simplicity it is assumed that the distance will be the same along both line segments, but in reality this probably won’t be the case. Testing shows this to have little to no impact on the final image.

Equation 3.14 web.png

All of the above can be implemented in the following pixel shader:

float3 ray_trace
            float2 pStart, 
            float3 vDir, 
            uniform float HEIGHT_SCALE, 
            uniform uint MIN_SAMPLES, 
            uniform uint MAX_SAMPLES, 
            uniform bool COUNT_SAMPLES_TAKEN 
    // Compute initial parallax displacement direction:
    float2 vParallaxDirection = normalize( vDir.xy );
    // The length of this vector determines the
    // furthest amount of displacement:
    float fLength          = length( vDir );
    fLength               *= fLength;
    float vDirZ_sq         = vDir.z * vDir.z;
    float fParallaxLength  = sqrt( fLength - vDirZ_sq ) 
    fParallaxLength       /= vDir.z; 
    // Compute the actual reverse parallax displacement vector:
    float2 vParallaxOffset = vParallaxDirection * fParallaxLength;
    // Need to scale the amount of displacement to account
    // for different height ranges in height maps.
    vParallaxOffset *= HEIGHT_SCALE;
    // corrected for tangent space. Normal is always z=1 in TS and 
    // v.viewdir is in tangent space as well...
    uint nNumSteps = (int)lerp
                                normalize( vDir ).z 
    float fCurrHeight        = 0.0;
    float fStepSize          = 1.0 / (float) nNumSteps;
    uint nStepIndex          = 0;
    float2 vTexCurrentOffset = pStart;
    float2 vTexOffsetPerStep = fStepSize * vParallaxOffset;
    float3 rVal = float3
                        - ( vTexOffsetPerStep * nNumSteps ),
    float dx = ddx( pStart.x );
    float dy = ddy( pStart.y );
    float fPrevHeight = 1.0;
    float curr_ray_dist = 1.0f;
    while ( nStepIndex < nNumSteps ) 
        // Determine where along our ray we currently are.
        curr_ray_dist -= fStepSize;
        vTexCurrentOffset -= vTexOffsetPerStep;
        fCurrHeight = texHeightMap.SampleGrad
        if( COUNT_SAMPLES_TAKEN ) 
            rVal.z += ( 1.0f / (MAX_SAMPLES - MIN_SAMPLES) );
        // Because we're using heights in the [0..1] range 
        // and the ray is defined in terms of [0..1] scanning
        // from top-bottom we can simply compare the surface 
        // height against the current ray distance.
        if ( fCurrHeight >= curr_ray_dist ) 
            // Push the counter above the threshold so that
            // we exit the loop on the next iteration
            nStepIndex = nNumSteps + 1;
            // We now know the location along the ray of the first
            // point *BELOW* the surface and the previous point 
            // *ABOVE* the surface:
            float ray_dist_above = curr_ray_dist 
                                   + ( 1.0f / (float)nNumSteps );
            float ray_dist_below = curr_ray_dist;
            // We also know the height of the surface before and
            // after we intersected it:
            float surf_height_before = fPrevHeight;
            float surf_height_after = fCurrHeight;
            float numerator = ray_dist_above - surf_height_before;
            float denominator = 
                  (surf_height_after - surf_height_before) 
                  - (ray_dist_below - ray_dist_above);
            // As the angle between the view direction and the
            // surface becomes closer to parallel (e.g. grazing
            // view angles) the denominator will tend towards zero.
            // When computing the final ray length we'll 
            // get a divide-by-zero and bad things happen.
            float x = 0.0f;
            if( all( denominator ) )
                x = numerator / denominator;
            // Now that we've found the position along the ray
            // that indicates where the true intersection exists
            // we can translate this into a texture coordinate 
            // - the intended output of this utility function.
            rVal.xy = lerp
                            vTexCurrentOffset + vTexOffsetPerStep, 
            fPrevHeight = fCurrHeight;
    return rVal;

The results:

Image 3.17.png
Image 3.17

The following image shows the number of samples required to render the above screenshot, lighter areas required more samples than darker. The exact sampling counts can be changed, but in this instance black would be 15 samples and white would be 75 samples.

Image 3.18.png
Image 3.18

Given the previously discussed method for tracing the ray defined by the view direction it is worth considering that there are actually two rays that could be considered for this approach:

Diagram 3.19.png
Diagram 3.19

The approach previously discussed can be used to find the point marked ‘Actual’ in diagram 3.19; from here a vector from the light to the surface intersection can be computed. Tracing along this new vector reveals an intersection (marked ‘A’ in diagram 3.19) that implies the fragment currently being rendered receives no light energy due to being in a shadow.

Exactly the same technique for tracing the ray as before makes it possible to determine whether the pixel in question is in a shadow cast by another part of the surface – also known as “self shadowing”. A slightly modified version of the code can be used for tracing shadows as it is no longer of any interest where the intersection is – just that there is, or is not an intersection along the ray. By rearranging the method to return a simple binary answer the code generated can be much less complicated and hence faster. The following code shows the refined ray-tracing method:

float2 vParallaxDirection = normalize(  v.lightdir.xy );
float fLength          = length( vDir );
fLength               *= fLength;
float vDirZ_sq         = vDir.z * vDir.z;
float fParallaxLength  = sqrt( fLength - vDirZ_sq ) 
fParallaxLength       /= vDir.z; 
float2 vParallaxOffset = vParallaxDirection * fParallaxLength;
vParallaxOffset *= HEIGHT_SCALE;
uint nNumSteps = (int)lerp
                            normalize( v.lightdir ).z 
float  fCurrHeight       = 0.0;
float  fStepSize         = 1.0 / (float) nNumSteps;
uint   nStepIndex        = 0;
float2 vTexCurrentOffset = uv.xy;
float2 vTexOffsetPerStep = fStepSize * vParallaxOffset;
float  fCurrentBound     = 1.0;
float dx                 = ddx( v.texcoord.x );
float dy                 = ddy( v.texcoord.y );
float fPrevHeight        = 1.0;
float initial_height = texHeightMap.Sample
                                    ).r + SHADOW_OFFSET;
    uv.z += ( 1.0f / ( (MAX_SAMPLES - MIN_SAMPLES) + 1 ) );
while( nStepIndex < nNumSteps )
    vTexCurrentOffset += vTexOffsetPerStep;
    float RayHeight = lerp
                            nStepIndex / nNumSteps
    fCurrHeight = texHeightMap.SampleGrad
        uv.z += ( 1.0f / ( (MAX_SAMPLES - MIN_SAMPLES) + 1 ) );
    if( fCurrHeight > RayHeight )
        // ray has gone below the height of the surface, therefore
        // this pixel is occluded...
        s = 1.0f;
        nStepIndex = nNumSteps;

As shown in the above code-listing it is necessary to add a small bias to the height of the ray. Without this bias it is all too easy to introduce a very noticeable aliasing where pixels are marked as being in shadow when they shouldn’t be. The cause for this isn’t technically an error but a case where the test is too accurate. Checking the ray for intersections can result in it incorrectly identifying the initial pixel as occluding the ray or if there is a very slight variation with one of its neighbours (e.g. for an 8 bit height map the origin could have value 150 and the neighbour 151) . Adding the bias to the calculation generates the results that we expect as it effectively skips intersections with the original pixel – correcting the starting location to not sample the original pixel is an alternative approach but still yields some minor cases of aliasing that this simple biasing approach manages to avoid.

The following image, 3.20 shows the results of a self-shadowed ray-traced per-pixel lighting implementation. It is immediately obvious that the shadowed areas are ‘hard’ – by definition the algorithm only determines whether a pixel is or is not within shadow. More visually pleasing results can be achieved by implementing soft shadowing, one method for this is described in [Tartarchuk06] and is not substantially more difficult or complex to implement. By utilizing the hard self-shadowing approach shown here it is worth noting that the frequency of the height map becomes a crucial factor. A low frequency height map contains smooth and gentle changes in height whereas a high frequency map will have sharp and dramatic features. The importance is that hard self-shadowing favours a more high-frequency surface as the hardness of the shadows appears more visually acceptable; self-shadowing on a low-frequency surface still works but can look incorrect (regardless of the fact that it is actually correct!) to the observer. All three of the core algorithms demonstrated in this chapter favour low- to mid- frequency surfaces for best results, consequently this means that some degree of testing is essential to ensure that the final results are acceptable.

Image 3.20.png
Image 3.20

As should be immediately obvious, performing this operation requires double the work and thus the expense may not seem worthwhile. However, as will be covered in subsequent chapters of this book it is worth considering that running a test at this point may well allow execution to completely skip any implementation of the lighting equation. For the more complex lighting models that require a huge number of instructions it may well be a net saving to eliminate them via an initial self-shadowing test. Then again, as has been previously stressed the advantages of this are largely dependent on the underlying hardware’s implementation of the high level constructs such as dynamic branching.

Comparison Of Results

This chapter started out by explaining why per-pixel lighting is important and then proceeded to explain three different methods for achieving the same thing. This final section compares the results of each of these techniques and provides a simple summary.

Image 3.21.bmp Image 3.22.bmp
Image 3.21 – Simple Normal Mapping Image 3.22 – Parallax Mapping with Offset Limiting
Image 3.23.bmp Image 3.24.bmp
Image 3.23 – Ray-Traced Image 3.24 – Ray-Traced with Self-Shadowing

The previous four images were all taken with the same environment settings and are directly comparable. Simple normal mapping exhibits an almost annoying behaviour of flipping between raised or depressed bumps depending on how you look at it – either way it is difficult to extract the exact shape of the surface being represented. Image 3.22, parallax mapping with offset limiting, is an improvement and it becomes clear that the surface is made up of a grid of depressions. With the ray-traced approach there is no doubt as to the surface being represented and the self-shadowing form reveals a lot about the direction of the light source.

There is a natural progression through the techniques demonstrated in this chapter: Ray traced parallax with offset limiting, normal mapping and no per-pixel lighting. [Tartarchuk06] includes the idea of using mip-mapping to blend between these different methods based on a given pixel’s level of detail in the final image. If the pixel shader manually computes the mip-level to be used then it can select a higher or lower quality per-pixel lighting technique. The basic idea being that the shader scales the quality of the technique according to the actual contribution to the final image.

The following table contains a technical summary of each technique. Using the command-line HLSL compiler it is possible to generate an HTML formatted assembly code listing (use the /Fc and /Cc parameters) and it is therefore possible to determine how complex the shader really is. The values shown in the following table are provided only for guidance purposes – it is possible to implement some of these techniques with more or less instructions and the HLSL compiler is often being improved such that in time the same code might compile to less instructions.

Name Normal Map? Height Map? Instructions Samples Dynamic Flow Comments
None No No 2 0 No Would require additional geometry to simulate a more complex surface
Simple Yes No 9 1 No Good quality results but can show some unexpected results
Parallax with offset Yes Yes 15 2 No A simple change to the base algorithm yielding substantially better results
Ray-traced Yes Yes 68 15-75 Yes Best quality results with a high computation cost
Ray-traced with shadows Yes Yes 114 30-150 Yes Improves upon parallax occlusion by adding self-shadowing to the surface

It is important to realise that the two ray-traced techniques use dynamic flow control (via looping) and as such the number of instructions executed could be substantially higher than those used to define the program itself.

Both the parallax and ray-traced approaches utilize a height-map and therefore become good candidates for run-time normal generation. By taking this route the number of instructions increases by between 13 and 38 whilst the samples increase by 2 or 7 depending on the technique used.


[Blinn78] “Simulation of Wrinked Surfaces”, Computer Graphics (SIGGRAPH 1978 Proceedings), volume 12, pages 286-292, August 1978.

[Lengyel03] Page 184 of Mathematics for 3D Game Programming & Computer Graphics 2nd Edition by Eric Lengyel. ISBN 1-58450-277-0.

[Welsh04] “Parallax Mapping with Offset Limiting: A Per-Pixel Approximation of Uneven Surfaces”, Terry Welsh, January 2004.

[Kaneko01] “Detailed Shape Representation with Parallax Mapping”, Tomomichi Kaneko, et al. 2001.

[Policarpo05] “Real-Time Relief Mapping on Arbitrary Polygonal Surfaces”, Policarpo, Oliveira and Comba, 2005.

[Tartarchuk06] “Practical parallax occlusion mapping for highly detailed surface rendering”, Natalya Tartarchuk, ATI Research, Presented at the Game Developers Conference 2006.

Navigate to other chapters in this section:

Foundation & Theory Direct Light Sources Techniques For Dynamic Per-Pixel Lighting Phong and Blinn-Phong Cook-Torrance Oren-Nayar Strauss Ward Ashikhmin-Shirley Comparison and Summary