Coder Chameleon: January 2011

Wednesday, January 26, 2011

In the Land of Mordor where the Shadows Lie

An investigation into the techniques used in state of the art videogame shadows.

God of War III : Shadows

These are really crisp and clean shadows based on fairly standard implementation of cascaded shadow maps with PCF filtering but has some great techniques for culling non-shadowed parts of the scene. They obtain really high quality shadows by using a multipass deferred "tiling" approach where they supersample the shadows in the nearest cascade. This works really well, is flexible and is made possible by the culling techniques utilizing the PS3 depth bounds test. Very cool stuff.

Unfortunately, the depth bounds test that the God of War III shadows depend upon so heavily is an Nvidia extension. The Nvidia depth bounds test utilizes hierarchical z so it is very fast. On hardware that doesn't support the hi-near-z and hi-far-z extension it should be possible to use hi-stencil by rendering view aligned boxes.

http://www.gdcvault.com/play/1012341/God-of-War-III

Algorithm

Parallel light sources for all shadows for single matrix multiply with no w divide.

float3 shadowPos = pixelPos * posToShadowMatrix;

Holy Grail = 1 filtered texel per pixel

"White Buffer" used to store full screen shadow buffer data for up to 4 lights. Shadow map cascades are no longer sampled in the opaque rendering pass. Renders each cascade's shadow map into the WB. The WB is then sampled once in the opaque pass.

Min blending is used to render the cascades to the WB in an order independent manner.

ZCull Unit is used to minimize the cost of full screen cascade passes. ZCull unit has conservative ZNear and conservative ZFar for "depth bounds test".

Left: ZNear. Right: ZFar

As the figure below illustrates, we are only interested in the areas of the screen that have depth values which fall within the z near and z far values.

Depth bounds test illustrated.
http://http.developer.nvidia.com/GPUGems/gpugems_ch09.html

So, the rendering process proceeds (for a 3 cascade shadow) as:

ZPrePass
Cascade 2 Shadow Map
Render Cascade 2 to WB
Cascade 1 Shadow Map
Render Cascade 1 to WB
Cascade 0 Shadow Map
Render Cascade 0 to WB
Opaque

This deferred approach can be used to apply different settings to each cascade. Eg: different sampling quality, different resolutions. It can also be used to tile the cascades and increase the effective resolution of the cascades. For example, we might want to double the effective resolution of cascade 0 (the nearest cascade) by rendering it as 4 x 1024 x 1024 tiles rather than just 1 x 1024 x 1024 tile.

ZPrePass
Cascade 2 Shadow Map
Render Cascade 2 to WB
Cascade 1 Shadow Map
Render Cascade 1 to WB
Cascade 0 Shadow Map (TILE 0)
Render Cascade 0 to WB (TILE 0)
Cascade 0 Shadow Map (TILE 1)
Render Cascade 0 to WB (TILE 1)
Cascade 0 Shadow Map (TILE 2)
Render Cascade 0 to WB (TILE 2)
Cascade 0 Shadow Map (TILE 3)
Render Cascade 0 to WB (TILE 3)
Opaque

That is a lot of rendering passes that can only be fast if we apply some aggressive optimizations.

Using this technique the GOW team could achieve 10 megapixel resolution (approx 3600x2800) shadows for the nearest cascade.

Optimization

In most GOW scenes shadows run at 4-5 ms.

Depth reconstruction has to go away - one pass to make fp32 depth buffer. From 5 cycles to 1 texture read.

Two speed hits for shadows:
1) Rendering casters
2) Full screen passes for receivers

Cull Optimizations

Two reasons why pixel not shadowed:
1) Not in volume made by casters
2) In caster volume but not right receiver type

Receiver type culling & volume culling

Receiver Type Culling

Two kinds of receiver type culling used in GOW:

1) Baked shadow integration. Eg: ground may use baked shadows but there will be a matching hidden caster to cast dynamic shadows on dynamic shadow receivers. Hidden casters are not visible in the world but are only rendered to the cascade shadow map. Hidden casters are rendered in unique shadow map cascades.

ZCull Unit culls stencil too.
- Can cull based on depth independent criteria.
- ZPrepass can lay down scene stencil for free.
- Same reject rate as Z.

Stencil culling is used to mask out baked shadow receivers when rendering the hidden casters.

2) Minor characters cast only on geometry marked as "background". So minor characters are not self casting. The beauty of this is the more background characters there are on screen the fewer pixels are affected by shadows. This optimization appears to make more sense when you have baked shadows.

Volume Culling

CPU based algorithm, GPU results

2D Cells from 3D Cells

No more full screen passes.

CPU processing per 2D cell. We want to know 3 things for every cell:
1) Do we need to draw it?
2) How big does it really need to be?
3) What distance to camera are shadows?

(64 cells 8x8 for GOW)

One 8x4x8 gridded box per cascade

Casters and receivers in the grid. What casters hit which receivers?

For each caster, the sphere bounds are stretched in the shadow dir to form a capsule. Rather than use a capsule for visibility testing this was converted to a AABB (or parallel projected frustum).

Side view of the grid showing how z near, z far are determined for a 2D screen cell. Partial use of a cell can also be determined by projecting the bounds to the 2D cell.

Summary

These are some great techniques for achieving high quality shadows. They are more applicable to a game that uses baked environment shadows and uses static environment lighting. For a game with fully dynamic lighting and shadows these techniques may not be as useful.

Tuesday, January 18, 2011

Getting Started with XNA

I've been meaning to mess around with XNA a bit. It looks like a great tool for quickly testing ideas on actual Xbox hardware. Not a big fan of C# though. Here I'll post some of the info and links that I find useful as I get into XNA development.

YouTube - XNA 4.0 Tutorial: Part 1

Back to Basics - SIMD Part III - Cell Architecture

Appendix

http://www.naughtydog.com/docs/gdc2010/intro-spu-optimizations-part-1.pdf

http://www.naughtydog.com/docs/gdc2010/intro-spu-optimizations-part-2.pdf

Monday, January 17, 2011

Back to Basics - SIMD Part II - Intel

Wikipedia covers the basics of Intel's SSE instruction set:
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

In 2011, SSE5 will be released and Intel is introducing a new AVX SIMD instruction set featuring 256-bit registers and data paths presumably for double precision float vectors.

When it comes to writing SIMD code, using intrinsics can make programming for these enhanced instruction sets simpler and less error prone than using asm blocks.

http://www.codeproject.com/KB/recipes/sseintro.aspx

Friday, January 14, 2011

Back to Basics - SIMD Part I - IPhone

To start off the Coder Chameleon blog I'd like to spend some time investigating the world of SIMD architectures on the most prevalent hardware architectures and to determine their commonalities and differences. We'll start with the IPhone and survey the major hardware platforms then maybe develop a simple math library utilizing SIMD on each platform.

SIMD on the IPhone

The wikipedia page on IPhone details the evolving hardware architecture of the platform (http://en.wikipedia.org/wiki/IPhone). The original IPhone and IPhone 3G were based on the ARM11 instruction set which had SIMD instructions but it was not until the IPhone 3GS that support for ARM's NEON general purpose SIMD engine was made available.

The IPhone 3GS uses the S5PC100 chip based on the ARM Cortex A8 architecture:

And an excellent presentation on SIMD on IPhone:
Cranking Floating Point Performance to 11 by Noel Llopis
http://www.slideshare.net/llopis/cranking-floating-point-performance-to-11-on-the-iphone-2111775

The presentation above references the vfp math library (by Wolfgang Engel) http://code.google.com/p/vfpmathlibrary.

NEON

Here is an excellent introduction to NEON on the IPhone:
http://wanderingcoder.net/2010/06/02/intro-neon/

16 x 128 bit registers named q0 to q15 (q for quadword).
These registers can also be referenced as 32 x 64 bit double word registers named d0 to d31.

NEON instructions use a Hungarian notation. Each NEON instructions begins with 'v' and can have one or more of the following after the 'v' which act as a modifier:

'q' means the instruction saturates
'r' means the instruction rounds
'h' means it halves

Practically all instructions need a suffix to describe the size and type of the elements being operated upon from .u8 (unsigned byte) to .f32 (single-precision floating point). For example, vqadd.s16. If the element size changes as part of the operation the prefix indicates the size of the narrowest input.

APPENDIX I - NEON and VFP References

ARM Info Center: NEON and VFP Programming

http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/

http://www.delmarnorth.com/microwave/requirements/TestCodeTutorial_neon-test_draft2.pdf

APPENDIX II - General IPhone Hardware Reference

Wandering Coder - A Few Things IOS Developers Should Know About ARM Architecture

Thursday, January 13, 2011

Welcome to Coder Chameleon.

The purpose of this blog is to track research into various aspects of videogame programming. Graphics, AI, sound, languages, optimization and hardware. The blog is mainly a personal research tool for me to log research notes and references but I hope it will be of some use to others too.

I will try to strike a balance between being comprehensive and concise by eliminating fluff and linking to references for details. I don't have a lot of time to devote to the blog so things may evolve in a slow and unpredictable fashion.