Path Tracing Optimization in Indiana Jones?: Shader Execution Reordering and Live State Reductions

This post is part of the Path Tracing Optimizations in Indiana Jones? series.

While adding a path-tracing mode to Indiana Jones and the Great Circle? in 2024, we used Shader Execution Reordering (SER), a feature available on NVIDIA GPUs since the NVIDIA GeForce RTX 40 Series, to improve the GPU performance.

To optimize the use of SER in the main path-tracing pass (TraceMain), we used the NVIDIA Nsight Graphics GPU Trace Profiler. What we found was that its RayGen shader was using a large number of ray-tracing (RT) live-state bytes, which was lowering the efficiency of SER. By using the Ray Tracing Live State tab in the GPU Trace Profiler, we identified some of the GLSL variables that caused RT live state bytes, and either eliminated them or reduced their sizes.

As a result, SER now saves 24% of GPU time on the TraceMain pass in the profiled scene below on an NVIDIA GeForce RTX 5080 GPU.

The profiled scene

A still from the opening scene of the game, where the player is walking on a trail in the Peru jungle behind two other characters. — *Figure 1. The profiled scene, with Full Ray Tracing enabled*

All of the data in this post is captured from the first moment of gameplay of Indiana Jones and the Great Circle? on a GeForce RTX 5080 GPU, with the graphics settings from Table 1.

Setting	Value
Output Resolution	4K UHD (3840 x 2160)
DLSS Mode	DLSS Ray Reconstruction, Performance Mode
Graphics Preset	High
Path-Tracing Mode	Full Ray Tracing
Ray-Traced Lights	All Lights
Vegetation Animation Quality	Ultra

Table 1. Graphics settings for all data in this post

This part of the game is happening in a jungle with a lot of dense, alpha-tested geometry. It is one of the most challenging locations to ray-trace. With DLSS-RR in performance mode and the output resolution set to 4K UHD (3840 x 2160), the path tracing is done in 1080p.

Starting point: 4.08 ms

Figure 2 shows the TraceMain pass within the GPU Trace Profiler of Nsight Graphics.

A screenshot of the user interface. In the markers row of the tool, the TraceMain pass is highlighted. The tool shows that the application uses multiple queues. In the right-side panel, the tool shows: Duration=4.08ms and Predicated-On Active Threads per Warp Inst Executed is 12.3 threads per warp (38.3%). — *Figure 2. TraceMain pass in the Nsight GPU Trace Profiler, before optimizations*

As provided by GPU Trace, this cmdTraceRays workload had the following GPU metrics:

GPU Time: 4.08 ms
Predicated-On Active Threads per Warp: 38%

The Predicated-On Active Threads per Warp percentage metric is the average number of SIMT-active thread lanes by warp (32 lanes) per executed instructions at the SM level.

The following reasons can cause suboptimal percentages of active threads per warp:

Dynamic loops with divergent loop counts within a warp
Dynamic branches with divergent executed paths within a warp
Different hit shaders getting invoked within a warp, in RayGen shaders

Figure 3 shows the reason a divergent dynamic branch increases the shader latency. For more information, see Optimize GPU Workloads for Graphics Applications with NVIDIA Nsight Graphics.

A diagram shows some brief pseudocode for an if-else statement in a shader and a graphic showing eight lines of execution representing threads split into two chunks of four threads during the if-else blocks. — *Figure 3. A simplified visualization of thread scheduling under the SIMT warp execution model*

Shader Execution Reordering

For RayGen shaders, Shader Execution Reordering (SER) can be an effective way to increase the average percentage of active threads per warp. SER can be used to reorder the threads of a RayGen shader after a traceRay call, according to a key driven from the hit object and an optional user-provided coherence hint.

Invoking the hit shader or accessing hit-related data will be more coherent after the reorder call, as the threads with same or similar keys are grouped to execute together.

For Vulkan, SER is exposed by the VK_NV_ray_tracing_invocation_reorder extension.
For DX12, SER is now available in DXR 1.2 or through NvAPI.

SER was implemented in the engine by adding a shader permutation of the main path-tracing RayGen shader for the GPUs that support SER (GeForce RTX 40 and beyond), with the following GLSL code:

#if defined( RPF_RT_ENABLE_SER )
    hitObjectNV  hitObject;
    traceRayHitObject( hitObject, rayFlags, instanceMask , ray, rayPayload, topLevelAccelerationStructure );
    reorderThreadNV( hitObject, bounceNum == ( bounceId + 1 ) ? 1 : 0 , 1 );
    hitObjectExecuteShaderNV( hitObject, _hitPayloadIndex( rayPayload ) );
#else
    traceRayEXT( rayFlags, instanceMask , ray, rayPayload, topLevelAccelerationStructure );
#endif

The coherence hint bounceNum == ( bounceId + 1 ) ? 1 : 0 is an implementation of the idea described in the original Shader Execution Reordering whitepaper: “A related situation can be found in path tracers or multi-bounce reflections. There, it is often highly effective to include in the coherence hint some additional information about whether the main loop will terminate.”

With SER: 3.63 ms

With the SER implementation, the TraceMain pass has become 11% cheaper (4.08 => 3.63 ms), and the average Predicated-On Active Threads per Warp metric has increased from 38% to 70% (Figure 4).

Predicated-On Active Threads per Warp metric has increased from 38% to 70%. — *Figure 4. TraceMain pass with SER added*

	SER OFF	SER ON
GPU Time	4.08 ms	3.63 ms
Active Threads / Warp	38%	70%

Table 2. ?GPU metrics for TraceMain, with SER on and off

Ray-tracing live-state spills

Table 3 shows the GPU metrics so you can get a sense of what changed between the two traces:

	SER OFF	SER ON
VidL2 Throughput	58.3%	85.6%
L1TEX Throughput	30.6%	43.0%
VidL2 Total from L1TEX	9.22 GiB	12.55 GiB

Table 3. More GPU metrics for TraceMain, with SER on and off

Adding SER to a RayGen shader is actually expected to increase L2 traffic if a RayGen shader has a significant number of spilled ray-tracing live-state bytes with SER off. These are the local variables that the ray-tracing driver is saving to memory before a traceRay call, and then reloading after a traceRay call is complete, when continuing the execution of the RayGen shader.

With the presence of reorderThread, RT live-state spills can get more costly as the live-state data potentially must be transferred across the GPU by the SER implementation when reordering threads. The higher the number of RT live-state spilled bytes is at traceRay or reorderThread call sites, the more GPU overhead the live state can produce.

Ray Tracing Live State tab

Figure 5, from the GPU Trace Profiler, shows in the bottom-right area of the tool a section with summaries of the shader profiler data associated with the selected time range (the TraceMain pass in this case).

Screenshot from the bottom panel of GPU Trace, showing instruction mix information. The Shader Pipelines tab shows the current percentage of cumulative latency per shader, with the type, name, hash, and filename for each shader. — *Figure 5. Real-time Shader Profiler section in GPU Trace*

First, let’s drag the horizontal separator between the bottom-left and bottom-right sections of the tool all the way to the left, and choose the Ray Tracing Live State tab:

A screenshot from the Ray Tracing Live State tab shows the total Live State Bytes per thread reported by the tool. — *Figure 6. Zoomed-in area showing the RT live-state bytes per thread*

Figure 6 shows, for each RT call site (traceRay, reorderThread, or callable), the GLSL declarations of the variables that were declared before a RT call site and re-used after the call site, and for which the RT driver has performed live-state spilling around the call site (writing the value of the variable to memory before the call site, and reloading it after).

In this case, a total 222 bytes per thread were spilled to global memory for this RayGen shader.

Optimization 1: Loop removal

Now, let’s expand the information on callsite #2, to see the list of the GLSL lines that correspond to spilled RT live-state declarations:

Screenshot of the Ray Tracing live state tab, with the call sites and calling contexts expanded, showing one line per line of GLSL, for each RT live-state variable declaration. For each row, the number of used Live State Bytes and the number of Live State Values is reported. — *Figure 7. RT live state declarations with SER added and no other changes*

The largest entry in the list is:

HitDesc primaryHitDesc = PrimaryHitDescFromGbuffer( pixelPos32 );

This primaryHitDesc struct is initialized from GBuffer at the beginning of the RayGen, and its values are used to decide if and how the first ray should be traced.

But why are 72 bytes from this struct reused after the first traceRay call, given that it should only be used to set up that first ray? The RayGen had the following loop across paths, where PATHTRACE calls traceRayHitObject and reorderThreadNV in an inner loop internally:

for( uint pathId = 0; ( pathId < pathNum ) && !primaryHitDesc.isSky; ++pathId ) {
 
  PathOutputDesc pathDesc = PathOutputDescNull();
  //...
  PATHTRACE( pixelPos, primaryHitDesc, bounceNum, pathDesc );
  if( pathDesc.isValid == false ) {
    continue;
  }
  //...
  if( pathDesc.isDiffuse ) {
    hitDistanceDiff = normHitDist;
    radianceDiff += pathDesc.radiance;
    numSamplesDiff += 1.0;
  } else {
    NRD_FrontEnd_SpecHitDistAveraging_Add( hitDistanceSpec, normHitDist );
    radianceSpec += pathDesc.radiance;
    numSamplesSpec += 1.0;
  }
}
 
radianceDiff /= max( 1.0, numSamplesDiff );
radianceSpec /= max( 1.0, numSamplesSpec );

In the shipping game, pathNum is always 1, so this loop really is a branch. As a result, for each new iteration of the loop, the complete primaryHitDesc struct was spilled before the for loop, and reloaded at the start of each loop iteration. Because this loop really is a branch in practice, we rewrote it as such:

PathOutputDesc pathDesc = PathOutputDescNull();
if ( !primaryHitDesc.isSky ) {
 
  //...
  PATHTRACE( pixelPos, primaryHitDesc, bounceNum, pathDesc, throughput );
  //...
 
  if ( pathDesc.isValid ) {
      if( pathDesc.isDiffuse ) {
        hitDistanceDiff = normHitDist;
        radianceDiff += pathDesc.radiance;
      } else {
        NRD_FrontEnd_SpecHitDistAveraging_Add( hitDistanceSpec, normHitDist );
        radianceSpec += pathDesc.radiance;
       }
    }
}

Changing the loop into an isSky branch removed the 72 bytes of spilled RT live state.

Optimization 2: Using FP16 Precision

The following GLSL line is listed in the Ray Tracing Live State tab from Figure 7:

pathDesc.radianceAndAccumHitDist.xyz

This keeps track of the accumulated radiance per path:

pathDesc.radianceAndAccumHitDist.xyz += ( hitDesc.shading.xyz * throughput.xyz );

The radianceAndAccumHitDist variable was declared as a float4, which was overkill in this case. By demoting the precision of radianceAndAccumHitDist from float4 to half4 (f16vec4), which compiles to four FP16 values, the resulting RT live-state size for this four-dimensional vector was reduced by half, without affecting image quality.

Before and after RT live-state optimizations

With SER enabled, for the TraceMain pass, the RT live-state optimizations have reduced the RT live state reported in GPU Trace from 222 to 84 bytes, and the GPU time spent on that pass by 15% (3.63 => 3.08 ms) on an NVIDIA RTX 5080 GPU.

With SER ON	Before	After
GPU Time	3.63 ms	3.08 ms
Active Threads / Warp	70%	68%
RT Live State Spilled	222 bytes	84 bytes

Table 4. GPU metrics on TraceMain, before and after the live state optimizations

Figure 8 shows the Before version with SER ON, at 222 bytes.

Ray Tracing Live State tab showing one line per GLSL variable declaration. For each row, the number of used Live State Bytes and the number of Live State Values is reported. — *Figure 8. RT live state with SER, before live-state optimizations*

Figure 9 shows the After version with SER ON, at 84 bytes.

SER ON compared to SER OFF, after RT live-state optimizations

Before the RT live-state optimizations, enabling SER made the TraceMain pass 11% cheaper on a GeForce RTX 5080 GPU.

Before	SER OFF	SER ON
RT Live State Spilled	180 bytes	222 bytes
GPU Time	4.08 ms	3.63 ms (-11%)
Active Threads / Warp	38%	70%

Table 5. GPU metrics for TraceMain, before the live state optimizations

Now, after having optimized the RT live state using the Nsight GPU Trace Profiler, let’s do another comparison of SER ON compared to SER OFF on that pass, with the same conditions as before, except for the differences in GLSL from the live-state optimizations.

After	SER OFF	SER ON
RT Live State Spilled	68 bytes	84 bytes
GPU Time	4.07 ms	3.08 ms (-24%)
Active Threads / Warp	38%	68%

Table 6. GPU metrics for TraceMain, after the live state optimizations

Reducing the number of RT live-state spilled bytes in the GLSL made SER an even greater accelerator. After the RT live-state reduction optimizations, enabling SER saves 24% of GPU time in that pass.

Other RT live state optimizations

This post compares the performance of SER with two versions of the GLSL, before and after applying the RT live state optimizations:

After: The shipping version of the GLSL from December 2024, with all optimizations.
Before: The After version with the two optimizations reverted (loop removal, and using FP16 precision for radianceAndAccumHitDist).

In reality, we implemented additional RT live-state optimizations in the GLSL of the TraceMain pass, which are part of the After version, but were not reverted for generating the Before state for this post:

The RGB throughput vector of the path tracer was a float3. Changing it to a half3 helped and looked the same.
The bounceId and bounceNum variables of the bounce loop were 32-bit integers. Making them uint16_t helped. (uint8_t would have been possible, too.)
The ray direction in the bounce loop body was a float3. Packing it into a uint32_t helped and looked the same.
Moving the signal demodulation from the end of the RayGen to a subsequent pass helped, by removing dependencies on GBuffer material data.

All of these optimizations were implemented in time for the initial release of the path-tracing mode of the game.

Conclusion

In RayGen shaders that have a poor percentage of average active threads per warp, adding SER can produce significant speedups with minimal code changes. Reducing the number of RT live-state spilled bytes reported in the Nsight Graphics GPU Trace Profiler can make SER an even greater accelerator.

In HLSL, declaring a floating-point variable as FP16 can be done using DXC and Shader Model 6.2 or later, and this explicit HLSL type: float16_t. For more information, see Half The Precision, Twice The Fun: Working With FP16 In HLSL.

The next post in this series, Path Tracing Optimizations in Indiana Jones?: Opacity Micro-Maps and Compaction of Dynamic BLASs, covers using Opacity MicroMaps (OMMs) as an effective way to speedup traceRay calls (or rayQuery objects) in scenes with alpha-tested materials, as well as compacting the BLASs of the dynamic vegetation to save VRAM.

Acknowledgements

We would like to thank all of the people from NVIDIA and MachineGames who made it possible to ship Indiana Jones and the Great Circle? with path tracing on December 9 2024:

MachineGames: Sergei Kulikov, Andreas Larsson, Marcus Buretorp, Joel de Vahl, Nikita Druzhinin, Patrik Willbo, Markus ?lind, Magnus Auvinen, Jim Kjellin, Truls Bengtsson, Jorge Luna (MPG).

NVIDIA: Ivan Povarov, Juho Marttila, Jussi Rasanen, Jiho Choi, Oleg Arutiunian, Dmitrii Zhdan, Evgeny Makarov, Johannes Deligiannis, Fedor Gatov, Vladimir Mulabaev, Dajuan Mcdaniel, Dean Bent, Jon Story, Eric Reichley, Magnus Andersson, Pierre Moreau, Rasmus Barringer, Michael Haidl, Martin Stich.

Path Tracing Optimization in Indiana Jones?: Shader Execution Reordering and Live State Reductions

The profiled scene

Starting point: 4.08 ms

Shader Execution Reordering

With SER: 3.63 ms

Ray-tracing live-state spills

Ray Tracing Live State tab

Optimization 1: Loop removal

Optimization 2: Using FP16 Precision

Before and after RT live-state optimizations

SER ON compared to SER OFF, after RT live-state optimizations

Other RT live state optimizations

Conclusion

Acknowledgements

Related resources

Tags

About the Authors

Path Tracing Optimization in Indiana Jones?: Shader Execution Reordering and Live State Reductions

The profiled scene

Starting point: 4.08 ms

Shader Execution Reordering

With SER: 3.63 ms

Ray-tracing live-state spills

Ray Tracing Live State tab

Optimization 1: Loop removal

Optimization 2: Using FP16 Precision

Before and after RT live-state optimizations

SER ON compared to SER OFF, after RT live-state optimizations

Other RT live state optimizations

Conclusion

Acknowledgements

Related resources

Tags

About the Authors

Comments

Related posts

Path Tracing Optimizations in Indiana Jones?: Opacity MicroMaps and Compaction of Dynamic BLASs

Powerful Shader Insights: Using Shader Debug Info with NVIDIA Nsight Graphics

Improve Shader Performance and In-Game Frame Rates with Shader Execution Reordering?

The Authoritative Book on Real-Time Ray Tracing Has Arrived

Video Series: Practical Real-Time Ray Tracing With RTX

Related posts

Path Tracing Optimizations in Indiana Jones?: Opacity MicroMaps and Compaction of Dynamic BLASs

Connect Simulations with the Real World Using NVIDIA Air Services

Just Released: NVIDIA Warp is Now Open-Source Under Apache 2.0

Applying Specialized LLMs with Reasoning Capabilities to Accelerate Battery Research

Revolutionizing Neural Reconstruction and Rendering in gsplat with 3DGUT