This post is part of the Path Tracing Optimizations in Indiana Jones series. While adding a path-tracing mode to Indiana Jones and the Great Circle in 2024, we used Shader Execution Reordering (SER), a feature available on NVIDIA GPUs since the NVIDIA GeForce RTX 40 Series, to improve the GPU performance. To optimize the use of SER in the main path-tracing pass (), we used the NVIDIA…
]]>The first post in this series, Path Tracing Optimization in Indiana Jones™: Shader Execution Reordering and Live State Reductions, covered ray-gen shader level optimizations that sped up the main path-tracing pass (“TraceMain”) of Indiana Jones and the Great Circle™. This second blog post covers additional GPU optimizations that were made at the level of the ray-tracing acceleration…
]]>As ray tracing becomes the predominant rendering technique in modern game engines, a single GPU RayGen shader can now perform most of the light simulation of a frame. To manage this level of complexity, it becomes necessary to observe a decomposition of shader performance at the HLSL or GLSL source-code level. As a result, shader profilers are now a must-have tool for optimizing ray tracing.
]]>If you are a DirectX 12 (DX12) game developer, you may have noticed that GPU times displayed in real time in your game HUD may change over time for a given pass. This may be the case even if nothing has changed on the application side. One reason for GPU time variations may be GPU Boost dynamically changing the GPU core clock frequency. Still, even with GPU Boost disabled using the DX12…
]]>UPDATE: NVIDIA Nsight Graphics 2023.3 and later feature the new Real-Time Shader Profiler, the first temporal sampling profiler for GPU shaders. This profiler enables you to examine the most expensive shaders at each moment in your frame. For more information, see GPU Trace UI in the Nsight Graphics User Guide. A less well-known but cool feature of NVIDIA Nsight Graphics is the Shader…
]]>This post was updated on May 19, 2023. How to optimize DX12 resource uploads from the CPU to the GPU over the PCIe bus is an old problem with many possible solutions, each with their pros and cons. In this post, I show how moving cherry-picked DX12 UPLOAD heaps to GPU upload heaps can be a simple solution to speed up PCIe-limited workloads. Take the example of a vertex buffer (VB)…
]]>As part of my GDC 2019 session, Optimizing DX12/DXR GPU Workloads using Nsight Graphics: GPU Trace and the Peak-Performance-Percentage (P3) Method, I presented an optimization technique named thread-group tiling, a type of thread-group ID swizzling. This is an important technique for optimizing 2D, full-screen, compute shader passes that are doing widely spread texture fetches and which are…
]]>While working with game developers on pre-release games, NVIDIA has had a steady flow of bugs reported where a game stutters for multiple milliseconds during gameplay. These stutter bugs can ruin the experience of the gamer, possibly making the game unplayable (as with the release of Batman Arkham Knight on PC), so they should be treated with a high priority. Until 2018, the only tool that…
]]>Many GPU performance analysis tools are based on a capture and replay mechanism, where a frame is first captured (either in-memory or to disk), and then replayed multiple times to be profiled. Nsight Graphics: GPU Trace differs in that it directly profiles the frames emitted by a live application, with no constraint on subsequent frames to be identical. This approach makes the tool simpler than…
]]>Figuring out how to reduce the GPU frame time of a rendering application on PC is challenging for even the most experienced PC game developers. In this blog post, we describe a performance triage method we’ve been using internally at NVIDIA to let us figure out the main performance limiters of any given GPU workload (also known as perf marker or call range), using NVIDIA-specific hardware metrics.
]]>