The Peak-Performance Analysis Method for Optimizing Any GPU Workload

Figuring out how to reduce the GPU frame time of a rendering application on a PC can be a challenging task, even for the most experienced PC game developers. This blog post describes a performance triage method we have been using internally at NVIDIA to figure out the main performance limiters of any given GPU workload (also known as perf marker or call range), using NVIDIA-specific hardware metrics.

Our performance triage method does not start from assumptions or knowledge about what is being rendered on the GPU. Instead, it starts solely from hardware metrics letting us know how well the whole GPU is utilized, what hardware units and sub-units are limiting the performance, as well as how close they are running to their respective peak performance (also known as “Speed Of Light” or ”SOL”). If the application does not use asynchronous compute, this hardware-centric information can then be mapped back to what the graphics API and shaders are doing, providing guidance on how to improve the GPU performance of any given workload:

  1. If no GPU unit has a high throughput (compared to its SOL), then we strive to improve the achieved throughput of at least one unit.
  2. If some GPU unit has a high throughput (compared to its SOL), then we figure out how to remove work from that unit.
Nsight Range Profiler

The hardware metrics per GPU workload can be captured by our PerfWorks library on DX11, DX12 and OpenGL for all NVIDIA GPUs starting from the Kepler architecture1 (Maxwell, Pascal and Volta GPUs are completely supported). Although the PerfWorks headers are not public yet, the library can be used today via publicly available tools: the Range Profiler of Nsight: Visual Studio Edition 5.5 for DX12, DX11 and OpenGL 4.6 (but not Vulkan yet), as well as Microsoft’s “PIX on Windows” for DX12.

1As can be found on wikipedia, the GeForce 600 and 700 series are mostly Kepler, the 900 series is Maxwell, the 1000 series is Pascal, and the TITAN V is Volta.

Step 1: Lock the GPU Core Clock

First, to get the most deterministic measurements, we recommend that you always lock your GPU Core Clock frequency before collecting any performance metrics (and unlock it after, to get maximum performance as well as minimizing power consumption and noise when the GPU is idle or just doing desktop rendering). On Windows 10 with “Developer Mode” enabled, this can be done by running a simple DX12 application that calls SetStablePowerState(TRUE) on a dummy DX12 device, then goes to sleep without releasing the device as described in this blog post.

NOTE: Since release 5.5 of Nsight: Visual Studio Edition, the Range Profiler is now effectively calling SetStablePowerState() before/after profiling any Range, using an internal driver API that works on all Windows versions (not just Windows 10) and does not require the OS to be in “Developer Mode”. So you should not need to worry about locking your GPU Core clock when using the Nsight Range Profiler.

Step 2: Capture a Frame with the Nsight HUD

For non-UWP (Universal Windows Platform) applications, this can be done by drag-and-dropping your EXE (or batch file) to the “NVIDIA Nsight HUD Launcher” shortcut that Nsight installs on your Desktop, reaching the game location you want to capture, and then:

  1. Pressing CTRL-Z to show the Nsight HUD in the top-right section of the screen, and
  2. Clicking on the “Pause and capture frame” button in the Nsight HUD, or pressing the space bar to initiate the capture.

You can export the current frame to a Visual Studio C++ project by clicking on the “Save capture to disk” button. (By default, the Nsight exported frames get saved to C:\Users...\Documents\NVIDIA Nsight\Captures...)

You can click on “Resume” to keep playing your game in order to find other locations where you may want to capture more frames.

NOTE: You can skip the “Save capture to disk” step and jump directly to the next steps (Scrubber & Range Profiler analysis), but we recommend to always save your captures to disk, and archive them so you can go back to them if you need to later on. Saving exported frames to disk lets you later on attach the data to your analysis so you or anyone else in your team can try to reproduce your results.

At NVIDIA, we treat performance analysis as a scientific process where we provide all the repro data associated with our analysis and encourage colleagues to repro and review our results. In our experience, it is also a good practice to capture frames before and after a performance optimization attempt has been made (be it successful or not), analyze how the hardware metrics have changed, and learn from the results.

Additional Notes on Nsight Frame Captures:

  • For Nsight frames, it does not matter if your app is running in windowed mode, full screen mode or full screen borderless mode, since Nsight always runs the frames in a hidden window anyway. Just make sure the resolution and graphics settings are the ones you want to profile with.
  • For DX12 applications, we assume asynchronous compute is not being used, otherwise some of the hardware metrics may be biased by multiple workloads executing concurrently on the GPU. As of today, for all PerfWorks-based analysis, we recommend disabling the use of asynchronous compute in DX12 applications.
  • With regards to DX12 asynchronous copy calls (in COPY queues), it is OK to use these in frame captures, but you should know PerfWorks and Nsight currently do not profile the COPY-queue calls separately. So any COPY-queue calls executing concurrently with other DIRECT-queue calls may have an impact on the GPU DRAM traffic in these workloads.
  • Dropping the exe to the “Nsight HUD Launcher” won’t work for UWP apps. The Nsight launch approach for UWP is currently only supported through the Visual Studio IDE.

Step 3: Break Down the GPU Frame Time

A top-down view of the GPU times is a great way to figure out what perf markers/workloads are the most expensive in a frame. For a HBAO+ DX11 test app, rendering SSAO in 4K on a GeForce GTX 1060 6GB, the GPU-frame-time breakdown looks like this in Nsight’s Scrubber:

Figure 1. Example GPU-frame-time breakdown in the Scrubber within Nsight: Visual Studio Edition 5.5.

In the “Perf Markers” row, the Scrubber shows the elapsed GPU time per workload measured via D3D timestamp queries, as well as the percentage of the GPU frame time (with the Present call excluded) that each workload is taking. In the example from Figure 1, it is immediately obvious which workload is most expensive in this frame: “DrawCoarseAOPS”, taking 49.1% of the GPU frame time.

NOTE: To repro this result, you can download version 3.1 of the HBAO+ source code from GitHub and then run the “SampleApp_D3D11” project in Visual Studio. To make the RenderAO call emit perf markers, you can define ENABLE_PERF_MARKERS=1 in GFSDK_SSAO_D3D11 -> Project Properties -> C/C+ -> Preprocessor. For the record, here is how the frame looks like:

Step 4: Launch the Nsight Range Profiler

You can open the Visual Studio solution file of the frame capture that you exported to disk in Step 2, build the solution in Release x64 and then go to the Nsight menu in Visual Studio and click on  “Start Graphics Debugging”.

To reach the Nsight Range Profiler, you can launch the Nsight frame capture EXE, then:

  1. Press CTRL-Z and then Space
  2. ALT-Tab to your Visual Studio window
  3. Find the Scrubber tab that Nsight has added there
  4. Right-click on the workload you want to profile and click on “Profile [Perf Markers]…

NOTE: If for some reason, CTRL-Z + SPACE is not working for your app, you can ALT-Tab to Visual Studio and click on Visual Studio -> Nsight menu -> “Pause and Capture Frame”.

Let’s invoke the Range Profiler on the “DrawCoarseAOPS” workload from Step 3 (by profiling only this call range, and nothing else) by right-clicking on the “DrawCoarseAOPS” box in the scrubber and doing “Profile [Perf Markers] DrawCoarseAOPS”:

Figure 2. Launching the Nsight Range Profiler for a given workload, from the Scrubber window.

The Range Profiler injects PerfWorks calls inside the Nsight frame capture and collects a set of PerfWorks metrics for the specified workload. Once the profiling is complete, Nsight shows the collected metrics in a new section of the Range Profiler window below the Scrubber.

Step 5: Inspect the Top SOLs and Cache Hit Rates

We start by inspecting the metrics from the “Pipeline Overview” Summary section of the Range Profiler. This is just a view of PerfWorks metrics for the current workload. By hovering the mouse over any of the metrics, the actual Perfworks metric name gets displayed in a tooltip:

Figure 3. Nsight Range Profiler tooltip showing PerfWorks metric names and descriptions.

5.1. The Per-Unit SOL% Metrics

The first top-level metrics to look at for each workload are the per-unit SOL% metrics of the GPU. These convey how close each unit is to its maximum theoretical throughput or Speed Of Light (SOL). At a high level, one can imagine the per-unit SOL% metric as a ratio of achieved throughput to SOL throughput. However, for units with multiple sub-units or concurrent data paths, the per-unit SOL% is the MAX over all the sub-SOL-metrics of all sub-units & data paths.

NOTE: If you are not familiar with the names of the units in our GPUs, please read the following blog post which provides a high-level overview of how the logical graphics pipeline maps to GPU units in our GPU architectures: “Life of a triangle – NVIDIA’s logical pipeline”, as well as slides 7 to 25 from the GTC 2016 presentation: “GPU-Driven Rendering”. In this context:

  • “IA” (Input Assembler) loads indices and vertices (before the vertex shader gets called).
  • SM (Streaming Multiprocessor) runs the shaders.
  • TEX performs SRV fetches (and UAV accesses, since Maxwell).
  • L2 is the Level-2 cache attached to each DRAM partition.
  • CROP does color writes & blending to render targets.
  • ZROP does depth-stencil testing.
  • DRAM (“Memory” in the Range Diagram) is the GPU video memory.

NOTE: A simplified view of the graphics pipeline and its mapping to GPU units can be found within the Nsight Range Profiler results, by selecting “Range Diagram” in the Pipeline Overview. In this diagram, the per-unit SOL% values are displayed as green bars:

Figure 4. Nsight Range Profiler: Pipeline Overview -> Range Diagram.

As you can see on Figure 4, a GPU today is not a simple linear pipeline (A->B->C->…), but rather a network of interconnected units (SM<->TEX<->L2, SM->CROP<->L2, etc). Simple “bottleneck” calculations, which rely on fixed upstream and downstream interfaces for each unit, are not sufficient to reason about GPU performance. Therefore, in doing our analysis, we primarily look at each unit’s SOL% metric to determine units and/or issues limiting performance. The next section will discuss this approach in detail.

5.2. The “Top SOL Units”

In our performance triage methodology, we always start by looking at the top 5 SOL units and their associated SOL% metrics. These are the top 5 hardware units that limit the GPU performance of this workload. The Nsight Range Profiler shows the top 5 SOL% metrics (aka “Top SOLs”) in the Pipeline Overview – Summary section, for instance:

Figure 5. Example of “Top SOLs” in the Range Profiler -> Pipeline Overview within Nsight: Visual Studio Edition 5.5.

Case 1: Top SOL% > 80%

If the top SOL% value is > 80%, then we know that the profiled workload is running very efficiently (close to max throughput) on the GPU, and to speed it up one should try removing work from the top SOL unit, possibly shifting it to another unit. For example, for workloads with the SM as top SOL unit with SOL% > 80%, one can try skipping groups of instructions opportunistically, or consider moving certain computations to lookup tables. Another example would be moving structured-buffer loads to constant-buffer loads in shaders that are accessing structured buffers uniformly (where all threads load data from the same address), for workloads limited by texture throughput (since structured buffers are loaded via the TEX unit).

 

Case 2: Top SOL% < 60%

If the top SOL% value is < 60%, this means the top SOL unit and all other GPU units with lower SOL% are under-utilized (idle cycles), running inefficiently (stall cycles), or not hitting their fast paths due to the specifics of the workload they are given. Examples of these situations include:

  • The app being partly CPU limited (see Section 6.1.1);
  • Lots of Wait For Idle commands or Graphics<->Compute switches draining the GPU pipeline repeatedly (see Section 6.1.2);
  • TEX fetching from a texture object with a format, dimensionality, or filter mode that makes it run at reduced throughput by design (see these synthetic benchmarks for GTX 1080). For instance, a 50% TEX SOL% is expected when sampling a 3D texture with tri-linear filtering;
  • Memory-subsystem inefficiencies, such as poor cache hit rates in the TEX or L2 units, sparse DRAM accesses causing a low DRAM SOL%, and VB/IB/CB/TEX fetches from system memory instead of GPU DRAM;
  • Input Assembly fetching a 32-bit index buffer (half-rate compared to 16-bit indices).

NOTE: In this case, we can use the top SOL% value to derive an upper bound on the maximum gain that can be achieved on this workload by reducing inefficiencies: if a given workload is running at 50% of its SOL, and by assuming that one can increase the SOL% up to 90% by reducing internal inefficiencies, we know the max expected gain on the workload is 90/50 = 1.8x = 80%.

 

Case 3: Top SOL% in [60, 80]

In this case (gray zone), we follow the approaches from both Case 1 (high Top SOL%) and Case 2 (low Top SOL%).

NOTE: The per-unit SOL% metrics are all defined relative to the elapsed GPU cycles (wall clock), which may be different from the active cycles (cycles where that hardware unit is not idle). The main reason we define them relative to elapsed cycles and not per-unit active cycles is to make SOL% metrics comparable, by giving them all a common denominator. Another benefit of defining them relative to elapsed cycles is that any GPU idle cycles that limit the overall GPU performance are reported as a low top SOL% value for that workload (top-level metric in our SOL-guided triage).

 

5.3. Secondary SOL Units and TEX and L2 Hit Rates

The reason the Nsight Range Profiler is reporting the top 5 SOL units and not just the top one is that there may be multiple hardware units that interact with one another and all limit the performance to some extent. So we recommend manually clustering the SOL units based on their SOL% values. (In practice, a 10% delta seems to work reasonably well to define these clusters, but we recommend doing the clustering manually to not miss anything.)

NOTE: We also recommend looking at the TEX (L1) and L2 hit rates, which are displayed in the “Memory” section of the Range Profiler. In general, hit rates greater than 90% are great, between 80% and 90% good, and below 80% poor (may limit performance significantly).

 

This full-screen HBAO+ blur workload with the top 5 SOLs:

             SM:94.5% | TEX:94.5% | L2:37.3% | CROP:35.9% | DRAM:27.7%

… is both SM and TEX limited. And since the SM and TEX SOL% values are identical, we can infer that the SM performance is most likely limited by the throughput of an interface between the SM and TEX units: either SM requests to TEX, or TEX returning data back to the SM.

It has TEX hit rate 88.9% and L2 hit rate 87.3%.

See the study of this workload in the “TEX-Interface Limited Workload” Appendix.

 

Switching away from the HBAO+ example, here are some typical game-engine workloads that we have recently analyzed.

 

This SSR workload with top SOLs:

             SM:49.1% | L2:36.8% | TEX:35.8% | DRAM:33.5% | CROP:0.0%

… has the SM as primary limiter, and the L2, TEX and DRAM units as secondary limiters, with TEX hit rate 54.6% and L2 hit rate 76.4%. This poor TEX hit rate can explain the low SM SOL%: since the TEX hit rate is poor (most likely due to adjacent pixels fetching far-apart texels), the average TEX latency as seen by the SM is higher that usual and more challenging to hide.

NOTE: Here the active units are actually a dependency chain: SM -> TEX -> L2 -> DRAM.

 

This GBuffer-fill workload with top SOLs:

             TEX:54.7% | SM:46.0% | DRAM:37.2% | L2:30.7% | CROP:22.0%

… has the TEX & SM units as primary limiters, and DRAM & L2 as secondary limiters, with TEX hit rate 92.5%, and L2 hit rate 72.7%.

 

This tiled-lighting compute shader with top SOLs:

             SM:70.4% | L2:67.5% | TEX:49.3% | DRAM:42.6% | CROP:0.0%

… has SM & L2 as primary limiters, and TEX & DRAM as secondary limiters, with TEX hit rate 64.3% and L2 hit rate 85.2%.

 

This shadow-map generation workload with top SOLs:

             IA:31.6% | DRAM:19.8% | L2:16.3% | VAF:12.4% | CROP:0.0%

… is IA-limited (Input Assembler) and has a low top SOL%. In this case, changing the index-buffer format from 32 to 16 bits helped a lot. The TEX hit rate was irrelevant because TEX is not in the top 5 SOL units. And the L2 hit rate was 62.6%.

 

Step 6: Understand the Performance Limiters

Having done Step 5, we know what the top SOL units are (GPU unit names and % of max throughput) for each workload of interest, as well as the TEX and L2 hit rates.

We know that the top-SOL GPU units are limiting the performance of the workload being studied, because these are the units that are running the closest to their maximum throughput. We now need to understand what is limiting the performance of these top-SOL units.

 

6.1. If the Top SOL% is Low

As outlined above in Section 5.2.2, there are multiple possible causes for this. We often refer to these as pathologies — and as with real-life patients, a workload may suffer from multiple pathologies simultaneously. We start by inspecting the values of the following metrics: the “GPU Idle%” and the “SM Unit Active %”.

 

6.1.1. The “GPU Idle%” metric

The “GPU Idle%” metric from the Range Profiler maps to the “gr__idle_pct” PerfWorks metric. It is the percentage of the GPU elapsed cycles during which the whole Graphics and Compute hardware pipeline was idle for the current workload. These “GPU Idle” cycles are the cycles during which the CPU is not feeding commands fast enough to the GPU, and as a result the GPU pipeline is fully empty, with no work to process. Note that pipeline drains caused by Wait For Idle commands are not counted as “GPU Idle”.

If you see this metric being greater than 1% for any given GPU workload, then you know that this workload is CPU bound for some reason and the performance impact of this CPU boundness is at least 1% on the workload. In this case, we recommend to measure the total elapsed CPU time per workload spent on the following CPU calls, and then try to minimize the most expensive ones:

For DX11:

  • Flush{,1}
  • Map
  • UpdateSubresource{,1}

For DX12:

  • Wait
  • ExecuteCommandLists

For DX11 and DX12:

  • Any Create or Release calls

DX11 Notes:

  • ID3D11DeviceContext::Flush forces a command-buffer kickoff, which may require the Flush() call to stall on the CPU.
  • Calling ID3D11DeviceContext::Map on a STAGING resource can cause a CPU stall due to resource contention, when mapping the same staging resource in consecutive frames. In this case, the Map call in the current frame must wait internally until the previous frame (which is using the same resource) has been processed before returning.
  • Calling ID3D11DeviceContext::Map with DX11_MAP_WRITE_DISCARD can cause a CPU stall due to the driver running out of versioning space. That is because each time a Map(WRITE_DISCARD) call is performed, our driver returns a new pointer to a fixed-size memory pool. If the driver runs out of versioning space, the Map call stalls.

DX12 Notes:

  • Each ExecuteCommandLists (ECL) call has some GPU idle overhead associated with it, for kicking off a new command buffer. So, to reduce GPU idle time, we recommend batching all your command lists into as few ECL calls as possible, unless you really want command-buffer kickoffs to happen at certain points in the frame (for example, to reduce input latency in VR apps with a single frame in flight).
  • When an application calls ID3D12CommandQueue::Wait on a fence, the OS (Windows 10) holds off submitting new command buffers to the GPU for that command queue until the Wait call returns.

NOTE: The CPU times per API call can be measured using Nsight by capturing a frame and then going to Visual Studio -> Nsight -> Windows -> Events:

6.1.2. The “SM Unit Active%” metric

The “SM Unit Active%” metric reports the percentage of elapsed cycles with at least 1 warp (32 threads) active for each SM instance, averaged across all of the SM instances. Note that warps that are waiting on memory requests to come back are accounted as “active” / in flight.

In Nsight: Visual Studio Edition 5.5, this metric is exposed in the Range Profiler results in the Pipeline Overview Summary.

For a full-screen quad or a compute shader workload, the SM should be active for more than 95% of the workload. If not, then most likely there are imbalances between the SMs, with some SMs being idle (no active warp) and others being active (at least one active warp). This can happen when running shaders with non-uniform warp latencies. Also, serialized Dispatch calls with a low number of small thread groups cannot go wide enough to fill all SMs on the GPU.

If for any geometry-rendering workload, the SM Active% is below 95%, you know it may be possible to get a performance gain on this workload by overlapping asynchronous Compute work with it. On DX12 and Vulkan, this can be done by using a Compute-only queue. Note that it’s also possible to get a speedup from async compute even if the SM Active% is close to 100%, because the SMs may be tracked as active, but may be able to take on more active warps.

Another reason why the “SM Unit Active%” may be below 95% for any workload is frequent GPU pipeline drains (Wait For Idle commands, aka WFIs), which can be caused by:

    • Frequent Compute<->Graphics switches in the same DX11 context or in the same DX12 queue.
        • Problem: Switching between Draw and Dispatch calls in the same hardware queue causes a GPU WFI to be executed. Also, performing non-CS state-setting calls in a compute-only workload (e.g. mapping a constant buffer which is bound to both CS and graphics shader stages) can result in WFI being executed.
        • Solution: Batch all Compute work and do NOT interleave graphics and compute API calls, including Map(WRITE_DISCARD) calls on “sticky” resources that are bound on all shader stages.

       

    • On DX11, multiple render calls with the same UAV bound and no “UAV overlap” hints provided to our driver.
        • Problem: By default, subsequent render calls having a bound UAV in common are conservatively separated by GPU WFI commands injected by our driver, to prevent any data hazards.
        • Solution: The following calls can be used to disable the insertion of UAV-related Wait For Idle commands: NvAPI_D3D11_{Begin,End}UAVOverlap or NvAPI_D3D11_BeginUAVOverlapEx.
        • Note that on DX12, OpenGL and Vulkan, the UAV-related Wait For Idle commands are explicitly controlled by the application using API calls (ResourceBarrier, glMemoryBarrier, or vkCmdPipelineBarrier).

       

    • On DX12, ResourceBarrier calls with little work in between the barriers.
        • Problem: Each batch of back-to-back ResourceBarrier calls on a given queue can cause all of the GPU work on that queue to be drained.
        • Solution: To minimize the performance impact of ResourceBarrier calls, it is important to minimize the number of locations in the frame where ResourceBarrier calls are performed.

       

    • Hundreds of state changes in workloads using tessellation shaders/or geometry shader (except for GSs that were created as pass-through Fast GSs, using NVAPI for DX11 and DX12, or NV_geometry_shader_passthrough for GL).
        • Problem: Having a lot of state changes with HS and DS active can cause the SMs to be drained, for shader scheduling reasons.
        • Solution: Minimize the number of state changes (including resource bindings) in workloads that have tessellation shaders.

       

6.2. If the Top SOL Unit is the SM

If the SM is the top SOL unit (or close, in terms of SOL%), we then analyze the values of the “SM Issue Utilization” metrics in the Nsight Range Profiler.

The SM unit is a complex unit and contains several sub-units, and each sub-unit has an associated SOL% metric. The PerfWorks library allows querying one of these SM sub-SOL% metrics: “sm__issue_active_per_elapsed_cycle_sol_pct”, which is exposed under the name “SM Issue Utilization Per Elapsed Cycle” in the Range Profiler. This metric is the percentage of elapsed cycles during which the SM scheduler issued at least one instruction. If in a given workload, we have “SM Issue Utilization Per Elapsed Cycle” == “SM SOL%”, then we know the top SM sub-limiter is the SM instruction scheduler, that is the workload is SM issue limited.

The Range Profiler also exposes the “sm__issue_active_per_active_cycle_sol_pct” metric under the name “SM Issue Utilization Per Active Cycle”. The only difference with the previous metric is that the Per Active Cycle one is normalized by the number of SM active cycles (cycles with at least one warp active) as opposed to elapsed cycles (wall clock).

The “SM Issue Utilization Per Active Cycle” metric is useful for determining whether a given workload is partly limited by the “SM Occupancy” (“sm__active_warps_avg_per_active_cycle” PerfWorks metric).

 

Case 1: “SM Issue Utilization Per Active Cycle” > 80%

If the SM is the top SOL unit and “SM Issue Utilization Per Active Cycle” is greater than 80%, then the current workload is mainly limited by the SM scheduler issue rate, and therefore increasing the SM occupancy would not improve performance significantly (at least no more than a 5% gain on the workload).

In this case, the next step of the performance triage process is figuring out what kind of instructions are saturating the bandwidth of the SM scheduler. Typically, that is the math instructions (FP32 or integer ops), but that can also be memory instructions such as texture fetches or shared-memory accesses. Also, it is unlikely that TEX instructions (SRV and UAV accesses) are limiting the performance for workloads with the SM as top SOL unit and SM Issue Utilization > 80%, unless the TEX unit also has a SOL% value close to the one of the SM.

 

Case 2: “SM Issue Utilization Per Active Cycle” < 60%

As described on slide 15 in this GTC 2013 talk (at t=14 min), when a given warp instruction cannot be issued (because its operands are not ready or because the pipeline sub-unit it needs for execution is not ready — we call this a warp stall), then the SM instruction scheduler tries to hide latency by switching to a different active warp. So there are two ways one can help the SM scheduler issue more instructions per SM active cycle:

  1. Increasing the SM occupancy (number of active warps the scheduler can switch to) and
  2. Reducing the SM issue-stall latencies (so warps stay in the stalled state for fewer cycles).
Approach 1: Increasing the SM Occupancy

If the SM is the top SOL unit (or close), “SM SOL%” < 80%, and “SM Issue Utilization Per Active Cycle” < 60%, then increasing the SM occupancy should improve performance.

To increase the SM occupancy, one must first figure out what is limiting it.

The most common SM-occupancy limiter for pixel and compute shaders is the number of hardware registers per thread used by the shader.

The impact of hardware register count on the maximum theoretical occupancy (number of active warps per active cycles) is available in our CUDA Occupancy Calculator. Here are theoretical occupancy graphs for “Compute Capability” 6.1, which includes all of GeForce GTX 10XX and Quadro Pxxxx GPUs (GP10X):

Figure 6. Graph from the CUDA Occupancy Calculator for “Compute Capability” 6.1.

Other than registers, the following resources can also limit the SM occupancy on Maxwell, Pascal and Volta GPUs:

  • For graphics shaders:
    • The total size of the Vertex Shader output attributes.
    • The total size of the Pixel Shader input attributes.
    • The total sizes of the input & output attributes of the HS, DS or GS.
    • For Pixel Shaders, out-of-order completion of pixel warps (typically due to dynamic control flow such as dynamic loops or early-exit branches). Note that CSs do not have this issue as much, since CS thread groups can complete in arbitrary order. See slide 39 from our GDC 2016 talk on “Practical DirectX 12” by Gareth Thomas & Alex Dunn.
  • For compute shaders:
    • The thread-group size can directly impact the SM occupancy since warps of a thread group launch on the SM in an all-or-none fashion (i.e. all warps of a thread-group have the necessary resources available and will launch together, or none will). The larger the thread-group size, the coarser the quantization of resources like shared memory and register file. While some algorithms may genuinely require large thread groups, in all other cases, developers should try to restrict thread-group sizes to 64 or 32 threads as much as possible. That’s because thread groups of size 64 or 32 threads give the most flexibility to our shader compiler in picking the best possible register target for a shader program.
    • Furthermore, for shaders with high register usage (>= 64) and with high thread-group barrier stall times in the SM (GroupMemoryBarrierWithGroupSync() in HLSL), lowering the thread-group size to 32 may produce a speedup compared to 64. The >= 64 register guidance ensures that the 32 max thread-groups per SM limit (architecture-dependent “Thread Blocks / Multiprocessor” limit in the CUDA Occupancy Calculator) does not become the primary SM occupancy limiter.
    • The total number of shared-memory bytes allocated per thread group can also directly impact the SM occupancy, as can be seen by plugging in various numbers into the CUDA Occupancy Calculator.

Our CUDA Nsight documentation page on Achieved Occupancy includes a couple of additional possible SM occupancy limiters for Compute Shaders:

  • Unbalanced workload within thread groups.
  • Too few thread groups launched. This can also be a problem for graphics shaders that do not launch enough warps to fully occupy the SMs in-between GPU Wait For Idles.

NOTE: There are actually two approaches to reducing thread-group sizes:

  • Approach 1: Lowering the thread-group size by a factor of N and simultaneously increasing grid launch dimensions by N. See above.
  • Approach 2: Merging the work of N>=2 threads into a single thread. This allows sharing common data between the N merged threads via registers, or performing reductions in registers instead of in shared memory with atomic ops (e.g. InterlockedMin in HLSL). Besides, this approach also has the advantage of automatically amortizing thread-group-uniform operations across the N merged threads. However, one should be wary of the likelihood of register bloat from this approach.

NOTE: If you want to get a sense of whether your SM occupancy for a given workload is mainly register-count limited for some full-screen Pixel Shader or some Compute Shader, you can do the following:

  • Do Add… “Program Ranges” in the Scrubber and find the shader-program range you are interested in studying. Right-click on your Program Range, launch the Range Profiler and check the “SM Occupancy” value in the Pipeline Overview Summary.
  • Select some render call in your Range by clicking on the “Time (ms)” line in the Scrubber. Switch tab to the API Inspector, select the shader stage that takes most of the cycles in the workload (PS or CS), and click on the “Stats” link next to the shader name.
  • This opens up the “Shaders” Nsight window (see screenshot below) which shows the hardware register count for the shader, in the “Regs” column. Note that you may need to wait for a few seconds for the shader stats to get populated.
  • Lookup the Max Theoretical Occupancy associated with this register count by using the CUDA Occupancy Calculator graph from Figure 6, and compare it to the actual “SM Occupancy” reported by the Range Profiler for this shader.
  • If the achieved occupancy is much lower than the max occupancy, then you know the SM occupancy is limited by something else than just the amount of registers per thread.
Figure 7. Nsight’s “Shaders View”, showing hardware register counts per shader in the “Regs” column.

To reduce the total number of registers allocated for a given shader, one can look at the DX shader assembly and study the number of registers used in each branch of the shader. The hardware needs to allocate registers for the most register-hungry branch, and the warps that skip that branch run with a sub-optimal SM occupancy.

For full-screen passes (pixel or compute shaders), a typical way to address this problem is to run a pre-pass that classifies the pixels into different regions and run different shader permutations for each region:

  • For compute shaders, this SIGGRAPH 2016 presentation describes a solution using DispatchIndirect calls to apply specialized compute shaders to different tiles on screen, with a variable number of thread blocks per shader permutation: “Deferred Lighting in Uncharted 4” – Ramy El Garawany (Naughty Dog).
  • For pixel shaders, a different specialization approach can be used: a full-screen stencil buffer can be filled up in a pre-pass to classify the pixels. Then, multiple draw calls can be performed efficiently, by relying on the stencil test to happen before the pixel-shader execution (which should be done automatically by our driver), and using stencil tests that discard the pixels that the current shader permutation does not touch. This GDC 2013 presentation uses this approach to optimize MSAA deferred rendering: “The Rendering Technologies of Crysis 3” – Tiago Sousa, Carsten Wenzel, Chris Raine (Crytek).

Finally, to better understand the SM occupancy limitedness of a given compute shader, you can use our CUDA Occupancy Calculator spreadsheet. To use it, just fill in the CUDA Compute Capability for your GPU, as well as the resource usage for your shader (thread-group size, register count from the Shader View, and shared-memory size in bytes).

 

Approach 2: Reducing the SM issue-stall latencies

There is another way to increase the SM issue utilization% other than increasing the SM occupancy: by reducing the number of SM issue-stall cycles. These are the SM active cycles between instruction issue cycles during which a warp instruction is stalled, due to one of its operands not being ready or due to resource contention on the datapath that this instruction needs to be executed on.

If all your shader is doing is math operations and texture fetches and the SM issue utilization% per active cycle is low, then it is reasonable to assume that most of the stall cycles are coming from the dependencies with the texture fetch results, which we often refer to as the “texture latency”. In this case:

  1. If the shader contains a loop for which the number of iterations can be known at shader compilation time (possibly by using different shader permutations per loop count), then try forcing FXC to fully unroll the loop by using the [unroll] loop attribute in the HLSL.
  2. If your shader is doing a dynamic loop that cannot be fully unrolled (e.g. a ray-marching loop), try batching the texture-fetch instructions to reduce the number of TEX-dependency stalls (by grouping independent texture fetches in batches of 2 to 4 back-to-back instructions at the HLSL level).
    See the “Optimizing Ray-Marching Loops” Appendix.
  3. If your shader is iterating through all of the MSAA sub-samples per pixel for a given texture, fetch all of the sub-samples together, in a single batch of TEX instructions for that texture. Since the MSAA sub-samples are stored next to each other in DRAM, fetching them together maximizes the TEX hit rate.
  4. If the texture loads are based on a conditional, that most of the time is expected to be true (e.g if (idx < maxidx) loadData(idx)), consider enforcing the load and clamping the coordinate ( loadData(min(idx,maxidx-1)) ).
  5. Try reducing the TEX latency by improving the TEX and L2 cache hit rates. The TEX & L2 hit rates can be improved by tweaking sampling patterns to make adjacent pixels/threads fetch more adjacent texels, by using mipmaps if applicable, as well as by reducing texture dimensions & using more compact texture formats.
  6. Try reducing the number of executed TEX instructions (possibly using branches per texture instruction, which get compiled as TEX instruction predicates, see the FXAA 3.11 HLSL for an example, e.g.: “if(!doneN) lumaEndN = FxaaLuma(…);”).

 

Case 3: Issue Utilization % in [60,80]

In this case, we follow the approaches from both Case 1 (high issue utilization) and Case 2 (low issue utilization).

 

6.3. If the Top SOL unit is not the SM

6.3.1. If the Top SOL unit is TEX, L2, or DRAM

If the top SOL unit is not the SM but is one of the memory-subsystem units (TEX-L1, L2, and DRAM), then it’s possible that the root cause of the poor performance is TEX or L2 cache thrashing caused by a non-GPU-friendly access pattern (typically, with adjacent threads in a warp accessing far-apart memory). In this case, the top limiting unit may be TEX or L2, but the root cause may be in the shaders executed by the SM, so it’s worth triaging the SM performance using the method from Section 6.2 (if the Top SOL unit is the SM).

If the top SOL unit is the DRAM, and its SOL% value is not poor (>60%), then this workload is DRAM-throughput limited and merging it with another pass should speedup the frame. A typical example is merging a gamma-correction pass with another post-processing pass.

 

6.3.2. If the Top SOL unit is CROP or ZROP

If CROP is the Top SOL unit, you can try using smaller render target format (e.g R11G11B10F instead of RGBA16F), and if you are using Multiple Render Targets, you can try reducing the number of render targets. Also, killing pixels in the pixel shader more aggressively may be worth it (for instance, for certain transparency effects, discarding pixels that have less than 1% opacity). See this blog post for more possible strategies for optimizing transparency rendering: “Transparency (or Translucency) Rendering”.

If ZROP is the Top SOL unit, you can try using a smaller depth format (e.g. D16 instead of D24X8 for shadow maps, or D24S8 instead of D32S8), as well as drawing opaque objects more in front-to-back order so that ZCULL (coarse-granularity depth test) has a chance to discard more pixels before ZROP & the pixel shader get invoked.

 

6.3.3. If the Top SOL unit is IA

As mentioned earlier, IA does index-buffer and vertex-buffer loads, to gather the input vertices of the vertex shader. If IA is the Top SOL unit, you can try using 16-bit index buffers instead of 32-bit ones. If that does not increase the IA SOL%, you can then try optimizing geometry for vertex reuse and locality. Also, separating out the position stream from the rest of the attributes may be beneficial for z-only or shadow map rendering.

Summary

To wrap up, our SOL-guided performance triage method does the following, for any given GPU workload:

  • Check the “Top SOL%” value (Sections 5.2 and 5.3).
    • If > 80% => (A) try removing work from the top SOL unit (Section 5.2.1).
    • If < 60% => (B) try increasing the top SOL% (Sections 5.2.2 and 6.1).
    • Else do both (A) and (B).
  • If the SM is the Top SOL unit (or close, in terms of SOL%) and “SM SOL%” < 80%:
    • Check the “SM Issue Utilization Per Active Cycle” value.
      • If > 80% => (C) try skipping groups of instructions opportunistically or consider moving certain computations to lookup tables (Section 6.2.1).
      • If < 60% => (D) try increasing the SM occupancy (number of active warps in flight) and reducing the number of SM issue-stall cycles (Section 6.2.2).
      • Else do both (C) and (D).
  • If some other unit is the Top SOL unit, try reducing the amount of work sent to this unit. (see Section 6.3).

To use this method, all you need is having Nsight: Visual Studio Edition 5.5 installed on your PC, as well as the latest available graphics drivers installed.

 

Appendix: Performance Triage Examples

Example 1: TEX-Interface Limited Workload

The HBAO+ “DrawBlurXPS” full-screen pixel shader workload running on GTX 1060 6GB @1506 Mhz has the following metrics in the Nsight Range Profiler:

  • Top SOLs: SM:94.5% | TEX:94.5% | L2:37.2% | DRAM:27.6%
  • GPU Idle: 0.0% => not CPU limited (see Section 6.1.1)
  • SM Unit Active: 99.5% => no SM idleness issue (see Section 6.1.2)
  • SM Issue Utilization Per Elapsed Cycle: 47.2%
  • SM Issue Utilization Per Active Cycle: 47.5%
  • SM Occupancy: 62.2 active warps per active cycle
  • TEX hit rate: 88.9%
  • L2 hit rate: 87.3%

Analysis:

  • Because the SM SOL% and TEX SOL% are both so close (equal actually), we know that the workload is limited by the throughput of an interface between the SM and TEX units.
  • Because the SM and TEX SOL%s are so high (94.5%), we know that the workload is completely limited by the throughputs of the SM and TEX units (95% is as high as SOL% values get in practice).
  • Because the SM Issue Utilization Per Active Cycle is much lower than 80%, we know that the bandwidth of the instruction scheduler is far from being saturated. So increasing the SM occupancy (number of active warps per active cycle) could help in theory. But because the SM SOL% (94.5%) is so high, we know that trying to increase occupancy would not increase performance significantly.
  • The TEX and L2 hit rates are good (close to 90%).

Conclusion: This workload is primarily limited by the bandwidth of an SM-TEX interface and the only way to speed it up significantly is to reduce the number of executed TEX instructions (e.g. by using Gather instructions to do 4 single-channel loads in one TEX instruction, or by calculating multiple output values per thread and sharing texture samples via registers).

Example 2: Math-Limited workload

The HBAO+ “DrawCoarseAOPS” workload has the following metrics in the Range Profiler on GTX 1060 @ 1506 Mhz:

  • Top SOLs: SM:93.4% | TEX:71.9% | L2:50.1% | DRAM:27.3% | CROP:3.6%
  • GPU Idle: 0.0% => not CPU limited (see section 6.1.1)
  • SM Unit Active: 99.5% => no SM idleness issue (see section 6.1.2)
  • SM Issue Utilization Per Elapsed Cycle: 93.4%
  • SM Issue Utilization Per Active Cycle: 95.9%

Top SOL analysis:

  • The top SOL unit is the SM, which is running at 93.4% of its maximum throughput, and the secondary SOL units (TEX, L2 and DRAM) are more than 20% behind the SM SOL, so we know that the workload is mainly SM-throughput limited.
  • The second top limiter is the TEX unit, running at 71.9% of its maximum throughput.

Now, let’s make the following thought experiment: if the TEX unit in this GPU was somehow made infinitely fast, the workload would still be limited by the work that’s happening inside the SM at 93% of the SM SOL. Some of that work is typically texture fetches from shaders, so if the TEX unit was made infinitely fast and the SM was initially limited by the TEX latency, the SM SOL may increase. But because 95% is as high as SOL% values get in practice, we know that there is no way the SM SOL% can increase significantly by making any other units go faster (or process less work), and therefore we know that the only way to speedup this workload significantly is to figure out what are the internal performance limiters within the SM.

Additional analysis:

  1. Because the SM is the top limiter and “SM Issue Utilization Per Elapsed Cycle” == “SM SOL%”, we know that the max SM sub-SOL metric of this workload is the SM Issue Utilization%, that is, the workload is mainly limited by the issue rate of the instructions.
  2. Because “SM Issue Utilization Per Active Cycle” > 80%, we know that the workload is not significantly limited by the SM occupancy, that is, increasing the number of active warps per active cycles would not significantly improve performance.
  3. We can infer that the performance is not limited by the latency of the texture-fetch instructions, otherwise we would have “SM Issue Utilization Per Active Cycle” much lower than 80%.
  4. We can infer that the performance is not primarily limited by the TEX unit, otherwise we would have the TEX SOL% value being much closer to the SM SOL% value.

Conclusion: This workload is primarily limited by the number of math instructions (FP32 ops and/or integer ops and/or other math ops such as rsqrt, pow, cos/sin, etc) in the shader. To speed it up, one needs to find a way to somehow shade less pixels or execute less math instructions per pixel.

Example 3: TEX-Latency Limited workload

Now, let’s take a look at the “Advanced Motion Blur” DX11 SDK sample from https://github.com/NVIDIAGameWorks/D3DSamples

By default, the app is running in 720p windowed mode. To make it start in 4K full-screen mode, I’ve made the following edit in main.cpp:

#if 0
    deviceParams.startfull-screen = false;
    deviceParams.backBufferWidth = 1280;
    deviceParams.backBufferHeight = 720;
#else
    deviceParams.startfull-screen = true;
    deviceParams.backBufferWidth = 3840;
    deviceParams.backBufferHeight = 2160;
#endif

Next, we can manually insert a workload marker in main.cpp, which will be intercepted by NSight and become a “Perf Marker” Range in the Range Profiler:

if(g_view_mode == VIEW_MODE_FINAL)
{
    D3DPERF_BeginEvent(0x0, L"Final Pass");
    //…
    ctx->Draw(6, 0);
    D3DPERF_EndEvent();
}

And finally we can profile this workload using the Nsight Range Profiler by following the steps from Section 4. On GTX 1060 @ 1506 Mhz, the Range Profiler provides the following metrics:

  • Top SOLs: SM:40.7% | TEX:39.8% | L2:36.3% | CROP:26.4% | DRAM:25.7%
  • GPU Elapsed Time: 0.70 ms
  • GPU Idle: 0.2% => not CPU limited (see section 6.1.1)
  • SM Unit Active: 97.4% => no SM idleness issue (see section 6.1.2)
  • SM Issue Utilization Per Elapsed Cycle: 40.7%
  • SM Issue Utilization Per Active Cycle: 41.8%
  • SM Occupancy: 37.0 active warps per active cycle
  • TEX hit rate: 80.1%
  • L2 hit rate: 83.0%

Analysis:

  • The top SOL% units are SM and TEX.
  • The top SOL% value is below 60%, so this workload is running inefficiently on the GPU.
  • The workload is TEX-latency limited because “SM Issue Utilization Per Active Cycle” is far below 80% and the TEX SOL% is so close to the SM SOL%.
  • The workload is also SM-occupancy limited, since the top SOL unit is the SM, “SM SOL%” < 80% and “SM Issue Utilization Per Active Cycle” < 60% (see Section 6.2.2).
  • The TEX and L2 hit rates are good.

Experiment: Removing the Early Exit

From Section 6.2.2, we know that for workloads with the SM as top SOL unit, “SM SOL%” < 80% and “SM Issue Utilization Per Active Cycle” < 60%, the performance may be limited by:

  • High issue-stall latencies (due to not having enough Instruction Level Parallelism to hide the instruction latencies), and/or
  • Low SM occupancy (due to not having enough active warps to hide the issue stalls).

Let us figure out the main cause of the low SM occupancy first. We know that this pixel shader has an early-exit branch, and by outputting a debug color if the branch is taken, we can see that most of the pixels are taking the early exit. We also know from Section 6.2.2 that out-of-order pixel-shader warp completion (compared to their launch order) can limit the SM occupancy.

To verify that the early-exit branch is actually limiting the occupancy, let us do the experiment of simply removing it from the pixel shader (ps_gather.hlsl):

#if 0
    //If the velocities are too short, we simply show the color texel and exit
    if(TempNX<HALF_VELOCITY_CUTOFF)
    {
        return CX;
    }
#endif

 

Metrics New value Old value New / Old
GPU Elapsed Time 5.05 ms 0.70 ms   7.21x
Top SOL[0] SM:47.2% SM:40.7%   1.16x
Top SOL[1] TEX:39.2% TEX:39.8%   0.01x
SM Unit Active 99.9% 97.4%   1.03x
SM Issue Utilization Per Elapsed Cycle 47.2% 40.7%   1.16x
SM Issue Utilization Per Active Cycle 47.2% 41.8%   1.13x
SM Occupancy 62.0 37.0   1.68x
TEX hit rate 85.8% 80.1%   1.07x
L2 hit rate 95.2% 83.0%   1.15x

Table 1. New metrics with the early-exit branch removed.

We see the new workload is still SM-occupancy limited, but the occupancy (62.0) is reaching the hardware limit (64.0) — and even with this maxed-out occupancy, we see the workload is still TEX-latency limited with top SOL units SM and TEX.

Obviously, the workload has become 7x slower due to the removal of the early exit, but that is OK because we want to study its performance limiters this way. This is actually an approach we often use to analyze workloads that we expect have multiple performance issues: if the initial problem is too complex and the root-cause of the low SOL%s are not clear, we simplify the problem (in this case by removing the expected root cause of the low occupancy), and redo the analysis on the simpler problem until the top SOL%s become good. This is effectively a Divide And Conquer performance-debugging approach. By taking a medical analogy, we are eliminating one pathology in order to make the analysis of other pathologies clearer and free of “cross talk”.

Note: There is a risk that performance optimizations that work for the simplified problems may not speedup the original problem. But we think that even if that happens, the analysis of the simpler problem is still very much worth having in most cases because not only does it help us verify that our understanding of the performance issues is correct, but it can also help re-architecting a rendering algorithm for avoiding GPU performance inefficiencies.

Conclusion:

The main reason for the poor SM occupancy in this workload is that the pixel-shader has an early-exit. Moving this pixel shader to a compute shader can alleviate this problem (warps that complete sooner than others within a given thread group would still limit the SM occupancy), but would also add some overhead for writing out the results via UAV write instructions.

Alternatively, using the stencil-masking approach described in Section 6.2.2 should also help in this case, as it would make the full-screen pixel shader process the complex pixels only and all of these pixels would complete in launch order.

 

Optimization 1: Using R11G11B10F instead of RGBA16F

The original SDK sample app uses the RGBA16F texture format to store HDR colors, and these are the colors that are filtered by the Motion Blur “Final Pass”.

Because the alpha channel of this RGBA16F texture is never actually used by this application, we can change it to the more compact R11G11B10F format, which produces output images that look the same in this case. Implementing this optimization is just a format change in main.cpp:

        CreateTextureWithViews(
            device, surface_desc->Width, surface_desc->Height,
#if 0
            DXGI_FORMAT_R16G16B16A16_FLOAT,
            DXGI_FORMAT_R16G16B16A16_FLOAT,
            DXGI_FORMAT_R16G16B16A16_FLOAT,
#else
            DXGI_FORMAT_R11G11B10_FLOAT,
            DXGI_FORMAT_R11G11B10_FLOAT,
            DXGI_FORMAT_R11G11B10_FLOAT,
#endif

Doing that helps reduce the TEX latency of the color fetches by increasing the TEX hit rate:

Metrics New values Old values New / Old
GPU Elapsed Time 4.25 ms 5.05 ms  19% gain
Top SOL[0] SM:63.7% SM:47.2%   1.35x
Top SOL[1] TEX:52.9% TEX:39.2%   1.35x
SM Unit Active 99.9% 99.9%   1.00x
SM Issue Utilization Per Elapsed Cycle 63.7% 47.2%   1.35x
SM Issue Utilization Per Active Cycle 63.8% 47.2%   1.35x
SM Occupancy 62.8 62.0   1.01x
TEX hit rate 94.2% 85.8%   1.10x
L2 hit rate 91.5% 95.2%   0.96x

Table 2. New metrics with the color format reduced from RGBA16F to R11G11B10F.

 

Optimization 2: Loop Unrolling

Now, let’s look that the HLSL for this full-screen pixel shader. It contains the following loop with 3 texture fetches per loop iteration and a dynamic branch that skips the loop body for a certain loop iteration index:

    for (int i = 0; i < c_S; ++i)
    {
        // Skip the same fragment
        if (i == SelfIndex) { continue; }
        //…
        float2 VY = readBiasScale(texVelocity.SampleLevel(sampPointClamp, Y, 0).xy);
        //…
        float ZY = getDepth(Y);
        //…
        Weight += alphaY;
        Sum += (alphaY * texColor.SampleLevel(sampLinearClamp, Y, 0).xyz);
    }

Running this HLSL through FXC with the following command line:

    fxc /T ps_5_0 /Ges /O3 ps_gather.hlsl

… produces the following control flow & texture fetches in the DXASM:

loop
 itof r5.w, r4.w
 ge r6.x, r5.w, cb0[21].z
 breakc_nz r6.x
 ieq r6.x, r3.x, r4.w
 if_nz r6.x
    mov r4.w, r3.w
    continue
  endif
 ...
 sample_l_indexable(texture2d)(float,float,float,float) r6.zw, r6.xyxx, t2.zwxy, s1, l(0.000000)
 …
  sample_l_indexable(texture2d)(float,float,float,float) r6.w, r6.xyxx, t1.yzwx, s1, l(0.000000)
  …
  sample_l_indexable(texture2d)(float,float,float,float) r6.xyz, r6.xyxx, t0.xyzw, s2, l(0.000000)
 mad r5.xyz, r5.wwww, r6.xyzx, r5.xyzx
 iadd r4.w, r4.w, l(1)
endloop

We know this workload is partly TEX latency limited. That happens because there are not enough instructions between the texture fetches and the dependent math instructions. To let our shader compiler better schedule the texture instructions, this loop needs to be unrolled, so our compiler knows that all of the texture fetches will be executed and can decide how to schedule them (possibly batching multiple texture fetches together) to try to best cover the latency of the texture instructions with independent math instructions.

In this case, the number of loop iterations (“c_S”) must be known at the FXC compilation time, which is not the case in the original shader where c_S is a constant-buffer value.

The “c_s” constant is stored in a constant buffer, but because this value does not change from frame to frame, it is possible to generate permutations of this pixel shader for different c_s constants, or just hard-code the value with a #define, like this:

#if 0
    float  c_S;
#else
    #define c_S 15
    float  c_S_unused;
#endif

Now that c_S is known at FXC compilation time, the for() loop can then be fully unrolled using the [unroll] HLSL keyword, like this:

    [unroll] for (int i = 0; i < c_S; ++i)
Metrics New values Old values New / Old
GPU Elapsed Time 3.44 ms 4.25 ms 24% gain
Top SOL[0] SM:81.2% SM:63.7%   1.27x
Top SOL[1] TEX:81.2% TEX:52.9%   1.53x
SM Unit Active 99.9% 99.9%   1.00x
SM Issue Utilization Per Elapsed Cycle 74.4% 63.7%   1.17x
SM Issue Utilization Per Active Cycle 74.5% 63.8%   1.17x
SM Occupancy 45.2 62.8   0.72x
TEX hit rate 96.2% 94.2%   1.02x
L2 hit rate 85.6% 91.5%   0.94x

Table 3. New metrics with the unrolled loop.

The SM occupancy has gone down from 62.8 to 45.2 active warps per active cycle due to the increased register pressure: fully unrolling the loop has resulted in more TEX instructions to be executed concurrently, which consumes more registers for storing the TEX results. Still, the “SM Issue Utilization Per Active Cycle” metric is 74.5% which is close to 80%, so having a few more active warps in flight may improve the performance a bit, but no more than 10% on the workload.

Overall, the two optimizations combined have produced a 47% performance gain on the workload (5.05 to 3.44 ms). And for the record, with the early-exit branch added back, these optimizations produce a 21% gain on the workload (0.75 to 0.62 ms).

Appendix: Optimizing Ray-Marching Loops

As mentioned in Section 6.2.2 (“reducing the SM issue-stall latencies”), dynamic loops containing TEX instructions can be limited by the latency of the TEX instructions and the fact that there are no independent instructions that can be scheduled from the same warp (no instruction level parallelism), and that the number of active warps in the SM is insufficient to fully hide the TEX latency.

This happens commonly in ray-marching dynamic loops, which are typically used in Screen-Space Reflections (SSR) shaders, as well as ray-marched volumetric lighting.

A typical SSR ray-marching loop with early exit looks like this in HLSL:

    float MinHitT = 1.0;
    float RayT = Jitter * Step + Step;

    [loop] for ( int i = 0; i < NumSteps; i++ )
    {
        float3 RayUVZ = RayStartUVZ + RaySpanUVZ * RayT;
        float SampleDepth = Texture.SampleLevel( Sampler, RayUVZ.xy, GetMipLevel(i) ).r;

        float HitT = GetRayHitT(RayT, RayUVZ, SampleDepth, Tolerance);
        [branch] if (HitT < 1.0)
        {
            MinHitT = HitT;
            break;
        }

        RayT += Step;
    }

 

By partially unrolling the loop 2 times and placing the texture-fetch instructions back-to-back in the HLSL, the batches of independent texture instructions can be executed together in parallel, and the latency of the second fetch can be partly hidden by the latency of the first one.

The resulting HLSL then looks like this, assuming that NumSteps is a multiple of 2:

    float MinHitT = 1.0;
    float RayT = Jitter * Step + Step;


    [loop] for ( int i = 0; i < NumSteps; i += 2 )
    {
        float RayT_0 = RayT;
        float RayT_1 = RayT + Step;

        float3 RayUVZ_0 = RayStartUVZ + RaySpanUVZ * RayT_0;
        float3 RayUVZ_1 = RayStartUVZ + RaySpanUVZ * RayT_1;
        
        // batch texture instructions to better hide their latencies        
        float SampleDepth_0 = Texture.SampleLevel( Sampler, RayUVZ_0.xy, GetMipLevel(i+0) ).r;
        float SampleDepth_1 = Texture.SampleLevel( Sampler, RayUVZ_1.xy, GetMipLevel(i+1) ).r;

        float HitT_0 = GetRayHitT(RayT_0, RayUVZ_0, SampleDepth_0, Tolerance);
        float HitT_1 = GetRayHitT(RayT_1, RayUVZ_1, SampleDepth_1, Tolerance);
        [branch] if (HitT_0 < 1.0 || HitT_1 < 1.0)
        {
            MinHitT = min(HitT_0, HitT_1);
            break;
        }

        RayT += Step * 2.0;
    }

By implementing the above optimization on a SSR test app in 4K on GTX 1080 at 1607 Mhz, the elapsed GPU time on the SSR workload went from 0.54 ms to 0.42 ms (29% speedup on the workload). By going further and batching 4 texture fetches instead of 2, the GPU time went down to 0.33 ms (64% speedup on the workload: 0.54 -> 0.33 ms / frame).

Acknowledgements

This blog post would not have been possible without the help of Marc Blackstein, Ram Rangan, and Zhen Yang who taught me and the NVIDIA DevTech group the SOL-guided performance triage method presented in this blog.

I would like to thank the following NVIDIA employees for their valuable expertise and feedback: Avinash Baliga, Alexey Barkovoy, Iain Cantlay, Jon Jansen, Alfred Junklewitz, Jeff Kiel, Justin Kim, Christoph Kubisch, Patrick Neill, Suryakant Patidar, Aurelio Reis, Mathias Schott, Greg Smith, John Spitzer, Nick Stam, Fabian Thuering, Yury Uralsky and Dmitry Zhdan.

No Comments