New: Vulkan Device Generated Commands

This post was updated substantially in March 2020. The original post was about the experimental NVX extension, while the current post is about the production NV version.

We are releasing the VK_NV_device_generated_commands (DGC) Vulkan extension, which allows GPU generation of the most frequent rendering commands. This extension offers an improved design over the experimental NVX extension.

Vulkan now has functionality like the DirectX12 ExecuteIndirect function. However, we added the ability to change shaders (ShaderGroups) on the GPU as well. This marks the first time that an open graphics API has provided functionality to create compact command buffer streams on the device, avoiding the worst-case state setup of previous indirect drawing methods.

Motivation

With general advances in programmable shading, the GPU can take on an ever-increasing set of responsibilities for rendering, by computing supplemental data and allowing a greater variety of rendering algorithms to be implemented. However, when it comes to setting up state for draw calls, the decisions must primarily be made on the CPU. Therefore, explicit synchronization or working from past frame’s results was necessary. Device-generated commands remove this readback latency and overcome existing inefficiencies.

Another trend is the development towards global scene representations on the GPU. For example, in ray tracing (VK_NV/KHR_ray_tracing), the GPU stores geometry and instance information for the scene. Recent API developments (VK_VERSION_1_2, or VK_EXT_descriptor_indexing and VK_EXT_buffer_device_address) allow the GPU to use bindless resources. This extension supplements such representations and allows you to determine the subset of objects to be rasterized on the device by modifying buffer contents before the command generation.

This diagram shows how ray tracing and rasterization have similar pipeline setups in the context of this extension. Both use resources, which can be buffers or images. Both ray tracing and graphics pipelines have some notion of shader groups - whether used in the shader groups described in this extension, or in the Shader Binding Tables of ray tracing. Similarly, both ray tracing and graphics pipelines have some concept of an instance buffer, whether used in indirect bind/draw/push commands in the case of graphics pipelines, or the top-level acceleration structure in the case of ray tracing pipelines. They also both use vertex and index buffers; for graphics pipelines, these can be referenced directly, while ray tracing requires them to be built into bottom-level acceleration structures. — *Figure 1. Unified Scene Representation*

In Figure 1, the center represents the scene resources and bindings stored in buffers. With this extension, both ray tracing and rasterization have similar setups to provide geometry and shading information in buffers and pipelines. In both cases, the decision of what work to perform can be done on the GPU: what rays to shoot, or which objects to rasterize and which shading to use.

Some usage scenarios that may benefit from the functionality of this extension are:

Occlusion culling: Use custom shader-based culling techniques to obtain greater accuracy with easy access to the latest depth-buffer information.
Object sorting: In forward shading-limited scenarios, sort the objects front to back to achieve greater performance.
Level of detail: Influence the shaders or geometry used for an object depending on its screen-space footprint.
Work distribution: Bucket arbitrary workloads for improved coherency in their resource usage, for example, by using more specialized shaders that have compile-time optimizations applied.

Evolution of GPU Work Creation

This diagram shows how GPU work creation has evolved over time, captioned "Ongoing evolution to source command data from GPU buffers". Left: Draw Indirect: Typically change the number of primitives and the number of instances. Middle left: Multi Draw Indirect: Multiple draw calls with different index or vertex offsets. Middle right: GL_NV_command_list & DX12 ExecuteIndirect: Change shader input bindings for each draw. Right: VK_NV_device_generated_commands: Change shaders per draw. — *Figure 2. Device-Generated Commands*

Over the years, GPU work creation has evolved. First, Draw Indirect provided the ability to source basic parameters for a draw call (number of primitives, instances, etc.) from GPU buffers that can be written by shaders.

Next came the popular Multi Draw Indirect feature that is practical for both CPU time reasons, including approaching zero driver overhead (AZDO) as well as GPU culling or level-of-detail algorithms.

Following this, our OpenGL GL_NV_command_list extension added the ability to do basic state changes (for example, vertex/index/uniform-buffers) using tokens stored in regular GPU buffers. Similar functionality was later exposed in DirectX 12 Execute Indirect, and we are now introducing this to Vulkan with DGC.

One problem with existing indirect techniques is that the command buffer must be prepared for the worst-case scenario on the CPU. You may also require more memory than necessary for the distribution of indirect commands, given that it is not known which state buckets they will end up in. To solve these problems, this extension adds further capabilities, such as the ability to switch between different shaders.

This diagram shows how GPU-driven state setup can be more efficient in terms of resource usage than CPU-driven state setup. A CPU-driven approach must allocate the worst-case amount of space to contain the data for draw indirect calls for each state. But some states might never be used, other states might be underutilized, and the amount of work could even be reduced by a shader (for instance, during occlusion culling). Instead, GPU-driven state setup yields a single, compact stream without unnecessary state setup and potentially with less memory. Ideally, you should still group the state a bit; the hardware can then filter out redundant state changes. — *Figure 3. Indirect State Commands*

Principal usage

To leverage this new functionality, there are a few additions to the Vulkan API. You need to express what commands to generate, make shaders bindable from the device, and run the actual generation and execution.

This diagram shows that a NVkIndirectCommandsLayoutNV object defines the stateless sequence of commands to be generated and sets the static arguments. It contains device-generated command sequences, such as BindShaderGroup(index), BindVertexBuffer(binding, bufferDeviceAddress), and Draw(various arguments) calls. BindShaderGroup sets shaders by indexing into the active graphics pipeline's shader groups. These can reference other pipelines. For BindVertexBuffer and Draw commands, dynamic arguments can be passed as buffer content at command generation time. — *Figure 4. Indirect Commands Layout*

A new Vulkan object type is introduced:

VkIndirectCommandsLayoutNV: The layout encodes the sequence of commands to be generated and where the data is found in the input streams. In addition to the command tokens and predefined functional arguments (binding units, variable sizes, etc.), additional usage flags that affect the generation can also be provided.

Graphics pipeline creation is extended by adding shader groups:

VkGraphicsPipelineShaderGroupsCreateInfoNV: As part of VkGraphicsPipelineCreateInfo, you can define multiple shader groups within the pipeline. By default, every pipeline has one group. The shader groups are bound through an index either coming from an indirect command or the new vkCmdBindPipelineShaderGroupNV. It is also possible to append shader groups by referencing existing compatible graphics pipelines, which eases integration.
VkGraphicsShaderGroupCreateInfoNV: A shader group can override a subset of the pipeline’s base graphics state. This includes the ability to change shaders, as well as the vertex or tessellation setup.

The following steps generate commands on the device:

Define a sequence of commands to be generated, using a VkIndirectCommandsLayoutNV object.
For the ability to change shaders, create the graphics pipelines with VK_PIPELINE_CREATE_INDIRECT_BINDABLE_BIT_NV and then create an aggregate graphics pipeline that imports those pipelines as graphics shader groups using VkGraphicsPipelineShaderGroupsCreateInfoNV, which extends VkGraphicsPipelineCreateInfo.
Create a preprocess buffer based on sizing information acquired using vkGetGeneratedCommandsMemoryRequirementsNV.
Fill the input buffers for the generation step and set up the VkGeneratedCommandsInfoNV struct accordingly. It is used in preprocessing as well as execution. Some commands require buffer addresses, which can be acquired using vkGetBufferDeviceAddress.
(Optional) Use a separate preprocess step using vkCmdPreprocessGeneratedCommandsNV.
Run the execution using vkCmdExecuteGeneratedCommandsNV.

Key design choices

Compared to the approach of GL_NV_command_list or DirectX12 ExecuteIndirect, this extension uses some different design choices:

Graphics Shader Groups: Like ray tracing, a graphics pipeline contains a collection of shader groups that can be accessed from the device.
Stateless Command Sequences: While the new design also uses command tokens, these tokens define a stateless sequence. The GL_NV_command_list extension encoded a serial token stream, which allowed you to resolve redundant state setup and create compact streams. However, the serial token stream design was not as portable or parallel-processing friendly as the traditional stateless design of MDI or ExecuteIndirect. Not all command types must be generated, although stateful commands must be defined before work-provoking ones.

VkIndirectCommandsTokenTypeNV	Equivalent vkCmd	Input buffer data
VK_INDIRECT_COMMANDS_TOKEN_TYPE_SHADER_GROUP_NV	vkCmdBindPipelineShaderGroup	ShaderGroup index
VK_INDIRECT_COMMANDS_TOKEN_TYPE_STATE_FLAGS_NV	–	vkFrontFace state
VK_INDIRECT_COMMANDS_TOKEN_TYPE_INDEX_BUFFER_NV	vkCmdBindIndexBuffer	VkDeviceAddress and VkIndexType
VK_INDIRECT_COMMANDS_TOKEN_TYPE_VERTEX_BUFFER_NV	vkCmdBindVertexBuffers	VkDeviceAddress and stride
VK_INDIRECT_COMMANDS_TOKEN_TYPE_PUSH_CONSTANT_NV	vkCmdPushConstant	Raw values
VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_INDEXED_NV	vkCmdDrawIndexedIndirect	Draw arguments
VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_NV	vkCmdDrawIndirect	Draw arguments
VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_TASKS_NV	vkCmdDrawMeshTasksIndirectNV	Draw arguments

Optional Explicit Preprocessing: This extension optionally allows some preprocessing to be done separately. This way, you can asynchronously overlap preparation of new draw calls with the drawing of a previous batch.
Explicit Resource Control: To fit into Vulkan’s explicit design, the memory requirements for command generation are exposed and managed directly through buffers by you.

This diagram shows how device-generated commands can be preprocessed on the compute queue, at the same time that the graphics queue is drawing running commands generated by the previous batch. Uses synchronization between generating and running commands. The caption says "Optionally preprocess commands asynchronously with rendering, and ping pong between buffers. Preprocess buffers are explicitly sized according to vkGetGenerateCommandsMemoryRequirementsNV and are always required (even if no explicit preprocess step is used)." — *Figure 5. Asynchronous Preprocessing*

Multiple Input Streams: Previous approaches only allowed a single input buffer stream. MDI and ExecuteIndirect are designed to use an array of structures (AoS) layout, which makes partial updates less cache-efficient. DGC lets you define how many input streams are used and where tokens are stored. This way, you can choose to use a structure of arrays (SoA) layout for compact, cache-friendly streams, or use a hybrid layout to easily separate dynamic from static content.

Command sequences are flexible and support being laid out in multiple different ways. Traditional approaches used a single interleaved stream, which corresponds to an array-of-structures, or AoS layout. However, the NV extension allows flexible streams in structure-of-arrays (SoA), array-of-structures (AoS), or even hybrid layouts for cache-efficient updates. For instance, you can have two buffers, one in SoA and one in AoS, to separate static from dynamic data and update just the data that you need. — *Figure 6. Multiple Input Streams*

Relaxed Ordering: By default, all sequences are processed in linear order. You can relax the ordering requirement to be implementation-dependent, which speeds up processing.
Sequence Count & Index Buffer: The number of sequences, as well as the subset, can be provided by additional buffers containing count and index values. This allows for easy sorting or filtering of draw calls without having to change the input data.

This diagram shows how sequence ordering can be controlled in many different ways. On the left, you can have the default monotonic order of command sequences - 0, 1, 2, 3, 4, 5, 6, 7 - with the number of sequences given by the GPU (in this case, 8). In the middle, the implementation can have its own incoherent order - 3, 2, 0, 1 - and the actual number of command sequences can be provided by the GPU buffer. On the right, you can even have a subset of sequences - 2, 5, 1, 4 - and pass the number and order of sequences as an additional GPU buffer - in this case, 4 indices, 2, 5, 1, and 4. — *Figure 7. Flexible Sequencing*

What’s the catch?

No feature is free of trade-offs. A device-generation approach means that some driver-side optimizations may not apply. Furthermore, the generation process can add to the overall frame time, in cases where the CPU is able to record commands without affecting the GPU time. Finally, it requires additional GPU memory.

In summary, the goal of this extension is primarily to reduce the amount of actual work done on the GPU, by making decisions on the device about what and how work is generated. It is not about off-loading command generation from CPU to the GPU in general.

Sample and availability

A new open-source sample is available on GitHub. This extension is exposed in our beta driver first, and will be available in production drivers later.

Special thanks to Neil Bickford, Matthew Rusch, Mathias Schott, Daniel Koch, James Helferty, and Markus Tavenrath who contributed to the posts.