Framebuffer Fetch in Vulkan

7230.GettyImages_5F00_2D00_5F00_956353800.jpg_5F00_2D00_5F00_1800x1012x2.jpg_2D00_900x506x2.jpg?_=638422195029628982

8 minute read time.

Tile-based architectures, like Arm Mali and Immortalis GPUs, allow fragment shaders access to the contents of the framebuffer from tile memory. This can be beneficial if we need to implement deferred shading. OpenGL ES exposes this functionality in such mechanisms as Pixel Local storage (PLS) and Framebuffer Fetch (FBF), which are explained in detail in this blog by Roberto Lopez Mendez.

Subpasses in Vulkan provides a similar mechanism. Within a subpass, a barrier must be inserted between each color attachment write and the corresponding input attachment read. This makes techniques such as programmable blending difficult to implement because you cannot insert these barriers between primitives in the same draw call.

Additionally, we expect new Vulkan applications to prefer Dynamic Rendering, where there is only a single subpass per render pass. In this case, we need a new way to support the use-cases that previously required multiple subpasses.

For these reasons, we have added a set of new extensions to Vulkan. We will introduce these extensions in this blog post but first let's explain why they are necessary.

Render pass feedback loop

Imagine the following scenario, a subpass uses the same framebuffer attachment both as input and output for color or depth and stencil. That means that the subpass will be able to read and write the data from and to the same attachment. This situation is called a render pass feedback loop.

An access pattern causing a render pass feedback loop

There are exceptions if the subpass reads only those components that are not written to. For example, if it reads RG channels and writes to BA channels, then there is no feedback loop.

What is the problem with render pass feedback loops? The writes are not automatically made visible to reads. So, within the same subpass, it is a data race. The behavior in this case is undefined and the application may simply crash.

If the crash does not happen, consider the following problem related with data race. When sampling the attachment at a given coordinate, we do not know whether we will get the value updated by the previous shader invocation. It can lead to artifacts shown on the following image:

Rendering artifacts due to data races

Subpass self-dependency

We can avoid the render pass feedback loop by adding a subpass self-dependency. That means that we first specify a subpass dependency during render pass initialization with srcSubpass and dstSubpass values pointing to the same subpass. This enables us to later add memory barriers for the attachment using vkCmdPipelineBarrier within the render pass instance to synchronize read accesses prior to the barrier with write accesses after the barrier. One limitation of this approach, is that the barrier can only be inserted between draw calls; it does not allow us to synchronize accesses to the attachment within a draw call. This matters for use-cases such as programmable blending where triangles in a single draw call may be overlapping.

Rasterization Order Extensions

Let us now look at how we can synchronize attachment access without the need for subpass self-dependencies and explicit pipeline barriers.

The VK_EXT_rasterization_order_attachment_access extension provides a mechanism for accessing the contents of framebuffer attachments from one fragment to the next in rasterization order.

To enable this functionality for a color attachment, you need to add the VK_PIPELINE_COLOR_BLEND_STATE_CREATE_RASTERIZATION_ORDER_ATTACHMENT_ACCESS_BIT_EXT bit to the flags value of pColorBlendState when initializing the rendering pipeline. When using render pass objects, you also need to set the VK_SUBPASS_DESCRIPTION_RASTERIZATION_ORDER_ATTACHMENT_COLOR_ACCESS_BIT_EXT bit in the flags value of pSubpasses when creating the render pass.

Similarly, you can enable it for depth and stencil attachments by adding the VK_PIPELINE_DEPTH_STENCIL_STATE_CREATE_RASTERIZATION_ORDER_ATTACHMENT_STENCIL_ACCESS_BIT_EXT bit to the flags value of pDepthStencilState.

This creates a framebuffer-local dependency for each fragment, just like PLS or FBF in OpenGL ES.

Note: If you have an older device with a Mali GPU where this extension is not exposed; it is still possible to experiment with this functionality because the implementation will synchronize the attachment accesses by default. In this case, it is still necessary to specify a subpass self-dependency, but the memory barrier is not needed.

Originally, this extension could not be used with dynamic rendering, but with the recent introduction of VK_KHR_dynamic_rendering_local_read this is now possible. We'll say more about this in the next section.

Finally, we should briefly mention the VK_ARM_rasterization_order_attachment_access extension. This provides identical functionality with the EXT extension. In fact, the two extensions are aliases of each other. There may be a small number of devices in the market that only support the ARM version. But if your device supports both extensions, there’s never a reason to use the ARM vendor extension.

Shader Tile Image Extension

The purpose of the VK_EXT_shader_tile_image extension is to provide rasterization order access to the contents of framebuffer attachments when using dynamic rendering, and to do so in a way that is explicit and guaranteed to access data from tile memory.

When designing this extension, we had a choice between two abstraction levels:

An explicit API that targets tile-based GPUs with fast raster order access to tile memory.
A more abstract API that would target all GPUs, where data may be opportunistically stored in tile memory, or in caches. However, this is not guaranteed.

For VK_EXT_shader_tile_image we chose the first approach. The recently introduced VK_KHR_dynamic_rendering_local_read extension takes the second approach.

When VK_KHR_dynamic_rendering_local_read is used in combination with VK_EXT_rasterization_order_attachment_access it provides the same functionality as VK_EXT_shader_tile_image. For that reason, VK_KHR_dynamic_rendering_local_read targets all GPUs, Khronos decided to include it in the Vulkan Roadmap 2024 Milestone. Developers may prefer to use that extension for portability when it is available because it uses the same shader interface as subpasses. Also, it has the benefit of being compatible with the examples in the following sections. But if VK_EXT_shader_tile_image is also present on the device, you can be confident that framebuffer attachment data is kept on-chip during the dynamic render pass.

Usage examples

Programmable blending

Mentioned previously, it is impractical to implement programmable blending using subpasses and subpass self-dependencies. For some use cases we can rely on default blending in Vulkan, more complicated blending algorithms require doing it in shader. For example, the previous color values are loaded from i_color attachment and then updated by assigning the value to o_color, which is the same attachment:

Fullscreen

layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_color;

layout(location = 0) out vec4 o_color; layout(set = 0, binding = 0) uniform BlendUniform

vec4 a;

vec4 b;

} blend_uniform;

void main(void)

vec4 color = subpassLoad(i_color);

color = lerp(color, blend_uniform.a, color.w * blend_uniform.a.w);

color *= blend_uniform.b;

o_color = color;

layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_color;

layout(location = 0) out vec4 o_color; layout(set = 0, binding = 0) uniform BlendUniform 
{ 
    vec4 a;
    vec4 b;
} blend_uniform; 

void main(void) 
{ 
    vec4 color = subpassLoad(i_color); 

    color = lerp(color, blend_uniform.a, color.w * blend_uniform.a.w); 

    color *= blend_uniform.b;

    o_color = color;
}

In other words, the value of i_color at the current coordinate is overwritten and can be used in subsequent draw calls.

Deferred shading

A detailed explanation of deferred shading implementation on Mali is covered in a previous blog from Hans-Kristian Arntzen. It explains how to use subpasses and attachments to implement this technique. Following that explanation, we split the rendering into two steps:

Geometry pass: this will be the first subpass in Vulkan
Lighting pass: the next subpass

In the geometry pass, we render all the objects in the scene and store their properties (albedo, normal, depth) in the output attachments. These are then used in the next pass to calculate lighting. In the following image, you can see multiple lights applied in the final subpass:

Example scene using deferred shading

The lighting shader looks like this:

Fullscreen

layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_depth;

layout(input_attachment_index = 1, binding = 1) uniform subpassInput i_albedo;

layout(input_attachment_index = 2, binding = 2) uniform subpassInput i_normal;

layout(location = 0) out vec4 o_color;

layout(set = 0, binding = 0) uniform LightsInfo

Light lights[MAX_LIGHT_COUNT];

} lights_info;

void main()

// Read G-Buffer values using subpassLoad

// Calculate world position and normal

vec3 L = vec3(0.0);

for (uint i = 0U; i < MAX_LIGHT_COUNT; ++i)

layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_depth; 
layout(input_attachment_index = 1, binding = 1) uniform subpassInput i_albedo; 
layout(input_attachment_index = 2, binding = 2) uniform subpassInput i_normal; 

layout(location = 0) out vec4 o_color; 
 
layout(set = 0, binding = 0) uniform LightsInfo 
{ 
    Light lights[MAX_LIGHT_COUNT];
} lights_info; 
 
... 
 
void main()
{  
    // Read G-Buffer values using subpassLoad 
    // Calculate world position and normal 
 
    vec3 L = vec3(0.0); 
 
    for (uint i = 0U; i < MAX_LIGHT_COUNT; ++i) 
    {
        L += calculate_light(position, normal, lights_info.lights[i]);
    } 

    o_color = vec4(ambient_color + L * albedo.xyz, 1.0); 
}

The rasterization order extension enables us to implement the same logic in a single subpass. We can avoid the need for a fixed MAX_LIGHT_COUNT value. Instead, we accumulate the final color value in the lighting pass, calling the shader for each light and adding to the previous values (probably with some custom blending logic).

In this version of the shader, we are using subpassLoad to read the previous color values, and then write the updated value to o_color, which is bound to the same attachment:

Fullscreen

layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_light;

layout(input_attachment_index = 1, binding = 1) uniform subpassInput i_depth;

layout(input_attachment_index = 2, binding = 2) uniform subpassInput i_albedo;

layout(input_attachment_index = 3, binding = 3) uniform subpassInput i_normal;

layout(location = 0) out vec4 o_color;

void main()

// Read G-Buffer values using subpassLoad

// Calculate world position and normal

vec3 previous_light_value = subpassLoad(i_light).xyz;

// Add ambient light value, if it’s the first light at this location

previous_light_value = max(previous_light_value, ambient_color);

vec3 light_value = calculate_light(position, normal, light);

layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_light; 
layout(input_attachment_index = 1, binding = 1) uniform subpassInput i_depth; 
layout(input_attachment_index = 2, binding = 2) uniform subpassInput i_albedo; 
layout(input_attachment_index = 3, binding = 3) uniform subpassInput i_normal; 
 
layout(location = 0) out vec4 o_color; 
 
... 
 
void main() 
{ 
    // Read G-Buffer values using subpassLoad 
    // Calculate world position and normal

    vec3 previous_light_value = subpassLoad(i_light).xyz; 

    // Add ambient light value, if it’s the first light at this location
    previous_light_value = max(previous_light_value, ambient_color); 

    vec3 light_value = calculate_light(position, normal, light); 

    o_color = vec4(previous_light_value + light_value * albedo.xyz, 1.0); 
}

Conclusions

Subpasses and subpass dependencies are great built-in mechanisms in Vulkan. This helps us to access tile-memory on mobile GPUs in an efficient way. Similar functionality existed in OpenGL ES using pixel local storage and framebuffer fetch extensions. However, the OpenGL ES extensions add more freedom for implementing such techniques as programmable blending.

With the extensions described in this blog, you can now do the same in Vulkan without needing to specify subpass self-dependencies.

In summary, let us recap when you should use the different extensions mentioned in this blog for framebuffer fetch:

If you are using multiple subpasses, use VK_EXT_rasterization_order_attachment_access.
If you are using dynamic rendering , use VK_KHR_dynamic_rendering_local_read in combination with VK_EXT_rasterization_order_attachment_access if you are on a device that supports the Vulkan 2024 Roadmap Profile; or use VK_EXT_shader_tile_image.

Framebuffer Fetch in Vulkan

Framebuffer Fetch in Vulkan

Render pass feedback loop

Subpass self-dependency

Rasterization Order Extensions

Shader Tile Image Extension

Usage examples

Programmable blending

Deferred shading

Conclusions

Recommend

喜提全网热搜、商家消费者都玩嗨，饿了么免单为啥总能爆火出圈？

OpenAI is working on AI education, safety initiative with Common Sense

阿里巴巴达摩院与安提瓜和巴布达达成医疗 AI 合作

Apple warns proposed UK law will affect software updates around the world

Fail2ban怎么解放并放入白名单

微信AI升级微信对话开放平台推出产品 “小微助手”

2024: 从头安装 Arch Linux 记录

Stripe Sessions 2024—come join us

昆仑万维：旗下 AI 游戏《Club Koala》首次 Beta 版测试预计 3 月展开

我国实现电视开机广告全面取消开机时长大幅缩短

About Joyk