In part II of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and we demonstrated how optimizing for low divergence can result in substantial performance gains on massively parallel processors. Having a fast traversal algorithm is not very useful, though, unless we also have a tree to go with it. In this part, we will close the circle by looking at tree building; specifically, parallel bounding volume hierarchy (BVH) construction. We will also see an example of an algorithmic optimization that would be completely pointless on a single-core processor, but leads to substantial gains in a parallel setting.
There are many use cases for BVHs, and also many ways of constructing them. In our case, construction speed is of the essence. In a physics simulation, objects keep moving from one time step to the next, so we will need a different BVH for each step. Furthermore, we know that we are going to spend only about 0.25 milliseconds in traversing the BVH, so it makes little sense to spend much more on constructing it. One well-known approach for handling dynamic scenes is to essentially recycle the same BVH over and over. The basic idea is to only recalculate the bounding boxes of the nodes according to the new object locations while keeping the hierarchical structure of nodes the same. It is also possible to make small incremental modifications to improve the node structure around objects that have moved the most. However, the main problem plaguing these algorithms is that the tree can deteriorate in unpredictable ways over time, which can result in arbitrarily bad traversal performance in the worst case. To ensure predictable worst-case behavior, we instead choose to build a new tree from scratch every time step. Let’s look at how.
Exploiting the Z-Order Curve
The most promising current parallel BVH construction approach is to use a so-called linear BVH (LBVH). The idea is to simplify the problem by first choosing the order in which the leaf nodes (each corresponding to one object) appear in the tree, and then generating the internal nodes in a way that respects this order. We generally want objects that located close to each other in 3D space to also reside nearby in the hierarchy, so a reasonable choice is to sort them along a space-filling curve. We will use the Z-order curve for simplicity. Continue reading
In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs in a set of 3D objects using their axis-aligned bounding boxes (AABBs). Each of the two algorithms has its weaknesses: sort and sweep suffers from high execution divergence, while uniform grid relies on too many simplifying assumptions that limit its applicability in practice.
In this part we will turn our attention to a more sophisticated approach, hierarchical tree traversal, that avoids these issues to a large extent. In the process, we will further explore the role of divergence in parallel programming, and show a couple of practical examples of how to improve it.
Bounding Volume Hierarchy
We will build our approach around a bounding volume hierarchy (BVH), which is a commonly used acceleration structure in ray tracing (for example). A bounding volume hierarchy is essentially a hierarchical grouping of 3D objects, where each group is associated with a conservative bounding box.
This series of posts aims to highlight some of the main differences between conventional programming and parallel programming on the algorithmic level, using broad-phase collision detection as an example. The first part will give some background, discuss two commonly used approaches, and introduce the concept of divergence. The second part will switch gears to hierarchical tree traversal in order to show how a good single-core algorithm can turn out to be a poor choice in a parallel setting, and vice versa. The third and final part will discuss parallel tree construction, introduce the concept of occupancy, and present a recently published algorithm that has specifically been designed with massive parallelism in mind.
Why Go Parallel?
The computing world is changing. In the past, Moore’s law meant that the performance of integrated circuits would roughly double every two years, and that you could expect any program to automatically run faster on newer processors. However, ever since processor architectures hit the Power Wall around 2002, opportunities for improving the raw performance of individual processor cores have become very limited. Today, Moore’s law no longer means you get faster cores—it means you get more of them. As a result, programs will not get any faster unless they can effectively utilize the ever-increasing number of cores.
Out of the current consumer-level processors, GPUs represent one extreme of this development. NVIDIA GeForce GTX 480, for example, can execute 23,040 threads in parallel, and in practice requires at least 15,000 threads to reach full performance. The benefit of this design point is that individual threads are very lightweight, but together they can achieve extremely high instruction throughput.
One might argue that GPUs are somewhat esoteric processors that are only interesting to scientists and performance enthusiasts working on specialized applications. While this may be true to some extent, the general direction towards more and more parallelism seems inevitable. Learning to write efficient GPU programs not only helps you get a substantial performance boost, but it also highlights some of the fundamental algorithmic considerations that I believe will eventually become relevant for all types of computing. Continue reading