In this blog post I will briefly discuss the importance and simplicity of graph coloring and its application to one of the most common problems in sparse linear algebra – the incomplete-LU factorization. My goal is to convince you that graph coloring is a problem that is well-suited for GPUs and that it should be viewed as a tool that can be used to expose latent parallelism even in cases where it is not obvious. In fact, I will apply this tool to expose additional parallelism in one of the most popular black-box preconditioners/smoothers—the incomplete-LU factorization—which is used in many applications, including Computational Fluid Dynamics; Computer-Aided Design, Manufacturing, and Engineering (CAD/CAM/CAE); and Seismic Exploration (Figure 1).

## What is Graph Coloring?

In general, graph coloring refers to the problem of finding the minimum number of colors that can be used to color the nodes of a graph, such that no two adjacent (connected) nodes have the same color. For example, the graph in Figure 2 can be colored with two colors (green and yellow).

Why is this mathematical problem of interest to us? Well, imagine that each node in the graph represents a task and each edge represents a dependency between two tasks. Then, graph coloring tells us which tasks are independent. Assuming that the edges have no particular direction assigned to them, we can process the tasks with the same color in parallel (they are independent by construction), perform a barrier, and proceed to the next set of tasks that are identified by a different color. Not all problems can be mapped to such a framework, but many are amenable to it.

The next question we should answer is how difficult is it to perform graph coloring? Now that the cuSPARSE routine provides a graph coloring implementation in the `csrcolor()`

routine, for most users it is trivially easy. But in this post I want to talk about implementing the algorithm itself in a bit more detail.

It is well-known that finding the best solution to this problem is NP-complete. However, there are many parallel algorithms that can find an approximate solution very quickly. Indeed, the exact solution is often not even required, as long as we obtain enough parallelism to fully utilize our parallel computing platform. Continue reading