This post demonstrates the practical utility of CUDA’s sinpi()
and cospi()
functions in the context of distance calculations on earth. With the advent of location-aware and geospatial applications and geographical information systems (GIS), these distance computations have become commonplace.
A great circle, also known as an orthodrome or Riemannian circle, of a sphere is the intersection of the sphere and a plane which passes through the center point of the sphere.
For almost any pair of points on the surface of a sphere, the shortest (surface) distance between these points is the path along the great circle between them. If you have ever flown from Europe to the west coast of North America and wondered why you passed over Greenland, your flight most likely followed a great circle path in order to conserve fuel.
Following is the code for a C function, haversine()
, which computes the great-circle distance of two points on earth (or another sphere), using the Haversine formula. People differ on their philosophy as to the “correct” radius when assuming a spherical earth (given that earth is not a sphere; Wikipedia provides some guidance on this matter). Therefore, the earth’s radius is an input to the function, which also allows trivial switching between kilometers or miles as units. The accuracy of the formula when computed in single-precision should generally be fully adequate for distance computations within a continent.
/* This function computes the great-circle distance of two points on earth using the Haversine formula, assuming spherical shape of the planet. A well-known numerical issue with the formula is reduced accuracy in the case of near antipodal points. lat1, lon1: latitude and longitude of first point, in degrees [-90, +90] lat2, lon2: latitude and longitude of second point, in degrees [-180, +180] radius: radius of the earth in user-defined units, e.g. 6378.2 km or 3963.2 miles returns: distance of the two points, in the same units as radius Reference: http://en.wikipedia.org/wiki/Great-circle_distance */ __device__ float haversine (float lat1, float lon1, float lat2, float lon2, float radius) { float dlat, dlon, c1, c2, d1, d2, a, c, t; c1 = cospif (lat1 / 180.0f); c2 = cospif (lat2 / 180.0f); dlat = lat2 - lat1; dlon = lon2 - lon1; d1 = sinpif (dlat / 360.0f); d2 = sinpif (dlon / 360.0f); t = d2 * d2 * c1 * c2; a = d1 * d1 + t; c = 2.0f * asinf (fminf (1.0f, sqrtf (a))); return radius * c; }
Thanks to Norbert Juffa for contributing this code.
]]>Note: this post was co-written by Alex Şuhan and Todd Mostak of MapD.
At MapD our goal is to build the world’s fastest big data analytics and visualization platform that enables lag-free interactive exploration of multi-billion row datasets. MapD supports standard SQL queries as well as a visualization API that maps OpenGL primitives onto SQL result sets.
Although MapD is fast running on x86-64 CPUs, our real advantage stems from our ability to leverage the massive parallelism and memory bandwidth of GPUs. The most powerful GPU currently available is the NVIDIA Tesla K80 Accelerator, with up to 8.74 teraflops of compute performance and nearly 500 GB/sec of memory bandwidth. By supporting up to eight of these cards per server we see orders-of-magnitude better performance on standard data analytics tasks, enabling a user to visually filter and aggregate billions of rows in tens of milliseconds, all without indexing. The following Video shows the MapD dashboard, showing 750 million tweets animated in real time. Nothing in this demo is pre-computed or canned. Our big data visual analytics platform is running on 8 NVIDIA Tesla K40 GPUs on a single server to power the dashboard.
Fast hardware is only half of the story, so at MapD we have invested heavily into optimizing our code such that a wide range of analytic workloads run optimally on GPUs. In particular, we have worked hard so that common SQL analytic operations, such as filtering (WHERE
) and GROUP BY
, run as fast as possible. One of the biggest payoffs in this regard has been moving from the query interpreter that we used in our prototype to a JIT (Just-In-Time) compilation framework built on LLVM. LLVM allows us to transform query plans into architecture-independent intermediate code (LLVM IR) and then use any of the LLVM architecture-specific “backends” to compile that IR code for the needed target, such as NVIDIA GPUs, x64 CPUs, and ARM CPUs.
Query compilation has the following advantages over an interpreter:
x*2+3
, an interpreter-based query engine would first evaluate x*2
for a number of rows, storing that to an intermediate buffer. The intermediate results stored in that buffer would then be read and summed with 3 to get the final result. Writing and reading these intermediate results to memory wastes memory bandwidth and/or valuable cache space. Compare this to a compiled query which can simply store the result of the first subexpression (x*2
) into a register before computing the final result, allowing the cache to be used for other purposes, for example to create the hash table necessary for a query’s GROUP BY
clause. This is related to loop fusion and kernel fusion compiler optimizations.
An efficient interpreter would likely involve executing instructions represented by vectors of opcodes/byte-codes. Decoding the byte-code to get the required operations and then branching to the correct operation requires a significant amount of extra cycles. On the other hand, pre-generating compiled code for the query avoids the inefficiencies of this virtual machine approach.
Depending on the number and range of the columns used in a GROUP BY
clause, different hash strategies are optimal. Some of them rely on generating collision-free hash functions based on the range of the data, which is only known at runtime. Reproducing such functionality efficiently with an interpreter, particularly when the number and types of columns can vary, is difficult.
Of course, LLVM is not the only way to generate a JIT query compiler. Some databases employ source-to-source compilers to convert SQL to another source language like C++, which they then compile using regular compilers like gcc. We think that an LLVM-based compiler has significant advantages over a transpiler, including:
LLVM IR is quite portable over the various architectures we run on (GPU, x86-64, ARM). In contrast, source language generation requires more attention to syntactic differences, particularly in divergent cases like CUDA vs. OpenCL (both can be targeted with LLVM quite easily).
LLVM comes with built-in code validation APIs and tools. For example, comparison and arithmetic operations on integers will fail (with a useful error message) if the operand widths are different. Once a function is generated, llvm::verifyFunction
performs additional sanity checks, ensuring (among other things) that the control flow graph of our query is well-formed.
LLVM is powerful and battle-proven for CPUs, but our product focuses on GPUs. If we could use LLVM for GPU code compilation we’d get all the benefits we’ve mentioned while also being able to run on a CPU when needed. Fortunately, the NVIDIA Compiler SDK made this a reality long before we started to build our product.
The NVIDIA Compiler SDK includes libNVVM, an LLVM-based compiler backend and NVVM IR, a rather extensive subset of LLVM IR. Thanks to our choice of LLVM and libNVVM, our system runs on NVIDIA GPUs, GPU-less ultrabooks, and even on the 32-bit ARM CPU on the Jetson TK1, all using the same code base.
MapD does not need to directly generate all code. We offload some of the functionality to a runtime written in C++ whenever code generation would be tedious and error-prone without any performance benefits. This approach is a great fit for things like aggregate functions, handling arithmetic on columns with SQL null
values, hash dictionaries and more. The LLVM based C++ compiler, clang, generates the corresponding LLVM IR, and we combine it with our explicitly generated IR.
As is always the case when compilation is involved, the time required to generate native code is an important consideration. An interactive system sees new queries all the time as the user refines them in search of insight. We’re able to keep code generation consistently under 30 ms for entirely new queries, which is good enough to be unnoticeable in the console, especially for massive datasets. However, for “mere billions” of rows, our UI is able to show smooth animations over multiple correlated charts. Since the actual execution is so fast in this case, 30 ms can matter a lot.
Fortunately, these queries are structurally identical and only differ in the value of literals as the filter window moves across the time range or the user selects the tail of a histogram. With caching in place, compilation time becomes a non-issue. We keep it simple and still generate the IR, then use it as a key in the native code cache. The LLVM API offers an easy way to serialize source level entities (functions in our case), shown below.
std::string serialize_function(const llvm::Function* f) { std::stringstream ss; llvm::raw_os_ostream os(ss); f->print(os); return ss.str(); }
Ideas are great in performance-focused systems, but the proof is in the pudding. As it turns out, MapD extracts a lot of performance out of GPUs.
Queries using filter and aggregate routinely hit more than 80% of the available bandwidth. We’ve measured more than 240 GB/s on a single K40 (vs a theoretical max of 288GB/sec) for a filter and count query touching a single column. When grouping by a single column with 20 possible values and some skew (the carrier in the airline data set in Figure 1), MapD can only reach slightly more than 100 GB/s on K40. On the new Titan X GPU, based on the Maxwell architecture, we are able to get more than 200 GB/s on the same query, on a single card. Maxwell handles contention in shared memory atomics significantly better than the Kepler architecture, which explains this great result on skewed inputs. We’re looking forward to this feature being implemented on future generations of Tesla cards as well.
MapD is easily able to get a 40-50x speedup on a multi-GPU system, even when compared to our own code running on a high end dual-socket CPU system, and there are even queries for which the gap is two orders of magnitude (this is often code with lots of divisions, which tend to be slow on x86-64). Compared to other leading in-memory CPU-based databases, which typically use interpreters or source-to-source compilers, the speedup can easily be three orders of magnitude, as Figure 2 shows.
We’ve learned a lot about LLVM and JIT compilation for GPUs while building MapD’s interactive query engine, and we’d like to share some of that experience with you.
Most MapD runtime functions are marked as always_inline
, which forces the LLVM AlwaysInliner
optimization pass to inline them so that there is no function call overhead and increased scope for other optimization passes. For example, the following is a reasonable way of implementing a max
aggregate.
extern "C" __attribute__((always_inline)) void agg_max(int64_t* agg, const int64_t val) { *agg = std::max(*agg, val); }
Note that the function is not marked as __device__
since this is not CUDA C++ code. Any explicit call to this function will be eventually inlined and the result can run unmodified on the GPU. Also, if agg
points to a value allocated on the stack (as is the case for queries without GROUP BY
clause), the PromoteMemoryToRegister
pass will place it in a register for the inner loop of the query. The runtime functions which need GPU-specific implementations are part of a regular CUDA C++ library we can call from the query.
We’ve said that NVVM generates native code, but there actually is an additional step we haven’t discussed. From the IR we generate, NVVM generates PTX, which in turn is compiled to native code for the GPU. Especially if you’re bundling a CUDA C++ library with the generated code, like we do, caching the result of this last step is very important. Make sure the compute cache directory is writable by your application or else it will silently fail and recompile every time. The code snippet below shows how we bundle a library with the PTX we generate.
checkCudaErrors(cuLinkCreate(num_options, &option_keys[0], &option_values[0], &link_state_)); if (!lib_path.empty()) { // To create a static CUDA library: // 1. nvcc -std=c++11 -arch=sm_30 --device-link // -c [list of .cu files] // 2. nvcc -std=c++11 -arch=sm_30 // -lib [list of .o files generated by step 1] // -o [library_name.a] checkCudaErrors(cuLinkAddFile(link_state_, CU_JIT_INPUT_LIBRARY, lib_path.c_str(), num_options, &option_keys[0], &option_values[0])); } checkCudaErrors(cuLinkAddData(link_state_, CU_JIT_INPUT_PTX, static_cast(ptx), strlen(ptx) + 1, 0, num_options, &option_keys[0], &option_values[0])); void* cubin; size_t cubin_size; checkCudaErrors(cuLinkComplete(link_state_, &cubin, &cubin_size)); checkCudaErrors(cuModuleLoadDataEx(&module_, cubin, num_options, &option_keys[0], &option_values[0])); checkCudaErrors(cuModuleGetFunction(&kernel_, module_, func_name.c_str()));
There is an upper bound for the number of registers a block can use, so the CU_JIT_THREADS_PER_BLOCK
option should be set to the block size. Failing to do so can make the translation to native code fail. We’ve had this issue for queries with many projected columns and a lot of threads per block before setting this option.
Speaking of libraries, not all POSIX C functions are included in the CUDA C++ runtime libraries. In our case, we needed gmtime_r
for the EXTRACT
family of SQL functions. Fortunately, we’ve been able to port it from newlib and compile it with NVCC.
Just a word of caution: despite sharing the IR specification, NVVM and LLVM are ultimately different code-bases. Going with an older version of LLVM, preferably the one NVVM is based on, can help. We decided against that approach since the LLVM API offers a wide range of “IR surgery” features and we were able to fix up these mismatches, but your mileage may vary.
Also, unlike LLVM IR, unaligned loads are not allowed in NVVM IR. The address of a load must be a multiple of the size of the type; otherwise, the query would crash with an invalid memory access error on the GPU, even if the load is not annotated as aligned.
Creating a SQL JIT for GPUs is just one of the many optimizations we’ve implemented to make MapD as fast as possible. If you’d like to learn more about MapD, please visit the MapD website, download our white paper, or read our blog.
]]>In my previous post, I introduced statistical machine translation and showed how it can and should be viewed from the perspective of machine learning: as supervised learning where the input and output are both variable-length sequences. In order to introduce you to neural machine translation, I spent half of the previous post on recurrent neural networks, specifically about how they can (1) summarize a sequence and (2) probabilistically model a sequence. Based on these two properties of recurrent neural networks, in this post I will describe in detail an encoder-decoder model for statistical machine translation.
I’m not a neuroscientist or a cognitive scientist, so I can’t speak authoritatively about how the brain works. However, if I were to guess what happens in my brain when I try to translate a short sentence in English to Korean, my brain encodes the English sentence into a set of neuronal activations as I hear them, and from those activations, I decode the corresponding Korean sentence. In other words, the process of (human) translation involves the encoder which turns a sequence of words into a set of neuronal activations (or spikes, or whatever’s going on inside a biological brain) and the decoder which generates a sequence of words in another language, from the set of activations (see Figure 1).
This idea of encoder-decoder architectures is the basic principle behind neural machine translation. In fact, this type of architecture is at the core of deep learning, where the biggest emphasis is on learning a good representation. In some sense, you can always cut any neural network in half, and call the first half an encoder and the other a decoder.
Starting with the work by Kalchbrenner and Blunsom at the University of Oxford in 2013, this encoder-decoder architecture has been proposed by a number of groups, including the Machine Learning Lab (now, MILA) at the University of Montreal (where I work) and Google, as a new way to approach statistical machine translation [Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015]. (There is also older, related work by Mikel Forcada at the University of Alcante from 1997! [Forcada and Neco, 1997].) Although there is no restriction on which particular type of neural network is used as either an encoder or a decoder, I’ll focus on using a recurrent neural network for both.
Let’s build our first neural machine translation system! But, before I go into details, let me first show you a big picture of the whole system in Figure 2. Doesn’t it look scarily complicated? Nothing to worry about, as I will walk you through this system one step at a time.
We start from the encoder, a straightforward application of a recurrent neural network, based on its property of sequence summarization. If you recall the previous post, this should be very natural. In short, we apply the recurrent activation function recursively over the input sequence, or sentence, until the end when the final internal state of the RNN is the summary of the whole input sentence.
First, each word in the source sentence is represented as a so-called one-hot vector, or 1-of-K coded vector as in Figure 3. This kind of representation is the dumbest representation you can ever find. Every word is equidistant from every other word, meaning that it does not preserve any relationships among them.
We take a hierarchical approach to extracting a sentence representation, a vector that summarizes the input sentence. In that hierarchy, the first step is to obtain a meaningful representation of each word. But, what do I mean by “meaningful” representation? A short answer is “we let the model learn from data!”, and there isn’t any longer answer.
The encoder linearly projects the 1-of-K coded vector (see Figure 3) with a matrix which has as many columns as there are words in the source vocabulary and as many rows as you want (typically, 100 – 500.) This projection , shown in Figure 4, results in a continuous vector for each source word, and each element of the vector is later updated to maximize the translation performance. I’ll get back to what this means shortly.
At this point, we have transformed a sequence of words into a sequence of continuous vectors s, and the recurrent neural network comes in. At the end of the last post, I said that one of the two key capabilities of the RNN was a capability of summarizing a sequence, and here, I will use an RNN to summarize the sequence of continuous vectors corresponding to the words in a source sentence. Figure 5 illustrates how an RNN does it.
I can write this process of summarization in mathematical notation as
where is an all-zero vector. In other words, after the last word’s continuous vector is read, the RNN’s internal state represents a summary of the whole source sentence.
Now that we have a summary vector, a natural question comes to mind: “what does this summary vector look like?” I would love to spend hours talking about what that summary vector should look like, what it means and how it’s probably related to representation learning and deep learning, but I think one figure from [Sutskever et al., 2014] says it all in a much more compact form (Figure 6).
To plot the points in Figure 6, Sutskever et al. (2014) trained a neural machine translation system which we’re now training on a large parallel corpus of English and French. Once the model was trained on the corpus, he fed in several English sentences into the encoder to get their corresponding sentence representations, or summary vectors ‘s. (I guess, in order to show off their model’s awesomeness!)
Unfortunately, human beings are pretty three dimensional, and our screens and papers can only faithfully draw two-dimensional projections. So it’s not easy to show anyone a vector which has hundreds of numbers, especially on paper. There are a number of information visualization techniques for high-dimensional vectors using a much lower-dimensional space. In the case of Figure 6, Sutskever et al. (2014) used principal component analysis (PCA) to project each vector onto a two-dimensional space spanned by the first two principal components (x-axis and y-axis in Figure 6). From this, we can get a rough sense of the relative locations of the summary vectors in the original space. What we can see from Figure 6 is that the summary vectors do preserve the underlying structure, including semantics and syntax (if there’s a such thing as syntax); in other words, similar sentences are close together in summary vector space.
Now that we have a nice fixed-size representation of a source sentence, let’s build a decoder, again using a recurrent neural network (the top half in Figure 2). Again, I will go through each step of the decoder. It may help to keep in mind that the decoder is essentially the encoder flipped upside down.
Let’s start by computing the RNN’s internal state based on the summary vector of the source sentence, the previous word and the previous internal state . Don’t worry, I’ll shortly tell you how to get the word. The new internal state is computed by
The details of were described in the previous post. Figure 7 illustrates this computation. With the decoder’s internal hidden state ready, we can now score each target word based on how likely it is to follow all the preceding translated words given the source sentence. This is done by assigning a probability to each word (Figure 8). Note, a probability is different from a score in that the probabilities over all possible words sum to one, while the scores don’t need to.
First, we score each word given a hidden state such that
where and are the (target) word vector and a bias, respectively.
Let’s forget about the bias for now, and think of the first term, the dot product between two vectors. The dot product is larger when the target word vector and the decoder’s internal state are similar to each other, and smaller otherwise. Remember: a dot product gives the length of the projection of one vector onto another; if they are similar vectors (nearly parallel) the projection is longer than if they very different (nearly perpendicular). So this mechanism scores a word high if it aligns well with the decoder’s internal state.
Once we compute the score of every word, we now need to turn the scores into proper probabilities using
This type of normalization is called softmax [Bridle, 1990].
Now we have a probability distribution over the target words, which we can use to select a word by sampling the distribution (see here), as Figure 9 shows. After choosing the -th word, we go back to the first step of computing the decoder’s internal hidden state (Figure 7), scoring and normalizing the target words (Figure 8) and selecting the next -th word (Figure 9), repeating until we select the end-of-sentence word (
Okay, now we have a neural machine translation system ready. How do we train this system so that it can actually translate? As usual with any machine learning model, there are many ways to tune this model to do actual translation. Here, I will describe how to train a neural machine translation model based on the previously described encoder-decoder by maximizing the log-likelihood. Maximum (log-) likelihood estimation (MLE) is a common statistical technique.
First, a so-called parallel corpus must be prepared. Each sample in the corpus is a pair of source and target sentences. Each sentence is a sequence of integer indices corresponding to words, which is equivalent to a sequence of one-hot vectors. (A one-hot vector is a binary vector with a single element set to 1. Multiplying a one-hot vector with a matrix (from the left) is equivalent to taking the -th column of the matrix, where the -th element of the one-hot vector is 1.) Given any pair from the corpus, the NMT model can compute the conditional log-probability of given : , and we write the log-likelihood of the whole training corpus as
where is the number of training pairs.
All we need to do is to maximize this log-likelihood function, which we can do using stochastic gradient descent (SGD).
The gradient of the log-likelihood with respect to all the parameters can be easily and efficiently computed by backpropagation. All you need to do is to build a backward graph starting from the final log-probability to the input, and compute derivatives of each operator in the forward computational graph. Well, I don’t know about you, but that sounds awfully complicated and time-consuming to me. Instead of doing it manually, we can use Theano’s automatic differentiation procedure by calling theano.tensor.grad(-loglikelihood, parameters)
. Here is an example, and here is more detailed documentation.
Once Theano has automatically computed the derivative of the log-likelihood with respect to each parameter, we update the parameter to move along that derivative slowly. Often, this slowly moving part is one that frustrates a lot of people, and this is one of the reasons why many people mistake deep learning for black magic. I agree that finding good learning parameters (initial learning rate, learning rate scheduling, momentum coefficient and its scheduling, and so on) can be frustrating.
So, I often simply go for one of the recently proposed adaptive learning rate algorithms. Among many of them, Adadelta [Zeiler, 2012] and Adam [Kingma and Ba, 2015] are my favourites. They can be easily implemented in Theano, and if you’re not too keen on reading the paper and implementing the algorithm, you can refer to the Theano documentation. Also you might want to have a look at these visualizations of the
optimization algorithms (also figure 10), although I must warn you that the behavior in this low dimensional space is not necessarily representative of the behavior in a higher dimensional space.
Does this mean that I can train a neural machine translation model on a large parallel corpus with my laptop? Unfortunately, no. As you may have guessed, the amount of computation you need for each update is quite large, and the number of SGD updates needed to fully train a model is also large. Let’s first count what kind of computation is required for a single forward pass:
Considering that and are often on the order of tens or hundreds of thousands and that , and are on the order of hundreds to thousands, the whole computation load is quite substantial. Furthermore, almost the same amount of computation is required for backpropagation, i.e., computing the gradient of the log-likelihood function.
Note that most of these computations are matrix-vector or matrix-matrix multiplications of high-dimensional vectors or matrices, and when it comes to Generalized Matrix-Matrix multiplications (GEMM) of large matrices, it’s well-known that GPUs significantly outperform CPUs (in terms of wall-clock time.) So it’s crucial to have a nice set of the latest GPUs to develop, debug and train neural machine translation models.
For instance, see Table 1 for how much time you can save by using GPUs. The table presents only the time needed for translation, and the gap between CPU and GPU grows much greater when training a model, as the complexity estimates above show.
CPU (Intel i7-4820K) | GPU (GTX TITAN Black) | |
RNNsearch [Bahadanau et al,, 2015] | 0.09s | 0.02s |
Table 1. The average per-word decoding/translation time. From [Jean et al., 2015] |
I can assure you that I didn’t write this section to impress NVIDIA; you really do need good GPUs to train any realistic neural machine translation models, at least until scalable and affordable universal quantum computers become available!
Continuing from the previous post, today I described how a recently proposed neural machine translation system is designed using recurrent neural networks. The neural machine translation system in today’s post is a simple, basic model that was recently shown to be excellent in practice for English-French translation.
Based on this basic neural machine translation model, in my next post I will tell you how we can push neural machine translation much further by introducing an attention mechanism into the model. Furthermore, I will show you how we can use neural machine translation to translate from images and even videos into their descriptions! Stay tuned.
]]>Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units).
You can enable linear interpolation (also known as linear filtering) on texture fetches in CUDA kernels. This hardware filtering uses a low-precision interpolant, so for this and other reasons it’s common to lerp in software.
The standard way to lerp is:
(1-t)*v0 + t*v1
Here’s a generic host/device function that performs a lerp:
template__host__ __device__ inline T lerp(T v0, T v1, T t) { return (1-t)*v0 + t*v1; }
But we can do better. Compiled as-is this maps to three floating-point operations. The compiler won’t re-arrange floating-point computation if the transformation does not preserve numerical equality, with the exception of the straightforward merging of an FADD followed by a dependent FMUL into an FMA.
To maximize performance, you may want to manually re-arrange the terms in the computation above so it can be reduced to two FMAs:
fma(t, v1, fma(-t, v0, v0))
This FMA-optimized version also provides slightly better accuracy overall. Here’s the full function:
template__host__ __device__ inline T lerp(T v0, T v1, T t) { return fma(t, v1, fma(-t, v0, v0)); }
For one use case (seismic processing CUDA code) we have seen performance improve by 5% just by optimizing the linear interpolation function as shown above. Not bad for a few minutes of work.
Thanks to Norbert Juffa for providing this Pro Tip.
]]>In this blog post I will briefly discuss the importance and simplicity of graph coloring and its application to one of the most common problems in sparse linear algebra – the incomplete-LU factorization. My goal is to convince you that graph coloring is a problem that is well-suited for GPUs and that it should be viewed as a tool that can be used to expose latent parallelism even in cases where it is not obvious. In fact, I will apply this tool to expose additional parallelism in one of the most popular black-box preconditioners/smoothers—the incomplete-LU factorization—which is used in many applications, including Computational Fluid Dynamics; Computer-Aided Design, Manufacturing, and Engineering (CAD/CAM/CAE); and Seismic Exploration (Figure 1).
In general, graph coloring refers to the problem of finding the minimum number of colors that can be used to color the nodes of a graph, such that no two adjacent (connected) nodes have the same color. For example, the graph in Figure 2 can be colored with two colors (green and yellow).
Why is this mathematical problem of interest to us? Well, imagine that each node in the graph represents a task and each edge represents a dependency between two tasks. Then, graph coloring tells us which tasks are independent. Assuming that the edges have no particular direction assigned to them, we can process the tasks with the same color in parallel (they are independent by construction), perform a barrier, and proceed to the next set of tasks that are identified by a different color. Not all problems can be mapped to such a framework, but many are amenable to it.
The next question we should answer is how difficult is it to perform graph coloring? Now that the cuSPARSE routine provides a graph coloring implementation in the csrcolor()
routine, for most users it is trivially easy. But in this post I want to talk about implementing the algorithm itself in a bit more detail.
It is well-known that finding the best solution to this problem is NP-complete. However, there are many parallel algorithms that can find an approximate solution very quickly. Indeed, the exact solution is often not even required, as long as we obtain enough parallelism to fully utilize our parallel computing platform.
The typical parallel algorithm for finding the approximate graph coloring is based on Luby’s maximal independent set algorithm. In this algorithm we
The third step allows us to find the maximal independent set, but graph coloring works even if we only find an independent set, although it will require more colors. The nodes in the (maximal) independent set can be assigned the same color, and the algorithm proceeds in the same fashion to color the remaining nodes of the graph.
The following code sample shows a CUDA C++ kernel for finding an independent set from an adjacency matrix of a graph stored in compressed sparse row (CSR) format, where Ao
contains the matrix row offsets, Ac
contains the column indices and Av
contains the non-zero values.
#include__global__ void color_jpl_kernel(int n, int c, const int *Ao, const int *Ac, const double *Av, const int *randoms, int *colors) { for (int i = threadIdx.x+blockIdx.x*blockDim.x; i < n; i += blockDim.x*gridDim.x) { bool f=true; // true iff you have max random // ignore nodes colored earlier if ((colors[i] != -1)) continue; int ir = randoms[i]; // look at neighbors to check their random number for (int k = Ao[i]; k < Ao[i+1]; k++) { // ignore nodes colored earlier (and yourself) int j = Ac[k]; int jc = colors[j]; if (((jc != -1) && (jc != c)) || (i == j)) continue; int jr = randoms[j]; if (ir <= jr) f=false; } // assign color if you have the maximum random number if (f) colors[i] = c; } } #define CUDA_MAX_BLOCKS void color_jpl(int n, const int *Ao, const int *Ac, const double *Av, int *colors) { int *randoms; // allocate and init random array thrust::fill(colors, colors+n, -1); // init colors to -1 for(int c=0; c < n; c++) { int nt = 256; int nb = min((n + nt - 1)/nt,CUDA_MAX_BLOCKS); color_jpl_kernel<< >>(n, c, Ao, Ac, Av, randoms, colors); int left = (int)thrust::count(colors,colors+n,-1); if (left == 0) break; } }
Luby’s algorithm is very powerful because it provides an outline that can be adjusted for different purposes using different heuristics. For example, we can use multiple hash functions instead of random numbers to assign multiple colors at once in order to perform the graph coloring faster. We can also attempt to use the degree (number of in/out edges) of a node in combination with a random number to color nodes with larger/smaller degree first, which could potentially allow us to perform reordering that minimizes fill-in and exposes additional parallelism. Many such combinations are still open research problems.
Another interesting aspect of this parallel outline is that it is ideally suited for GPUs. It is very easy to map the available parallelism to the CUDA programming model. For example, each CUDA thread can look only at its local neighbors and essentially perform the color assignment independently of others.
Figure 3 illustrates the performance of the above algorithm on several sample problems with two different heuristics based on the original idea about random numbers (Jones-Plassmann-Luby) and the novel idea about multiple hash functions (Cohen-Castonguay). The experiments were performed with CUDA Toolkit 7.0 on Ubuntu 14.04 LTS, on an NVIDIA Tesla K40c GPU Accelerator. For more details please refer to our technical report [1].
The incomplete-LU factorization is an algorithm that approximately factors a large sparse matrix into lower and upper triangular matrices, so that
This approximation is later used as an iterative method preconditioner or an algebraic multigrid smoother, and is designed to speed up convergence to the solution of a given linear system.
In order to find the and triangular factors we perform Gaussian elimination; that is, we scale and add rows together to eliminate elements below the main diagonal of . We do no pivoting; in other words, we do not shuffle rows to bring the largest element in a row or column to the diagonal. However, we might perform diagonal boosting to improve numerical stability. In other words, we might bump the value of the diagonal element slightly if we consider it to be too small (or zero).
If we performed the above algorithm in dense storage, we would simply compute the LU-factorization without pivoting (ignore the diagonal boosting for now). However, in sparse storage the algorithm is more complex. Notice that as we proceed with Gaussian elimination, the matrix will have more elements than , because as we scale and add two rows the number of elements in the resulting row will be the union of the elements of both rows. Therefore, in order to compute the LU-factorization we first need to estimate the required storage, allocate it and compute all the extra elements, which can be quite computationally expensive.
The incomplete-LU factorization simply drops the extra elements created during the Gaussian elimination process. Different heuristics for dropping the elements result in different types of incomplete-LU, such as ILU0, ILUT and ILU(p). We focus on the incomplete-LU with 0 fill-in (ILU0), where all extra elements outside of the sparsity pattern of the original matrix are dropped.
On one hand, in order to parallelize this algorithm, we can analyze the dependencies between rows, and find out which rows can be processed in parallel. For example, for a given coefficient matrix
the dependencies between rows are represented by the directed acyclic graph (DAG) in Figure 4.
Notice that it is identical to the graph on which we performed graph coloring in Figure 2. Here the independent rows are organized into levels and this parallel approach is called level scheduling.
On the other hand, we can explicitly reorder the matrix a priori based on graph coloring (such that nodes with the same color are ordered next to each other) and only then analyze it for parallelism. The reordered matrix based on the coloring from the previous section is shown below.
It turns out that reordering based on graph coloring exposes additional parallelism, making the dependency dag shorter and wider. This means fewer colors, which means fewer parallel stages with greater parallelism in each stage.
In fact, for realistic matrices we can have up to 100 times more rows per level after performing reordering based on graph coloring, as Figure 6 illustrates.
Ultimately, as a result of graph coloring we are able to better utilize the GPU. Indeed, we can now take full advantage of its memory bandwidth because we have exposed enough parallelism in our problem. We show the resulting improvement in performance on a sample set of matrices in Fig. 7, where we have used the coloring algorithm implemented in the cuSPARSE library csrcolor()
routine. The experiments were performed with CUDA Toolkit 7.0 on Ubuntu 14.04 LTS, on an NVIDIA Tesla K40c GPU Accelerator. For more details please refer to the technical report [1].
Finally, note that we have used approximate graph coloring, and therefore could potentially further improve the performance of ILU0 with better heuristics or re-coloring techniques.
Graph coloring is a general technique that can enable greater parallelism to be extracted from a problem. As an example, I have shown that reordering matrix rows based on graph coloring can provide a significant speedup of the to the incomplete-LU factorization algorithm on the GPU. Also, I have shown that parallel approximate graph coloring itself is well suited for the GPU. In fact, the algorithm parallelizes extremely well and can be adapted to a variety of problems by using different heuristics.
[1] M.Naumov, P.Castonguay, J. Cohen, “Parallel Graph Coloring with Applications to the incomplete-LU factorization on the GPU”, NVIDIA Research Technical Report, May, 2015.
]]>Analysis of statistical algorithms can generate workloads that run for hours, if not days, tying up a single computer. Many statisticians and data scientists write complex simulations and statistical analysis using the R statistical computing environment. Often these programs have a very long run time. Given the amount of time R programmers can spend waiting for results, it makes sense to take advantage of parallelism in the computation and the available hardware.
In a previous post on the Teraproc blog, I discussed the value of parallelism for long-running R models, and showed how multi-core and multi-node parallelism can reduce run times. In this blog I’ll examine another way to leverage parallelism in R, harnessing the processing cores in a general-purpose graphics processing unit (GPU) to dramatically accelerate commonly used clustering algorithms in R. The most widely used GPUs for GPU computing are the NVIDIA Tesla series. A Tesla K40 GPU has 2,880 integrated cores, 12 GB of memory with 288 GB/sec of bandwidth delivering up to 5 trillion floating point calculations per second.
The examples in this post build on the excellent work of Mr. Chi Yau available at r-tutor.com. Chi is the author of the CRAN open-source rpud
package as well as rpudplus
, R libraries that make is easy for developers to harness the power of GPUs without programming directly in CUDA C++. To learn more about R and parallel programming with GPUs you can download Chi’s e-book. For illustration purposes, I’ll focus on an example involving distance calculations and hierarchical clustering, but you can use the rpud package to accelerate a variety of applications.
Cluster analysis, or clustering, is the process of grouping objects such that objects in the same cluster are more similar (by a given metric) to each other than to objects in other clusters. Cluster analysis is a problem with significant parallelism. In a post on the Teraproc blog we showed an example that involved clustering analysis using k-means. In this post we’ll look at hierarchical cluster in R with hclust
, a function that makes it simple to create a dendrogram (a tree diagram as in Figure 1) based on differences between observations. This type of analysis is useful in all kinds of applications from taxonomy to cancer research to time-series analysis of financial data.
Similar to our k-means example, grouping observations in a hierarchical fashion depends on being able to quantify the differences (or distances) between observations. This means calculating the Euclidean distance between pairs of observations (think of this as the Pythagorean Theorem extended to more dimensions). Chi Yau explains this in his two posts Distance Matrix by GPU and Hierarchical Cluster Analysis, so we won’t attempt to cover all the details here.
R’s hclust
function accepts a matrix of previously computed distances between observations. The dist
function in R computes the difference between rows in a dataset supporting multiple methods including Euclidean distance (the default). If I have a set of M observations (rows), each with N attributes (columns), for each distance calculation I need to compute the length of a vector in N-dimensional space between the observations. There are discrete distance calculations between all rows. Thus, computation scales as the square of the number of observations: for 10 observations I need 45 distance calculations, for 100 observations I need 4,950, and for 100,000 observations I need 4,999,950,000 (almost 5 billion) distance calculations. As you can see, dist
can get expensive for large datasets.
Before I can start running applications, first I need access to a system with a GPU. Fortunately, for a modest price I can rent a machine with GPUs for a couple of hours. In preparing this blog I used two hours of machine time on the Teraproc R Analytics Cluster-as-a-Service. The service leverages Amazon EC2 and the total cost for machine time was $1.30; quite a bit cheaper than setting up my own machine! The reason I was able to use so little time is because the process of installing the cluster is fully automated by the Teraproc service. Teraproc’s R Cluster-as-a-Service provides CUDA, R, R Studio and other required software components pre-installed and ready to use. OpenLava and NFS are also configured automatically, giving me the option to extend the cluster across many GPU capable compute nodes and optionally use Amazon spot pricing to cut costs.
I deployed a one-node cluster on Teraproc.com using the Amazon g2.2xlarge machine type as shown below. I could have installed the g2.2xlarge instance myself from the Amazon EC2 console, but in this case I would have needed to install R, R Studio and configure the environment myself resulting in spending more time and money. You can learn how to set up an R cluster yourself on different node types (including free machines) at the Teraproc R Analytics Cluster-as-a-Service website. If you already have an Amazon EC2 account you can set up a cluster in as little as five minutes.
The g2.2xlarge machine instance is a Sandy Bridge based machine with 8 cores / vCPUs on a Xeon E5-2670 processor, 15GB or memory, a solid state disk drive and an NVIDIA GRID K520 GPU. The on-demand price for this machine is $0.65 per hour. The NVIDIA GRID K520 has two GK104 graphics processors each with 1,536 cores on a single card with 8 GB of RAM.
First we use the teraproc.com R-as-a-cluster service to provision the R environment making sure that we select the correct machine type (G22xlarge) and install a one-node cluster, as Figure 1 shows. This automatically deploys a single-node cluster complete with R Studio and provides us with a URL to access the R Studio environment.
Using the shell function within R Studio (under the Tools menu), I can run an operating system command to make sure that the GPU is present on the machine.
gordsissons@ip-10-0-93-199:~$ lspci | grep -i nvidia 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
To use the rpud
package to access GPU functions we need to install it first. Run the command below from inside R-Studio to install rpud
from CRAN.
> install.packages(“rpud”)
Next, use the library
command to access rpud
functions.
> library(rpud) Rpudplus 0.5.0 http://www.r-tutor.com Copyright (C) 2010-2015 Chi Yau. All Rights Reserved. Rpudplus is free for academic use only. There is absolutely NO warranty.
If everything is working properly, we should be able to see the GPU on our Amazon instance from the R command prompt by calling rpuGetDevice
.
> rpuGetDevice() GRID K520 GPU [1] 0
The following listing shows a sample R program that compares performance of a hierarchical clustering algorithm with and without GPU acceleration. The first step is to create a suitable dataset where we can control the number of observations (rows) as well as the number of dimensions (columns) for each observation. The test.data
function returns a matrix of random values based on the number of rows and columns provided.
The run_cpu
function calculates all the distances between the observations (rows) using R’s dist
function, and then runs R’s native hclust
function against the computed distances stored in dcpu
to create a dendrogram. The run_gpu
function performs exactly the same computations using the GPU-optimized versions of dist
and hclust
(rpuDist
and rpuHclust
) from the rpud
package.
The R script creates a matrix m of a particular size by calling test.data
and then measures and displays the time required to create a hierarchical cluster using both the CPU and GPU functions.
library("rpud") # # function to populate a data matrix # test.data <- function(dim, num, seed=10) { set.seed(seed) matrix(rnorm(dim * num), nrow=num) } run_cpu <- function(matrix) { dcpu <- dist(matrix) hclust(dcpu) } run_gpu <- function(matrix) { dgpu <- rpuDist(matrix) rpuHclust(dgpu) } # # create a matrix with 20,000 observations each with 100 data elements # m <- test.data(100, 20000) # # Run dist and hclust to calculate hierarchical clusters using CPU # print("Calculating hclust with Sandy Bridge CPU") print(system.time(cpuhclust <-run_cpu(m))) # # Run dist and hclust to calculate hierarchical clusters using GPU # print("Calculating hclust with NVIDIA K520 GPU") print(system.time(gpuhclust <- run_gpu(m)))
Running the script yields the following results:
>source('~/examples/rgpu_hclust.R') [1] "Calculating hclust with Sandy Bridge CPU" user system elapsed 294.760 0.746 295.314 [1] "Calculating hclust with NVIDIA K520 GPU" user system elapsed 19.285 3.160 22.431
To explore the GPU vs. CPU speedup, we ran the script on datasets with a varying number of rows and plotted the results. The distance calculation is highly parallel on the GPU, while much of the GPU-optimized hclust
calculation runs on the CPU. For this reason the calculation scales well as the dataset gets larger and the time required for the distance calculations dominates.
Number of rows | Number of dimensions | Total elements | # distance calculations | CPU time (seconds) | GPU time (seconds) | Speed-up |
1,000 | 100 | 100,000 | 1,998,000 | 0.50 | 0.04 | 11.8 |
2,000 | 100 | 200,000 | 7,996,000 | 2.06 | 0.17 | 12.1 |
5,000 | 100 | 500,000 | 49,990,000 | 13.42 | 1.17 | 11.5 |
10,000 | 100 | 1,000,000 | 199,980,000 | 59.83 | 5.03 | 11.9 |
15,000 | 100 | 1,500,000 | 449,970,000 | 141.15 | 11.61 | 12.2 |
20,000 | 100 | 2,000,000 | 799,960,000 | 295.31 | 22.43 | 13.2 |
Looking at the run times side by side, we see that running the multiple steps with the GPU is over ten times faster than running on the CPU alone.
The result of our analysis is a hierarchical cluster that we can display as a dendrogram like Figure 1 using R’s plot
command.
> plot(gpuhclust,hang = -1)
Our results clearly show that running this type of analysis on a GPU makes a lot of sense. Not only can we complete calculations ten times faster, but just as importantly we can reduce the cost of resources required to do our work. We can use these efficiencies to do more thorough analysis and explore more scenarios. By using the Teraproc service, we make GPU computing much more accessible to R programmers who may not otherwise have access to GPU-capable nodes.
In a future post we’ll show how you can tackle very large analysis problems with clusters of GPU-capable machines. Try out Teraproc R Analytics Cluster-as-a-Service today! To learn about other ways to accelerate your R code with GPUs, check out the post Accelerate R Applications with CUDA by NVIDIA’s Patric Zhao.
]]>Neural machine translation is a recently proposed framework for machine translation based purely on neural networks. This post is the first of a series in which I will explain a simple encoder-decoder model for building a neural machine translation system [Cho et al., 2014; Sutskever et al., 2014; Kalchbrenner and Blunsom, 2013]. In a later post I will describe how an attention mechanism can be incorporated into the simple encoder-decoder model [Bahdanau et al., 2015], leading to the state-of-the-art machine translation model for a number of language pairs including En-Fr, En-De, En-Tr and En-Zh [Gulcehre et al., 2015; Jean et al., 2015]. Furthermore, I will introduce recent work which has applied this framework of neural machine translation to image and video description generation [Xu et al., 2015; Li et al., 2015].
First, let’s start with a brief overview of machine translation. In fact, the name, machine translation, says everything. We want a machine to translate text in one language, which we will call the source sentence, to corresponding text in another language, which we call the target sentence. (Although ideally the machine should be able to translate a whole document from one language to another, let us concentrate in this blog post on sentence-level machine translation.)
There are multiple ways to build such a machine that can translate languages. For instance, we can ask a bilingual speaker to give us a set of rules transforming a source sentence into a correct translation. This is not a great solution, as you can imagine, because we don’t even know the set of rules underlying a single language, not to mention the rules underlying a pair of languages. It is simply hopeless to write an exhaustive set of rules for translating a source sentence into a correct translation. Hence, in this blog post, we focus on a statistical approach where those rules, either implicitly or explicitly, are automatically extracted from a large corpus of text.
This statistical approach to machine translation is called statistical machine translation. The goal is the same (build a machine that translates a sentence from one language to another), but we let the machine learn from data how to translate rather than design a set of rules for the machine (See Fig. 1 for a graphical illustration.) Learning is based on statistical methods, which should sound familiar to anyone who has taken a basic course on machine learning. In fact, statistical machine translation is nothing but a particular application of machine learning, where the task is to find a function that maps from a source sentence to a corresponding target.
One important characteristics of machine translation is that the target (translation) function is neither one-to-one nor many-to-one as in many other applications of machine learning (such as classification, which is many-to-one), but one-to-many in the sense that one source sentence can be translated into many possible translations. Because of this, we model the translation function not as a deterministic function but as a conditional probability of a target sentence (translation) given . The conditional probability may apply an equally high probability to more than one well-separated configurations/sentences, leading to a one-to-many relationship between source and target sentences.
Now let’s say you want to build a statistical machine translation system that translates a source sentence in English to a sentence in French. The first and probably most important job is to collect pairs of source sentences and their corresponding translations. I will use and to represent a pair of source and corresponding translation, respectively. The superscript means that it’s the -th pair in a set of many more pairs (often, we need tens to hundreds of thousands of pairs to train a good translation model.) I’ll use to denote the data set with pairs.
Where can I get these training pairs? For widely used languages in machine translation, you probably want to check out the Workshop on Statistical Machine Translation or the International Workshop on Spoken Language Translation.
With the training data in hand, we can now score a model by looking at how well the model works on the training data . The score, which I’ll call the log-likelihood of the model, is the average of the log-likelihood of the model on each pair . With the probabilistic interpretation of the machine translation model, the log-likelihood of the model on each pair is simply how high a log-probability the model assigns to the pair: , where is a set of parameters that defines the model. Then, the overall score of the model on the training data is
If the log-likelihood is low, the model is not giving enough probability mass to the correctly translated pairs, meaning that it’s wasting its probability mass on some wrong translations. Thus, we want to find a configuration of the model, or the values of the parameters $\theta$ that maximizes this log-likelihood, or score.
In machine learning, this is known as a maximum likelihood estimator. But we’re left with a perhaps more important question: how do we model ?
This question of how to model the conditional distribution has been asked and answered for a long time, starting more than 20 years ago at IBM T.J. Watson Research Center [Brown et al., 1993 and references therein]. The core of research on statistical machine translation (SMT) since then has been a log-linear model, where we approximate the logarithm of the true with a linear combination of many features:
where is the normalization constant. In this case, a large part of the research comes down to finding a good set of feature functions , and there is a very well-written textbook that covers all the details about it [Koehn, 2009].
In this approach of statistical machine translation, often the only thing left to machine learning is to find a nice set of coefficients that balance among different features, or to filter/re-rank a set of potential translations decoded from the log-linear model [Schwenk, 2007]. More specifically, neural networks have been used both as a part of the feature functions or to re-rank so-called -best lists of possible translations, as in the middle and right panels of Fig. 2.
In this blog post, on the other hand, I focus on a recently proposed approach, called neural machine translation, where machine learning, and more specifically a neural network, has more or even full control, as in the left panel of Fig. 2.As is usual with general deep learning, neural machine translation (NMT) does not rely on pre-designed feature functions. (By pre-designed feature functions, I mean those that are not learned.) Rather, the goal of NMT is to design a fully trainable model of which every component is tuned based on training corpora to maximize its translation performance.
A fully trainable NMT model $\mathcal{M}$ starts from as raw a representation of a source sentence as possible and finishes by generating as raw a representation of a target sentence as possible. Here, let’s consider a sequence of words as the most raw representation of a sentence. (This is not true for most natural languages, but without loss of generality, I will consider a word the smallest unit.) Each word in a sequence is represented by its integer index in a vocabulary. For instance, in the vocabulary of English sorted according to frequency, the will be the first word, represented as an integer 1. Let me use to denote a source sentence, and a target sentence.
Given a source sequence of word indices, the NMT model computes the conditional probability of . Next I’ll discuss how we can build a neural network to approximate this conditional probability .
One important property of machine translation, or any task based on natural languages, is that we deal with variable-length input and output . In other words, and are not fixed.
To deal with these types of variable-length input and output, we need to use a recurrent neural network (RNN). Widely used feed-forward neural networks, such as convolutional neural networks, do not maintain internal state other than the network’s own parameters. Whenever a single sample is fed into a feed-forward neural network, the network’s internal state, or the activations of the hidden units, is computed from scratch and is not influenced by the state computed from the previous sample. On the other hand, an RNN maintains its internal state while reading a sequence of inputs, which in our case will be a sequence of words, thereby being able to process an input of any length.
Let me explain this in more detail. The main idea behind RNNs is to compress a sequence of input symbols into a fixed-dimensional vector by using recursion. Assume at step that we have a vector which is the history of all the preceding symbols. The RNN will compute the new vector, or its internal state, which compresses all the preceding symbols as well as the new symbol by
where is a function parametrized by which takes as input the new symbol and the history up to the -th symbol. Initially, we can safely assume that is an all-zero vector.
The recurrent activation function is often implemented as, for instance, a simple affine transformation followed by an element-wise nonlinear function:
In this formulation, the parameters include the input weight matrix , the recurrent weight matrix and the bias vector . I must say that this is not the only possibility, and there is a very large room for designing a novel recurrent activation function. See Fig. 3 for some examples from [Pascanu et al., 2014].
This simple type of RNN can be implemented very easily using (for instance) Theano, which allows your RNN to be run on either the CPU or the GPU transparently. See Recurrent Neural Networks with Word Embeddings; note that the whole RNN code is written in less than 10 lines!
Recently, it has been observed that it is better, or easier, to train a recurrent neural network with more sophisticated activation functions such as long short-term memory units [Hochreiter and Schmidhuber, 1997] and gated recurrent units [Cho et al., 2014].
As was the case with the simple recurrent activation function, the parameters here include the input weight matrices , and , the recurrent weight matrices , and and the bias vectors , and .
Although these units look much more complicated than the simple RNN, the implementation with Theano, or any other deep learning framework, such as Torch, is just as simple. For instance, see LSTM Networks for Sentiment Analysis (example code).
I have explained a recurrent neural network (RNN) as a history compressor, but it can also be used to probabilistically model a sequence. Here, by probabilistically modeling a sequence I mean a machine learning model that computes the probability of any given sequence . How can we formulate such that it can be written as a recurrence?
Let’s start by rewriting into
which comes from the definition of conditional probability, From this, we can make a recursive formula such that
Now, we let an RNN model at each time by
outputs a probability distribution conditioned on the whole history up to the -th symbol via . In other words, at each time step, the RNN tries to predict the next symbol given the history of the input symbols.
There are a lot of interesting properties and characteristics of a recurrent neural network that I would love to spend hours talking about, but I will have to stop here for this blog post, since after all, what I have described so far are all the things you need to start building a neural machine translation system. For those who are more interested in recurrent neural networks, I suggest you to read the following papers. Obviously, this list is definitely not exhaustive. Or, you can also check out my slides on how to use recurrent neural networks for language modeling.
In this post, I introduced machine translation, and described how statistical machine translation approaches the problem of machine translation. In the framework of statistical machine translation, I have discussed how neural networks can be used to improve the overall translation performance.
The goal of this blog series is to introduce a novel paradigm for neural machine translation; this post laid the groundwork, concentrating on two key capabilities of recurrent neural networks: sequence summarization and probabilistic modeling of sequences.
Based on these two properties, in the next post, I will describe the actual neural machine translation system based on recurrent neural networks. I’ll also show you why GPUs are so important for Neural Machine Translation! Stay tuned.
]]>Today software companies use frameworks such as .NET to target multiple platforms from desktops to mobile phones with a single code base to reduce costs by leveraging existing libraries and to cope with changing trends. While developers can easily write scalable parallel code for multi-core CPUs on .NET with libraries such as the task parallel library, they face a bigger challenge using GPUs to tackle compute intensive tasks. To accelerate .NET applications with GPUs, developers must write functions in CUDA C/C++ and write or generate code to interoperate between .NET and CUDA C/C++.
Alea GPU closes this gap by bringing GPU computing directly into the .NET ecosystem. With Alea GPU you can write GPU functions in any .NET language you like, compile with your standard .NET build tool and accelerate it with a GPU. Alea GPU offers a full implementation of all CUDA features, and code compiled with Alea GPU performs as well as equivalent CUDA C/C++ code.
Alea GPU is a professional CUDA development stack for .NET and Mono built directly on top of the NVIDIA compiler toolchain. Alea GPU offers the following benefits:
You can easily install Alea GPU as a Nuget package, as Figure 1 shows.
Alea GPU is easy to use for all kinds of parallel problems. Developers can write GPU code in any .NET language and use the full set of CUDA device functions provided by NVIDIA LibDevice, as well as CUDA device parallel intrinsic functions, such as thread synchrhonization, warp vote functions, warp shuffle functions, and atomic functions. Let’s consider a simple example which applies the same calculation to many data values. SquareKernel
is a GPU kernel written in C# that accesses memory on the GPU.
static void SquareKernel(deviceptr outputs, deviceptr inputs, int n) { var start = blockIdx.x * blockDim.x + threadIdx.x; var stride = gridDim.x * blockDim.x; for (var i = start; i < n; i += stride) { outputs[i] = inputs[i] * inputs[i]; } }
Alea GPU kernels require no special attribution and have access to the full CUDA semantics. Invoking a CUDA kernel requires configuring the thread block and grid layout, transferring data to device memory, and launching the kernel. The above SquareKernel
GPU function can be launched as shown in the following code.
static double[] SquareGPU(double[] inputs) { var worker = Worker.Default; using (var dInputs = worker.Malloc(inputs)) using (var dOutputs = worker.Malloc(inputs.Length)) { const int blockSize = 256; var numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT; var gridSize = Math.Min(16 * numSm, Common.divup(inputs.Length, blockSize)); var lp = new LaunchParam(gridSize, blockSize); worker.Launch(SquareKernel, lp, dOutputs.Ptr, dInputs.Ptr, inputs.Length); return dOutputs.Gather(); } }
When we call worker.Launch
, Alea GPU Just-In-Time (JIT) compiles the kernel function SquareKernel
, loads it into the worker
and executes it on the GPU attached to the worker.
The JIT compilation workflow is extremely flexible. It allows code generation and execution on the fly, enabling GPU scripting and rapid prototyping. JIT compilation is also very useful for application scenarios where the algorithms depend on runtime information. JIT compilation adds a small start-up time overhead and requires deployment of the Alea GPU compiler along with the application.
An alternative is Ahead-Of-Time (AOT) compilation. For kernel functions tagged with the attribute AOTCompile
, the Alea GPU compiler generates PTX code at compile time and embeds it into the assembly as a binary resource.
[AOTCompile] static void SquareKernel(deviceptr outputs, deviceptr inputs, int n) ...
AOT compilation saves run-time compilation overhead and simplifies deployment because only the Alea GPU runtime components need to be installed. More details about JIT and AOT compilation can be found in the Alea GPU manual.
Another benefit of GPU development in .NET is that all GPU resources are managed, thus simplifying development and leading to more robust code. For example, all memory objects allocated through a Worker
instance are disposable. The using
statement
using (var dOutputs = worker.Malloc(inputs.Length)) { ... }
is a convenient syntax that ensures the correct use of IDisposable
objects, providing a clean and safe mechanism for releasing unmanaged resources. You can find more details in the Alea GPU tutorial.
Alea GPU is fully cross-platform. The code is compiled on one platform and the resulting assembly is binary compatible with all other platforms. Alea GPU supports Windows, Linux, Mac OS X and is also tested on the ARM based Tegra development kits.
In combination with other .NET libraries, impressive cross-platform GPU-accelerated applications with sophisticated user interfaces or graphics visualization can be developed. The n-body simulation (Figure 2) in the Alea GPU tutorial is an example which uses OpenGL through OpenTK to display the simulation results. Its code base is 100% cross-platform.
Developing high-performance generic GPU kernels for basic parallel primitives such as scan, reduce, sort or linear algebra codes for parallelized matrix multiplication or linear system solving is challenging and time-consuming.
Alea GPU offers productivity gains in the form of a range of GPU algorithms and integrated libraries such as cuBLAS and cuDNN. These library interfaces are fully type-safe, and library functions can be mixed seamlessly with custom GPU kernels developed in .NET as both rely on the same memory management and data types for GPU memory and GPU pointers.
Alea GPU provides a rich set of device-side functions and advanced CUDA features which are useful for creating sophisticated GPU algorithms, including
__ballot
, __atomic_add
, etc.;LibDeviceEx
.Alea GPU is flexible enough to handle complex CUDA code found in some advanced CUDA C++ libraries. A good example is the CUB library of generic GPU parallel algorithm primitives. We have ported a subset of the CUB primitives to .NET using Alea GPU and made them available in Alea Unbound. Here is an example of how to use the device level sum scan primitive in C#:
public static void DeviceScanInclusive() { const int numItems = 1000000; var rng = new Random(42); var inputs = Enumerable.Range(0, numItems).Select(i => rng.Next(-10, 10)).ToArray(); var gpuScanModule = DeviceSumScanModuleI32.Default; using (var gpuScan = gpuScanModule.Create(numItems)) using (var dInputs = gpuScanModule.GPUWorker.Malloc(inputs)) using (var dOutputs = gpuScanModule.GPUWorker.Malloc(inputs.Length)) { gpuScan.InclusiveScan(dInputs.Ptr, dOutputs.Ptr, numItems); var actual = dOutputs.Gather(); Assert.AreEqual(actual, inputs.ScanInclusive(0, (a, b) => a + b).ToArray()); } }
The generic scan primitive Primitives.DeviceScanModule
expects the binary operator to be used in the scan process as a delegate Func
:
public static void DeviceGenericScanInclusive() { const int numItems = 1000000; var rng = new Random(42); var inputs = (Enumerable.Repeat(rng, numItems).Select(gen)).ToArray(); FuncscanOp = Math.Max; using (var gpuScanModule = new Primitives.DeviceScanModule(GPUModuleTarget.DefaultWorker, scanOp)) using (var gpuScan = gpuScanModule.Create(numItems)) using (var dInputs = gpuScanModule.GPUWorker.Malloc(inputs)) using (var dOutputs = gpuScanModule.GPUWorker.Malloc(inputs.Length)) { gpuScan.InclusiveScan(dInputs.Ptr, dOutputs.Ptr, numItems); var actual = dOutputs.Gather(); Assert.AreEqual(actual, inputs.ScanInclusive(zero, scanOp).ToArray()); } }
Following the design of CUB, Alea Unbound has warp, block and device-wide primitives. The warp- and block-wide primitives can be used in kernels as convenient plugin components to write new algorithms. Alea Unbound algorithms deliver the same performance as the CUB CUDA C/C++ counterparts. They are all implemented in F# using warp shuffle or shared memory with union storage for optimal shared memory use.
Besides the primitive algorithms, Alea Unbound also provides fast implementations of matrix multiplication, matrix transpose, random number generators, statistical functions and some linear system solvers.
Alea GPU provides first class tools for coding, debugging and profiling which are fully integrated into Visual Studio. GPU kernels developed with Alea GPU can be debugged on Windows with the NVIDIA Nsight Visual Studio Debugger.
To support debugging and profiling, Alea GPU has three compilation levels: Optimized
, Profiling
and Diagnostic
:
Level | Description | Profiling | Debugging |
---|---|---|---|
Optimized | No source line information nor variable meta data | No | No |
Profiling | Source line information but no variable meta data | Yes | No |
Diagnostic | Source line information and variable meta data | Yes | Yes |
The Nsight Visual Studio debugger allows breakpoints to be directly set in Alea GPU source code even in F# code quotations. The full range of standard debugging features are available such as memory inspection, local variable values and memory checks, as Figure 3 shows. Debugging functionality is based upon LLVM debug meta information which is generated by Alea GPU.
The compilation level Profiling
also supports source code correlation, as Figure 4 shows.
The Alea GPU tutorial has detailed explanations about how to debug and profile GPU-accelerated .NET applications.
Developing GPU algorithms is often an iterative process. Usually many variations have to be explored to fine-tune the algorithm. GPU scripting and rapid prototyping greatly improve productivity and encourage the developer to thoroughly investigate the efficiency of the code.
Alea GPU is the only solution that can deliver GPU scripting and a REPL in Visual Studio for interactive prototyping of GPU code. F# code can be directly sent to the F# interactive console for execution. The JIT compilation mode of Alea GPU allows for the execution of F# GPU code on the fly in the F# interactive console, as Figure 5 shows.
GPU code can also be embedded in F# scripts which are then executed with fsi.exe
as ordinary scripts on the console.
The example above illustrates the most basic usage of Alea GPU. It uses plain functions or methods for GPU code and a separate function for memory management and kernel execution and is therefore very well-suited for simple applications. For more complex problems Alea GPU offers two alternatives: Class instances and Workflows.
Class instances use a class derived from GPUModule
or ILGPUModule
to manage all GPU resources. CUDA compile-time arguments can be supplied to the constructor. This allows for the creation of advanced kernels using generics, such as the following example generic map kernel.
internal class TransformModule : ILGPUModule { private readonly Funcop; public TransformModule(GPUModuleTarget target, Func opFunc) : base(target) { op = opFunc; } [Kernel] public void Kernel(int n, deviceptr x, deviceptr y) { var start = blockIdx.x * blockDim.x + threadIdx.x; var stride = gridDim.x * blockDim.x; for (var i = start; i < n; i += stride) y[i] = op(x[i]); } ... }
Workflows specify all GPU resources and kernels in composable cuda {...}
workflow blocks. This exposes the full expressive power of Alea GPU and is very well suited for scripting. This feature is only available in F#.
let template (transform:Exprint -> int>) = cuda { let! kernel = <@ fun (z:deviceptr ) (x:deviceptr ) (y:deviceptr ) (n:int) -> let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let mutable i = start while i < n do z.[i] <- (%transform) x.[i] y.[i] i <- i + stride @> |> Compiler.DefineKernel return Entry(fun program -> let worker = program.Worker let kernel = program.Apply kernel let lp = LaunchParam(16, 256) let run (x:int[]) (y:int[]) = let n = inputs1.Length use x = worker.Malloc(x) use y = worker.Malloc(y) use z = worker.Malloc(n) kernel.Launch lp z.Ptr x.Ptr y.Ptr n outputs.Gather() run) }
You can find more details about the programming approaches that Alea GPU supports in the Alea GPU tutorial.
Alea GPU is a complete compiler built on top of the popular LLVM compiler infrastructure and the NVIDIA CUDA compiler SDK. Code compiled with Alea GPU delivers the same performance as equivalent CUDA C/C++ or CUDA Fortran. Alea GPU takes advantage of the code optimization passes in LLVM as well as the GPU-specific optimizations provided in the NVIDIA NVVM compiler back-end.
Sign up for a free Alea GPU hands-on webinar hosted by NVIDIA on July 8, 2015.
Alea GPU contains individual packages which can be conveniently installed through NuGet. The deployment package Alea.CUDA.Fody installs the necessary compilers together with the ahead-of-time compilation tool.
To install Alea GPU, run the following command in the
Package Manager Console:
PM> Install-Package Alea.CUDA.Fody
Alea GPU requires a CUDA-capable GPU with compute capability 2.0 (Fermi architecture) or higher and an installed CUDA driver version 6.5 or higher.
Finally you need a Alea GPU license. The community edition is free and supports consumer GPUs of the GeForce product line. Register on the QuantAlea web page, select Client Login
and sign up to retrieve a free community edition license. For applications which require enterprise hardware or mobile GPUs QuantAlea provides commercial licenses.
We are interested to hear all of your feedback and suggestions for Alea GPU. Write to us at info@quantalea.com or @QuantAlea on Twitter.
]]>
Went from training 700 img/s in MNIST to 1500 img/s (using CUDA) to 4000 img/s (using cuDNN) that is just freaking amazing! @GPUComputing
— Leon Palafox (@leonpalafox) March 27, 2015
I stumbled upon the above tweet by Leon Palafox, a Postdoctoral Fellow at the The University of Arizona Lunar and Planetary Laboratory, and reached out to him to discuss his success with GPUs and share it with other developers interested in using deep learning for image processing.
We are working on developing a tool that can automatically identify various geological processes on the surface of Mars. Examples of geological processes include impact cratering and volcanic activity; however, these processes can generate landforms that look very similar, even though they form via vastly different mechanisms. For example, small impact craters and volcanic craters can be easily confused because they can both exhibit a prominent rim surrounding a central topographic depression.
Of particular interest to our research group is the automated mapping of volcanic rootless cones as Figure 2 shows. These landforms are generated by explosive interactions between lava and ground ice, and therefore mapping the global distribution of rootless cones on Mars would contribute to a better understanding of the distribution of near-surface water on the planet. However, to do this we must first develop algorithms that can correctly distinguish between landforms of similar appearance. This is a difficult task for planetary geologists, but we are already having great success by applying state-of-the-art artificial neural networks to data acquired by the High Resolution Imaging Science Experiment (HiRISE) camera, which is onboard the Mars Reconnaissance Orbiter (MRO) satellite.
The project is in the development phase; we expect to have it completed in one or two years depending on the number of features that we wish to train for. Before, we spent much of our processing time on processing the images with the CNN, but now, thanks to the NVIDIA cuDNN library, we have substantially reduced that time.
As of now, we are focusing on the identification of volcanic rootless cones and impact craters, but plan to extend our search to include other landforms like sand dunes, recurring slope lineae (thought to be formed by seasonal seeps of surface water), and cloud formations. Of particular interest are dynamic phenomena because once we have developed a robust identification algorithm we can apply it to time series satellite observations to investigate how the Martian environment changes through time. Mars provides an ideal place to develop and test such approaches, but our ultimate aim will be to apply similar techniques to study the Earth.
We used a MATLAB approach to access the CUDA library, since at present most of our implementation is in MATLAB. We use the MatConvNet framework, which—like Theano and Caffe—provides a great set of tools to build and deploy your own Convolutional Neural Network (CNN) architectures. It also provides great CUDA interfaces to the cuDNN library.
We are still fine tuning some of the libraries, but in essence we use a CNN very similar to LeNet, albeit modified to work in this particular regime. We are also running five CNNs in parallel, each of which is using different pixel sizes to search for differently scaled features in the image.
We have five machines and each has two NVIDIA Quadro K5000s.
Figures 1 and 3 show the output in a region in Elysium Planitia, where we use the CNN to map the location of volcanic rootless cones. This process, if done in a larger scale, is an incredible tool for understanding the geologic history of Mars. This has been processed by five CNNs looking for features at different scales, and finally pooled to generate a contour map.
So far we have trained on 800 examples of Martian landforms selected from full-resolution HiRISE images. Each HiRISE image typically has a resolution of 0.25 m/pixel and covers a swath that is 6-km-wide, resulting in file sizes up to 3.4 GB. Individual HiRISE images are spectacular, but the most impressive data products are digital terrain models generated using stereo-photogrammety (Figure 3). This data enables us to visualize the surface of Mars in three dimensions and generate simulated illumination images that can be used to expand our natural training sets.
Additionally, we are using the trained Convolutional Neural Networks (CNNs) to examine thousands of HiRISE images within Elysium Planitia. This is a volcanic region that includes some of the youngest lava flows on Mars, covering a total of millions of square kilometers (Figure 1). Performing a manual search for every rootless cone in this region would be prohibitively time-consuming. Fortunately, automated approaches will enable us to map these landforms over a vast area and use the results of this systematic survey to infer the regional distribution of former ground-ice deposits in Elysium Planitia for the first time.
This project has many challenges, from the algorithm implementation to the analysis of the results. I think the biggest challenge is having readily available databases to use and train over for the different features on the surface of Mars.
While there are some databases, not all of them are very consistent, unlike in the computer vision community, which has the MNIST and CIFAR standards.
This is both good and bad. It is good in the sense that it allows you to tackle a real-world problem with state-of-the-art tools, but since databases are not consistent, there is a lot of skepticism in the community about whether the approach will work with all the features on the surface. However, in planetary science, the situation is different because the data collected from instruments, like HiRISE, is made freely available to the public in a standardized form through the NASA Planetary Data System (PDS).
I’ve used CUDA, but not intensively before now because most of my previous research has focused on processing other kinds of data than images. I had a couple of classes and projects where I used CUDA before, but this has been the first time that it really became critical to optimize the efficiency of the image analyses approach to investigate a problem on such a large scale.
In this short time, having access to a powerful GPU greatly reduces the amount of time that I need to process each of the images that I need to analyze. These images are from the HiRISE database, which consists of 35,000 grayscale and color images with a total database size over 25 TB, and I need to apply different types of CNNs to classify them correctly and to choose the most suitable architecture for our particular problem.
Without using GPUs, it would take days to finish processing a single image, while recent results have shown that an hour or two is enough to process a single image using a GPU.
Having access to different research groups and topics has allowed me to use the experience I have gained from some Machine Learning topics in one area and apply it in a different way to a different area. For example, my work in Bayesian networks applied to gene networks has a huge applicability in my previous work at UCLA on EEG data from Brain–Machine Interfaces.
And surprisingly enough, my work on time series analysis can also find a use in planetary sciences by using things like Hidden Markov Models to model various terrain profiles.
Machine learning has gone from a niche research-oriented area in the 1990s to a boom in the industry in the past decade. Part of it has been the explosion of readily available data that we have due to the information revolution of the last several years.
Many companies are investing a large amount of resources in their own data science divisions, like Facebook, which created its own machine learning laboratory a year ago. This has gotten people more excited about using machine learning tools, since it increases their value in the job market.
This is not without its downside, since many people will apply Machine Learning software without knowing the nuts and bolts of the process, which could result in disappointment for companies in the future, realizing their classifiers and tools are not tuned to their particular datasets. I’ve seen my fair share of implementations where there was no preanalysis of the data whatsoever, and just used the same tuning parameters as the textbook example.
I think that in the next few years more sophisticated Machine Learning tools will become available, and most of the work will be oriented toward large datasets. More effort has to be put into training people how the algorithms work rather than just using them.
I think a more pervasive use of unmanned aerial vehicles (UAVs) would be amazing. Having them readily available running algorithms like CNNs to do feature recognition would allow us to have real-time information about natural disasters, riots and local events.
I can imagine how having a UAV in a large setting like the Coachella Valley Music and Arts Festival or a football stadium would allow the organizers to better control the flow of people in real time to prevent accidents.
Having a UAV using a CNN to track wildfires would allow us to have information on how they spread, and in some cases, how to stop or prevent them.
While the privacy implications are still a concern, I think there is much to gain from this technology, and mounting NVIDIA cards in them would be even better since we could do real-time image processing without the need to transmit video over Wi-Fi.
Do you have a success story like Leon’s that involves GPUs? If so, comment below and tell us about your research – we love reading and sharing our developer’s stories!
To learn more about deep learning on GPUs, visit the NVIDIA Deep Learning developer portal. Check out the related posts below, especially the intro to cuDNN: Accelerate Machine Learning with the cuDNN Deep Neural Network Library. Be sure to check out the cuDNN Webinar Recording: GPU-Accelerated Deep Learning with cuDNN. If you are interested in embedded applications of deep learning, check out the post Embedded Machine Learning with the cuDNN Deep Neural Network Library and Jetson TK1.
]]>Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI (Message Passing Interface) calls on the GPU timeline in the profiler. While tools like Vampir and Tau will allow programmers to see a big picture view of how a parallel application performs, sometimes all you need is a look at how MPI is affecting GPU performance on a single node using a simple tool like the NVIDIA Visual Profiler. With the help of the NVIDIA Tools Extensions (NVTX) and the MPI standard itself, this is pretty easy to do.
The NVTX API lets you embed information within a GPU profile, such as marking events or annotating ranges in the timeline with details about application behavior during that time. Jiri Kraus wrote past posts about generating custom application timelines with NVTX, and about using it to label individual MPI ranks in MPI profiles. In this post I’ll show you how to use an NVTX range to annotate the time spent in MPI calls. To do this, we’ll use the MPI profiling interface (PMPI), which is a standard part of MPI. PMPI allows tools to intercept calls to the MPI library to perform actions before or after the MPI call is executed. This means that we can insert NVTX calls into our MPI library calls to mark MPI calls on the GPU timeline.
Wrapping every MPI routine in this way is a bit tedious, but fortunately there’s a tool to automate the process. We’ll use the wrap.py
script found at https://github.com/scalability-llnl/wrap to generate the PMPI wrappers for a number of commonly used MPI routines. The input file for this script is the following (also available as a github gist):
#include#include #include // Setup event category name {{fn name MPI_Init}} nvtxNameCategoryA(999, "MPI"); {{callfn}} int rank; PMPI_Comm_rank(MPI_COMM_WORLD, &rank); char name[256]; sprintf( name, "MPI Rank %d", rank ); nvtxNameOsThread(pthread_self(), name); nvtxNameCudaDeviceA(rank, name); {{endfn}} // Wrap select MPI functions with NVTX ranges {{fn name MPI_Send MPI_Recv MPI_Allreduce MPI_Reduce MPI_Wait MPI_Waitany MPI_Waitall MPI_Waitsome MPI_Gather MPI_Gatherv MPI_Scatter MPI_Scatterv MPI_Allgather MPI_Allgatherv MPI_Alltoall MPI_Alltoallv MPI_Alltoallw MPI_Bcast MPI_Sendrecv MPI_Barrier MPI_Start MPI_Test MPI_Send_init MPI_Recv_init }} nvtxEventAttributes_t eventAttrib = {0}; eventAttrib.version = NVTX_VERSION; eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE; eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; eventAttrib.message.ascii = "{{name}}"; eventAttrib.category = 999; nvtxRangePushEx(&eventAttrib); {{callfn}} nvtxRangePop(); {{endfn}}
So what’s happening in this file? First, it includes the NVTX header file, and then loops over a series of common MPI functions and inserts the beginning of an NVTX range (nvtxRangePushEx
) and then ends the range as we leave the MPI routine (nvtxRangePop
). For convenience, I’ve named the range after the MPI routine being called. All I need to do now is call wrap.py
to generate a C file with my PMPI wrappers, which I’ll then build with my MPI C compiler.
$ python wrap/wrap.py -g -o nvtx_pmpi.c nvtx.w $ mpicc -c nvtx_pmpi.c
Now I just need to rerun my code with these wrappers. To do this I’ll relink my application with the object file I just built and the NVTX library (libnvToolsExt). As an example, I’ll use the simple Jacobi Iteration used in the GTC session Multi GPU Programming with MPI, which you can find on Github. Once I’ve built both the application and the wrappers generated above, I run the executable as follows.
$ mpicc -fast -ta=tesla -Minfo=all $HOME/nvtx_pmpi.o laplace2d.c -L$CUDA_HOME/lib64 -lnvToolsExt -o laplace2d $ MV2_USE_CUDA=1 mpirun -np 2 nvprof -o laplace2d.%q{MV2_COMM_WORLD_RANK}.nvvp ./laplace2d
One word of caution: the linking order does matter when using tools such as PMPI, so if you run your code and are not seeing the expected results, the object file containing the wrappers may not appear early enough in the build command.
In the above commands I’m rebuilding my code with the necessary bits. I’m also setting MV2_USE_CUDA at runtime to enable cuda-awareness in my MVAPICH library. Additionally I’m informing nvprof to generate a timeline file per-MPI process by passing the MV2_COMM_WORLD_RANK environment variable to nvprof, which is defined to equal the MPI rank of each process. Figure 1 is the result of importing one of these resulting nvprof output files into Visual Profiler and then zooming in to an area of interest.
Looking in the “Markers and Ranges” row of the GPU timeline for MPI Rank 0, we see three green boxes denoting two calls to MPI_Sendrecv and one to MPI_Allreduce. Furthermore, we can see that the MPI library is using a device-to-device memcpy operation to communicate between two GPUs on the same node. As you can see, the NVIDIA Visual Profiler, combined with PMPI and NVTX can give you interesting insights into how the MPI calls in your application interact with the GPU.
]]>