MapD: Massive Throughput Database Queries with LLVM on GPUs

Note: this post was co-written by Alex Şuhan and Todd Mostak of MapD.

At MapD our goal is to build the world’s fastest big data analytics and visualization platform that enables lag-free interactive exploration of multi-billion row datasets. MapD supports standard SQL queries as well as a visualization API that maps OpenGL primitives onto SQL result sets.

Although MapD is fast running on x86-64 CPUs, our real advantage stems from our ability to leverage the massive parallelism and memory bandwidth of GPUs. The most powerful GPU currently available is the NVIDIA Tesla K80 Accelerator, with up to 8.74 teraflops of compute performance and nearly 500 GB/sec of memory bandwidth. By supporting up to eight of these cards per server we see orders-of-magnitude better performance on standard data analytics tasks, enabling a user to visually filter and aggregate billions of rows in tens of milliseconds, all without indexing. The following Video shows the MapD dashboard, showing 750 million tweets animated in real time. Nothing in this demo is pre-computed or canned. Our big data visual analytics platform is running on 8 NVIDIA Tesla K40 GPUs on a single server to power the dashboard.

Fast hardware is only half of the story, so at MapD we have invested heavily into optimizing our code such that a wide range of analytic workloads run optimally on GPUs. In particular, we have worked hard so that common SQL analytic operations, such as filtering (WHERE) and GROUP BY, run as fast as possible. One of the biggest payoffs in this regard has been moving from the query interpreter that we used in our prototype to a JIT (Just-In-Time) compilation framework built on LLVM. LLVM allows us to transform query plans into architecture-independent intermediate code (LLVM IR) and then use any of the LLVM architecture-specific “backends” to compile that IR code for the needed target, such as NVIDIA GPUs, x64 CPUs, and ARM CPUs.

Query compilation has the following advantages over an interpreter:

  1. Since it is inefficient to evaluate a query plan for a single row at a time (in one “dispatch”), an interpreter requires the use of extra buffers to store the intermediate results of evaluating an expression. For example, to evaluate the expression x*2+3, an interpreter-based query engine would first evaluate x*2 for a number of rows, storing that to an intermediate buffer. The intermediate results stored in that buffer would then be read and summed with 3 to get the final result. Writing and reading these intermediate results to memory wastes memory bandwidth and/or valuable cache space. Compare this to a compiled query which can simply store the result of the first subexpression (x*2) into a register before computing the final result, allowing the cache to be used for other purposes, for example to create the hash table necessary for a query’s GROUP BY clause. This is related to loop fusion and kernel fusion compiler optimizations. Continue reading
Theano Logo

Introduction to Neural Machine Translation with GPUs (Part 2)

Note: This is part two of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 1 and part 3.

In my previous post, I introduced statistical machine translation and showed how it can and should be viewed from the perspective of machine learning: as supervised learning where the input and output are both variable-length sequences. In order to introduce you to neural machine translation, I spent half of the previous post on recurrent neural networks, specifically about how they can (1) summarize a sequence and (2) probabilistically model a sequence. Based on these two properties of recurrent neural networks, in this post I will describe in detail an encoder-decoder model for statistical machine translation.

Encoder-Decoder Architecture for Machine Translation

Figure 1. Encoder-Decoder for Machine Translation.
Figure 1. Encoder-Decoder for Machine Translation.

I’m not a neuroscientist or a cognitive scientist, so I can’t speak authoritatively about how the brain works. However, if I were to guess what happens in my brain when I try to translate a short sentence in English to Korean, my brain encodes the English sentence into a set of neuronal activations as I hear them, and from those activations, I decode the corresponding Korean sentence. In other words, the process of (human) translation involves the encoder which turns a sequence of words into a set of neuronal activations (or spikes, or whatever’s going on inside a biological brain) and the decoder which generates a sequence of words in another language, from the set of activations (see Figure 1).

This idea of encoder-decoder architectures is the basic principle behind neural machine translation. In fact, this type of architecture is at the core of deep learning, where the biggest emphasis is on learning a good representation. In some sense, you can always cut any neural network in half, and call the first half an encoder and the other a decoder.

Starting with the work by Kalchbrenner and Blunsom at the University of Oxford in 2013, this encoder-decoder architecture has been proposed by a number of groups, including the Machine Learning Lab (now, MILA) at the University of Montreal (where I work) and Google, as a new way to approach statistical machine translation [Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015]. (There is also older, related work by Mikel Forcada at the University of Alcante from 1997! [Forcada and Neco, 1997].) Although there is no restriction on which particular type of neural network is used as either an encoder or a decoder, I’ll focus on using a recurrent neural network for both.

Let’s build our first neural machine translation system! But, before I go into details, let me first show you a big picture of the whole system in Figure 2. Doesn’t it look scarily complicated? Nothing to worry about, as I will walk you through this system one step at a time.

 Figure 2. The very first neural machine translation system.
Figure 2. The very first neural machine translation system.

Continue reading


GPU Pro Tip: Lerp Faster in C++

Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units).

Linear Interpolation
Linear Interpolation (from Wikipedia)

You can enable linear interpolation (also known as linear filtering) on texture fetches in CUDA kernels. This hardware filtering uses a low-precision interpolant, so for this and other reasons it’s common to lerp in software.

The standard way to lerp is:

(1-t)*v0 + t*v1

Here’s a generic host/device function that performs a lerp:

template <typename T>
__host__ __device__
inline T lerp(T v0, T v1, T t) {
    return (1-t)*v0 + t*v1;

But we can do better. Continue reading


Graph Coloring: More Parallelism for Incomplete-LU Factorization

In this blog post I will briefly discuss the importance and simplicity of graph coloring and its application to one of the most common problems in sparse linear algebra – the incomplete-LU factorization. My goal is to convince you that graph coloring is a problem that is well-suited for GPUs and that it should be viewed as a tool that can be used to expose latent parallelism even in cases where it is not obvious. In fact, I will apply this tool to expose additional parallelism in one of the most popular black-box preconditioners/smoothers—the incomplete-LU factorization—which is used in many applications, including Computational Fluid Dynamics; Computer-Aided Design, Manufacturing, and Engineering (CAD/CAM/CAE); and Seismic Exploration (Figure 1).

Fig. 1: Applications that benefit from graph coloring applied to incomplete-LU factorization.
Figure 1: Applications that benefit from graph coloring applied to incomplete-LU factorization.

What is Graph Coloring?

In general, graph coloring refers to the problem of finding the minimum number of colors that can be used to color the nodes of a graph, such that no two adjacent (connected) nodes have the same color. For example, the graph in Figure 2 can be colored with two colors (green and yellow).

Fig. 2: This simple graph coloring requires two colors.
Figure 2: This simple graph coloring requires two colors.

Why is this mathematical problem of interest to us? Well, imagine that each node in the graph represents a task and each edge represents a dependency between two tasks. Then, graph coloring tells us which tasks are independent. Assuming that the edges have no particular direction assigned to them, we can process the tasks with the same color in parallel (they are independent by construction), perform a barrier, and proceed to the next set of tasks that are identified by a different color. Not all problems can be mapped to such a framework, but many are amenable to it.

The next question we should answer is how difficult is it to perform graph coloring? Now that the cuSPARSE routine provides a graph coloring implementation in the csrcolor() routine, for most users it is trivially easy. But in this post I want to talk about implementing the algorithm itself in a bit more detail.

It is well-known that finding the best solution to this problem is NP-complete. However, there are many parallel algorithms that can find an approximate solution very quickly. Indeed, the exact solution is often not even required, as long as we obtain enough parallelism to fully utilize our parallel computing platform. Continue reading


GPU-Accelerated R in the Cloud with Teraproc Cluster-as-a-Service

Analysis of statistical algorithms can generate workloads that run for hours, if not days, tying up a single computer. Many statisticians and data scientists write complex simulations and statistical analysis using the R statistical computing environment. Often these programs have a very long run time. Given the amount of time R programmers can spend waiting for results, it makes sense to take advantage of parallelism in the computation and the available hardware.

In a previous post on the Teraproc blog, I discussed the value of parallelism for long-running R models, and showed how multi-core and multi-node parallelism can reduce run times. In this blog I’ll examine another way to leverage parallelism in R, harnessing the processing cores in a general-purpose graphics processing unit (GPU) to dramatically accelerate commonly used clustering algorithms in R. The most widely used GPUs for GPU computing are the NVIDIA Tesla series. A Tesla K40 GPU has 2,880 integrated cores, 12 GB of memory with 288 GB/sec of bandwidth delivering up to 5 trillion floating point calculations per second.

The examples in this post build on the excellent work of Mr. Chi Yau available at Chi is the author of the CRAN open-source rpud package as well as rpudplus, R libraries that make is easy for developers to harness the power of GPUs without programming directly in CUDA C++. To learn more about R and parallel programming with GPUs you can download Chi’s e-book. For illustration purposes, I’ll focus on an example involving distance calculations and hierarchical clustering, but you can use the rpud package to accelerate a variety of applications.

Hierarchical Clustering in R

Cluster analysis, or clustering, is the process of grouping objects such that objects in the same cluster are more similar (by a given metric) to each other than to objects in other clusters. Cluster analysis is a problem with significant parallelism. In a post on the Teraproc blog we showed an example that involved clustering analysis using k-means. In this post we’ll look at hierarchical cluster in R with hclust, a function that makes it simple to create a dendrogram (a tree diagram as in Figure 1) based on differences between observations. This type of analysis is useful in all kinds of applications from taxonomy to cancer research to time-series analysis of financial data.

Figure 1: Dendrogram created using hierarchical clustering in R.
Figure 1: Dendrogram created using hierarchical clustering in R.

Continue reading

Theano Logo

Introduction to Neural Machine Translation with GPUs (part 1)

Note: This is the first part of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 2 and part 3.

Neural machine translation is a recently proposed framework for machine translation based purely on neural networks. This post is the first of a series in which I will explain a simple encoder-decoder model for building a neural machine translation system [Cho et al., 2014Sutskever et al., 2014Kalchbrenner and Blunsom, 2013]. In a later post I will describe how an attention mechanism can be incorporated into the simple encoder-decoder model [Bahdanau et al., 2015], leading to the state-of-the-art machine translation model for a number of language pairs including En-Fr, En-De, En-Tr and En-Zh [Gulcehre et al., 2015Jean et al., 2015]. Furthermore, I will introduce recent work which has applied this framework of neural machine translation to image and video description generation [Xu et al., 2015Li et al., 2015].

Statistical Machine Translation

First, let’s start with a brief overview of machine translation. In fact, the name, machine translation, says everything. We want a machine to translate text in one language, which we will call the source sentence, to corresponding text in another language, which we call the target sentence. (Although ideally the machine should be able to translate a whole document from one language to another, let us concentrate in this blog post on sentence-level machine translation.)

There are multiple ways to build such a machine that can translate languages. For instance, we can ask a bilingual speaker to give us a set of rules transforming a source sentence into a correct translation. This is not a great solution, as you can imagine, because we don’t even know the set of rules underlying a single language, not to mention the rules underlying a pair of languages. It is simply hopeless to write an exhaustive set of rules for translating a source sentence into a correct translation. Hence, in this blog post, we focus on a statistical approach where those rules, either implicitly or explicitly, are automatically extracted from a large corpus of text.

This statistical approach to machine translation is called statistical machine translation. The goal is the same (build a machine that translates a sentence from one language to another), but we let the machine learn from data how to translate rather than design a set of rules for the machine (See Fig. 1 for a graphical illustration.) Learning is based on statistical methods, which should sound familiar to anyone who has taken a basic course on machine learning. In fact, statistical machine translation is nothing but a particular application of machine learning, where the task is to find a function that maps from a source sentence to a corresponding target.

Figure 1. Statistical Machine Translation
Figure 1. Statistical Machine Translation

Continue reading


Accelerate .NET Applications with Alea GPU

Today software companies use frameworks such as .NET to target multiple platforms from desktops to mobile phones with a single code base to reduce costs by leveraging existing libraries and to cope with changing trends. While developers can easily write scalable parallel code for multi-core CPUs on .NET with libraries such as the task parallel library, they face a bigger challenge using GPUs to tackle compute intensive tasks. To accelerate .NET applications with GPUs, developers must write functions in CUDA C/C++ and write or generate code to interoperate between .NET and CUDA C/C++.

Alea GPU closes this gap by bringing GPU computing directly into the .NET ecosystem. With Alea GPU you can write GPU functions in any .NET language you like, compile with your standard .NET build tool and accelerate it with a GPU. Alea GPU offers a full implementation of all CUDA features, and code compiled with Alea GPU performs as well as equivalent CUDA C/C++ code.

CUDA on .NET with Alea GPU

Alea GPU is a professional CUDA development stack for .NET and Mono built directly on top of the NVIDIA compiler toolchain. Alea GPU offers the following benefits:

  • Easy to use
  • Cross-platform
  • Support for many existing GPU algorithms and libraries
  • Debugging and profiling functionality
  • JIT compilation and a compiler API for GPU scripting
  • Future-oriented technology based on LLVM
  • No compromise on performance

You can easily install Alea GPU as a Nuget package, as Figure 1 shows.

Figure 1: Alea GPU Nuget packages.
Figure 1: Alea GPU Nuget packages.

Ease of Use

Alea GPU is easy to use for all kinds of parallel problems. Developers can write GPU code in any .NET language and use the full set of CUDA device functions provided by NVIDIA LibDevice, as well as CUDA device parallel intrinsic functions, such as thread synchrhonization, warp vote functions, warp shuffle functions, and atomic functions. Let’s consider a simple example which applies the same calculation to many data values. SquareKernel is a GPU kernel written in C# that accesses memory on the GPU.

static void SquareKernel(deviceptr outputs, 
                         deviceptr inputs, int n)
    var start = blockIdx.x * blockDim.x + threadIdx.x;
    var stride = gridDim.x * blockDim.x;
    for (var i = start; i < n; i += stride)
        outputs[i] = inputs[i] * inputs[i];

Continue reading


Deep Learning for Image Understanding in Planetary Science

I stumbled upon the above tweet by Leon Palafox, a Postdoctoral Fellow at the The University of Arizona Lunar and Planetary Laboratory, and reached out to him to discuss his success with GPUs and share it with other developers interested in using deep learning for image processing.

Tell us about your research at The University of Arizona

Leon Palafox
Leon Palafox

We are working on developing a tool that can automatically identify various geological processes on the surface of Mars. Examples of geological processes include impact cratering and volcanic activity; however, these processes can generate landforms that look very similar, even though they form via vastly different mechanisms. For example, small impact craters and volcanic craters can be easily confused because they can both exhibit a prominent rim surrounding a central topographic depression.

Of particular interest to our research group is the automated mapping of volcanic rootless cones as Figure 2 shows. These landforms are generated by explosive interactions between lava and ground ice, and therefore mapping the global distribution of rootless cones on Mars would contribute to a better understanding of the distribution of near-surface water on the planet. However, to do this we must first develop algorithms that can correctly distinguish between landforms of similar appearance. This is a difficult task for planetary geologists, but we are already having great success by applying state-of-the-art artificial neural networks to data acquired by the High Resolution Imaging Science Experiment (HiRISE) camera, which is onboard the Mars Reconnaissance Orbiter (MRO) satellite.

Figure 1: A view of Mars centered on Elysium Planitia, which includes some of the youngest volcanic terrains on the planet. Performing a systematic regional survey of sub-kilometer-scale landforms, such as volcanic rootless cones, would be prohibitively time consuming using manual methods, but is ideal for Machine Learning algorithms such as Convolutional Neural Networks (CNNs).
Figure 1: A view of Mars centered on Elysium Planitia, which includes some of the youngest volcanic terrains on the planet. Performing a systematic regional survey of sub-kilometer-scale landforms, such as volcanic rootless cones, would be prohibitively time consuming using manual methods, but is ideal for Machine Learning algorithms such as Convolutional Neural Networks (CNNs).

Continue reading


GPU Pro Tip: Track MPI Calls In The NVIDIA Visual Profiler

Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI (Message Passing Interface) calls on the GPU timeline in the profiler. While tools like Vampir and Tau will allow programmers to see a big picture view of how a parallel application performs, sometimes all you need is a look at how MPI is affecting GPU performance on a single node using a simple tool like the NVIDIA Visual Profiler. With the help of the NVIDIA Tools Extensions (NVTX) and the MPI standard itself, this is pretty easy to do.

The NVTX API lets you embed information within a GPU profile, such as marking events or annotating ranges in the timeline with details about application behavior during that time. Jiri Kraus wrote past posts about generating custom application timelines with NVTX, and about using it to label individual MPI ranks in MPI profiles. In this post I’ll show you how to use an NVTX range to annotate the time spent in MPI calls. To do this, we’ll use the MPI profiling interface (PMPI), which is a standard part of MPI. PMPI allows tools to intercept calls to the MPI library to perform actions before or after the MPI call is executed. This means that we can insert NVTX calls into our MPI library calls to mark MPI calls on the GPU timeline.

Wrapping every MPI routine in this way is a bit tedious, but fortunately there’s a tool to automate the process. We’ll use the script found at to generate the PMPI wrappers for a number of commonly used MPI routines. The input file for this script is the following (also available as a github gist):

#include <pthread.h>
#include <nvToolsExt.h>
#include <nvToolsExtCudaRt.h>
// Setup event category name
{{fn name MPI_Init}}
  nvtxNameCategoryA(999, "MPI");
  int rank;
  PMPI_Comm_rank(MPI_COMM_WORLD, &rank);
  char name[256];
  sprintf( name, "MPI Rank %d", rank );
  nvtxNameOsThread(pthread_self(), name);
  nvtxNameCudaDeviceA(rank, name);
// Wrap select MPI functions with NVTX ranges
{{fn name MPI_Send MPI_Recv MPI_Allreduce MPI_Reduce MPI_Wait MPI_Waitany
MPI_Waitall MPI_Waitsome MPI_Gather MPI_Gatherv MPI_Scatter MPI_Scatterv
MPI_Allgather MPI_Allgatherv MPI_Alltoall MPI_Alltoallv MPI_Alltoallw MPI_Bcast
MPI_Sendrecv MPI_Barrier MPI_Start MPI_Test MPI_Send_init MPI_Recv_init }}
  nvtxEventAttributes_t eventAttrib = {0};
  eventAttrib.version = NVTX_VERSION;
  eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
  eventAttrib.message.ascii  = "{{name}}";
  eventAttrib.category = 999;

So what’s happening in this file? First, it includes the NVTX header file, and then loops over a series of common MPI functions and inserts the beginning of an NVTX range (nvtxRangePushEx) and then ends the range as we leave the MPI routine (nvtxRangePop). For convenience, I’ve named the range after the MPI routine being called. All I need to do now is call to generate a C file with my PMPI wrappers, which I’ll then build with my MPI C compiler.

$ python wrap/ -g -o nvtx_pmpi.c nvtx.w
$ mpicc -c nvtx_pmpi.c

Now I just need to rerun my code with these wrappers. To do this I’ll relink my application with the object file I just built and the NVTX library (libnvToolsExt). As an example, I’ll use the simple Jacobi Iteration used in the GTC session Multi GPU Programming with MPI, which you can find on Github. Once I’ve built both the application and the wrappers generated above, I run the executable as follows.

$ mpicc -fast -ta=tesla -Minfo=all $HOME/nvtx_pmpi.o laplace2d.c -L$CUDA_HOME/lib64 -lnvToolsExt -o laplace2d
$ MV2_USE_CUDA=1 mpirun -np 2 nvprof -o laplace2d.%q{MV2_COMM_WORLD_RANK}.nvvp ./laplace2d

One word of caution: the linking order does matter when using tools such as PMPI, so if you run your code and are not seeing the expected results, the object file containing the wrappers may not appear early enough in the build command.

In the above commands I’m rebuilding my code with the necessary bits. I’m also setting MV2_USE_CUDA at runtime to enable cuda-awareness in my MVAPICH library. Additionally I’m informing nvprof to generate a timeline file per-MPI process by passing the MV2_COMM_WORLD_RANK environment variable to nvprof, which is defined to equal the MPI rank of each process. Figure 1 is the result of importing one of these resulting nvprof output files into Visual Profiler and then zooming in to an area of interest.

NVIDIA Visual Profiler with MPI ranges.
Figure 1: NVIDIA Visual Profiler with MPI ranges.

Looking in the “Markers and Ranges” row of the GPU timeline for MPI Rank 0, we see three green boxes denoting two calls to MPI_Sendrecv and one to MPI_Allreduce. Furthermore, we can see that the MPI library is using a device-to-device memcpy operation to communicate between two GPUs on the same node. As you can see, the NVIDIA Visual Profiler, combined with PMPI and NVTX can give you interesting insights into how the MPI calls in your application interact with the GPU.


Parallel Direct Solvers with cuSOLVER: Batched QR

[Note: Lung Sheng Chien from NVIDIA also contributed to this post.]

A key bottleneck for most science and engineering simulations is the solution of sparse linear systems of equations, which can account for up to 95% of total simulation time. There are two types of solvers for these systems: iterative and direct solvers.  Iterative solvers are favored for the largest systems these days (see my earlier posts about AmgX), while direct solvers are useful for smaller systems because of their accuracy and robustness.

CUDA 7 expands the capabilities of GPU-accelerated numerical computing with cuSOLVER, a powerful new suite of direct linear system solvers.   These solvers provide highly accurate and robust solutions for smaller systems, and cuSOLVER offers a way of combining many small systems into a ‘batch’ and solving all of them in parallel, which is critical for the most complex simulations today.   Combustion models, bio-chemical models and advanced high-order finite-element models all benefit directly from this new capability.  Computer vision and object detection applications need to solve many least-squares problems, so they will also benefit from cuSOLVER.

Direct solvers rely on algebraic factorization of a matrix, which breaks a hard-to-solve matrix into two or more easy-to-solve factors, and a solver routine which uses the factors and a right hand side vector and solves them one at a time to give a highly accurate solution. Figure 1 shows an example of LDL^T factorization of a dense matrix.   A solver for this factorization would first solve the transpose of L part, then apply the inverse of the D (diagonal) part in parallel, then solve again with L to arrive at the final answer. The benefit of direct solvers is that (unlike iterative solvers), they always find a solution (when the factors exist; more on this later) and once a factorization is found, solutions for many right-hand sides can be performed using the factors at a much lower cost per solution. Also, for small systems, direct solvers are typically faster than iterative methods because they only pass over the matrix once.

In this post I give an overview of cuSOLVER followed by an example of using batch QR factorization for solving many sparse systems in parallel. In a followup post I will cover other aspects of cuSOLVER, including dense system solvers and the cuSOLVER refactorization API.

Figure 1: Example of LDL^T factorization

Continue reading