As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes:
Does your CUDA application need to target a specific GPU? If you are writing GPU enabled code, you would typically use a device query to select the desired GPUs. However, a quick and easy solution for testing is to use the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application sees. This can be useful if you are attempting to share resources on a node or you want your GPU enabled executable to target a specific GPU
As Chris points out, robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide. But the CUDA_VISIBLE_DEVICES environment variable is handy for restricting execution to a specific device or set of devices for debugging and testing. You can also use it to control execution of applications for which you don’t have source code, or to launch multiple instances of a program on a single machine, each with its own environment and set of visible devices. Continue reading →
It’s that time of year again! Here at NVIDIA we’re hard at work getting ready for the 2014 GPU Technology Conference, the world’s most important GPU developer conference. Taking place in the heart of Silicon Valley, GTC offers unmatched opportunities to learn how to harness the latest GPU technology including 500 sessions, hands-on labs and tutorials, technology demos, and face-to-face interaction with industry luminaries and NVIDIA technologists.
Come to the epicenter of computing technology March 24-27, and see how your peers are using GPUs to accelerate impactful results in various disciplines of scientific and computational research. Register for GTC now, because the Early Bird discount for GTC registration ends in one week on Wednesday, January 29th. The Early Bird discount is 25% on a full-conference registration, and to sweeten the deal I can offer Parallel Forall readers an extra 20% off using the code GM20PFB. That gets you four days of complete access to GTC for just $720, or $360 for academic and government employees. Don’t miss it, register now!
Here are a few talks to give you an idea of the breadth and quality of talks you will see at GTC: Continue reading →
Alex St. John has a new post on his blog “The Saint” about his first experience porting C++ classes to run on the GPU with CUDA 6 and Unified Memory.
The introduction of Unified Memory in CUDA, for the first time makes it practical to move huge bodies of general C++ code entirely up to the GPU and to write and run entire complex code systems entirely on the GPU with minimal CPU governance. In theory a big leap, but not without some new challenges.
Alex extends the example I provided in my post Unified Memory in CUDA 6 to make it portable between the CPU, with a switch to select managed memory or host memory allocation. He also touches on an approach to making the member functions of the class portable (using __host__ __device__—see my post about Hemi for more ideas on this topic).
Overall it looks like Alex had a very positive experience with Unified Memory: “Using this approach I ported several thousand lines of C++ code and half a dozen objects to CUDA 6.0 in a couple days.” I expect many programmers to have similar good experiences in the future.
The key to the power of GPUs is their 1000′s of parallel processors that execute threads. Anyone who has worked with even a handful of threads know how easy it can be to introduce race conditions, and how difficult it can be to debug and fix these errors. Because a modern GPU can have thousands of simultaneously executing threads, NVIDIA engineers felt it was imperative to create an incredibly powerful tool for detecting and debugging race conditions.
This racecheck tool comes as part of the cuda-memcheck command-line utility. In CUDA 5.5 a new racecheck analysis mode presents much more human-readable analysis of your code, even reporting which source lines conflict with other lines. In this episode of CUDACasts we use a simple version of Conway’s Game of Life to show the new racecheck features cuda-memcheck. We’ll start with a few race condition bugs, and then use the analysis tool to find and fix them.
Artefacto Estudio is a developer of interactive applications and games. The company’s projects include a real-time virtual shoe fitting kiosk that allows people to “try on” shoes using augmented reality powered by Microsoft Kinect and GPU computing (see the video).
The following is an excerpt from our interview (you can read the complete Spotlight here).
NVIDIA: Néstor, tell us a bit about Artefacto Estudio. Néstor: Artefacto is an independent development studio. We integrate solutions using cutting-edge technologies like Microsoft Kinect, Oculus Rift and Leap Motion.
NVIDIA: How did you become involved in the shoe industry? Néstor: An ad agency, Kempertrautmann, was seeking a technology partner to work on a prototype for a virtual shoe fitting exhibit for Goertz, the German shoe company.
Many industries use Computational Fluid Dynamics (CFD) to predict fluid flow forces on products during the design phase, using only numerical methods. A famous example is Boeing’s 777 airliner, which was designed and built without the construction (or destruction) of a single model in a wind tunnel, an industry first. This approach dramatically reduces the cost of designing new products for which aerodynamics is a large part of the value add. Another good example is Formula 1 racing, where a fraction of a percentage point reduction in drag forces on the car body can make the difference between a winning or a losing season.
Users of CFD models crave higher accuracy and faster run times. The key enabling algorithm for realistic models in CFD is Algebraic Multi-Grid (AMG). This algorithm allows solution times to scale linearly with the number of unknowns in the model; it can be applied to arbitrary geometries with highly refined and unstructured numerical meshes; and it can be run efficiently in parallel. Unfortunately, AMG is also very complex and requires specialty programming and mathematical skills, which are in short supply. Add in the need for GPU programming skills, and GPU-accelerated AMG seems a high mountain to climb. Existing GPU-accelerated AMG implementations (most notably the one in CUSP) are more proofs of concept than industrial strength solvers for real world CFD applications, and highly tuned multi-threaded and/or distributed CPU implementations can outperform them in many cases. Industrial CFD users had few options for GPU acceleration, so NVIDIA decided to do something about it.
NVIDIA partnered with ANSYS, provider of the leading CFD software Fluent to develop a high-performance, robust and scalable GPU-accelerated AMG library. We call the library AmgX (for AMG Accelerated). Fluent 15.0 uses AmgX as its default linear solver, and it takes advantage of a CUDA-enabled GPU when it detects one. AmgX can even use MPI to connect clusters of servers to solve very large problems that require dozens of GPUs. The aerodynamics problem in Figure 1 required 48 NVIDIA K40X GPUs, and involved 111million cells and over 440 million unknowns. Continue reading →
This post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we extend the matrix transpose example from a previous post to operate on a matrix that is distributed across multiple GPUs. The data layout is shown in Figure 1 for an nx × ny = 1024 × 768 element matrix that is distributed amongst four devices. Each device contains a horizontal slice of the input matrix shown in the figure, as well as a horizontal slice of the output matrix. These input matrix slices of 1024 × 192 elements are divided into four tiles containing 256 × 192 elements each, which are referred to as p2pTile in the code. As the name indicates, the p2pTiles are used for peer-to-peer transfers. After a p2pTile has been transferred to the appropriate device if necessary (tiles on the block diagonal do not need to be transferred as the input and output tiles are on the same device), a CUDA transpose kernel launch transposes the elements within the p2pTile using thread blocks that process smaller tiles of 32 × 32 elements.
In the world of high-performance computing, it is important to understand how your code affects the operating characteristics of your HW. For example, if your program executes inefficient code, it may cause the GPU to work harder than it needs to, leading to higher power consumption, and a potential slow-down due to throttling.
A new profiling feature in CUDA 5.5 allows you to profile the clocks, power, and thermal characteristics of the GPU as it executes your code. This feature is available in the NVIDIA Visual Profiler on Linux and 64-bit Windows 7/8 and NSight Eclipse Edition on Linux. Learn how to activate and use this feature by watching CUDACasts Episode 13.
KIPAC members work in the Physics and Applied Physics Departments at Stanford University and at the SLAC National Accelerator Laboratory.
To handle the massive amounts of data involved in cosmological measurements, Debbie and her colleagues Matt Bellis (now an assistant professor at Siena College) and Mark Allen (now a data scientist at Chegg) teamed up to explore the potential of GPU computing and CUDA.
They concluded that “GPUs are a useful tool for cosmological calculations, allowing calculations to be made one or two orders of magnitude faster.” Their results were presented in a paper titled Cosmological Calculations on the GPU, which appeared earlier this year in Astronomy and Computing.
So far in the CUDA Python mini-series on CUDACasts, I introduced you to using the @vectorize decorator and CUDA libraries, two different methods for accelerating code using NVIDIA GPUs. In today’s CUDACast, I’ll be demonstrating how to use the NumbaPro compiler from Continuum Analytics to write CUDA Python code which runs on the GPU.
In CUDACast #12, we’ll continue using the Monte Carlo options pricing example, and I’ll show how to write the step function in CUDA Python rather than using the @vectorize decorator. In addition, by using the nvprof command-line profiler, we’ll be able to see the speed-up we’re able to achieve by writing the code explicitly in CUDA.