About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
CUDA 6

5 Powerful New Features in CUDA 6

Today I’m excited to announce the release of CUDA 6, a new version of the CUDA Toolkit that includes some of the most significant new functionality in the history of CUDA. In this brief post I will share with you the most important new features in CUDA 6 and tell you where to get more information. You may also want to watch the recording of my talk “CUDA 6 and Beyond” from last month’s GPU Technology Conference, embedded below.

Without further ado, if you are ready to download the CUDA Toolkit version 6.0 now, by all means, go get it on CUDA Zone. The five most important new features of CUDA 6 are

  • support for Unified Memory;
  • CUDA on Tegra K1 mobile/embedded system-on-a-chip;
  • XT and Drop-In library interfaces;
  • remote development in NSight Eclipse Edition;
  • many improvements to the CUDA developer tools.

Continue reading

TK1_applications

Jetson TK1: Mobile Embedded Supercomputer Takes CUDA Everywhere

Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more portable; and unmanned aircraft are learning to navigate autonomously. As a result, the computational requirements for these devices are increasing exponentially, while their size, weight, and power limits continue to decrease. Aimed at these and other embedded parallel computing applications, last week at the 2014 GPU Technology Conference NVIDIA announced an awesome new developer platform called Jetson TK1.

TK1_Dev_Kit-6350-GREEN-V4_280Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.

Jetson TK1 is aimed at two groups of people. The first are OEMs, including robotics, avionics, and medical device companies, who would like to develop new products that use Tegra K1 SoCs, and need a development platform that makes it easy to write software for these products. Once these companies are ready to move to production, they can work with one of our board partners to design the exact board that they need for their product. The second group is the large number of independent developers, researchers, makers, and hobbyists who would like a platform that will enable them to create amazing technology such as robots, security devices, or anything that needs substantial parallel computing or computer vision in a small, flexible and low-power platform. For this group, Jetson TK1 offers the size and adaptability of Raspberry Pi or Arduino, with the computational capability of a desktop computer. We’re excited to see what developers create with Jetson TK1!

Tegra K1

Tegra K1 is NVIDIA’s latest mobile processor.  It features a Kepler GPU with 192 cores, Continue reading

cuda_pro_tip

CUDA Pro Tip: Increase Application Performance with NVIDIA GPU Boost

NVIDIA GPU Boost™ is a feature available on NVIDIA® GeForce® products and NVIDIA® Tesla® products. It makes use of any power headroom to boost application performance. In the case of Tesla, the NVIDIA GPU Boost feature is customized for compute intensive workloads running on clusters. This application note is useful for anyone who wants to take advantage of the power headroom on the Tesla K40 in a server or within a workstation. Note that GPU Boost is a system setting, which means that this Pro Tip applies to any user of a CUDA-accelerated application, not just developers.

The Tesla K40 board targets a specific power budget (235W) when running a highly optimized compute workload, but HPC workloads vary in power consumption and profile, as the graph in Figure 1 shows. This shows that for many applications there is power headroom.  NVIDIA GPU Boost for Tesla allows customers to use available power headroom to select higher graphics clocks using NVML or nvidia-smi.

Figure 1: Average GPU Power Consumption for Real Applications
Figure 1: Average GPU Power Consumption for Real Applications on Tesla K20X.


Acceleware LogoA great post by Saad Rahim on the Acceleware Blog
covers everything you need to know to use GPU Boost. In the post, Saad benchmarks two applications with varying clocks on K40: Reverse Time Migration (RTM), a depth migration algorithm used to image complex geologies; and a Finite-difference time-domain (FDTD) electromagnetic solver. Continue reading

5 Things You Should Know About the New Maxwell GPU Architecture

The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such as notebooks and small form factor computers. What is exciting about this announcement for HPC and other GPU computing developers is the great leap in energy efficiency that Maxwell provides: nearly twice that of the Kepler GPU architecture. That makes Maxwell a great architecture for future NVIDIA Tesla products, so stay tuned.

This post will tell you five things that you need to know about Maxwell as a GPU computing programmer, including high-level benefits of the architecture, specifics of the new Maxwell multiprocessor, guidance on tuning and pointers to more resources.

maxwell_vs_kepler_power_efficiency

1. The Heart of Maxwell: More Efficient Multiprocessors

Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves power efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.

Improved Instruction Scheduling

The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell’s improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of the SM means CUDA cores per GPU will be substantially higher versus comparable Fermi or Kepler chips. The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design. Continue reading

cuda_pro_tip

CUDA Pro Tip: Do The Kepler Shuffle

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others’ registers by using a new instruction called SHFL, or “shuffle”.

In upcoming posts here on Parallel Forall we will demonstrate uses of shuffle. To prepare, I highly recommend watching the following recording of a GTC 2013 talk by Julien Demouth entitled “Kepler’s SHUFFLE (SHFL): Tips and Tricks”. In the talk, Julien covers many uses for shuffle, including reductions, scans, transpose, and sorting, demonstrating that shuffle is always faster than safe uses of shared memory, and never slower than unsafe uses of shared memory.

Earlier this year on the Acceleware blog, Kelly Goss wrote a detailed post about shuffle, including a detailed example. Like Julien, Kelly provided several reasons to use shuffle. Continue reading

cuda_pro_tip

CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES

As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes:

Acceleware LogoDoes your CUDA application need to target a specific GPU? If you are writing GPU enabled code, you would typically use a device query to select the desired GPUs. However, a quick and easy solution for testing is to use the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application sees. This can be useful if you are attempting to share resources on a node or you want your GPU enabled executable to target a specific GPU

As Chris points out, robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide. But the CUDA_VISIBLE_DEVICES environment variable is handy for restricting execution to a specific device or set of devices for debugging and testing.  You can also use it to control execution of applications for which you don’t have source code, or to launch multiple instances of a program on a single machine, each with its own environment and set of visible devices. Continue reading

gtc-logo-2014_thumb

Register for GTC 2014 Now and Save 40%!

It’s that time of year again!  Here at NVIDIA we’re hard at work getting ready for the 2014 GPU Technology Conference, the world’s most important GPU developer conference. Taking place in the heart of Silicon Valley, GTC offers unmatched opportunities to learn how to harness the latest GPU technology including 500 sessions, hands-on labs and tutorials, technology demos, and face-to-face interaction with industry luminaries and NVIDIA technologists.

Come to the epicenter of computing technology March 24-27, and see how your peers are using GPUs to accelerate impactful results in various disciplines of scientific and computational research. Register for GTC now, because the Early Bird discount for GTC registration ends in one week on Wednesday, January 29th. The Early Bird discount is 25% on a full-conference registration, and to sweeten the deal I can offer Parallel Forall readers an extra 20% off using the code GM20PFB. That gets you four days of complete access to GTC for just $720, or $360 for academic and government employees. Don’t miss it, register now!

Here are a few talks to give you an idea of the breadth and quality of talks you will see at GTC: Continue reading

CUDA 6

The Saint on Porting C++ Classes to CUDA with Unified Memory

Alex St. John has a new post on his blog “The Saint” about his first experience porting C++ classes to run on the GPU with CUDA 6 and Unified Memory.

The introduction of Unified Memory in CUDA, for the first time makes it practical to move huge bodies of general C++ code entirely up to the GPU and to write and run entire complex code systems entirely on the GPU with minimal CPU governance. In theory a big leap, but not without some new challenges.

Alex extends the example I provided in my post Unified Memory in CUDA 6 to make it portable between the CPU, with a switch to select managed memory or host memory allocation. He also touches on an approach to making the member functions of the class portable (using __host__ __device__—see my post about Hemi for more ideas on this topic).

Overall it looks like Alex had a very positive experience with Unified Memory: “Using this approach I ported several thousand lines of C++ code and half a dozen objects to CUDA 6.0 in a couple days.”  I expect many programmers to have similar good experiences in the future.

CUDA 6

Unified Memory in CUDA 6

With CUDA 6, we’re introducing one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be allocated in both memories, and explicitly copied between them by the program. This adds a lot of complexity to CUDA programs.

unified_memoryUnified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

In this post I’ll show you how Unified Memory dramatically simplifies memory management in GPU-accelerated applications.  The image below shows a really simple example. Both codes load a file from disk, sort the bytes in it, and then use the sorted data on the CPU, before freeing the memory. The code on the right runs on the GPU using CUDA and Unified Memory.  The only differences are that the GPU version launches a kernel (and synchronizes after launching it), and allocates space for the loaded file in Unified Memory using the new API cudaMallocManaged().

simplified_memory_mananagement_codeIf you have programmed CUDA C/C++ before, you will no doubt be struck by the simplicity of the code on the right. Notice that we allocate memory once, and we have a single pointer to the data that is accessible from both the host and the device. We can read directly into the allocation from a file, and then we can pass the pointer directly to a CUDA kernel that runs on the device. Then, after waiting for the kernel to finish, we can access the data again from the CPU. The CUDA runtime hides all the complexity, automatically migrating data to the place where it is accessed. Continue reading

The New Parallel Forall

Parallel_ForAll_F_wht_V_green_gradient_340x190Today I’m excited to introduce you to the new and improved Parallel Forall blog. For over a year and a half, Parallel Forall has been a part of NVIDIA’s CUDA Zone website, but to better serve the parallel programming community, today we’re launching a dedicated home for Parallel Forall at http://devblogs.nvidia.com/parallelforall.

In addition to a fantastic new design (including a mobile-friendly theme), the new Parallel Forall adds a host of great features that readers expect from a great blog. The most exciting of these are features that allow readers to engage with our stories and with each other: comments and social sharing buttons. You can comment at the bottom of any post, and above each post you’ll find convenient social media sharing buttons for Twitter, Reddit, Facebook, Google+, LinkedIn, and email. Please get involved in the conversation, and share it with your network.

Parallel Forall also now helps you find more content of interest. First is an improved search (see the search box in the sidebar). Second, at the bottom of each story you will see recommendations of related posts. And finally, we’re featuring our wonderful authors more prominently — each of their posts will feature their photo and bio, with a link to that author’s archive page with all of their posts on Parallel Forall.

For the past few months we have featured three regular post series, CUDACasts, CUDA Pro Tips, and CUDA Spotlights, so the new Parallel Forall menu bar highlights these series for quick access. CUDACasts are short, informative “screencast” videos that teach you about parallel programming techniques and CUDA features and strategies. CUDA Pro Tips provide concise, focused insight into useful techniques on the CUDA platform. CUDA Spotlights are in-depth interviews with people who are using CUDA successfully for solving difficult problems across a broad range of fields.

In addition to these series, we will continue to feature a variety of in-depth posts and series on parallel programming on the CUDA platform, such as our popular Introduction to CUDA C/C++ series, and our recent posts on GPU computing in MATLAB and Python.

We hope you agree that Parallel Forall is a great resource for programming on the CUDA platform, and we need your help to make it even better. So visit Parallel Forall regularly, subscribe to the feed (you can also subscribe via email using the button in the sidebar), share posts with your friends, join in the discussion, and let us know what you like and what you’d like to read about in the future.