Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (“nodes”) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC), NVIDIA GPUs are becoming part of some of the world’s most powerful supercomputers and clusters. The most recent top 500 list of the worlds fastest supercomputers included nearly 50 supercomputers powered by NVIDIA GPUs, and the current world’s fastest supercomputer, Oak Ridge National Labs TITAN, utilizes more than 18,000 NVIDIA Kepler GPUs.
In this post I will take you step by step through the process of designing, deploying, and managing a small research prototype GPU cluster for HPC. I will describe all the components needed for a GPU cluster as well as the complete cluster management software stack. The goal is to build a research prototype GPU cluster using all open source and free software and with minimal hardware cost.
I gave a talk on this topic at GTC 2013 (session S3516 – Building Your Own GPU Research Cluster Using Open Source Software Stack). The slides and a recording are available at that link so please check it out!
There are multiple motivating reason for building a GPU-based research cluster. Continue reading