CUDA Spotlight: GPU-Accelerated Genomics

This week’s Spotlight is on Dr. Knut Reinert. Knut is a professor at Freie Universität in Berlin, Germany, and chair of the Algorithms in Bioinformatics group in the Institute of Computer Science. Knut and his team focus on the development of novel algorithms and data structures for problems in the analysis of biomedical mass data. In particular, the group develops mathematical models for analyzing large genomic sequences and data derived from mass spectrometry experiments (for example, for detecting differential expression of proteins between normal and diseased samples). Previously, Knut was at Celera Genomics, where he worked on bioinformatics algorithms and software for the Human Genome Project, which assembled the very first human genome.

On Oct. 22, 2013, Knut will deliver a GTC Express Webinar presentation titled: Intro to SeqAn, an Open-Source C++ Template Library.

Following is an excerpt from our interview (you can read the complete Spotlight here).

Knut Reinert, Freie Univ. Berlin
Knut Reinert, Freie Univ. Berlin

NVIDIA: Knut, tell us about the SeqAn library.
Knut: Before setting up the Algorithmic Bioinformatics group at Freie Universität, I had been working for years at a U.S. company – Celera Genomics in Maryland – where I worked on the assembly of both the Drosophila (fruit fly) and human genomes. A central part of these projects was the development of large software packages containing algorithms for assembly and genome analysis developed by the Informatics Research team at Celera.

Although successful, the endeavor clearly showed the lack of available implementations in sequence analysis, even for standard tasks. Implementations of much needed algorithmic components were either not available, or hard to access in third-party, monolithic software products.

With this experience in mind, and being educated at the Max-Planck Institute for Computer Science in Saarbrücken (the home of very successful software libraries like LEDA and CGAL) I put the development of such a software library high on my research agenda.

The fundamental idea was that the library should be comprehensive for the field of sequence analysis, it should be easy to use, and most of all (because of the tremendous data volumes in genomics, ~3 GB per human genome) be efficiently implemented.

In 2003, Andreas Gogol-Döring joined the Algorithmic Bioinformatics group. Over the next 18 months, lively discussions about goals, different software designs, and the possible content of the library followed, which led to various prototypes that allowed us to verify the design ideas with the corresponding implementations.

Although this approach was rather work-intensive, it led to a lot of insights and finally to the current SeqAn design, which in our opinion, fulfills our initial goals.

In the following years we were able to attract a handful of very talented PhD students who joined the project. In 2006, David Weese and Tobias Rausch joined the SeqAn team, followed by Anne-Katrin Emde in 2008. Their help in augmenting the functionality of SeqAn and in implementing algorithms, data types, and providing documentation and tutorials was indispensable in making SeqAn a great product.

seqan-logoNVIDIA: Describe the hardware/software platform currently in use by your lab.
Knut: We have modest needs, since we develop tools and software libraries and do not in general have a lot of data analysis tasks. We have a couple of high memory compute servers (6 to 12 cores, between 48 to 256 GB main memory). In addition we have a compute cluster, with some NVIDIA Tesla K20s in the cluster and some on a dedicated development machine.

NVIDIA: What types of parallel algorithms are being implemented?
Knut: So far we have implemented only the low-hanging fruit, which has worked astonishingly well. The parallelization consists of two parts.

The first was to bring an involved data structure, the FM-index, generically onto the GPU. The FM-index allows fast exact and approximate searches of patterns in genome sized data. It is popular, since it uses a small amount of memory. For example, an enhanced suffix array for the complete human genome (another efficient search index) needs 42 GB of main memory, while an FM-index needs much less. Our new implementation in SeqAn needs only 3.7 GB and hence fits on current GPUs.

Secondly, we programmed a parallel traversal on this index in a generic fashion, which allows the code to run on a multicore as well as on the GPU.

NVIDIA: What approaches have been useful for CUDA development?
Knut: First we made sure that the whole SeqAn library compiles with the CUDA compiler on the GPU. Thus, we could seamlessly compile code in SeqAn on the CPU and the GPU.

Secondly, we took a holistic approach and thought “big.” Instead of aiming to speed up only certain aspects of the algorithm, we thought about how we can implement the infrastructure to port ALL our indices to the GPU. The end result is simple and elegant.

NVIDIA: Were any specific GPU-accelerated libraries utilized?
Knut: Yes. We used the device vectors and pointers from Thrust to easily port our data structures to the GPU.

Read the full interview. Read more CUDA Spotlights.


About Calisa Cole

Calisa Cole
Calisa joined NVIDIA in 2003 and currently focuses on marketing related to CUDA, NVIDIA’s parallel computing architecture. Previously she ran Cole Communications, a PR agency for high-tech startups. She majored in Russian Studies at Wellesley and earned an MA in Communication from Stanford. Calisa is married and the mother of three boys. Her favorite non-work activities are fiction writing and playing fast games of online scrabble.