Scaling Keras Model Training to Multiple GPUs

Keras Multi-GPU Training with MxNet on NVIDIA DGXKeras is a powerful deep learning meta-framework which sits on top of existing frameworks such as TensorFlow and Theano. Keras is highly productive for developers; it often requires 50% less code to define a model than native APIs of deep learning frameworks require (here’s an example of LeNet-5 trained on MNIST data in Keras (71 lines) and TensorFlow (156 lines)). This productivity has made it very popular as a university and MOOC teaching tool, and as a rapid prototyping platform for applied researchers and developers.

Unfortunately, Keras is quite slow in terms of single-GPU training and inference time (regardless of the backend). It is also hard to get it to work on multiple GPUs without breaking its framework-independent abstraction.

Last week, the MXNet community introduced a release candidate for MXNet v0.11.0 with support for Keras v1.2. In this post I’ll show you how to use Keras with the MXNet backend to achieve high performance and excellent multi-GPU scaling. To learn more about the MXNet v0.11.0 release candidate, check out this post on the Amazon Web Services AI blog.

Overcoming Keras Performance Limitations

The primitive operations that Keras uses to build a network graph are not always optimal relative to framework best practices. A good example of this is that achieving maximum performance with TensorFlow requires using different API calls than the ones shown in public TensorFlow examples. Keras constructs the graph for Resnet-50 more or less like the ResNet-50 implementation in the TensorFlow examples, while the highly-optimized model in TensorFlow’s performance guide is quite different. The difference in training throughput (images per second) between these two ResNet-50 implementations is about a factor of two (when training on eight Tesla P100 GPUs, such as on an NVIDIA DGX-1). This is a significant difference. Given the fact that models such as ResNet-50 train for many hours even on a DGX-1, faster time to results means much higher productivity.

Multi-GPU Scaling

Using multiple GPUs is currently not officially supported in Keras using existing Keras backends (Theano or TensorFlow), even though most deep learning frameworks have multi-GPU support, including TensorFlow, MXNet, CNTK, Theano, PyTorch, and Caffe2.

Various developers have come up with workarounds for Keras’s lack of native multi-GPU support, but none are officially supported by Keras. This raises concerns about API divergence and potential corner cases that have not been code reviewed by core Keras developers. Keras’s official blog also demonstrates that by breaking the backend-independent abstraction and exposing TensorFlow’s multi-GPU primitives, it’s possible to get Keras to scale. Unfortunately this requires the user to understand the operation of the backend and its APIs, and exposes low-level operations such as multi-GPU gradient reduction to the user.

Better Performance with Keras and MXNet

Can we do better, leveraging Keras’s high-level API, while still achieving good single-GPU performance and multi-GPU scaling? It turns out that the answer is yes, thanks to the MXNet backend for Keras, and MXNet’s efficient data pipeline. As a motivating example, I’ll show you how to build a fast and scalable ResNet-50 model in Keras.

First, make sure you have MXNet installed. You can easily install MXNet with GPU support using pip. For higher performance, though, I highly recommend building from source, because you can select C++ compiler flags and other options to optimize performance for your target architecture. This is also a preferable route if pip packages lag behind the source. This happens when a new version of CUDA is released and the build pipeline for the pip binaries hasn’t caught up with the source.

Once you have MXNet installed, it’s time to set up Keras. For this experiment, you will need Keras with MXNet support, to install Keras from a fork rather than the original repository. The MXNet Keras fork is maintained by the MXNet team, but for this example, I will use  an NVIDIA fork of Keras which contains the ResNet-50 example developed for this blog post.

First, make sure you have SciPy installed, since Keras requires it. Then clone the Keras repository and install Keras.

pip install scipy
git clone
cd keras
python install

Note that should you wish to swap out backends, you just need to install another backend, say TensorFlow, and then change the backend framework from mxnet to tensorflow in ~/.keras/keras.json.

Data Preparation

Now that you have installed Keras with MXNet backend support, you need to download the ImageNet dataset and preprocess it into the MXNet-friendly RecordIO format. Create an ImageNet account  to access the dataset. Upon creating the account, you will obtain a user ID and an access key. The following download script asks for the authentication information and where to store the data.

cd examples
bash ./

Please note that the ImageNet dataset is about 250 GB in compressed form, so this download can take a while.

Once you have downloaded and preprocessed the dataset, it’s time to create the “.lst” files containing image metadata. Assuming that MXNet is installed in /opt/mxnet and the dataset resides in /data/imagenet, you can issue the following commands

python /opt/mxnet/tools/ --list True --recursive True train /data/imagenet/train-jpeg
python /opt/mxnet/tools/ --list True --recursive True val /data/imagenet/val-jpeg

Now you need to create RecordIO images. The RecordIO format is able to wrap many images along with their metadata into a single large file. Since the ImageNet dataset is very large, this process takes some time. While performing the conversion, you can resize the images to a specific dimension, since the original images have non-uniform dimensions. I chose to resize to 480×480 pixels, with a target JPEG image quality of 95 (for more information about the JPEG quality factor, see here). If your machine has lots of CPU cores, you should increase the number of processing threads to the number of physical cores in the system. For example, the NVIDIA DGX-1 has two CPUs with 20 cores each, so I set the num-thread parameter to 40.

python /opt/mxnet/tools/ --resize 480 --quality 95 --num-thread 40 train /data/imagenet/train-jpeg
python /opt/mxnet/tools/ --resize 256 --quality 95 --num-thread 40 val /data/imagenet/val-jpeg

That’s it! Now you’re ready to train your model. The source code for model training can be found here. You will note that the get_data() function relies on an MXNet image iterator. To use TensorFlow’s serialization format (TFRecord), which stores images and associated metadata in Protobuf format, you have to change this function to instead deserialize TFRecord files. It’s possible to use TensorFlow’s image I/O and MXNet’s execution backend as well, since ultimately Keras expects the inputs to be in NumPy array format, so the image pipeline and deep learning backends can be mixed and matched.

The Importance of an Efficient Data Pipeline

As I discussed in the last section, you can swap out the image pipelines from different deep learning frameworks, while keeping the graph execution framework (e.g. MXNet) fixed. You could even skip the framework’s image pipeline altogether, and instead use Pillow, OpenCV or other tools that read raw JPEGs and metadata text files to make the image pipeline backend-independent. Unfortunately, this is typically very slow compared to the frameworks’ own pipelines, even for one GPU. There are multiple reasons for this:

  • The image processing routines are implemented in C++ (e.g. for Python’s OpenCV wrapper), but the library control is in Python. Jumping back and forth between native code and the Python interpreter adds to the overhead.
  • The functions exposed in Python process one image at a time, requiring loops to batch up images. These are Python loops, and as many benchmarks show, Python code is often more than 10x slower than C++ code.
  • The Python interpreter has a global interpreter lock (GIL), preventing multiple Python threads from executing at once. Workarounds that allow Python users to benefit from multi-core machines, e.g. multiprocessing, tend to be slower than a single-process, multi-threaded application, because data has to be copied between processes rather than shared by threads in a single process.

As expected, doing just image decoding via Python and Keras is very slow. Batching the images manually in Python resulted in about 257 images per second on a DGX-1. Even using Keras’s batching and augmentation wrapper (with augmentation disabled), which has some level of concurrency, only achieved 1,332 images per second. Meanwhile, MxNet’s image pipeline can decode about 3,767 480×480 pixel JPEG images per second with an intermediate level of augmentation (random cropping, left-right flipping, etc.).

As I show in this post, the 8-GPU throughput for ResNet-50 with the optimized image pipeline is 1,417 images per second, so using a Python image pipeline would have severely reduced the performance, even without augmentation. Since ResNet-50 needs augmentation in order to converge to the stated top-1 and top-5 accuracy, one can only imagine how much slower it would have been if we added augmentation to the already slow Python-based image pipeline.

A Few Words about Multi-GPU

Note the model_agnostic_compile() wrapper in our ResNet-50 script.

def backend_agnostic_compile(model, loss, optimizer, metrics, args):
  if keras.backend._backend == 'mxnet':
      gpu_list = ["gpu(%d)" % i for i in range(args.num_gpus)]
          context = gpu_list)
      if args.num_gpus > 1:
          print("Warning: num_gpus > 1 but not using MxNet backend")

The specification of the list of GPUs to use is specific to MXNet’s fork of Keras, and does not exist as an option when using other backends such as TensorFlow or Theano. Hence, this wrapper permits the user to benefit from multi-GPU performance using MXNet, while keeping the model fully general for other backends.

A word of caution: the data batch provided by the image pipeline is based on a global batch size that gets divided into shards if there are multiple GPUs. Therefore running the same model on a single GPU in another framework may blow up with an out of memory exception. The solution is simple: divide the global batch by the number of GPUs used previously to obtain the batch size adequate for one GPU. It’s also possible to configure a script that takes the per-GPU batch size and determines the global batch size based on the number of GPUs specified by the user. In this case, the model should not run out of memory on a single GPU, and should simply run faster on multiple GPUs.

Performance Results

Figure 1 compares the Keras model training performance using the MXNet backend to a reference model written using MXNet’s native API.

Figure 1. ResNet-50 training throughput (images per second) comparing Keras using the MXNet backend to a native MXNet implementation.
Figure 1. ResNet-50 training throughput (images per second) comparing Keras using the MXNet backend (green bars) to a native MXNet implementation (blue bars).

Keras performance is a bit worse than if we implemented the same model using the native MXNet API, but it’s not too far off. For a single GPU, the difference is about 15%. For 8 GPUs, it is 30%. However, for many users, the research and development productivity gain from Keras’s convenient abstractions outweighs this small slowdown. Performance with MXNet’s efficient image pipeline and fast and scalable backend is very good compared to running Keras training using a naive Python image pipeline (running Pillow or OpenCV and batching examples via the user’s script). For ResNet-50, Keras’s multi-GPU performance on an NVIDIA DGX-1 is even competitive with training this model using some other frameworks’ native APIs.

These results were obtained on an NVIDIA DGX-1 system with 8 Pascal GPUs and the following system details.

  • ​CPUs: 2x Intel Xeon CPU E5-2698 v4 @ 2.20GHz (20 cores per socket, 40 cores total)
  • GPUs: 8x Tesla P100
  • Interconnect: NVLink between 4 GPUs attached to a given PCIe root complex (4 GPUs per CPU), QPI interconnect between CPUs
  • 512 GB RAM
  • Keras 1.2.1
  • MxNet 0.10.1
  • CUDA 8
  • cuDNN 6
  • base OS: Ubuntu 16.04 LTS
  • NVIDIA Driver 384
  • gcc 5.4.0

Try Keras with MXNet Today

After reading this post, I hope you can see that Keras is not only a productive way to develop deep learning models, but it can train them fast on multi-GPU machines like NVIDIA DGX-1 using the MXNet backend. If you would like to give the examples a try, make sure to pull the NVIDIA fork of DMLC Keras, which includes MXNet backend support and our ResNet-50 example.

To learn more about the MXNet v0.11.0 release candidate with support for Keras v1.2, check out this post on the Amazon Web Services AI blog.

  • nor_he

    The following code works for me. Simply drop the code in and it’s good to go. As far as I can tell it only scales on a single box, not a distributed cluster.

    • Marek Kolodziej

      @nor_he:disqus Yes, that works, I tried it before too. I also did mention it in the article in the following sentence:

      “Keras’s official blog also demonstrates that by breaking the backend-independent abstraction and exposing TensorFlow’s multi-GPU primitives, it’s possible to get Keras to scale.”

      The above sentence also hyperlinks to the official Keras blog:

      Note the next sentence, though:

      “Unfortunately this requires the user to understand the operation of the backend and its APIs, and exposes low-level operations such as multi-GPU gradient reduction to the user.”

      The point is that, IMHO, Keras should not require the user to know anything TensorFlow (or MxNet, CNTK or Theano) specific in order to do data-parallel training on multiple GPUs. The MxNet backend for Keras does data parallelism *behind* the API, and it allows the user to list the available GPUs while constructing the model. Arguably, the same could have been done in the TF backend – to expose the GPU list as an argument to Keras’s model.compile, or via some other Keras API unified for all backends, and to hide “with tf.device” inside the TensorFlow backend for Keras as an implementation detail.

      Had it been planned that way from the start, there would have been one abstraction to ensure multi-GPU data parallelism across all Keras backends that support it. Since MxNet, TensorFlow, CNTK and Theano currently all support multi-GPU training, keeping the abstraction unified while changing per-backend implementations would have maintained Keras’s goal as the meta-API. The “I” in API stands for “interface,” not implementation, and the multi-GPU Keras approach for TensorFlow mixes the interface (Keras code) with the implementation (“with tf.device(),” as in the Keras TensorFlow blog post), effectively making the code non-portable.

      Granted, the MxNet approach is based on a Keras fork, meaning that the fact that the change for listing GPUs was added to the interface hasn’t propagated to other implementations yet. The spirit of it is correct, though – keep the interface abstract and portable, and hide the implementation specifics behind the abstraction. Once you can officially specify the GPU list in the Keras API, how backend-specific calls are made to ensure data parallelism will become the role of the implementations, and one API could support all mult-GPU training for all underlying frameworks.

      One mundane example comes to mind. Say you’re doing logging in Java and have log4j, java.util.logging, logback and other options to choose from, but you don’t want to rewrite the code every time the implementation library changes. This is where SLF4J comes in – you call its APIs, and it delegates the work to the chosen backend. If one had to intermix SLF4J code with, say, log4j code, the interface (SLF4J) and implementation (log4j) would have become intermixed, and one would no longer be able to swap out the backend. More importantly though, that would have beeen a leaky abstraction. A solid abstraction encapsulates the implementation details, so the user only need concern themselves with the main API.

    • Marek Kolodziej

      @nor_he:disqus Yes, these approaches do work. In fact, I did mention this in the blog post. Recall the following sentence:

      “Keras’s official blog also demonstrates that by breaking the backend-independent abstraction and exposing TensorFlow’s multi-GPU primitives, it’s possible to get Keras to scale. ”

      There was also a hyperlink associated with the above:

      The main problem I see here is that this requires the user to mix the interface (Keras) with the implementation (“with tf.device()” wrappers), which effectively makes the code non-portable to other backends.

      The simplest analogy that I can see here is the SLF4J logging wrapper (Simple Logging Facade for Java). The user only need to concern themselves with writing API calls using the facade, and can keep swapping backends (log4j, java.util.logging, logback) at will. Also thinking in terms of object-oriented programming, having an interface shouldn’t require the user to look into the implementation, which “with tf.device()” requries us to do.

      Granted, the GPU list provided to Keras’s model.compile() only works in the MxNet Keras fork now, but I would argue that this is the right approach. The change has been made at the interface level, which will hopefully soon become absorbed into mainstream Keras, and it’s the Keras backends’ job to detemine how to make multi-GPU data parallelism happen. That way one can have one abstraction that’s stable, and can swap out the backends while maintaining a portable multi-GPU solution.

      The second issue I mentioned in my post is that TensorFlow performance varies widely, e.g. the difference between this ResNet-50 tutorial implementation ( and this performance-optimized implementation ( is about 2x on a DGX-1. Since TensorFlow has many ways of doing the same thing, it can’t always be expected that a highly abstract implementation will be fast, even if the backend can support that. MxNet’s approach is usually to have one “mainstream” way of doing a particular thing, so it is easier to map the Keras frontend to the MxNet backend while keeping performance reasonably close to that of the native API.

      I encourage you to try both backends and see what performance differences you get.

    • thanks !

  • Michael

    What I didn’t really get:

    Is the performance issue of Keras for single GPUs mostly based on the need for efficient data pipelines (could also just use Dataflow from Tensorpack then) or is it primarily based on how Keras builds a network internally?

    Framework-native Keras input is currently also in the works:

    • Marek Kolodziej

      It’s actually both a pipelines and a backend issue. In my experiments with TF vs. MxNet, MxNet was quite a bit faster. This is not necessarily TF’s fault, because TF can be shown to be competitive with other frameworks on a DGX-1 (with P100 GPUs) for 8 GPUs on ResNet-50, but the thing is that Keras constructs the TF graph very differently than a highly optimized TF graph. It seems to me that the Keras backend for TF uses high-level ops that are not particularly important, and to get full perf in TF, one has to use low-level ops such as the “staging area” node, which essentially does explicit data prefetch (which is implicit in other frameworks, rather than part of the graph definition). See here, for example ( So, the issue regarding TF + Keras performance is not with TF per se, but with how Keras uses TF. It seems to me that the MxNet backend was simply written in a way where the graph constructed by Keras simply uses the backend better. With TF it’s entirely possible, but would require rewriting some parts of the Keras backend. Also as I mentioned, handling multi-GPU behind the scenes in the TF backend would be convenient, but IMHO having that control in Keras rather than at the user level would permit further perf optimizations, e.g. deciding when to use a traditional TF multi-device reduction vs. using NVIDIA’s nccl library for the reduction (e.g. nccl benefits perf for a large number of GPUs and large transfers). The heuristics for that could be encapsulated in the TF backend for Keras, since right now, TF exposes many ways of doing the same thing directly to the user, but most users wouldn’t know which path is efficient in which case.

    • Unfortunately, Keras is quite slow in terms of single-GPU training and inference time (regardless of the backend). It is also hard to get it to work on multiple GPUs without breaking its framework-independent abstraction.