Turing H.264 Video Encoding Speed and Quality

All NVIDIA GPUs starting with Kepler support fully-accelerated hardware video encoding;  GPUs starting with Fermi support fully-accelerated hardware video decoding. The recently released Turing hardware delivered Tensor Cores and better machine learning performance, but the new GPU also incorporated new multimedia features such as an improved NVENC unit to deliver better compression and image quality in video codecs.

Let’s take a closer look at performance and quality of the new NVENC unit designed into Turing.

NVENC Performance Test Setup

H.264 emerged 15 years ago and has become an ubiquitous video coding standard. It has become most important and widespread codec in the industry. These tests show how the Tesla T4 performs versus the well-known open source encoder libx264 in two scenarios:

  • High Quality mode which represents most common encoding scenarios with VBR control and B frames enables.
  • Low Latency Fast mode which is useful in applications sensible to latency such as remote gaming or video conferencing.

For that, testing computer was prepared with the configuration shown in table 1:

Testing Configuration
Component Tesla T4 NVENC libx264
CPU Dual Intel Xeon E5-2660v3 @ 2.6 GHz Dual Intel Xeon E5-2660v3 @ 2.6 GHz
GPU TU104 (Tesla T4) N/A
RAM 128 GB 128 GB
FFMPEG 4.0.2 4.0.2
Driver 415.15 N/A

These performance tests set the encode parameters to those shown in table 2:

Encoding Parameters
Preset NVENC libx264
High Quality -c:v h264_nvenc -preset medium -b:v BITRATE -bufsize BITRATE*2 -profile:v high -bf 3 -b_ref_mode 2 -temporal-aq 1 -rc-lookahead 20 -vsync 0 -c:v libx264 -preset medium -b:v BITRATE -bufsize BITRATE*2 -profile:v high -tune psnr -vsync 0 -threads 4
Low Latency Fast -c:v h264_nvenc -preset llhp -rc cbr_ld_hq -b:v BITRATE -bufsize BITRATE/FRATE -profile:v high -g 999999 -vsync 0 -c:v libx264 -preset fast -b:v BITRATE -bufsize BITRATE/FRATE -profile:v high -g 999999 -x264opts no-sliced-threads:nal-hrd=cbr -tune zerolatency -threads 4 -vsync 0

We used a variety of input videos for RD-estimation, such as basketball_drive, bq_terrace, cactus, crowd_run, ducks_take_off, jokey, kimono, and many more at 1280×720, 1920×1080 and 3840×2160 resolutions.

Performance and Quality Results

Balanced Sequences

Figures 1 through 4 show that the Tesla T4 delivers same or slightly better visual quality to libx264 in high quality mode for all-round balanced sequences like Kimono, BQ terrrace and Park Scene.

PSNR RD curve chart for Kimono in 720p
Figure 1. PSNR RD curve for Kimono sequence in 720p resolution.
PSNR RD curve chart for BQ terrace in 1080p
Figure 2. PSNR RD curve for BQ terrace sequence in 1080p resolution.
PSNR RD curve chart for Park Scene in 720p
Figure 3. PSNR RD curve for Park Scene sequence in 720p resolution.
PSNR RD curve chart for Park Scene in 1080p
Figure 4. PSNR RD curve for Park Scene sequence in 1080p resolution.

Tesla T4 shows better prediction and filtering in comparison to libx264, as illustrated on figures 5 and 6.

bq terrace libx264 image
Figure 5. Frame of bq_terrace sequence encoded by libx264. (Click for larger image)
bq terrace nvenc image
Figure 6. Frame of bq_terrace sequence encoded by T4 Nvenc. (Click for larger image)

High-Motion Sequences

The Red Kayak and Cactus sequences include significant chaotic and circular motion, respectively. NVENC shows a clear advantage over libx264 in these scenes which contain complex inter-predicition, as shown on figures 7 and 8.

PSNR RD curve chart for Red Kayak sequence, 1080p
Figure 7. PSNR RD curve for Red Kayak sequence in 1080p resolution.
PSNR RD curve chart for Cactus sequence, 1080p
Figure 8. PSNR RD curve for Cactus sequence in 1080p resolution.

The Tesla T4 NVENC easily outperforms libx264 in low-latency mode as highlighted in figures 9 and 10. Notice how the Tesla T4 is more effective at high resolutions, offering one dB better visual quality at the same bitrate.

PSNR RD curve chart for Cactus sequence, 1080p
Figure 9. PSNR RD curve for Cactus sequence in 1080p resolution and low latency mode
PSNR RD curve chart for Ducks Take Off in 2160p
Figure 10. PSNR RD curve for Ducks take off sequence in 2160p resolution.

Difference in visual quality can be easily seen by eye as it’s illustrated on figures 11 and 12:

ducks take off 3840x2160 420 8 30 500.y4m libx264 ll 8M.H264 0 3840x2160 decoded
Figure 11. First frame of ducks_take_off sequence encoded by libx264. (Click for larger image)
ducks take off 3840x2160 420 8 30 500.y4m nvenc ll 8M.H264 0 3840x2160 decoded
Figure 12. First frame of ducks_take_off sequence encoded by T4 Nvenc. (Click for larger image)

Turing GPUs come equipped with powerful NVENC video encoding units which delivers higher video compression efficiency compared to sophisticated software encoders like libx264, due to the combination of higher performance and lower energy consumption. The ideal solution for transcoding needs to be cost effective (dollars/stream) and power efficient (watts/stream). Let’s look at performance and power consumption results averaged across multiple test sequences, as presented by figures 13 and 14.

HQ simultaneous streams at 30fps chart
Figure 13. Number of streams encoded simultaneously at 30 FPS in High Quality mode
Simultaneous encoded 30fps streams chart
Figure 14. Number of streams encoded simultaneously at 30 FPS in Low Latency mode.

The T4  encodes 22 720p streams, simultaneously in High Quality mode. The GPU can also handle ten streams on average at 1080p and two or three at UltraHD (2160p) resolutions. This equates to almost double that of libx264 at equal visual quality level.

Running in low-latency mode shows an even larger advantage for the T4. It can encode 37 streams at 720p resolution, 17-18 in 1080p, and 4-5 streams in Ultra HD, which is 2-2.7x  higher performance than libx264 with higher visual quality. You can see watts per stream charts in figures 15 and 16.

Watts per stream power in HQ mode chart
Figure 15. Average watts per stream power consumption in High Quality mode
Watts per stream, low-latency mode chart
Figure 16. Average watts per stream power consumption in Low Latency mode

The Tesla also shows great power efficiency, outperforming libx264 2-4x in High Quality mode and up to 5x in Low Latency mode while keeping the CPU load low. 

Conclusion

NVIDIA’s Tesla T4 has greatly improved encoding capabilities in comparison to previous generations. It shows the same or better visual quality compared to software encoders like libx264 in High Quality mode while outperforming them in Low Latency mode. This equates to twice the performance at 2-5x lower power consumption.

No Comments