VPF: Hardware-Accelerated Video Processing Framework in Python

Support for accelerated hardware video encoding began with the Kepler generation of NVIDIA GPUs, and all GPUs since the Fermi generation support hardware video acceleration decoding through the NVIDIA Video Codec SDK.

While showing great performance and flexibility, it requires knowledge of C/C++. Another option is to use third party libraries and applications like FFmpeg or GStreamer which again require C/C++ expertise to be built-in and customized per user.

However, hardware accelerated video features might be useful for a broader audience, and the intent of VPF (Video Processing Framework) is a simple, yet powerful tool for utilizing NVIDIA GPUs when working with video using Python. VPF utilizes the NVIDIA Video Codec SDK for flexibility and performance, and provides developers with the ease-of-use inherent to Python.

VPF is a set of C++ libraries and Python bindings which provides full hardware acceleration for video processing tasks such as decoding, encoding, transcoding and GPU-accelerated color space and pixel format conversions. VPF is a CMake-based open source cross-platform software released under Apache 2 license. It relies on FFmpeg library for (de)muxing and pybind11 project for building Python bindings.

VPF exports C++ video processing classes into PyNvCodec Python module. To illustrate the ease of use, let’s start with a quick code snippet which shows how to do fully hardware-accelerated video transcoding on a GPU without raw frames copying between Host and Device:

import PyNvCodec as nvc

gpuID = 0
encFile = "big_buck_bunny_1080p_h264.mov"
xcodeFile = open("big_buck_bunny_1080p.h264", "wb")

nvDec = nvc.PyNvDecoder(encFile, gpuID)
nvEnc = nvc.PyNvEncoder({'preset': 'hq', 'codec': 'h264', 's': '1920x1080'}, gpuID)

while True:
    rawSurface = nvDec.DecodeSingleSurface()
    # Decoder will return zero surface if input file is over;
    if not (rawSurface.GetCudaDevicePtr()):
        break
    
    encFrame = nvEnc.EncodeSingleSurface(rawSurface)
    if(encFrame.size):
        frameByteArray = bytearray(encFrame)
         xcodeFile.write(frameByteArray)

# Encoder is asynchronous, so we need to flush it
encFrames = nvEnc.Flush()
for encFrame in encFrames:
    encByteArray = bytearray(encFrame)
    xcodeFile.write(encByteArray)

Despite a simple design, VPF demonstrates good performance. The transcoding sample shown above is enough to saturate Nvenc unit on an RTX 5000 GPU as illustrated below:

Big Buck Bunny sequence contains 14,315 frames and can be transcoded within 32 seconds which gives ~447fps without using any advanced techniques such as producer-consumer pattern with decoded frames queue being shared by decoder and encoder launched in separate threads. Since all the transcoding is done on the GPU, there’s no noticeable CPU load.

The core part of VPF are PyNvDecoder and PyNvEncoder classes which are Python bindings to NVIDIA Video Codec SDK. There are two major data types which VPF operates:

NumPy arrays for CPU-side data
User-transparent Surface class which represents GPU-side data

Since GPU-side memory objects allocation is complex and influences performance heavily, all VPF classes methods which return Surface, own them and may reuse previously returned Surface upon next call. Unlike that, VPF classes methods return new NumPy array instance every time they are called. Move constructors are used for that to avoid memory copy overheads.

Both PyNvDecoder and PyNvEncoder classes support only NV12 pixel format for the sake of simplicity. Other pixel formats are supported with set of color space and pixel format conversion classes. All conversions are GPU-accelerated and done in VRAM memory for better performance.

PyNvDecoder class has five main methods:

`DecodeSingleSurface`	Decodes single frame from input video, returns Surface with decoded pixels. The next time a user calls this method, previously returned Surface may be reused. If the frame isn’t decoded, decoded Surface’s `GetCudaDevicePtr` method will return zero.
`DecodeSingleFrame`	Decodes single frame from input video, returns NumPy array with decoded pixels. Next time user calls this method, another NumPy array instance will be returned. If frame isn’t decoded, it will return empty NumPy array. This operation does Device to Host memory copy.
`Width`	Returns decoded frame width.
`Height`	Returns decoded frame height.
`PixelFormat`	Returns decoded frame pixel format.

User may mix DecodeSingleSurface and DecodeSingleFrame calls, it will not break decoder internal state. Decoder class supports H.264 and H.265 codecs.

PyNvEncoder class has six methods:

`EncodeSingleSurface`	Takes NV12 Surface with raw pixels, encodes it and returns elementary video bitstream as NumPy array. The encoder is asynchronous, so this method may return empty array upon the first few calls (depending on encoder settings), which is not an error.
`EncodeSingleFrame`	Takes NumPy array with raw pixels, encodes it and returns elementary video bitstream as NumPy array. The encoder is asynchronous, so this method may return empty array upon first few calls (depending on encoder settings), which is not an error.
`Flush`	Flushes the encoder. It does not return unless all raw frames in encoder’s queue are encoded and returns a list of NumPy arrays with elementary stream bytes.
`Width`	Returns encoded frame width.
`Height`	Returns encoded frame height.
`PixelFormat`	Returns encoded frame pixel format.

If a user mixes EncodeSingleSurface and EncodeSingleFrame calls, it will not break the encoder internal state. Also, PyNvEncoder can take an input frame of arbitrary resolution and resize it on the GPU on-the-fly before the actual encoding. The encoder class supports H.264 and H.265 codecs, and has low latency, so in the end of the encoding session, one should call Flush method that will flush the encoder frame queue.

Below is a list of supported encoder parameters:

Parameter	Type	Meaning
`profile`	string	Encoding profile. Possible values for h264: `baseline`, `main`, `high<,code>.` Possible values for hevc: `main`.
`lookahead`	integer	Size of look-ahead.
`vbvinit`	integer + unit	VBV initial delay in bits, can be in unit of 1, K, M.
`bitrate`	integer + unit	Average bit rate, can be in unit of 1, K, M.
`fps`	integer	Frame rate.
`preset`	string	Encoding preset. Possible values: `default`, `hp`, `hq`, `bd`, `ll`, `ll_hp`, `ll_hq`, `lossless`, `lossless_hp`.
`constqp`	integer	QP value for constqp rate control mode.
`qmin`	integer	Min QP value.
`qmax`	integer	Max QP value.
`cq`	integer	Target constant quality level for VBR mode Possible values: `[0,51]`, `0=auto`.
`initqp`	integer	Initial QP value.
`temporalaq`		(No value) Enable temporal AQ.
`vbvbufsize`	integer + unit	VBV buffer size in bits, can be in unit of 1, K, M.
`bf`	integer	Number of consecutive B-frames.
`rc`	string	Rate control mode. Possible values: `constqp`, `vbr`, `cbr`, `cbr_lowdelay_hq`, `cbr_hq`, `vbr_hq`.
`aq`	integer	Enable spatial AQ and set its strength Possible values: `[0,15]`, `0=auto`.
`maxbitrate`	integer + unit	Max bit rate, can be in unit of 1, K, M.
`gop`	integer	Length of GOP (Group of Pictures).
`codec`	string	Video codec. Possible values: `h264`, `hevc`.

HardwareSurface class is a wrapper around CUdeviceptr:

GetCudaDevicePtr

Returns CUdeviceptr handle to CUDA memory object.

For memory transfers between Host and Device, there are two classes named PyFrameUploader and PySurfaceDownloader.

PyFrameUploader is used for uploading a NumPy array to GPU. It has only one method:

UploadSingleFrame

Uploads a numpy array to GPU, returns handle to uploaded Surface. Next time user calls this method, previously returned Surface may be reused.

PySurfaceDownloader is used to download a Surface from GPU. It also has only one method:

DownloadSingleSurface

Downloads a GPU-side Surface into CPU-side numpy array. Next time user calls this method, another numpy array instance will be returned.

Finally, there’s a PySurfaceConverter class which is used for GPU-accelerated color space and pixel format conversion. Below is the list of supported conversions:

YUV420 to NV12
NV12 to YUV420
NV12 to RGB

PySurfaceConverter has one method:

Execute

Performs conversion on the GPU, returns handle to Surface with output format. The next time a user calls this method, previously returned Surface may be reused.

VPF provides developers with a simple, yet powerful Python tool for fully hardware -accelerated video encoding, decoding and processing classes. Thanks to the C++ code underneath the Python bindings, it allows you to achieve high GPU utilization within tens of code lines. Decoded video frames are exposed either as NumPy arrays or CUDA device pointers for simpler interaction and features extension. VPF does not impose any restrictions above the NVIDIA Video Codec SDK and allows you to fully utilize the potential of NVIDIA professional-grade GPUs.