Setting Up GPU Telemetry with NVIDIA Data Center GPU Manager

Understanding GPU usage provides important insights for IT administrators managing a data center. Trends in GPU metrics correlate with workload behavior and make it possible to optimize resource allocation, diagnose anomalies, and increase overall data center efficiency. NVIDIA Data Center GPU Manager (DCGM) offers a comprehensive tool suite to simplify administration and monitoring of NVIDIA Tesla-accelerated data centers.

One key capability provided by DCGM is GPU telemetry. DCGM includes sample code for integrating GPU metrics with open source telemetry frameworks such as collectd and Prometheus. The DCGM API can also be used to write custom code that can integrate with site specific telemetry frameworks.

Let’s look at how to integrate DCGM with collectd on a CentOS system, making GPU telemetry data available alongside your existing telemetry data.

Integrating DCGM with collectd

Prerequisites

First you need to install and configure collectd and DCGM.

If collectd is not already present on the system, you can install it from the EPEL repository. (Unless otherwise specified, all command line examples need to be run as a superuser.)

# yum install -y epel-release
# yum install -y collectd

DCGM is available free-of-charge from the NVIDIA website. Download the x86_64 RPM package and install it.

# rpm --install datacenter-gpu-manager-1.5.6-1.x86_64.rpm

The DCGM host engine service (nv-hostengine) needs to be running in order to collect the GPU telemetry data.

# nv-hostengine

Verify the DCGM host engine service is running by using it to query the current temperature of the GPUs. Note, this command can be run as a non-superuser.

$ dcgmi dmon -e 150 -c 1

If you want to automatically start the host engine when the system starts, configure a DCGM systemd service. Otherwise the host engine will need to be started manually whenever the system restarts.

[Unit]
Description=DCGM service

[Service]
User=root
PrivateTmp=false
ExecStart=/usr/bin/nv-hostengine -n
Restart=on-abort

[Install]
WantedBy=multi-user.target

Setting up the DCGM collectd plugin

Now that you’ve successfully installed collectd and DCGM, the real work to integrate the two begins. The DCGM package includes a sample collectd plugin implemented using the DCGM Python binding. The plugin needs to be installed and configured to use it with collectd.

First, copy the DCGM Python binding and collectd plugin to the collectd plugin directory. The DCGM collectd plugin installs into a subdirectory to separate it from other collectd plugins.

# mkdir /usr/lib64/collectd/dcgm
# cp /usr/src/dcgm/bindings/*.py /usr/lib64/collectd/dcgm
# cp /usr/src/dcgm/samples/scripts/dcgm_collectd_plugin.py /usr/lib64/collectd/dcgm

Next, verify that the plugin is configured with the correct location of the DCGM library (libdcgm.so) on this system. The DCGM library is installed in /usr/lib64 on CentOS systems by default. Edit /usr/lib64/collectd/dcgm/dcgm_collectd_plugin.py so that the variable g_dcgmLibPath is set to /usr/lib64.

# sed -i -e 's|\(g_dcgmLibPath =\) '"'"'/usr/lib'"'"'|\1 '"'"'/usr/lib64'"'"'|g' /usr/lib64/collectd/dcgm/dcgm_collectd_plugin.py

The DCGM plugin is initially configured to collect a number of generally useful GPU metrics.  You can customize the list of metrics by modifying the g_publishFieldIds variable. You’ll find the names and meaning of the available fields in /usr/src/dcgm/bindings/dcgm_fields.py.

Configuring collectd

Once the DCGM collect plugin has been set up, collectd still needs to be configured to recognize the new metrics.

First, configure collectd to recognize the DCGM plugin by adding dcgm.conf to /etc/collectd.d.

LoadPlugin python
<Plugin python>
      ModulePath "/usr/lib64/collectd/dcgm"
      LogTraces true
      Interactive false
      Import "dcgm_collectd_plugin"
</Plugin>

Second, add a corresponding collectd type for each of the GPU fields defined in /usr/lib64/collectd/dcgm/dcgm_collectd_plugin.py. Assuming no additional fields were defined, append the following type information to /usr/share/collectd/types.db.

### DCGM types
ecc_dbe_aggregate_total                  value:GAUGE:0:U
ecc_sbe_aggregate_total                  value:GAUGE:0:U
ecc_dbe_volatile_total                   value:GAUGE:0:U
ecc_sbe_volatile_total                   value:GAUGE:0:U
fb_free                                  value:GAUGE:0:U
fb_total                                 value:GAUGE:0:U
fb_used                                  value:GAUGE:0:U
gpu_temp                                 value:GAUGE:U:U
gpu_utilization                          value:GAUGE:0:100
mem_copy_utilization                     value:GAUGE:0:100
memory_clock                             value:GAUGE:0:U
memory_temp                              value:GAUGE:U:U
nvlink_bandwidth_total                   value:GAUGE:0:U
nvlink_recovery_error_count_total        value:GAUGE:0:U
nvlink_replay_error_count_total          value:GAUGE:0:U
pcie_replay_counter                      value:GAUGE:0:U
pcie_rx_throughput                       value:GAUGE:0:U
pcie_tx_throughput                       value:GAUGE:0:U
power_usage                              value:GAUGE:0:U
power_violation                          value:GAUGE:0:U
retired_pages_dbe                        value:GAUGE:0:U
retired_pages_pending                    value:GAUGE:0:U
retired_pages_sbe                        value:GAUGE:0:U
sm_clock                                 value:GAUGE:0:U
thermal_violation                        value:GAUGE:0:U
total_energy_consumption                 value:GAUGE:0:U
xid_errors                               value:GAUGE:0:U

If you defined additional GPU fields when installing the DCGM collectd plugin, then a corresponding collectd type needs to be manually added to the list above. The Python field name in /usr/lib64/collectd/dcgm/dcgm_collectd_plugin.py and the collectd type in /usr/share/collectd/types.db are related, but different. To correlate the two variants of a metric name, use the field ID defined in /usr/src/dcgm/bindings/dcgm_fields.py to correlate the two variants of a metric name. For example, DCGM_FI_DEV_GPU_TEMP represents the GPU temperature in /usr/lib64/collectd/dcgm/dcgm_collectd_plugin.py. Looking up this field in /usr/src/dcgm/bindings/dcgm_fields.py shows that it corresponds to fieldID 150. The list of collectd visible field names can be obtained from the command dcgmi dmon -l; the collectd type name corresponding to field ID 150 is gpu_temp.

(Re-)Start collectd

When DCGM is successfully integrated with collectd, output similar to what is shown below should be reported by collectd when it starts.

collectd[25]: plugin_load: plugin "python" successfully loaded.
…
collectd[25]: uc_update: Value too old: name = f707be0c326d/dcgm_collectd-GPU-ace28880-3f61-dbc4-1f8c-0dc7916f3108/gpu_temp-0; value time = 1539719060.000; last cache update = 1539719060.000;
collectd[25]: uc_update: Value too old: name = f707be0c326d/dcgm_collectd-GPU-ace28880-3f61-dbc4-1f8c-0dc7916f3108/power_usage-0; value time = 1539719060.000; last cache update = 1539719060.000;
collectd[25]: uc_update: Value too old: name = f707be0c326d/dcgm_collectd-GPU-ace28880-3f61-dbc4-1f8c-0dc7916f3108/ecc_sbe_volatile_total-0; value time = 1539719060.000; last cache update = 1539719060.000;
collectd[25]: uc_update: Value too old: name = f707be0c326d/dcgm_collectd-GPU-ace28880-3f61-dbc4-1f8c-0dc7916f3108/ecc_dbe_volatile_total-0; value time = 1539719060.000; last cache update = 1539719060.000;
collectd[25]: uc_update: Value too old: name = f707be0c326d/dcgm_collectd-GPU-ace28880-3f61-dbc4-1f8c-0dc7916f3108/ecc_sbe_aggregate_total-0; value time = 1539719060.000; last cache update = 1539719060.000;
...

The GPU data provided by DCGM can be visualized along side the rest of your monitoring data, as shown in figure 1.

Sample output charts
Figure 1. Example output from collectd, visualized by Grafana

Summary

Integrating DCGM with the collectd telemetry framework provides IT administrators with a comprehensive view of GPU usage. If you are already using collectd, the information in this blog post will enable you to include GPU monitoring on the same pane of glass as the rest of your telemetry data. If you are using another telemetry framework, please see Chapter 4 of the DCGM User’s Guide for more information on how to integrate GPU metrics into your solution.

GPU telemetry is just scratching the surface of the full feature set of DCGM. DCGM also includes active health checks, diagnostics, as well as management and accounting capabilities.

No Comments