Management Visualized Storage Monitoring Lead image: Lead Image © Gino Santa Maria, 123RF.com

Storage monitoring with Grafana

Painting by Numbers

Create intuitive and meaningful visualizations of storage performance values with a "TIG" stack: Telegraf, InfluxDB, and Grafana. By Andreas Stolzenberger

Performance values as plain old numbers do not present a visually appealing overview of system performance, but graphical dashboards can help you visualize what would otherwise be boring metrics. A number of free applications visualize metrics in almost any form desired, and one of the most popular open source tools in this family is Grafana [1]. Without much programming knowledge, you can build dashboards to present Internet of Things (IoT) values, stock prices, or the performance data of monitored systems. In this article, I show you how to use Grafana in a convenient GUI to display storage performance values, as well as how to retrieve the desired Simple Network Management Protocol (SNMP) data from InfluxDB.

SNMP, the source for the performance information in this example, is supported by all common operating systems and networked devices. This example uses Synology network-attached storage (NAS) as the data source. However, because the queries only use entries from the Management Information Base (MIB) v2 standard, the example will also work with other Linux-based NAS and storage area network (SAN) devices or commercial storage systems. However, before Grafana can visually evaluate performance data, other tools are needed to collect the data and store the results in a usable way.

InfluxDB Database

Performance data can best be saved in time series databases (TSDBs), which automatically reduce the volume of acquired data by reducing the accuracy of the measurements over time according to appropriate rules. For example, a metric database retains up-to-the-minute readings for several days, but then only hourly readings for older data, and after a further period of time, only one value per day. Popular TSDBs include RRDtool, Prometheus, and InfluxDB.

InfluxDB [2] is a very simple open source TSDB with a simple HTTP API on port 8086 and an SQL-flavored query language. Anyone who has ever worked with SQL databases will find InfluxDB very easy to use. Also, the API does not require too much programming knowledge. To store stock market prices in InfluxDB, for example, all you need is a simple Bash script that retrieves the price data from a website up front and then saves the data in the database:

curl -i -XPOST 'http://localhost:8086/write?db=stockprices' --data-binary "Stock,Symbol=$symbol value=$price"

This command adds a share price value for Symbol to the Stock table in the stockprices database. InfluxDB itself writes the current timestamp to the entry. In principle, even a simple Bash script would be sufficient to query performance values with snmpget, then filter them with awk, and write them to the database with a similar curl statement.

However, InfluxDB has some helpful tools to make this operation easier. In addition to its own visualization front end Chronograf, the Telegraf data importer collects information with various import plugins and sends the data to InfluxDB. Of course, the import plugins also include an SNMP grabber.

Setting Up the Grafana Host

In this example, I used a virtual machine with CentOS 7 as the Grafana host. Strictly speaking, the test setup does not even rely on a full-fledged virtual machine, but uses an LXC container with CentOS 7, which is absolutely fine. Manufacturers InfluxData and Grafana Labs both provide RPM repositories for their tools. After a minimal CentOS 7 installation (and the yum update -y command), you should create two repository files, influxdb.repo and grafana.repo, in the /etc/yum.repos.d directory, the contents of which are shown in Listing 1. To install the required tools, enter:

Listing 1: Repository Files

[influxdb]
name = InfluxDB Repository - RHEL $releasever
baseurl = https://repos.influxdata.com/centos/$releasever/$basearch/stable
enabled = 1
gpgcheck = 1:x
gpgkey = https://repos.influxdata.com/influxdb.key
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpmrepo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

yum install net-snmp net-snmp-utils grafana telegraf influxdb

For this example, neither InfluxDB nor Grafana require special configuration options and can be started with the default values:

systemctl enable influxdb
systemctl start influxdb
systemctl enable grafana-server
systemctl start grafana-server

InfluxDB (on port 8086) and Grafana (on port 3000) are now available. No firewall is deployed in the demo environment. The default login to Grafana is the admin account with a password of admin. In a production environment, you would need to secure access to InfluxDB with an account, a password, and SSL; then deploy a reverse HTTP proxy with SSL termination upstream of Grafana and unlock the required ports with the firewall-cmd command.

Configuring Telegraf

Before Telegraf can start reading data via SNMP, the SNMP protocol needs to be running on the storage system and serving the public community. To choose another community, you change the name in the configuration. The insecure SNMPv1/v2 protocol versions only allow read access, so to change the configuration of devices over SNMP, you have to use the secure but more complex SNMPv3. To enable SNMPv2 on a Linux system, simply install the appropriate package with the SNMP daemon and add an entry for rocommunity public as the only line in the /etc/snmp/snmpd.conf file. Of course, the firewall must allow UDP access to port 161.

Now you have to decide which SNMP data you want to collect. The snmpwalk command-line tool queries and displays parts of the SNMP MIB of a target system. The Synology NAS is running on IP address 192.168.2.6 and responds to SNMP queries over protocol version 2c of the SNMP public community. The command

snmpwalk -v 2c -c public 192.168.2.6

first outputs the complete SNMP MIB of the storage system, but in this example, I only want to look at selected values, and I want Telegraf to collect the following metrics:

System name
Network traffic
Disk I/O
Memory usage of the filesystems
CPU and RAM utilization (optionally)

The name of the system is important, because you will not want to run the SNMP query against one system only. Later, you will want to display the data for several systems in different views on the Grafana dashboard. For the purposes here, a new /etc/telegraf/telegraf.conf (Listing 2) first tells Telegraf to request new data every 30 seconds and store the data in the telegraf database on the local InfluxDB instance. If the database does not yet exist, Telegraf will create it. The configuration for the SNMP data source is:

[[inputs.snmp]]
agents = [ "192.168.2.6:161" ]
version = 2
community = "public"
name = "snmp"

Listing 2: Requesting New Data

#Telegraf Configuration, Collect SNMP
[agent]
interval = "30s"
round_interval = true
 **
[outputs]
[outputs.influxdb]
&lt;font color="#ffff00"&gt;-=http://localhost:8086=- proudly presents
database = "telegraf"

Several systems can be specified in the agents line, which Telegraf then queries sequentially. This information is then followed by the queries for individual MIB values, all of which later end up in the snmp table:

[[inputs.snmp.field]]
name = "hostname"
oid = "RFC1213-MIB::sysName.0"
is_tag = true
[[inputs.snmp.field]]
name = "RAMFree"
oid = "1.3.6.1.4.1.2021.4.6.0"

The SNMP object identifier (OID) can be specified both numerically and by its name. The hostname field is also assigned the is_tag modifier, which defines the field as a table index, making it easier to use in later queries. In this example, Telegraf queries the OID UCD-SNMP-MIB::memAvailReal as the only memory value. It is one of the few metrics that actually refers to the available RAM. Most other memory values only provide information about the virtual memory (i.e., RAM plus swap). If you want the exact details, you can of course query all memory metrics. Optionally, Telegraf can also retrieve the CPU load:

[[inputs.snmp.field]]
name = "CPUsystem"
oid = "1.3.6.1.4.1.2021.11.10.0"

This worked for the old Synology NAS in our lab, which only uses a single-core Atom CPU. Modern systems require further requests for all cores (10.0, 10.1, 10.2, etc.). However, this test does without the direct CPU values and later prefers to query the OS value "System Load", which indicates the load of the system independent of the number of cores.

Reading Complete SNMP Tables

SNMP organizes various items of system information in tables, which is quite practical for my purposes because Telegraf can retrieve complete SNMP tables in a single action. With a networked storage system, administrators naturally want to know exactly how many gigabytes are coming in and going out over the network interfaces. Telegraf therefore collects the complete network table:

[[inputs.snmp.table]]
name = "if"
inherit_tags = [ "hostname" ]
oid = "IF-MIB::ifXTable"
 **
[[inputs.snmp.table.field]]
name = "ifName"
oid = "IF-MIB::ifName"
is_tag = true

The ifName field is a table index, which makes it easy later to display the values of the various network interfaces separately. This example could also be used to monitor a managed network switch. The IF-MIB then lists all switch ports and their loads, and it works for Fibre Channel switches, as well. The input

snmpwalk -v 2c -c public 192.168.2.6 IF-MIB::ifXTable

shows the complete table content, including the names of the interfaces and many different counters for packets sent and received.

On the basis of the same pattern, Telegraf will also import the disk I/O values into InfluxDB, which are also organized in a standard MIB table; the device name later acts as an index. The input/output operations per second (IOPS) here are far more interesting than the throughput (MBps) per disk. Bandwidth bottlenecks in sequential data transfer are primarily caused by the network connection. The disks, on the other hand, with limited IOPS, cause problems in the case of many small instances of random access, such as database queries or simultaneous access by different clients. The entries are thus:

[[inputs.snmp.table]]
name = "diskio"
inherit_tags = [ "hostname" ]
oid = "UCD-DISKIO-MIB::diskIOTable"
 **
[[inputs.snmp.table.field]]
name = "DiskName"
oid = "UCD-DISKIO-MIB::diskIODevice"
is_tag = true

Telegraf retrieves information on disk allocation and system load from two further standard SNMP tables (Listing 3).

Listing 3: Disk and Load Requests

[[inputs.snmp.table]]
name = "diskusage"
inherit_tags = [ "hostname" ]
oid = "HOST-RESOURCES-MIB::hrStorageTable"
 **
[[inputs.snmp.table.field]]
name = "VolumeName"
oid = "HOST-RESOURCES-MIB::hrStorageDescr"
is_tag = true
 **
[[inputs.snmp.table]]
name = "load"
inherit_tags = [ "hostname" ]
oid = "UCD-SNMP-MIB::laTable"
 **
[[inputs.snmp.table.field]]
name = "loadtime"
oid = "UCD-SNMP-MIB::laNames"
is_tag = true

For the moment, all desired values for visualization are in Grafana where needed. Because InfluxDB does not require a rigid database structure, you can add more tables or single values to the configuration later on. Additionally, you can use this configuration, as mentioned above, to query data from several systems. The

inherit_ tags = [ "hostname" ]

entry tells InfluxDB queries to select values as a function of the system, but more about this later.

To check whether the Telegraf configuration actually works, first issue the telegraf -test command. The tool then parses the configuration, executes the queries, and displays the results at the command line. You can check whether the results suit your needs and, if not, change the queries. If everything is fine, enter

systemctl restart telegraf
systemctl enable telegraf

to start the service and deliver fresh metrics to InfluxDB every 30 seconds.

Creating Custom Dashboards

The user interface tool works on a simple principle: data sources with information on the one hand and visualizations that display data from the sources on the other. Grafana combines several visualizations in dashboards and has a simple user and rights system, as well. Therefore, you can restrict access to dashboards to individual groups and users. Here, however, I will not be looking at access controls.

A newly installed Grafana first requires a new password and the first data source. In this case, InfluxDB is on http://localhost:8086, Access: Server (Default) with the telegraf database, which does not require a username and password. I quickly create a new dashboard named Synology and get started with an initial visualization task showing the network traffic (Figure 1).

Figure 1: Overview of NAS performance. The dashboard displays disk and network I/O, system load, and storage system utilization level.

On the dashboard the Add Panel button in the top starts the dialog for the new visualization, which first wants information about the query. Grafana does not require manual input; rather, it relies on point-and-click in the Query Builder, which greatly simplifies even the more complex database queries. The query starts with the FROM statement. The first and only data source is also the default system. The table is simply named if in the Telegraf configuration set up earlier, and the WHERE selection filters for host and interface names. The NAS in the test goes by the name fatbox and has two LAN ports, of which only eth1 is attached to the switch. The selection is therefore:

FROM default if WHERE ifName = eth1 AND hostname = fatbox

For the SELECT statements, Grafana now only suggests the fields that match the FROM filter criteria. SNMP does not provide values in megabits per second, but simply counts the incoming and outgoing network octets (bytes) in 64-bit counters. The following selection is required for a value in bits per second:

SELECT field(ifHCInOctets) mean() derivative(1s) math(*8) alias(IN)

The ifHCInOctets field is a 64-bit integer that returns the number of incoming octets; the derivative(1s) function calculates the change from second to second. With new values only every 30 seconds, mean() determines the mean value between the last data points, and math(*8) converts the octet (=byte) per second into a bit per second value. The alias(IN) is only used for cosmetic reasons so that the legend for the graph reads if.IN.

To display the OUT value, as well, simply click on the + at the end of the query and scroll to fields/field. Grafana then duplicates the existing query into a second SELECT query. This second line is then assigned the (ifHCOutOctets) field and the alias (OUT):

SELECT field(ifHCOutOctets) mean() derivative(1s) math(*8) alias(OUT)

Now the visualization will show the incoming and outgoing network traffic, but the two graphs overlap. To make this a little more clear cut, simply assign the math() entry of the OUT graph a math(*-8) entry instead of math(*8). Grafana now visualizes the OUT traffic in a far more intuitive graph as a negative value in the downward direction.

For up-to-date values at all times, you can set the displayed time span and refresh interval in the upper right corner of the dashboard. In this early phase, Last 1 hour Refresh every 30s is recommended.

Grafana shows the section icons for further graph configuration to the left of the query. From the Visualization icon, you can set the graphic type and the display options. The default values will normally be fine for line graphs. In the Axes | Left Y section, you can define the measurement Unit; in this example it is Data Rate – bit/s. The General tab has a field for the visualization name, which is then saved by pressing the Save Dashboard icon at the top of the screen.

Refining the Display

Using the same procedure, you can create a second panel with the disk IOPS (Figure 2). Choose whether you want to monitor the values of all physical disks separately (i.e., sda, sdb, sdc). However, the monitoring example here only considers the multidisk device dm-1 (i.e., the software raid that the Synology NAS has created from the disks). Depending on the configuration of your network storage, completely different device names will appear here.

Figure 2: Grafana's graphical query tool builds InfluxDB queries and displays the results immediately.

Those with iSCSI target services and a block back end will find their iSCSI target devices listed separately as dm-2, dm-3, and so on. iSCSI targets in file mode, on the other hand, save the virtual disk as a file, whose IOPS appear as part of dm-1 and cannot be monitored separately. The query for IOPS is the same:

FROM default diskio   WHERE DiskName=dm-1 AND hostname = fatbox
SELECT field(diskIOReads) mean() derivate(1s) math(*-1) alias(ReadIO)

The value read here extends in the downward direction because of the math(*-1) entry, so that it visually reflects the OUT network. The second graph, similar to the first, uses diskIOWrites and the alias(WriteIO) field and omits the math() field. Now just fine tune the appearance and the second panel is done and dusted.

The last item of the Graph section lets you create alerts. Grafana can then alert in various channels if the monitored values drop below or climb above a certain value over a defined period of time. This is not necessary for I/O values. However, values such as temperatures, fan speeds, or UPS battery status can be called up with this SNMP setup – then the alert function makes sense. In addition to good old email, Grafana can control messengers such as Slack, Telegram, and Discord by configurable notification channels.

To display the fill level of the NAS as a percentage, you need a more complex query and a nice Singlestat Panel (Figure 3). InfluxDB can perform mathematical calculations in the queries. For percent level of the NAS /volume1 filesystem, you need to divide the SNMP hrStorageUsed value from hrStorageTable by the total capacity hrStorageSize and multiply the result by 100.

Figure 3: NAS fill in a Singlestat Panel.

For queries that Grafana cannot create in the user interface, you can first create a more simple query with the graphical tools (e.g., only for hrStorageUsed, but then switch from the graphical to the text query by pressing the eye icon). You will find the following query on doing so:

SELECT mean("hrStorageUsed" )/mean("hrStorageSize")*100 FROM "diskusage" WHERE ("VolumeName"='/volume1') AND ("hostname"='fatbox') AND $timeFilter GROUP BY time($__interval) fill(previous)

Unlike the user interface query, the text query displays the required InfluxDB syntax with parentheses and quotes. You can check immediately whether your manual edits to the query actually work with the Query Inspector, which displays the complete output of a query onscreen, including all error messages.

To display this query appropriately, select the Singlestat option in the Visualization section. To make it pretty, add the following values to the Value pane: Stat: current; Unit percent (0-100); Threshholds: 50,80 with the colors green, yellow, and red; and Gauge: Show, with Threshhold Markers checked. These markers change the color of the graph if the values specified are exceeded. This graph will show red as of 80 percent occupancy of the data carrier. The Stat: current entry forces the display to show the latest acquired value. If the Stat: avg default is kept, the graph shows the average value over the period selected top right.

Conclusions

Grafana lets you create intuitive and meaningful visualizations with TSDB queries without the need for prior training. The tool is intuitive to use and does not require complex knowledge of a query language for simple visualizations. Forums provide ready-made dashboards for various evaluation purposes, but in most cases, the overhead involved in customizing an imported dashboard will be greater than defining your own. The flexible InfluxDB and Telegraf team on the back end collects monitoring data without too much configuration overhead. Once set up, the Telegraf, InfluxDB, and Grafana (TIG) stack can be expanded continuously so that you can monitor your IT environment graphically and seamlessly.