Tools Grafana Back Ends Lead image: Lead Image © Raoul Fesquet, fotolia.com

Grafana and time series databases

More than a Thousand Words

We look at database back ends for monitoring, alerting, and trending analysis in the Grafana visualization tool. By Martin Loschwitz

Admins are rumored to feel more at ease working with text-based terminals than with graphical tools, and the command line often is better suited to classic admin tasks than a GUI because it allows scripting and direct input. For other admin tasks, though, this assessment is typically inverted: When it comes to processing measurement data and statistics, visual tools are clearly superior. In particular, data for monitoring, alerting, and trending (MAT) that comes from several sources requires visualization for meaningful analysis.

Virtually every large environment, whether a container platform or a public cloud environment, is strongly dependent on MAT. Only MAT provides reliable clues on the health and usage of systems that let you know when you need to add new hardware because your platform is currently fully utilized. Grafana [1] targets admins who need MAT analysis.

Match Winner

Grafana cooperates with various back ends and can pull the data you want to display from many sources. Its developers refer to this configuration as data-driven architecture. In place of classic event monitoring, in which monitoring is a spin-off of the need to collect various metrics continuously, Grafana uses a time-series-based principle. For example, if you operate a multinode cluster for MySQL based on Galera, you will want the load on all database back ends to be equally high. If system load on one of the back ends suddenly drops off, it is a certain indicator that something is wrong.

Unlike typical incident monitoring in the style of Nagios, Grafana bases its conclusions on performance data. Although Grafana is not primarily about monitoring, it does help you prepare the corresponding time series from monitoring systems in an easily interpretable way. In this article, I first highlight the key features of Grafana and then present the most important back ends from which it can draw its information.

New Monitoring Paradigm

To understand the motivation behind Grafana, you need to take a small excursion into the world of monitoring. A paradigm change has taken place in the past few years that, in turn, is closely linked to cloud computing. Monitoring a cloud is different from monitoring conventional IT platforms, which to a certain degree are static and, after setting up the environment, change only in the details. The standard tools for monitoring are well known to experienced admins: Nagios, Icinga, Check_MK, and various other solutions of same design.

Monitoring in conventional environments relies on events. If a service stops running on a server, the monitoring system notices and raises an alert. Trending plays a minor role in this classic scenario, because the workload of such a setup will tend to grow evenly, giving you sufficient time to purchase new hardware. That said, even conventional monitoring solutions cannot completely do without trending. For example, PNP4Nagios [2] uses checks to collect performance data and then displays the data in a graphical format directly in the Nagios web interface.

This arrangement no longer works for a public cloud, because it is not predictable when the platform will need to scale horizontally by adding new servers. For example, a new customer with a huge workload could easily set up an account in a typical public cloud and start generating a massive load.

Trending Becomes More Important

For newer types of platforms, trending therefore plays a bigger role than for its conventional predecessors. PNP4Nagios or comparable solutions look more like stop-gaps in these cases. They normally store their measurement data in the background in a normal database, usually MySQL, which is perfectly suited for classic, event-based monitoring, in which incidents become separate entries in a table. Because you are interested in the individual events in this scenario and the data is stored exactly that way in MySQL, you will have no problem retrieving or processing the data.

Trending changes the rules of the game because it does not focus on events, but on the evolution of performance data over time. In the case of trending, you no longer want to know whether or not a specific service was working at a given time; instead, you are interested in the central health values of the systems (e.g., CPU load and RAM usage). If their values remain high over a period of time, new hardware will be required for load balancing.

However, if you want to generate this information from individual events stored in a MySQL database, it would require many individual database queries, and thus a correspondingly high load. Also, it takes quite a while for MySQL or a similar database to provide answers to these queries.

Storage with a Difference

At this point, databases for time series enter the scene. Although they are much like MySQL databases, their internal structures clearly differ. Time series databases are designed out of the box to save events in their temporal sequence, rather than as individual events. Therefore, they are in a position to respond far faster than with MySQL to a request such as, "What was the CPU load of system x in period y?"

Fortuitously, in a time series, classic, event-based monitoring is also available – almost as a kind of side product – because the defined time period can be just a single point in time (e.g., exactly 30 seconds ago). MAT systems based on time series can therefore be used for classic monitoring as well as for comprehensive trending, and this is precisely where current developments are heading: All monitoring systems that have arisen in recent years rely in the background on a database with a time series structure.

How Does Grafana Work?

The Grafana developers noticed that a number of time series databases were already on the market, but even the prettiest trending data will not help if you cannot somehow visualize it and thus understand it. Grafana slots neatly into this gap. Essentially, Grafana is a server that speaks HTTP(S) and exposes a graphical web interface.

Under the hood, the service has a plugin interface for various back ends (i.e., between Grafana and various time series databases). In the web interface, you determine what information Grafana should show from the respective back end. Grafana then collects the required information and proceeds to generate images. A unified interface presents the most important information from various time series databases at a glance. Because Grafana places great emphasis on flexibility, you can set up its GUI (i.e., the dashboard or control center) to suit your preferences.

Grafana also is easy to install and use. On their website, the developers explain how to install and configure their tool on Red Hat Enterprise Linux, CoreOS, Debian, or Ubuntu. Grafana also can be operated on Windows servers and Macs. For a simple approach, a Docker container comes with Grafana pre-installed, which you can boot on the fly in any Docker environment. Within a few minutes, you have access to a running Grafana in which only the data sources need to be defined.

Support extends from common time series databases such as OpenTSDB and InfluxDB, through Prometheus, to more exotic solutions such as OpenStack Gnocchi. Grafana now even supports databases that are not even genuine time series databases: Elasticsearch is just one example.

In the following sections, I look at the most important back ends and show how Grafana can be used with the respective databases.

Graphite

Graphite [3] is the first in the pageant of Grafana back ends. The phonetic similarity already suggests that the two solutions have something in common: Like Grafana, Graphite lets you visualize metrics data. Unlike Grafana, it is not Graphite's declared main task. Instead, Graphite is primarily a time series database with a GUI added on.

At its core, Graphite comprises three components: Carbon acts as a kind of broker that waits for external agents to deliver metrics, which are then stored in Whisper, the core component of the software (i.e., the time series database). Interestingly, Graphite also comes with a web interface: The tool with the uninspiring name graphite-web can easily visualize data from Whisper. Some admins thus rightly ask why Grafana would need to dock with Graphite.

Basically, a quick look at the two solutions answers this question, because Graphite and Grafana differ greatly in terms of appearance and functionality. Grafana is much more extensive than the Graphite web interface (Figure 1); it also looks far more modern. Grafana indisputably offers clear advantages over Graphite, especially when you need to visualize data from multiple sources in a central interface.

Figure 1: In terms of appearance, Graphite loses out to Grafana.

Grafana visualizes data simply by configuring multiple data sources and building appropriate dashboards. However, Graphite can only output a list of the values that it finds in its own Whisper database. Graphite also does not let you submit your own queries or define special views. Fact is, Grafana is not a drop-in replacement for the Graphite web interface. It offers far more features and is universally usable.

Graphite itself is pretty much useless without a connected monitoring solution providing the data: Generally, admins therefore combine Grafana with Sensu [4] (Figure 2), a complete monitoring environment of a new type that uses Graphite as the storage location for its own user data. People looking for a modern Nagios alternative therefore often end up at the Nagios-compatible Sensu, because it can recycle the well-known Nagios checks.

Figure 2: Sensu, here with its Uchiwa graphical front end, builds on a time series database such as Graphite or InfluxDB to provide monitoring.

In combination with Grafana and Sensu, Graphite's primary task is storing data, but you could just as easily dispense with the Graphite web interface if you use Grafana (Figure 3).

Figure 3: Grafana lets you compose dashboards to suit your needs.

InfluxDB

If you want leave Graphite out of the equation, you will find a possible alternative in InfluxDB, which is also a genuine time series database that was developed for requirements very similar to those handled by Graphite. Nevertheless, the products are independent and, apart from individual design principles, have little in common with each other. Weighing the pros and cons of the two solutions against each other is futile; naturally, both candidates claim to offer a little more performance than the other solution.

InfluxDB integration with Grafana is absolutely comparable to that of Graphite in terms of quality. Ultimately, it is all about your specific use case and, not least, about your own preferences as an admin, so you will need to run both products in a direct comparison and complete realistic test scenarios. InfluxDB only makes sense in combination with Sensu, because Sensu also acts as the supplier of the coveted metrics.

Elasticsearch

Elasticsearch perfectly proves that Grafana can be used as a monitoring tool in combination with other programs. What may look nonsensical at first glance makes perfect sense when you take a closer look. Because various tools use Elasticsearch to save their results – especially in performance metering – Felix Barnsteiner, who is responsible for the Stagemonitor [5] Java performance application, pushed the integration of Grafana and Elasticsearch. Stagemonitor stores its measured values in Elasticsearch, so with an interface between Grafana and Elasticsearch, you can now perfectly visualize these values. In this sense, Elasticsearch is a time series database after all.

OpenTSDB

In contrast to Graphite and InfluxDB, OpenTSDB never explicitly refers to trending in its product description. Instead, the solution presents as a distributed data store that can cope with large volumes of time series data and leaves it up to admins to consider the use cases that match its capabilities.

Under these circumstances, admins are unlikely to opt for OpenTSDB. From the user's perspective, the solution is simple to deploy; however, from an admin perspective, doing so implies a running Hadoop cluster, because OpenTSDB uses Hadoop as its data store in the background. In other words, you cannot easily build this setup. However, if you happen to have a working OpenTSDB cluster with the appropriate payload, you can add Grafana through the matching plugin.

Amazon CloudWatch

Amazon offers its own metrics service – also a time series database – for its Amazon Web Services cloud. A plugin for Grafana can read data from CloudWatch and present it visually. Grafana thus demonstrates that it is also suitable for hybrid workloads: With a single click, you can visualize the current load data for a private cloud or your part of a public cloud.

Prometheus

A prime example of the full integration of a MAT solution in Grafana is Prometheus [6], a monitoring solution that is a time series database at its core (see also the article on Prometheus in this issue). The aim of the original SoundCloud project was to monitor large and fast-growing computer farms, including the ability to detect failures very quickly and to plan setup and scale-out in the long term.

In contrast to the solutions already presented in this article (i.e., Graphite, OpenTSDB, InfluxDB), one Prometheus developer focused on performance: The Prometheus server thus uses a proprietary format to store data on disk and index it in the LevelDB format. Of the back ends for Grafana referred to here, Prometheus is the only complete monitoring solution: In all other cases, the time series database is only the back end used by another solution (e.g., Sensu) to store its data.

In terms of functionality, Prometheus is state of the art and owes great flexibility to its highly modular architecture. The Node Exporter is a good example of this. It runs and performs tests on the individual hosts and records the results. Subsequently, the central Prometheus server retrieves the Node Exporter results on all hosts and saves the information accordingly.

Prometheus shows evidence that it is designed for large and highly elastic environments. For example, it has a direct interface to Etcd, a daemon that keeps records for all the nodes in a cluster. As soon as you add a new server to a cluster, Prometheus automatically learns about it via a detour to Etcd. If the automation works and automatically rolls out a Node Exporter on each server, you only need to install a new server to integrate it into your existing monitoring setup.

When the first Prometheus article [6] went to press, the developers still offered a graphical interface. PromDash (Figure 4) was in many ways a competitor of Grafana, but far inferior in almost all aspects. In the end, Grafana released a plugin to access Prometheus data (Figure 5), and shortly afterward, the Prometheus developers announced they had stopped working on PromDash.

Figure 4: SoundCloud originally developed its own GUI for Prometheus, PromDash, but it is now officially dead …

Figure 5: … and the Prometheus developers recommend using Grafana.

Because the Grafana developers wrote the storage back end for their metrics from scratch, they took the opportunity to develop their own query language: PromQL. Grafana makes good use of the query language: Once you have set up a Prometheus server in Grafana as a data source, you can configure arbitrary PromQL queries to generate graphics in Grafana. All told, the team of Grafana and Prometheus turns out to be an extremely potent combination: If you are looking for real MAT, you definitely should take a closer look at this combination.

Gnocchi

Grafana also offers genuine added value for OpenStack admins with Gnocchi [7]. This database was developed by the OpenStack project as its metering service and is configurable as a data source in Grafana. You might remember that Ceilometer was designed to take care of metering in OpenStack, but for many years, it has led the life of a wallflower, largely because the first versions of Ceilometer expected MySQL as the storage back end. Metering data from OpenStack includes values such as the number of virtual machines running per customer, the number of virtual networks created, or the entire memory consumed by all of a customer's volumes. Such a workload is not perfectly suited for MySQL, so Ceilometer turned out to be resource-hungry and thus unpopular with admins in the OpenStack environment.

Gnocchi is an attempt to fix these problems. This time series database was specially designed for data in a metering and billing context. Gnocchi assumes the role of the storage engine, and Ceilometer populates it with the acquired data. The Ceilometer back end in Grafana lets you to read several values from Gnocchi and display them in a visually appealing way.

Grafana does not create a complete billing solution with Ceilometer and Gnocchi. If you want to use the data from Gnocchi for billing purposes, Gnocchi still needs to be integrated into the billing system. The strength of the Grafana-Gnocchi combination is that it visualizes the workloads of individual projects, on which you can base a forecast as to when the next expansion stage will be needed. If relevant data is available from all the projects – even better!

In the context of OpenStack, Grafana also offers the only good approach to visualizing the Gnocchi data in a way that lets you draw conclusions at a glance. OpenStack cloud admins could thus do worse than taking a closer look at Grafana.

Conclusions

The bottom line on Grafana and its back ends is that the solution can be installed easily and quickly, is easy to maintain, and offers genuine added value in everyday life. If you already operate a time series database, or will be doing so soon, you should look into Grafana at an early stage and try out the features provided by the software. Grafana is very useful in the admin's daily grind and in dealing with time series data.