Tools Prometheus Add-Ons Lead image: Photo by Valerie Elash on Unsplash

Make better use of Prometheus with Grafana, Telegraf, and Alerta

Makeover

The Prometheus monitoring tool might not always look like one of the Titans, but add-ons like Alerta or Telegraf can improve its looks. By Martin Loschwitz

Prometheus is specially designed for monitoring large and scalable setups. The solution comprises several components: Prometheus itself is only the time series database. The Prometheus Node Exporter reads and provides basic system values on the target systems. Pushgateway handles the values that Prometheus itself cannot read directly on the hosts. If a data problem arises, the Alertmanager appears, comparing the incoming metric data with freely definable limit values. If a value gets out of control, it generates an alarm.

Theoretically, these components can be used to build a complete monitoring, alerting, and trending (MAT) system that easily monitors large environments, but practically, the components in this compilation lack elementary functions.

In this article, I look at complementary projects that exist for Prometheus that make admin life easier, including ready-made dashboards for data visualization, various metric data exporters, and two tools that display alarms graphically and coherently.

Beautiful Is Not Enough

The measurement data in Prometheus is good, but you need a way to visualize it. For a long time Prometheus developed its own dashboard; today, the official recommendation is to use Grafana, for which Prometheus has a plugin that can be used to configure it as a data source.

If you want to display measured values from Prometheus in Grafana, however, you need suitable dashboards. To aggravate the situation, the metrics provided by Prometheus Node Exporter are not sufficient in many setups. Special cases like MySQL, RabbitMQ, or Open vSwitch are not handled by the Node Exporter.

The Alertmanager also causes trouble. Most conventional monitoring systems offer a clearly arranged web page on which the current alarms are listed. However, the Prometheus Alertmanager only has a rudimentary GUI that does not meet modern requirements.

Grafana

If you use Prometheus as a classical monitoring system, you do not even come into direct contact with a large part of the measurement data collected by the tool, because only Alertmanager analyzes incoming values and raises the alarm if necessary. However, this leaves one of Prometheus' core functions unleveraged: trending.

Trending requires that the metric data stored in Prometheus needs to be prepared graphically in such a way that it is comprehensible to the admin. A table listing the current RAM usage figures for 5,000 hosts doesn't help you much. A corresponding graph that shows the progression of RAM usage provides much more information.

Grafana offers exactly this possibility: The program specializes in the graphical display of values from various tools and now supports a large number of data sources. One of the great strengths of Grafana is undoubtedly its modularity: You can define what you want displayed, and how, and the graphs you want to combine. Welcome to the world of Grafana dashboards: You can define any number of dashboards, and any number of metrics can be visualized in them, as long as they come from the same data source.

The problem is that the Grafana dashboard list is empty out of the box. If you build a brand new Prometheus setup, you first have to construct your own dashboards laboriously. Grafana expects the dashboards in JSON format, and the names of the individual metrics are not so intuitive that you would automatically hit on them without help.

Although you can find tools that let admins create JSON files in a GUI, you still need the catalog of metrics, and the work is still time consuming (Figure 1).

Figure 1: In a GUI, you can click to compile dashboards in Grafana.

Not Reinventing the Wheel

The good news is that others also have wanted to wed Prometheus and Grafana, so you can find several setups that do so. Grafana offers a marketplace on its website [1], where developers can make their DIY dashboards available to the public. Admins can download these dashboards and simply import them into Grafana by hitting the Import button. If necessary, you can also adapt the templates.

If you want to use prebuilt dashboards in Grafana from the web, you will find different filters on the left side of the selection page. These filters are important, because how the dashboard is put together depends primarily on the exporter you use to collect your metrics. For example, if you use Prometheus Node Exporter, you need a matching dashboard. If you decide on Telegraf, the Grafana dashboards for Node Exporter will not work.

Several dashboards from the Grafana collection stand out. Knut Ytterhaug's Node Exporter Server Metrics, for example, lets you compare the values for Node Exporter metrics on different servers. As an alternative, Thomas Cheronneau's Prometheus System dashboard represents the values of a single server (Figure 2).

Figure 2: Node Exporter collects metrics for basic system details, which can then be processed graphically.

Grafana dashboards that display data prepared by Docker are particularly popular: Docker Host & Container Overview, for example (Figure 3), shows which containers run on a host and the resources they consume.

Figure 3: Even if you use Docker, you will find prebuilt Grafana dashboards to process your metrics.

Ready-made dashboards are also available for more pedestrian software: The Apache dashboard, for example, displays the metrics Prometheus can collect from the Apache web server. If you also use other exporters, the scope of the dashboards available for Grafana expands again. If you collect data from an OpenStack setup with Telegraf, for example, you can display it with the appropriate dashboard for Grafana.

All in all, it is clear: If you combine Grafana with the dashboards from its marketplace, you can get started far faster without losing flexibility. A detailed look at the Grafana website is therefore recommended to every prospective Prometheus admin.

Telegraf

Almost every Prometheus setup uses the Prometheus Node Exporter, but it is not absolutely necessary. As long as the data is available in the appropriate format, Prometheus does not care from which sources it obtains its metrics. Node Exporter has nevertheless developed into a general-purpose tool because of its feature scope. It collects basic data such as CPU load, RAM usage, and load average without further configuration. Thus, it offers the same functions that many admins first used in classic monitoring systems.

On the one hand, Node Exporter has various plugins that support not only basic values but also special cases, such as reading network statistics for InfiniBand cards. However, other functions have virtually died in the program, such as reading SMART statistics for hard drives, for which Node Exporter can only call an immature external script. Prometheus developers find that their SMART implementation is exactly what Prometheus is supposed to offer, and if that is not enough for a user, then the developers imply that the users are employing Prometheus incorrectly.

If you don't want to have a long discussion with the Prometheus developers, Telegraf might be a good alternative to Node Exporter: The program is part of InfluxData's TICK Stack and therefore lives in the competitor's camp. However, the open APIs provided by modern applications, including Prometheus and InfluxData, allow software from the two camps to work together.

Talking to Prometheus

Prometheus' open standards allowed Telegraf's developers to provide it with a function that outputs Prometheus-compatible metric values. Telegraf is, then, a drop-in replacement for Prometheus Node Exporter. Like Node Exporter, Telegraf opens a TCP/IP port on the respective system and exposes its metric data there. You can configure Prometheus so that it not only connects to Node Exporter, but also queries the Telegraf port.

The only clue that Prometheus is not talking to its native exporter is that the metrics have different names. However, this is irrelevant for Prometheus functionality, because the product is designed to support a wide variety of exporters.

Telegraf is not superior in every way to Prometheus Node Exporter, which, for example, collects various metrics with no adequate equivalent in Telegraf (e.g., metrics for Mellanox network cards). Similarly, from the admin's point of view, it doesn't make much sense to have both Telegraf and Node Exporter collect all metrics, because many are similar, and you would end up with a huge volume of redundant metrics in Prometheus.

Telegraf offers a fundamental advantage in large setups with individual network segments that are isolated from each other by firewalls. If you want to use different exporters in such environments, the result is often massive holes in the firewalls, because theoretically, every exporter needs an open port on every host and in every firewall on the path between the host and Prometheus.

Telegraf can work around that problem by collecting metrics from other exporters that only listen on the localhost IP address 127.0.0.1 and pass them to Prometheus in one fell swoop. This setup saves admins the need for bulk firewall rule editing.

In addition to the collection function described above, Telegraf has an extremely practical function that calls external scripts and sends their output to Prometheus as metrics. In this way, you can create exporters quickly for metrics that currently have none.

Conceivably, for example, you could check compliance on systems with InSpec [2], which can output its results in JSON format. If you call InSpec as an external script from Telegraf and send the results to Prometheus, Prometheus can trigger an alarm if a compliance test fails.

Alerta

The Prometheus Alertmanager will always work reliably if configured correctly. The graphics it creates to visualize alerts, on the other hand, are not convincing. Although it has a web interface, it only lists the alarms that currently exist in the Alertmanager in a very plain format.

If you are familiar with the charts that Nagios or Icinga use, you won't appreciate the Alertmanager output. Its web GUI is certainly not suitable for use as a tool that runs on a screen in a control center to notify admins in case of alerts. However, precisely such a tool is part of the standard toolkit in most operations.

Because Alertmanager has an open API, tools other than those from the Prometheus toolchain can connect to it. The developers of TICK Stack took advantage, once again, and gave their Alerta tool an interface to let it talk to Prometheus Alertmanager.

Alerta itself comprises several components. The most important from the admin's point of view is undoubtedly the web interface (Figure 4). Alerta displays alerts with multiple colors you can set in a configuration file. The Alerta API makes sure that alarms appear in the Alerta web front end in the first place.

Figure 4: Alerta is a clear and intuitive alternative to the Prometheus Alertmanager GUI.

All alerts are managed by Alerta through its API; the web interface only displays them. As an alternative to the web interface, you can use a command-line interface (CLI) to manipulate alerts in the shell. However, Alertmanager, which also has a CLI component, can do this, as well.

The web interface proves to be extremely useful in practice: Not only does it display current alarms in tabular form, it also offers various details when the admin clicks on an alarm from the list, such as the date on which the alarm first appeared or the type of alarm.

Alerta differentiates between several problem levels when displaying errors: Critical errors are displayed in red, Major errors are highlighted in orange, less critical Minor errors are highlighted in yellow, and a Warning is highlighted in blue. Prometheus determines the degree to which a problem is relevant.

Even Better Overview

To further improve its intelligibility, Alerta also divides incoming alerts into different categories when sent by the monitoring system. If you used labels in Prometheus to ensure that alarms were sorted by different parameters, Alerta adopts this categorization without any further configuration. The cooperation between Alerta and Prometheus proved to be error-free in the test.

However, integration between the services is not quite perfect with Alerta in regard to the manipulation of alerts by the admin. Alerta offers the ability to acknowledge an alert so that it no longer appears in the list of acute alerts. However, clicking on Acknowledge does not directly affect the corresponding alert in Alertmanager. The Alerta GitHub folder contains a module [3] that can be used to retrofit this type of integration.

Undoubtedly Alerta's greatest strength is its versatility, which runs through the entire program on several levels. For example, Alerta has its own account management, which can be linked easily to external services such as OAuth or LDAP. Alerta is not explicitly built for the Prometheus Alertmanager, so it can also field alerts from a variety of other services, including classic monitoring systems like Nagios and Zabbix.

This flexibility can turn out to be a big advantage in the enterprise: If you already run classic monitoring for conventional setups and then add a cloud with Prometheus, Alerta combines all the alerts under a uniform interface. Alerta also has native interfaces for a variety of public cloud implementations, such as Amazon CloudWatch.

You can also set up multiple alarm sources of the same type in Alerta. If you run a geographically distributed setup and roll out your own Prometheus instances together with their alert managers, you can easily collect these alerts at the end in Alerta from a central location.

Alerta is a useful graphical alerting tool for Prometheus. Because the Alerta developers offer their tool in the form of a Docker container, it is very easy to try out. The Alerta configuration also presents anyone with cloud experience few challenges worth mentioning. Alerta is a genuine recommendation in this respect.

Unsee

A second project, Unsee [4], also addresses the problems of the Alertmanager web interface. The software was developed by cloud provider Cloudflare and is specifically aimed at Prometheus. The comprehensive integration with other tools demonstrated by Alerta is missing from Unsee.

Unsee also takes a different approach from Alerta when it comes to visualizing alerts: Whereas Alerta is more reminiscent of Nagios or Icinga, Unsee takes the term "dashboard" seriously and arranges its alerts in boxes (Figure 5) that are distributed across the screen.

Figure 5: The Unsee dashboard approach is explicitly aimed at Prometheus users.

Unsee also communicates directly with Alertmanager through its API and thus notices defined labels that belong to alerts. Like Alerta, Unsee also displays these labels and tags. If you confirm an alert with the Acknowledge function in Alertmanager, this information also appears in Unsee.

Unsee also can handle alerts, so you can work around the Alertmanager dashboard if you wish. If you don't want to use this functionality, you can also use Unsee in read-only mode, although it requires tinkering and is not very intuitive.

Unsee is generally much simpler than Alerta, which is evident in the lack of its own user administration. Instead, the login works with the credentials the user would use in the Alertmanager web interface.

If you don't like Alerta, you will find a very lightweight yet efficient alternative in Unsee. It looks just as good and, like Alerta, is distributed by its developers as a Docker container, so trying it out is not difficult.

Conclusions

Prometheus alone will not meet all your needs, but if you make use of various extensions, you can look forward to receiving genuine added value in the form of extended functions. As an indicator for visual alarms, Alerta unquestionably outperforms the Prometheus Alertmanager web interface, and other external tools, such as Telegraf, are also very valuable extensions.