Tools MAT with the TICK Stack Lead image: Lead Image © Krishna Kumar-Sivaraman, 123RF.com

Monitoring, alerting, and trending with the TICK Stack

Cloud Radar

If you are looking for a monitoring, alerting, and trending solution for large landscapes, you will find all the components you need in the TICK Stack. By Martin Loschwitz

Anyone who needs to monitor large IT setups (e.g., a cloud) faces a challenge: Nodes come and go, and not every departure is a failure that needs to trigger an alert. In addition to monitoring and alerting, trending is also necessary; in many cases, it is the only way you can know when to add hardware to compensate for an increased base load.

Soon it becomes clear that typical monitoring solutions such as Nagios or Zabbix will not do. If you look into the subject in more detail, you end up with time series databases. In this article, I introduce the four components of the TICK Stack [1] (Telegraph, InfluxDB, Chronograf, and Kapacitor) and explain their respective strengths.

Deficits and Alternatives

The most prominent representative of this genre is probably Prometheus [2]. Launched as an internal tool by SoundCloud, the program and the additional components attached to it are now popular, but power users complain: In many respects Prometheus is missing functions, and design decisions were made that are not a good match for many setups.

An example is Prometheus Node Exporter, which is designed to collect metrics from the systems in the environment, often not in a way that the administrator desires (see the "Prometheus Add-ons" article in the previous issue [3]). Moreover, with a Prometheus server, you cannot store metrics redundantly and in a distributed storage system.

If your setup becomes too large for a single Prometheus instance, you have to split it, thus possibly canceling out one of the biggest advantages of a monitoring, alerting, and trending (MAT) system – namely, the single point of administration.

Additionally, Prometheus slows down as the volume of data increases. The program is fine for short and mid-term trending, but if you want to keep trending data safe for years, you will reach the limits of Prometheus as it loses much of its speed.

This situation is amazing considering the software only stores measured values numerically. Other solutions that store log messages as strings in the database can handle far larger amounts of data. To cut a long story short: You have good reasons not to build your platform for MAT on the current king of the hill, Prometheus.

If you are looking for an alternative to Prometheus, you might find InfluxDB, a component of the TICK Stack, useful. At the end of the article, I show that Prometheus and the TICK Stack are not necessarily mutually exclusive.

InfluxDB

The heart of the TICK Stack is the InfluxDB time series database [4]. Note that time series databases do not use tables, like classical databases, as a central element for data management; instead, they align all stored data in a time stream.

Time series databases are especially practical for the study of trends, with a focus on the value of a parameter over time, so you can guess how this value might develop in the future. For example, if you know how RAM usage has increased over the last three months, you can recognize when you might have to buy additional hardware to avoid resource bottlenecks.

A query of this kind is problematic for traditional databases: They need to browse through tables and rows and compare them with a time you specify as a parameter. If an entry falls within the specified period, it is part of the result. A graph can only be drawn at the end. The process is time-consuming, because the format of the result is different from the format used internally by the database, and the transposition of the data required for this process is resource intensive.

Time series databases, on the other hand, store the data exactly as you would like to view it later, thus saving a great deal of overhead. Because many system parameters can be displayed with reference to measured data, monitoring is a kind of waste product of trending in MAT environments, at least if you add components to the time series database that can trigger alarms according to specific metrics.

InfluxDB is the time series database in the TICK Stack (Figure 1). The tool differs noticeably from Prometheus in many areas. One big difference is that InfluxDB can use not only numbers as measured values, but also strings, which is especially practical if you want to log events. With Prometheus, you would need to take a round-about route: If a message containing the word ERROR appears in the kernel log, Prometheus can display Log messages with ERROR; however, Prometheus cannot display the details of the problem, such as the error message itself. InfluxDB, on the other hand, can.

Figure 1: InfluxDB is a time series database for storing metrics and the heart of the TICK Stack.

Fundamental Differences

InfluxDB differs from Prometheus in other ways, too. Starting with the basics, Prometheus collects metrics directly from hosts (the pull principle). InfluxDB, on the other hand, expects a separate process to deliver values. Both principles have their fans, and depending on the context, one or the other might be more suitable. InfluxDB cannot claim a fundamental advantage; nevertheless, you need keep in mind how you collect data when planning your future setup.

The two solutions also differ in the way you query data from the database (e.g., to interpret it graphically). Prometheus uses a modification of SQL (PromQL) specially designed for use in MAT solutions, whereas InfluxDB uses a genuine dialect of SQL (InfluxQL) that would meet your needs if you are already familiar with SQL. Although PromQL is not very difficult to learn, it would be wrong to say you could use it without any preparation.

Backups and Clustering

Influx is clearly ahead of Prometheus when it comes to data storage. The Prometheus approach is simple: Individual Prometheus instances record and store the data on local hard drives. If you have to monitor more than a single instance, you just run several instances, each of which covers a part of the setup. Clustering can be achieved with some restrictions. The developers explicitly advise against mirroring the content of multiple Prometheus instances in a central database. If you want to create backups of Prometheus, you have to set up a second instance that collects the same measurement values as the first.

InfluxDB is far more agile. If you need clustering to interconnect instances of InfluxDB, you get exactly this feature for a price in the commercial version. InfluxDB follows an open source core model, wherein the basic functionality is available free of charge as open source software, and additional features, such as clustering, come at a price.

Data retention is also more flexible with InfluxDB: Instead of just one database, like Prometheus, InfluxDB can contain multiple databases, each using its own retention policy. InfluxDB has a genuine backup function that uses either binary or raw files – just like a classic database such as MariaDB.

InfluxDB also works better for downsampling (i.e., reducing the sampling points of a time series). Instead of defining rules at the beginning to stipulate which data is to be kept, InfluxDB continuously queries data with InfluxQL and stores the result at any location (e.g., in another database). Downsampling is therefore simple and flexible, specifically because the queries can be changed easily during operation.

Alerts by Kapacitor

Although InfluxDB has a number of advantages over Prometheus, unlike Prometheus, it cannot trigger events from stored readings. In a MAT setup, this is a problem, because a database that recognizes a difficulty by date but cannot sound an alarm is not helpful. Enter TICK Stack Kapacitor [5] (Figure 2), the InfluxDB watchdog that can generate alarms according to defined parameters and then pass to various alerting frameworks. If you are looking for the counterpart in Prometheus, the most likely candidate is Alert Manager.

Figure 2: Kapacitor triggers alarms according to various events and thus retrofits InfluxDB with alerting functionality.

Kapacitor can do much more: In addition to detecting anomalies in the metrics, Kapacitor also features automatic detection of the environment in which it runs, which saves configuration work. If Kapacitor notices by itself that it is running in an AWS environment, various alarms that are preconfigured for this scenario take effect. The same applies to the Microsoft Azure Cloud or Kubernetes environments.

Versatility Is Key

If you think Kapacitor is a pure alarm generator, you would be wrong. With a plugin interface, the tool can be extended almost arbitrarily and connected to various external systems. The functionality is then by no means limited to plain vanilla monitoring. Conceivably, you could transfer collected metrics to a connected system for machine learning, to draw conclusions from the available information. Kapacitor is clearly superior to Prometheus' Alert Manager, at least in this respect.

Finally, it is worth mentioning that Kapacitor has his own domain-specific language (DSL), TICKscript, a scripting language with which you can define alarm conditions. The content of a TICKscript should be a task description that contains the condition that leads to an action and describes the action Kapacitor should take. Although this step has a learning curve, the Kapacitor DSL is well documented.

Where the Data Lives

The best monitoring system is useless if it can't access the data in some way. Because the TICK Stack seeks to be a complete offering, its developers have also put much thought into this subject. The answer is Telegraf [6]: In a MAT system based on InfluxDB, this service collects the metrics on the servers and passes them to InfluxDB for storage.

Telegraf (Figure 3) is similar to Prometheus Node Exporter in the same way that Kapacitor is like its Prometheus counterpart Alert Manager. Telegraf is far more powerful than the Node Exporter, which becomes clear when comparing the scope of delivery: More than 100 plugins for Telegraf extend the program to include a variety of functions. When you roll out Telegraf on the systems from which you want to collect metrics, you use a configuration file to define which plugins will be active and which data will find its way into InfluxDB.

Figure 3: Telegraf reads basic system values (in a Ceph cluster, e.g., the CPU load, disk I/O, and network throughput).

When selecting the desired plugins, you can choose from almost your heart's desire: PowerDNS or PostgreSQL can be evaluated just as easily as Apache or iptables. Telegraf also provides various other databases out the box, such as Kafka, MongoDB, and CouchDB. Several interesting parameters for classic Linux systems are also provided (e.g., disk fill states, current RAM usage, and system load).

Additionally, various functions let Telegraf monitor services on other systems. Network hardware typically speaks SNMP and outputs data over this interface, but Telegraf cannot be installed on most switches, because it would require Cumulus, which still has a niche status despite all its advantages. However, if you roll out a Telegraf instance that collects metrics from the devices in question over SNMP, the data will find its way into InfluxDB.

Telegraf also plays a central role in generating alarms, which shows once again that the TICK Stack and Prometheus are clearly different in details. Alert Manager connects directly to the Prometheus server and monitors incoming measurement data, whereas in the TICK Stack, Telegraf supplies Kapacitor with data and stores the data in a parallel database. Therefore, if you want to use the TICK Stack, you at least need a minimal Telegraf-InfluxDB-Kapacitor (TIK) setup.

Colorful Chronograf

The "C" in TICK Stack also plays a role: The Chronograf tool [7] conjures up colorful graphs from data in InfluxDB for interpreting the measured data. This TICK Stack component is the only one that might leave you with mixed feelings. InfluxDB, Telegraf, and Kapacitor prove to be unrivaled winners in their respective fields, but Chronograf might lead to ask, "Why?"

Clearly Chronograf has exactly the same target group as Grafana, which is all about data visualization. When Influx started developing Chronograf, Grafana already existed and was already able to evaluate data stored in InfluxDB. Whereas the developers of Prometheus gave up their own PromDash user interface in favor of Grafana (Figure 4), Influx went the opposite way and developed its own user interface.

Figure 4: Grafana is a popular tool for visualizing metrics.

Their success was limited: Chronograf (Figure 5) is naturally better designed than Grafana for InfluxDB, but it still does not offer all the functions of its competitor and cannot claim to have more useful graphs than Grafana. Many admins do without Chronograf and use the TIK setup. All told, the feeling Chronograf delivers is a very clear case of "Not invented here."

Figure 5: TICK Stack Chronograf is a competitor of Grafana and performs worse in a direct comparison.

Peaceful Coexistence?

Many administrators see the choice between Prometheus and the TICK Stack as a black and white decision: Either you commit yourself to Prometheus and rely on the components from its stack or you take InfluxDB and put its helpers on your systems. As usual, there are many shades between black and white, and this is no different when choosing a MAT solution: Instead of an either-or decision, a "why not both" alternative is also conceivable. You can combine the best of both worlds and use the available functions optimally.

Instead of using the not always optimally designed Prometheus Node Exporter, you can use Telegraf to collect metrics on the hosts. This solution offers even more possibilities, because Telegraf comes with various features that are missing in the Prometheus Node Exporter.

Parts of both solutions can be combined even better when it comes to storing data. Whereas Prometheus is not very good at storing data permanently in large quantities to implement long-term trending, InfluxDB is far better at this task and can even store distributed data in a storage cluster on the network, while hardly slowing down – even with large amounts of data.

Prometheus has advantages in short-term trending, and it is also a step ahead when the task is metrics-based monitoring. What could be more obvious than combining the two solutions? The InfluxDB developers even explicitly provide for such a possibility.

InfluxDB developer Paul Dix, who is responsible for the adapter that translates between InfluxDB and Prometheus, describes this kind of setup in a blog post [8]. In a presentation at Percona Live Europe 2017, database manufacturer Percona also discussed the combination of solutions in detail [9]. Finally, the InfluxDB documentation contains a description of this process.

The Best of Both Worlds

Metric data can be collected by the Prometheus Node Exporter, Telegraf, or any other exporter that meets the requirements of Prometheus. All data is sent first to Prometheus – Chronograf and Kapacitor do not have a role in this setup – and the data in Prometheus is downsampled to suit your needs: Instead of saving all of the data, it might be fine to save exemplary data records for longer periods of time, because they are equally meaningful. For example, a downsampling approach could be to keep all data from the past four weeks, as well as one data set for each day of the previous two months and one data set per week of the last year.

In this scenario, you slot an adapter between Prometheus and InfluxDB that communicates with the Prometheus server on one side and with InfluxDB on the other. According to the downsampling rules, the adapter reads the required data from Prometheus and saves it in InfluxDB. From Prometheus' point of view, all this is completely transparent and does not increase load to a noticeable extent.

Such a construct does not mean you sacrifice the advantage of a single point of administration. Although InfluxDB is available in a setup of this type as a data source parallel to Prometheus, if Grafana is used as the tool to visualize the data, matching dashboards can also be integrated into Grafana for InfluxDB. The end result is the same interface, but one that accesses another source in the background.

As a basic rule, if you combine Prometheus and InfluxDB, you get the best of both worlds.

Conclusions

Neither the TICK Stack nor Prometheus are panaceas when it comes to MAT. When comparing the two products, you need to bear in mind that their objectives are not the same: The TICK Stack sees itself as a solution to processing measurement data that can also act as a platform for monitoring, alerting, and trending.

Therefore, InfluxDB has functions that are missing in Prometheus. On the other hand, Prometheus sees itself more as a pure MAT solution and gets by with less overhead. Your own requirements should determine whether you choose the TICK Stack or Prometheus for your situation.

If you want to run MAT on numerical measurements, you'll be better off with Prometheus and probably reach your goal faster. However, if you need more flexibility and want to log events as text and collect metric data, you should have a look at the TICK Stack.

Always remember that the two solutions can be combined. If you ultimately use the TICK Stack as long-term storage for your Prometheus data, you can run Telegraf, Kapacitor, and Chronograf in parallel with Prometheus and let them handle the work.