Whenever in DevOps we discuss about monitoring and alerting systems we often come across the TICK Stack! What is a TICK stack? What is so special about it? Is it different from ELK Stack, Prometheus, Grafana, Cloudwatch, and NewRelic? I will try to answer all of these queries briefly but my motivation for writing this blog is the Alert Flooding issue I faced while testing my TICK stack.
Note: This blog is not about the detailed working about TICK and its setup.
What is TICK ? What is special about it?
To explain TICK, it is basically a complete collection of services provided by the InfluxData community to capture, store, stream, process, and visualize data to provide us a highly available and robust solution for monitoring and alerting. TICK is an abbreviation for :
- Telgeraf – It is a very light-weighted server agent for scrapping metrics from the system it runs on, also has the capability to pull the metrics from various third-party APIs like Kafka, StatsD, etc.
- InfluxDB – It is known as the heart of the TICK stack and genuinely speaking it is one of the most efficient and high-performance database stores for handling high volumes of time-series data. It is open source and uses SQL-like query language.
- Chronograf – It is the stack’s visualization engine and administrative user interface. It is quick and simple to set up and maintain the alerting and monitoring of our infrastructure. Using chronograf is quite simple and includes templates and libraries that allow us to build dashboards rapidly with real-time visualizations of the data and to easily create automation and alerting rules.
- Kapacitor – It is a data processing engine, and is capable of processing both stream and batch data from InfluxDB. It lets us plug in our own custom logic or user-defined functions to process alerts with dynamic thresholds, compute statistical anomalies, match metrics for patterns, and perform specific actions depending on these alerts, such as dynamic load rebalancing. Kapacitor integrates with OpsGenie, HipChat, Alerta, Sensu, Slack, PagerDuty, and more applications. It uses TICKscript which opens so many modification options we can perform using kapacitor.
How is it different from other Monitoring and alerting tools?
Let me tell you a secret, every tool has its own significance, powers, and weaknesses, it is all a matter of the project’s use case and situation and which tool/ stack suits it best. I have briefly described the common use case differences between
- ELK Stack and TICK Stack
- Prometheus and TICK Stack.
TICK vs ELK
Elasticsearch and InfluxDB are both robust and highly available time-series database engines, the choice to use one over the other comes down to application time dependency and data types. Each has its own set of pros and cons. However, you can leverage both solutions in a single project.
InfluxDB is best for time-critical applications that need real-time querying. It can handle a greater number of writes than Elasticsearch. However, Elasticsearch is more suitable for textual data such as log messages, requests, and responses. Because it promotes textual data searching, Elasticsearch remains a superior option for querying by content.
Retaining data in InfluxDB while using Elasticsearch for metadata can be an effective solution for storing impeccable time-series records. Elasticsearch can quickly locate text-based events with timestamps of events and InfluxDB can then run calculations as the data comes in.
TICK vs Prometheus
Prometheus is a monitoring system for infrastructure that was designed to be fast, memory-efficient, and easy to use.
If you want to do more than a mere monitoring tool, InfluxDB is a fantastic solution for storing time-series data, such as data from sensor networks or data used in real-time analytics. On the Other hand,Prometheus was designed for monitoring, specifically distributed, cloud-native monitoring. It shines in this category, with several beneficial integrations with current products.
Finally coming onto the main motivation of this blog “The Alert Flooding issue” which we faced while testing the alerting mechanism on our TICK Stack setup.
The issue faced by the team was, getting a flood of alerts in our dedicated slack channel for the same issue. Let me be simpler, alerting happens when some threshold value crosses, right? So the issue was when the field value crosses the threshold value and suppose it stays above the threshold value for a minute, so continuous alerts were flooding up until 1 minute each second or so. Either the field value gets down itself or we have to disable that alert to stop that flood in my slack channel.
This was the configuration of our TICK Script which was set up for alerting –
So following the convention method, we started finding the solution for the issue, got some on influx data forum, Stackoverflow saying to turn on the .stateChangesOnly parameter.
We updated the conf. by enabling the parameter but still no improvement, some forum says, set the .stateChangesOnly parameter to some specific value like 5 minutes. The .stateChangesOnly parameter defines that it will never trigger the same alert until its state changes (Ok, Low Priority, Medium, Critical) based on a threshold value. For example, .stateChangesOnly(5m), this will not trigger the same alert until 5 minutes if the state doesn’t change, but in 5 minutes interval it will trigger the same alert, it is also one of the use cases so that we don’t miss any alert.
So how did it resolve eventually?
We started to analyze our data for long intervals and how is it changed, since we were testing it we didn’t analyze the data frequency, what are the min, average, and max values, so when we did, our data was frequently changing in milliseconds getting above and below the threshold value very fast, generating so many alerts. Since it was just random data, initially we did set up the threshold value by seeing only the past 15 mins graph. Also, we didn’t enable the positive alerts due to which we didn’t know whether the alert status is changing or not. So after analyzing the data for the last 1 month we put the threshold value correctly and also enabled the Positive alerts notification to identify and analyze the alerts correctly.
So, what happened was as the data was coming quite fast and random that Negative alert is coming in milliseconds then it is getting corrected in other milliseconds, then again negative then again OK, since we haven’t enabled the positive alert notifications we couldn’t understand and thought that the same alert is getting triggered repeatedly.
However this is not a big issue for some, it’s just we have set the wrong threshold value but it taught us major things. We were debugging this issue and googled it for hours.
Below are my learnings from this issue :
- To analyze the data properly you should have a good understanding of data values before setting the alert rules.
- Positive alerts are as crucial as negative alerts, try never disabling them, they really provide us a very good insight and solve many problems, we would have understood the issue firsthand if positive alerts were enabled.
- Just don’t directly dive into google and StackOverflow for the issues and configuration, they sometimes just take you to another different road and we end up just wasting our time and head, please try to debug on your own with basic things first.
Opstree is an End to End DevOps solution provider.