Have you noticed an unexpected unallocation of Shards happening at a duration of 1 hour resulting in Cluster state switching from Green > Yellow > Red > Yellow > Green?. During this transition, ES becomes unreachable and the API calls start responding with non 200 code.
3 Master Node with 3 Worker Node.
Analysis of Error
Garbage Collector Sawtooth Pattern
The reason for this sawtooth pattern is that the JVM continuously needs to allocate memory on the heap as new objects are created very frequently as a part of the elasticsearch own program execution like for search queries, write queries, flush, refresh operation and more. Most of these objects are however short-lived objects and quickly become available for collection by the garbage collector in the young region of the heap. When the garbage collector finishes you’ll see a drop on the memory usage graph. Reference
Note: Most of the Elasticsearch objects are short-lived and collected by Garbage Collector at the Young region.
High allocation Rates Of Objects Cause Performance Issues
The GC logs provide a way to capture how frequently your app is allocating objects. While high allocation rates aren’t necessarily a problem, they can lead to performance issues. To see whether this is affecting your app you can compare the size of the young generation after a collection and before the next one.
For example, the following three GC log entries show the app is allocating objects at about 12.48GB/sec.
[31.087s][info ][gc ] GC(153) Pause Young (Normal) (G1 Evacuation Pause) 3105M->794M(3946M) 55.590ms [31.268s][info ][gc ] GC(154) Pause Young (Normal) (G1 Evacuation Pause) 3108M->798M(3946M) 55.425ms [31.453s][info ][gc ] GC(155) Pause Young (Normal) (G1 Evacuation Pause) 3113M->802M(3946M) 55.571ms
Between 31.087s and 31.268s 2314M was allocated by the app (3108M-794M) and between 31.268s and 31.453s another 2315M was allocated (3113M-798M). This works out about 2.3GB every 200ms or 12.48GB/sec. Depending on your app’s behavior, allocating objects at that rate could negatively affect its performance. Reference
High allocation Of Objects Cause High Frequency Of Garage Collection
We observed Garbage Collector is executed in an interval of every 1 minute, because of the high rate of object allocation due to Elasticsearch own programs like search queries on shards, and a lot of shards on 3 worker node result in a lot of object with very high frequency. Also when Garage Collector executes, it causes the “Stop the World State” means the main elastic search worker’s main thread stop. When the main thread of elastic search is unresponsive for a long duration, the elasticsearch master assumes that worker node has left cluster and it reallocates the shards among other nodes.
Below is an example error obtained from elasticsearch logs
Previously we set “2 primary with 6 replicas” for each index which was causing a lot of shards and more shards mean more parallel read operations across each shard resulting in more objects being created frequently. And Elasticsearch suggests having 600 shards per node.
So we decided on this change :
– 2 primary with 1 replica
– Increase worker node from 3 to 5.
Reason to increase worker node
First, with the increase in the number of worker nodes, we are able to maintain minimum shard count on each node.
Second, 3 parallel Garbage Collectors previously (on each worker node) had to handle a lot of garbage collection but with 5 Garbage Collectors, the task of garbage collection is divided.
Third, with shards divided among 5 nodes, objects created by search queries are also divided among 5 nodes. So on each node frequent object count decreases, which, in turn, decreases the frequency of Garbage Collector execution.
Opstree is an End to End DevOps solution provider