Docker Inside Out – A Journey to the Running Container

Introduction: 

Necessity is the mother of invention, same happens here in case of docker. With the pressure of splitting monolithic applications for the purpose of ease, we arrived at docker and it made our life much simpler. We all access docker with docker-cli command but I wonder what it does behind the scene, to run a container. Let’s get deeper into it in this very blog.

There’s a saying that “Behind every successful man, there is a woman”. I would love to give my perspective on this. One of the things that I have actively observed from the life of successful people I know is that there is a lot of truth in this statement but it varies with different situations and in most of the cases these women are not directly helping men in their prime work but taking care of other important work around so that they can concentrate on their prime work. Keeping this in my mind, I am expecting that there are other components as well which are behind docker-cli command that leads to the successful creation of containers. Whenever I talk about docker containers with developers who are new to docker in my organization the only thing I hear from them is “docker-cli command is used to invoke docker daemon to run container”

But, Docker daemon is not the process that gets executed when a container is meant to be run – it delegates the action to containerd which then controls a list of runtimes (runc by default) which is then responsible for creating a new process (calling the defined runtime as specified in the configuration parameters) with some isolation and only then executing the entrypoint of that container.

Components involved

  • Docker-cli
  • Dockerd
  • Containerd
  • RunC
  • Containerd-shim

Docker-cli: Used to make Docker API calls.

Dockerd:  dockerd listens for Docker API requests, dockerd can listen for Docker Engine API requests via three different types of Socket: unix, tcp, and fd and manages host’s container life-cycles with the help of containerd. Hence, actual container life-cycle management is outsourced to containerd. 

Containerd: Actually manages container life-cycle through the below mentioned tasks:

  • Image push and pull
  • Management of storage
  • Of course executing containers by calling runc with the right parameters to run containers.

Let’s go through some subsystems of containerd :

Runc: Containerd uses RunC to run containers according to the OCI specification.

Containerd shim: With docker 1.11 version, this component has been added. This is the parent process of every container started and it also allows daemon-less containers. First it allows the runtimes, i.e. runc, to exit after which it starts the container.  This way we don’t have to have the long running processes for containers.  When you start nginx you should only see the nginx process and the shim.  

Daemon-less: When I say daemon-less containers in the above paragraph, it means there is an advantage of this. When containerd shim was not there, upgrading docker daemon without restarting all your containers was a big pain. Hence, containerd shim got introduced to solve this problem.

The communication between Dockerd and ContainerD

We can see how docker delegates all the work of setting up the container to containerd. Regarding interactions between docker, containerd, and runc, we can understand that without even looking at the source code – plain strace and pstree can do the job.

Command:

when no containers running:

ps fxa | grep docker -A 3 

Result:

Working of all components together

Well, To see how all these components work together? We need to initialize a container i.e. Nginx in our case. We will be firing the same command after running an Nginx container.

This shows us that we have two daemons running – the docker daemon and the docker-containerd daemon.

Given that dockerd interacts heavily with containerd all the time and the later is never exposed to the internet, it makes sense to bet that its interface is unix-socket based.

High-Level overview of initializing container

Initializing container to see involvement of all components

Command:

docker run --name docker-nginx -p 80:80 -d nginx
docker ps
pstree  -pg | grep -e docker -e containerd 
ps fxa | grep -i “docker” -A 3 | grep -v “java”

Summary

By now, it might be clear that dockerd is not only the single component involved while running a container. We got to know what all components are backing a running container beside dockerd and how they work together to manage the lifecycle of a container.

I hope we have a good understanding of the docker components involved. Now it’s time to see things practically on your own with commands discussed in this blog without mugging up theoretical concepts.

That’s all till next time, thanks for reading, I’d really appreciate your feedback, please leave your comment below if you guys have any feedback or any queries.

Happy Containerization !!

References :

Recap Amrita InCTF 2019 | Part 1

Amrita InCTF 10th Edition, is an offline CTF(Capture the Flag) event hosted by Amrita university at their Amritapuri campus 10 KM away from Kayamkulam in Kerala, India. In this year’s edition two people from Opstree got invited to the final round after roughly two months of solving challenges online. The dates for the final rounds were 28th,29th and 30th December 2019. The first two days comprised of talks by various people from the industry and the third day was kept for the final competition. In the upcoming three blog series starting now, we’d like to share all the knowledge, experiences and learning from this three day event.

Talk from Cisco

The hall was full of a little more than 300 people, among which a lot were college students all the way ranging from sophomore year up till final as well as pre-final year. Also, to our surprise there were roughly 50+  school students sitting ready to compete for the final event as well. The initial talk by CISCO was refreshing and very insightful for everyone present in the room. The talk majorly focused on how technology is changing lives all around the world be it with machine learning to help doctors treat faster or be it use drones to put off fire or IoT enabled system to provide efficient irrigation at remote areas. The speakers also made a point on how learning in a broader segment of technologies and tools serves longer than in depth knowledge of limited technology.
One thing that really stuck with me was that never learn a technology just for the sake of it or for the hype around it. But learn with a thought on how it can solve a problem around us.

Talk Title: Cyberoam startup and experiences -Hemanth Patel

Hemal Patel talked about his couple of startups and how he has always learned through failures. The talk was full of experiences and it is always serene to listen to someone telling about how they failed over and over again which eventually led them to succeed at whatever they are doing today. He talked about CyberRoam which is a Sophos Company, secures organizations with its wide range of product offerings at the network gateway. The talk went on to give us an overview of how business is done along different governments all around the world and how Entrepreneurship is so much more than just tackling a problem at a business level. And the how Cyberroam ended up making the product that they have today. 

Talk on Security by Cisco – Radhika Singh and Prapanch Ramamoorthy 

This was a wide range talk about a lot of things affecting us. We’ll try to list down most of it here. 

The talk started out with exploring Free/Open WiFi. Though it has a huge benefit of wifi being free it comes with a lot of risks as well. To name a few : 

→ Sniffing

→ Snooping 

→ Spoofing

These just to mention a few ways you can be compromised over a free WiFi. 

You can read up more on it here :

The talk also presented us with facts over data, how only 1% of the total data is generated via laptops and computers, Rest all are generated by smart phones, smart TVs as other IoT devices. Hence comes a very important point of securing IoT devices. 

It was pointed out during the talk that majority of the companies worry about security over the end of the entire IoT chain i.e. over the cloud etc. But not many people are caring about the edge devices and how lack of security measures here can compromise them. 

There was this really interesting case study about of IoT devices brought down the internet for the entire US east coast and how this attack was just meant to get some more time to submit an assignment at it’s initial days. Read more on this story from 2016 here

Hackers prefer to exploit IOT devices over cloud infrastructure.

Memes apart, The talk also focused on privacy vs security and how  Google’s dns resolution encryption helps in securing DNS based internet traffic on the world wide web. 

National Critical Information Infrastructure Protection Centre(NCIIPC)

National Critical Information Infrastructure Protection Centre (NCIIPC) is an organisation of the Government of India created under Sec 70A of the Information Technology Act, 2000 (amended 2008), through a gazette notification on 16th Jan 2014, based in New Delhi, India. It is designated as the National Nodal Agency in respect of Critical Information Infrastructure Protection. 

Representatives from this organization was there to speak at the event and they talked in detail about defining what is a CII (Critical Information Infrastructure) is and how any company with such infrastructure needs to inform the government about it. 

A CII is basically any Information Infrastructure (by any financial/medical etc institute) which if compromised can affect the national security of the country. And attacking any such infrastructure is an act of terrorism as defined by the article 66F in the IT Act,2018. 

They talked about some of the threats they deal with at the national level. They particularly talked about how BGP routing protocol which works on trust was compromised lately to route all Indian traffic via Pakistan servers/routers.

Image result for darknet dark web internet

One more interesting talk was about the composition of Internet.

How we think that the internet we see would comprise of 90% of the total internet but in reality it’s just 4%, bummer right? .  Deep web is the one which comprises of 90% of the total internet and as a matter of fact that no one completely knows about the DarkNet and it’s volume. Hence even the numbers mentioned above are as good as a guess.

 This was a very insightful talk and put a lot of things in perspective.

Digital Forensics – Beyond the Good Ol’ Data Recovery by Ajith Ravindran 

Image result for data forensics

This talk by Ajith Ravindran mainly focused on Computer forensics, which is the application of investigation and analysis techniques to gather and preserve evidence from a particular computing device in a way that is suitable for presentation in a court of law.

The majority of tips and tricks shared were about getting data from Windows based machines even after it is deleted from the system and how such data can be retrieved in order to show as proof for crimes.

Some of the tricks talked about are mentioned below : 

The prefetch files in Windows  gives us the list of files and executables last accessed and the number of times executed. 

Userassist allows investigators to see what programs were recently executed on a system.

Shellbags list down files that are accessed via a user at least once.

Master file table enables us to get a list of all the files in the system, or even entered the system via network of USB drives.

$usrnjrnl gives us information regarding all user activities in past 1-2 days.

Hiberfil.sys is a file the system creates when the computer goes into hibernation mode. Hibernate mode uses the Hiberfil.sys file to store the current state (memory) of the PC on the hard drive and the file is used when Windows is turned back on.

This was all from day 1 talk, Come back on next Tuesday for talks from Day 2. And as the final segment of this series we’ll be updating about attack/defense and jeopardy CTF experience.

Stay Tuned, Happy Blogging!

Log Everything as JSON

Logging and monitoring are like Tony Stark and his Iron Man suit, the two will go together. Similarly, logging and monitoring work best together because they complement each other well.

For many years, logs have been an essential part of troubleshooting application and infrastructure performance. But over the period of time we have realized that logs are not only meant for troubleshooting purposes, they can also be used for business dashboards visualization and performance analysis.

So logging application data in a file is great, but we need more.

Why JSON logging is best framework?

For understanding the greatness of the JSON logging framework, let’s understand this conversation between Anuj(A system Engineer) and Kartik(A Business Analyst).


A few days later Kartik complains that Web Interface is broken. Anuj scratches his head and takes a look at the logs and realizes that Developer has added an extra field to the log lines broked his custom parser.

I am sure anyone can face a similar kind of situation.

In this case, if Developer has designed the application to write logs as JSON, it would be a piece of cake for Anuj to create a parser for that because then he has to search fields on the basis of the JSON key and it doesn’t matter how many new fields are getting added in the logline.

The biggest benefit of logging in JSON is that it has a structured format. This makes possible to analyze application logs just like Big Data. It’s not just readable, but a database that can be queried for each and every field. Also, every programming language can parse it.

Magic with JSON logging

Recently, we have created a sample Golang application to get Code Build, Code Test and Deployment phase experience with Golang Applications. So while writing this application we have incorporated the functionality to write logs in JSON.
The sample logs are something like this:-

And while integrating ELK for logs analysis, the only parsing line we have to add in logstash is:-

 
filter {
    json {
        source => "message"
    }
}

After this, we don’t require any further parsing and we can add as many fields in the log file.

As you can see I have all fields available in Kibana like:- employee name, employee city and for this, we do not have to add some complex parsing in logstash or in any other tool. Also, I can create a beautiful Business Dashboard with this data.

Application Repository Link:-
https://github.com/opstree/Opstree-Go-WebApp

Conclusion

It will not take too long to migrate from text logging to JSON logging as there are multiple programming language log drivers are available. I am sure JSON logging will provide more flexibility to your current logging system.
If your organization is using any Log Management platform like Splunk, ELK, etc. I think JSON logging could be a companion of it.

Some of the popular logging drivers which support JSON output are:-

Golang:- https://github.com/sirupsen/logrus
Python:- https://github.com/thangbn/json-logging-python
Java:- https://howtodoinjava.com/log4j2/log4j-2-json-configuration-example/
PHP:- https://github.com/nekonomokochan/php-json-logger

I hope now we have a good understanding of JSON logging. So now it’s time to choose your logging wisely.


That’s all I have, thanks for reading, I’d really appreciate any and all feedback, please leave your comment below if you guys have any feedback or any queries.

Cheers till next time!!

Image Reference:- https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwi0jJzD8t_mAhVOzDgGHc6ODNQQjB16BAgBEAM&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DbPTZ43nM688&psig=AOvVaw0N3k7-bSq4OD4oMxPxYPrN&ust=1577881894317756

How to test Ansible playbook/role using Molecules with Docker

Why Molecule?

Have you ever faced issue that your Ansible code gets executed successfully but something went wrong like, service is not started, the configuration is not getting changed, etc?

There is another issue which you might have faced, that your code is running successfully on Redhat 6 but not running successfully on Redhat 7, to make your code smart enough to run on every Linux flavour, in order to achieve this, molecule came into the picture. Let’s start some brainstorm in Molecule.

Molecule has capability to execute YML linter and custom test cases which you have written for your Ansible code. We will explain the linter and test cases below

Why code testing is required?

Sometimes during the playbook execution, although it executes playbook fine but it does not give us the desired result so in order to check this we should use code testing in Ansible.

In general, code testing helps developer to find bugs in code/application and make sure the same bugs don’t cause the application to break. it also helps us to deliver application/software as per standard of code. code testing helps us to increase code stability.

Introduction :

This whole idea is all about to use Molecule (A testing tool) you can test your Ansible code whether it’s functioning correctly on all Linux flavour including all its functionalities or not.

Molecule is a code linter program that analyses your source code for potential errors. It can detect errors such as syntax errors; structural problems like the use of undefined variables, etc.

The molecule has capabilities to create a VM/Container environment automatically and on top, it will execute your ansible code to verify all its functionalities.

Molecule Can also check syntax, idempotency, code quality, etc

Molecule only support Ansible 2.2 or latest version

NOTE: To run ansible role with the molecule in different OS flavour we can use the cloud, vagrant, containerization (Docker)
Here we will use Docker……………………

Let’s Start……………

How Molecule works:

“When we setup molecule a directory with name “molecule” creates inside ansible role directory then it reads it’s main configuration file “molecule.yml” inside molecule directory. Molecule then creates platform (containes/Instances/Servers) in your local machine once completed it executes Ansible playbook/role inside newly created platform after successful execution, it executes test cases. Finally Molecule destroy all newly created platform”

Installation of Molecule:

Installation of the molecule is quite simple.

$ sudo apt-get update
$ sudo apt-get install -y python-pip libssl-dev
$ pip install molecule [ Install Molecule ]
$ pip install --upgrade --user setuptools [ do not run in case of VM ]

That’s it…………

Now it’s time to setup Ansible role with the molecule. We have two option to integrate Ansible with molecule:

  1. With New Ansible role
  2. with existing Ansible role

1. Setup new ansible role with molecule:

$ molecule init role --role-name ansible-role-nginx --driver-name docker

When we run above command, a molecule directory will be created inside the ansible role directory

2. Setup the existing ansible role with molecule:

Goto inside ansible role and run below command.

$ molecule init scenario --driver-name docker

When we run above command, a molecule directory will be created inside the ansible role directory

NOTE: Molecule internally uses ansible-galaxy init command to create a role

Below is the main configuration file of the molecule:

  • molecule.yml – Contains the definition of OS platform, dependencies, container platform driver, testing tool, etc.
  • playbook.yml – playbook for executing the role in the vagrant/Docker
  • tests/test_default.py | we can write test cases here.

Content of molecule.yml

cat molecule/default/molecule.yml

---
molecule:
  ignore_paths:
    - venv

dependency:
  name: galaxy
driver:
  name: docker
lint:
  name: yamllint	
platforms:
  - name: centos7
    image: centos/systemd:latest
    privileged: True
  - name: ubuntu16
    image: ubuntu:16.04
provisioner:
  name: ansible
  lint:
    name: ansible-lint
#    enabled: False
verifier:
  name: testinfra
  lint:
    name: flake8
scenario:
  name: default  # optional
  create_sequence:
    - create
    - prepare
  check_sequence:
    - destroy
    - dependency
    - create

Explanation of above contents:

Dependency:

Testing roles may rely upon additional dependencies. Molecule handles managing these dependencies by invoking configurable dependency managers.

“Ansible Galaxy” is the default dependency manager.

Linter:

A linter is a problem which analyses our code for potential errors.

What code linters can do for you?

Code linter can do:

  1. Syntax errors;
  2. Check for undefined variables;
  3. Best practice or code style guideline.
  4. Extra lines.
  5. Extra spaces. etc

**We have linters for almost every programming languages like we have yamllint for YAML languages, etc.

yamllint: It checks for syntax validity, key repetition, lines length, trailing spaces, indentation, etc.

provisioner: Ansible is the default provisioner. No other provisioner will be supported.

Flake8:– is the default verifier linter. Usage python file

platforms:

What platform (Containers) will be created and Ansible code will be executed.

Driver:

Driver defines your platform where your Ansible code will be executed

Molecule supports below drivers:

  • Azure
  • Docker
  • EC2
  • GCE
  • Openstack
  • Vagrant

Scenario:

Scenario – scenario defines what will be performed when we run molecule

Below is the default scenario:

–> Test matrix

└── default
├── lint
├── destroy
├── dependency
├── syntax
├── create
├── prepare
├── converge
├── idempotence
├── side_effect
├── verify
└── destroy

However, we can change this scenario and sequence by changing molecule.yml file :

scenario:
  name: default  # optional
  create_sequence:      # molecule create 
    - create
    - prepare
  check_sequence:       # molecule check 
    - destroy
    - dependency
    - create
    - prepare
    - converge
    - check
    - destroy
  converge_sequence:    # molecule converge 
    - dependency
    - create
    - prepare
    - converge
  destroy_sequence:     # molecule destroy 
    - cleanup
    - destroy
  test_sequence:        # molecule test 
#    - lint
    - cleanup
    - dependency
    - syntax
    - create
    - prepare
    - converge

NOTE: If anyone scenario (action) fails, others will not be executed. this is the default molecule behaviour

Here I am defining all the scenarios:

lint: Checks all the YAML files with yamllint

destroy: If there is already a container running with the same name, destroy that container

Dependency: This action allows you to pull dependencies from ansible-galaxy if your role requires them

Syntax: Checks the role with ansible-lint

create: Creates the Docker image and use that image to start our test container.

prepare: This action executes the prepare playbook, which brings the host to a specific state before running converge. This is useful if your role requires a pre-configuration of the system before the role is executed.

Example: prepare.yml

---
- name: Prepare
  hosts: all
  gather_facts: false
  tasks:
    - name: Install net-tools curl
      apt: 
      	name: ['curl', 'net-tools']
      	state: installed 
      when: ansible_os_family == "Debian"

NOTE: when we run “molecule converge” below task will be performed :

====> Create –> create.yml will be called
====> Prepare –> prepare.yml will be called
====> Provisioning –> playbook.yml will be called

converge: Run the role inside the test container.

idempotence: molecule runs the playbook a second time to check for idempotence to make sure no unexpected changes are made in multiple runs:

side_effect: Intended to test HA failover scenarios or the like. See Ansible provisioner

verify: Run tests inside the container which we have written

destroy: Destroys the created container

NOTE: When we run molecule commands, a directory with name molecule created inside /tmp which is molecule managed, which contains ansible configuration, Dockerfile for all linux flavour and ansible inventory

cd /tmp/molecule

tree
.
└── osm_nginx
└── default
├── ansible.cfg
├── Dockerfile_centos_systemd_latest
├── Dockerfile_ubuntu_16_04
├── inventory
│ └── ansible_inventory.yml
└── state.yml

state.yml :- maintain scenario which has been performed .

Molecule managed

---
converged: true
created: true
driver: docker
prepared: true

Testing:

This is is most important part of Molecule where we will write some test cases.

Testinfra is the default test runner.

Below module should be installed:

  • $ pip install testinfra
  • $ molecule verify

Molecule calls below file for unit test using “testinfra” verifier

molecule/default/tests/test_default.py

verifier:

Verifier is used for running your test cases.

Below are the three verifiers which we can use in Molecule

  • testinfra – It usage python language for writing test cases.
  • goss – It usage yml language for writing test cases.
  • serverspac – usage ruby language for writing test cases.

Here I am using testinfra as verifier for writing test case.

Molecule commands:

  • # molecule check [ Run playbook.yml in check mode ]
  • # molecule create [ Create instance/ Platform]
  • # molecule destroy [ destroy instance / Platform]
  • # molecule verify [ perform unit test ]
  • # molecule test [ It performs below default scenario in sequence ]
  • # molecule prepare
  • #molecule converge

NOTE: To enable logs to run a command with –debug flag

$ molecule –debug test

Sample Test cases :

cat molecule/default/tests/test_default.py

import os

import testinfra.utils.ansible_runner

testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
    os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('all')

def test_user(host):
    user = host.user("www-data")
    assert user.exists

def test_nginx_is_installed(host):
    nginx = host.package("nginx")
    assert nginx.is_installed


def test_nginx_running_and_enabled(host):
  os = host.system_info.distribution
  if os == 'debian':
    nginx1 = host.service("nginx")
    assert nginx1.is_running
    assert nginx1.is_enabled

def test_nginx_is_listening(host):
    assert host.socket('tcp://127.0.0.1:80').is_listening

 That’s all ! we have covered all required topics which will help you to create your own environment of Molecule and test cases.

Thanks all !!! see you soon with new and effective blog 🙂

Links you may refer:

https://yamllint.readthedocs.io/en/stable/

Collect Logs with Fluentd in K8s. (Part-2)

Thanks for going through part-1 of this series, if not go checkout that as well here EFK 7.4.0 Stack on Kubernetes. (Part-1). In this part, we will focus on solving our Log collection problem from docker containers inside the cluster. We will do so by deploying fluentd as DaemonSet inside our k8s cluster. DaemonSet ensures that all (or some) nodes run a copy of a pod in all worker nodes of K8s cluster.

In Kubernetes, containerized applications that log to stdout and stderr have their log streams captured and redirected to JSON files on the nodes. The Fluentd Pod will tail these log files, filter log events, transform the log data, and ship it off to the Elasticsearch cluster we deployed earlier.

In addition to container logs, the Fluentd agent will tail Kubernetes system component logs like kubelet, kube-proxy, and Docker logs. To see a full list of sources tailed by the Fluentd logging agent, consult the kubernetes.conf file used to configure the logging agent. 

Step-1 Service Account for Fluentd

First, we will create a Service Account called fluentd that the Fluentd Pods will use to access the Kubernetes API with ClusterRole and ClusterRoleBinding. We create it in the logging Namespace with label app: fluentd.

#fluentd-service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: logging
  labels:
    app: fluentd
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: fluentd
  labels:
    app: fluentd
rules:
- apiGroups:
  - "*"
  resources:
  - pods
  - namespaces
  verbs:
  - get
  - list
  - watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: fluentd
roleRef:
  kind: ClusterRole
  name: fluentd
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: logging

Step-2 Fluent Configuration as ConfigMap

Secondly, we’ll create a configMap fluentd-configmap,to provide a config file to our fluentd daemonset with all the required properties.

Here, we will be creating a “separate index for each namespace” to isolate the different environments. Optionally, user can create the index as per the different pods name as well in K8s cluster.

Also, we will see how to “disable the index creation for certain namespaces & pods.”

#fluetd-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-configmap
  namespace: logging
  labels:
    app: fluentd
    kubernetes.io/cluster-service: "true"
data:
  fluent.conf: |
    <match fluent.**>
      @type null
    </match>
    <source>
      @type tail
      path /var/log/containers/*.log
# Here in exclude_path, we can define the path having the namespace name like prometheus, logging etc for which we don't want to create the indexes.
      exclude_path ["/var/log/containers/*prometheus*.log", "/var/log/containers/*logging*.log"]
      pos_file /var/log/fluentd-containers.log.pos
      time_format %Y-%m-%dT%H:%M:%S.%NZ
      tag kubernetes.*
      format json
      read_from_head false
    </source>
    <filter kubernetes.**>
      @type kubernetes_metadata
      verify_ssl false
    </filter>
    <match kubernetes.**>
        @type elasticsearch_dynamic
        include_tag_key true
        logstash_format true
#Below line is use to isolate the indexes as per different namespaces in K8s.
        logstash_prefix kubernetes-${record['kubernetes']['namespace_name']}
#Uncomment the below line, if want to isolate the indexes as per different pods in K8s.
        #logstash_prefix kubernetes-${record['kubernetes']['pod_name']}
        host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
        port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
        scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
        user "#{ENV['FLUENT_ELASTICSEARCH_USER']}"
        password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD']}"
        reload_connections false
        reconnect_on_error true
        reload_on_failure true
        <buffer>
            flush_thread_count 16
            flush_interval 5s
            chunk_limit_size 2M
            queue_limit_length 32
            retry_max_interval 30
            retry_forever true
        </buffer>
    </matchΩ

Step-3 Fluentd as Daemonset

Now, we will deploy Fluentd as DaemonSet which allows to deploy an agent on each node of the k8s cluster to collect logs according to the settings configured in Step-2.

#fluentd-daemonset.yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
  labels:
    app: fluentd
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      app: fluentd
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        app: fluentd
        kubernetes.io/cluster-service: "true"
    spec:
      serviceAccount: fluentd
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.7.3-debian-elasticsearch7-1.0
        env:
          - name:  FLUENT_ELASTICSEARCH_HOST
            value: "elasticsearch.logging.svc.cluster.local"
          - name:  FLUENT_ELASTICSEARCH_PORT
            value: "9200"
          - name: FLUENT_ELASTICSEARCH_SCHEME
            value: "http"
          - name: FLUENT_ELASTICSEARCH_USER
            value: "elastic"
          - name: FLUENT_ELASTICSEARCH_PASSWORD
            valueFrom:
              secretKeyRef:
                name: efk-pw-elastic
                key: password
          - name: FLUENT_ELASTICSEARCH_SED_DISABLE
            value: "true"
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluentconfig
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluentconfig
        configMap:
          name: fluentdconf

Step-4 Apply the files & see indexes in Kibana

We will apply all the configured files as below

kubectl apply -f fluentd-service-account.yaml \
              -f fluentd-configmap.yaml \
              -f fluentd-daemonset.yaml

Now, Open the Kibana Dashboard with admin user created in Part-1 and navigate to Management from Left bar and then click on Index management under Elasticsearch. Here you can see your indexes created with respect to the different namespaces in K8s cluster.

Here, I am having indexes for 2 namespaces i.e. prod & kube-system, rest others i have disabled.

Next steps

In the following article [Collect Metrics with Elastic Metricbeat & Hearbeat for K8s Monitoring. (Part-3)], we will learn how to install and configure metricbeat & heartbeat to collect the K8s metrics.

EFK 7.4.0 Stack on Kubernetes. (Part-1)

INTRODUCTION

In this article, we will learn how to set up a complete stack for your Kubernetes environment, its a one stop solution for Logging, Monitoring, Alerting & Authentication. This kind of solution allows your team to gain visibility over your infrastructure and each application.

So, what is the EFK Stack? “EFK” is the acronym for three open source projects: Elasticsearch, Fluentd, and Kibana. Elasticsearch is a search and analytics engine. Fluentd is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.

The Elastic Stack is the next evolution of the EFK Stack.

Overview of EFK Stack

To achieve this, we will be using the EFK stack version 7.4.0 composed of Elastisearch, Fluentd, Kibana, Metricbeat, Hearbeat, APM-Server, and ElastAlert on a Kubernetes environment. This article series will walk-through a standard Kubernetes deployment, which, in my opinion, gives a overall better understanding of each step of installation and configuration.

PREREQUISITES

Before you begin with this guide, ensure you have the following available to you:

  • A Kubernetes 1.10+ cluster with role-based access control (RBAC) enabled
    • Ensure your cluster has enough resources available to roll out the EFK stack, and if not scale your cluster by adding worker nodes. We’ll be deploying a 3-Pod Elasticsearch cluster each master & data node (you can scale this down to 1 if necessary).
    • Every worker node will also run a Fluentd &,Metricbeat Pod.
    • As well as a single Pod of Kibana, Hearbeat, APM-Server & ElastAlert.
  • The kubectl command-line tool installed on your local machine, configured to connect to your cluster.
    Once you have these components set up, you’re ready to begin with this guide.
  • For Elasticsearch cluster to store the data, create the StorageClass in your appropriate cloud provider. If doing the on-premise deployment then use the NFS for the same.
  • Make sure you have applications running in your K8s Cluster to see the complete functioning of EFK Stack.

Step 1 – Creating a Namespace

Before we start deployment, we will create the namespace. Kubernetes lets you separate objects running in your cluster using a “virtual cluster” abstraction called Namespaces. In this guide, we’ll create a logging namespace into which we’ll install the EFK stack & it’s components.
To create the logging Namespace, use the below yaml file.

#logging-namespace.yaml
kind: Namespace
apiVersion: v1
metadata:
  name: logging

Step 2 – Elasticsearch StatefulSet Cluster

To setup a monitoring stack first we will deploy the elasticsearch, this will act as Database to store all the data (metrics, logs and traces). The database will be composed of three scalable nodes connected together into a Cluster as recommended for production.

Here we will enable the x-pack authentication to make the stack more secure from potential attackers.

Also, we will be using the custom docker image which has elasticsearch-s3-repository-plugin installed and required certs. This will be required in future for Snapshot Lifecycle Management (SLM).

Note: Same Plugin can be used to take snapshots to AWS S3 and Alibaba OSS.

1. Build the docker image from below Docker file

FROM docker.elastic.co/elasticsearch/elasticsearch:7.4.0
USER root
ARG OSS_ACCESS_KEY_ID
ARG OSS_SECRET_ACCESS_KEY
RUN elasticsearch-plugin install --batch repository-s3
RUN elasticsearch-keystore create
RUN echo $OSS_ACCESS_KEY_ID | /usr/share/elasticsearch/bin/elasticsearch-keystore add --stdin s3.client.default.access_key
RUN echo $OSS_SECRET_ACCESS_KEY | /usr/share/elasticsearch/bin/elasticsearch-keystore add --stdin s3.client.default.secret_key
RUN elasticsearch-certutil cert -out config/elastic-certificates.p12 -pass ""
RUN chown -R elasticsearch:root config/

Now let’s build the image and push to your private container registry.

docker build -t elasticsearch-s3oss:7.4.0 --build-arg OSS_ACCESS_KEY_ID=<key> --build-arg OSS_SECRET_ACCESS_KEY=<ID> .

docker push <registerypath>/elasticsearch-s3oss:7.4.0

2. Setup the ElasticSearch master node:

The first node of the cluster we’re going to setup is the master which is responsible of controlling the cluster.

The first k8s object, we’ll create a headless Kubernetes service called elasticsearch-master-svc.yaml that will define a DNS domain for the 3 Pods. A headless service does not perform load balancing or have a static IP.

#elasticsearch-master-svc.yaml
apiVersion: v1
 kind: Service
 metadata:
   namespace: logging 
   name: elasticsearch-master
   labels:
     app: elasticsearch
     role: master
 spec:
   clusterIP: None
   selector:
     app: elasticsearch
     role: master
   ports:
     - port: 9200
       name: http
     - port: 9300
       name: node-to-node

Next, part is a StatefulSet Deployment for master node ( elasticsearch-master.yaml ) which describes the running service (docker image, number of replicas, environment variables and volumes).

#elasticsearch-master.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: logging
  name: elasticsearch-master
  labels:
    app: elasticsearch
    role: master
spec:
  serviceName: elasticsearch-master
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
      role: master
  template:
    metadata:
      labels:
        app: elasticsearch
        role: master
    spec:
      affinity:
        # Try to put each ES master node on a different node in the K8s cluster
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - elasticsearch
                  - key: role
                    operator: In
                    values:
                      - master
                topologyKey: kubernetes.io/hostname
      # spec.template.spec.initContainers
      initContainers:
        # Fix the permissions on the volume.
        - name: fix-the-volume-permission
          image: busybox
          command: ['sh', '-c', 'chown -R 1000:1000 /usr/share/elasticsearch/data']
          securityContext:
            privileged: true
          volumeMounts:
            - name: data
              mountPath: /usr/share/elasticsearch/data
        # Increase the default vm.max_map_count to 262144
        - name: increase-the-vm-max-map-count
          image: busybox
          command: ['sysctl', '-w', 'vm.max_map_count=262144']
          securityContext:
            privileged: true
        # Increase the ulimit
        - name: increase-the-ulimit
          image: busybox
          command: ['sh', '-c', 'ulimit -n 65536']
          securityContext:
            privileged: true

      # spec.template.spec.containers
      containers:
        - name: elasticsearch
          image: <registery-path>/elasticsearch-s3oss:7.4.0
          ports:
            - containerPort: 9200
              name: http
            - containerPort: 9300
              name: transport
          resources:
            requests:
              cpu: 0.25
            limits:
              cpu: 1
              memory: 1Gi
          # spec.template.spec.containers[elasticsearch].env
          env:
            - name: network.host
              value: "0.0.0.0"
            - name: discovery.seed_hosts
              value: "elasticsearch-master.logging.svc.cluster.local"
            - name: cluster.initial_master_nodes
              value: "elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2"
            - name: ES_JAVA_OPTS
              value: -Xms512m -Xmx512m
            - name: node.master
              value: "true"
            - name: node.ingest
              value: "false"
            - name: node.data
              value: "false"
            - name: search.remote.connect
              value: "false"           
            - name: cluster.name
              value: prod
            - name: node.name
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
         # parameters to enable x-pack security.
            - name: xpack.security.enabled
              value: "true"
            - name: xpack.security.transport.ssl.enabled
              value: "true"
            - name: xpack.security.transport.ssl.verification_mode
              value: "certificate"
            - name: xpack.security.transport.ssl.keystore.path
              value: elastic-certificates.p12
            - name: xpack.security.transport.ssl.truststore.path
              value: elastic-certificates.p12
          # spec.template.spec.containers[elasticsearch].volumeMounts
          volumeMounts:
            - name: data
              mountPath: /usr/share/elasticsearch/data

      # use the secret if pulling image from private repository
      imagePullSecrets:
        - name: prod-repo-sec
  # Here we are using the cloud storage class to store the data, make sure u have created the storage-class as pre-requisite.
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: elastic-cloud-disk
      resources:
        requests:
          storage: 20Gi

Now, apply the these files to K8s cluster to deploy elasticsearch master nodes.

$ kubectl apply -f elasticsearch-master.yaml \
                   elasticsearch-master-svc.yaml

3. Setup the ElasticSearch data node:

The second node of the cluster we’re going to setup is the data which is responsible of hosting the data and executing the queries (CRUD, search, aggregation).

Here also, we’ll create a headless Kubernetes service called elasticsearch-data-svc.yaml that will define a DNS domain for the 3 Pods.

#elasticsearch-data-svc.yaml
apiVersion: v1
kind: Service
metadata:
  namespace: logging 
  name: elasticsearch
  labels:
    app: elasticsearch
    role: data
spec:
  clusterIP: None
  selector:
    app: elasticsearch
    role: data
  ports:
    - port: 9200
      name: http
    - port: 9300
      name: node-to-node

Next, part is a StatefulSet Deployment for data node elasticsearch-data.yaml , which describes the running service (docker image, number of replicas, environment variables and volumes).

#elasticsearch-data.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: logging 
  name: elasticsearch-data
  labels:
    app: elasticsearch
    role: data
spec:
  serviceName: elasticsearch-data
  # This is number of nodes that we want to run
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
      role: data
  template:
    metadata:
      labels:
        app: elasticsearch
        role: data
    spec:
      affinity:
        # Try to put each ES data node on a different node in the K8s cluster
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - elasticsearch
                  - key: role
                    operator: In
                    values:
                      - data
                topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 300
      # spec.template.spec.initContainers
      initContainers:
        # Fix the permissions on the volume.
        - name: fix-the-volume-permission
          image: busybox
          command: ['sh', '-c', 'chown -R 1000:1000 /usr/share/elasticsearch/data']
          securityContext:
            privileged: true
          volumeMounts:
            - name: data
              mountPath: /usr/share/elasticsearch/data
        # Increase the default vm.max_map_count to 262144
        - name: increase-the-vm-max-map-count
          image: busybox
          command: ['sysctl', '-w', 'vm.max_map_count=262144']
          securityContext:
            privileged: true
        # Increase the ulimit
        - name: increase-the-ulimit
          image: busybox
          command: ['sh', '-c', 'ulimit -n 65536']
          securityContext:
            privileged: true
      # spec.template.spec.containers
      containers:
        - name: elasticsearch
          image: <registery-path>/elasticsearch-s3oss:7.4.0
          imagePullPolicy: Always
          ports:
            - containerPort: 9200
              name: http
            - containerPort: 9300
              name: transport
          resources:
            limits:
              memory: 4Gi
          # spec.template.spec.containers[elasticsearch].env
          env:
            - name: discovery.seed_hosts
              value: "elasticsearch-master.logging.svc.cluster.local"
            - name: ES_JAVA_OPTS
              value: -Xms3g -Xmx3g
            - name: node.master
              value: "false"
            - name: node.ingest
              value: "true"
            - name: node.data
              value: "true"
            - name: cluster.remote.connect
              value: "true"
            - name: cluster.name
              value: prod
            - name: node.name
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: xpack.security.enabled
              value: "true"
            - name: xpack.security.transport.ssl.enabled
              value: "true"  
            - name: xpack.security.transport.ssl.verification_mode
              value: "certificate"
            - name: xpack.security.transport.ssl.keystore.path
              value: elastic-certificates.p12
            - name: xpack.security.transport.ssl.truststore.path
              value: elastic-certificates.p12 
          # spec.template.spec.containers[elasticsearch].volumeMounts
          volumeMounts:
            - name: data
              mountPath: /usr/share/elasticsearch/data

      # use the secret if pulling image from private repository
      imagePullSecrets:
        - name: prod-repo-sec

# Here we are using the cloud storage class to store the data, make sure u have created the storage-class as pre-requisite.
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: elastic-cloud-disk
      resources:
        requests:
          storage: 50Gi

Now, apply these files to K8s Cluster to deploy elasticsearch data nodes.

$ kubectl apply -f elasticsearch-data.yaml \
                   elasticsearch-data-svc.yaml

4. Generate a X-Pack password and store in a k8s secret:

We enabled the x-pack security module above to secure our cluster, so we need to initialize the passwords. Execute the following command which runs the program bin/elasticsearch-setup-passwords within the data node container (any node would work) to generate default users and passwords.

$ kubectl exec $(kubectl get pods -n logging | grep elasticsearch-data | sed -n 1p | awk '{print $1}') \
    -n monitoring \
    -- bin/elasticsearch-setup-passwords auto -b

Changed password for user apm_system
PASSWORD apm_system = uF8k2KVwNokmHUomemBG

Changed password for user kibana
PASSWORD kibana = DBptcLh8hu26230mIYc3

Changed password for user logstash_system
PASSWORD logstash_system = SJFKuXncpNrkuSmVCaVS

Changed password for user beats_system
PASSWORD beats_system = FGgIkQ1ki7mPPB3d7ns7

Changed password for user remote_monitoring_user
PASSWORD remote_monitoring_user = EgFB3FOsORqOx2EuZNLZ

Changed password for user elastic
PASSWORD elastic = 3JW4tPdspoUHzQsfQyAI

Note the elastic user password and we will add into a k8s secret (efk-pw-elastic) which will be used by another stack components to connect elasticsearch data nodes for data ingestion.

$ kubectl create secret generic efk-pw-elastic \
    -n logging \
    --from-literal password=3JW4tPdspoUHzQsfQyAI

Step 3 – Kibana Setup

To launch Kibana on Kubernetes, we’ll create a configMap kibana-configmap,to provide a config file to our deployment with all the required properties, Service called kibana, and a Deployment consisting of one Pod replica. You can scale the number of replicas depending on your production needs, and Ingress which helps to routes outside traffic to Service inside the cluster. You need an Ingress controller for this step.

#kibana-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: kibana-configmap
  namespace: logging
data:
  kibana.yml: |
    server.name: kibana
    server.host: "0"
    # Optionally can define dashboard id which will launch on main Kibana Page.
    kibana.defaultAppId: "dashboard/781b10c0-09e2-11ea-98eb-c318232a6317"
    elasticsearch.hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
    elasticsearch.username: ${ELASTICSEARCH_USERNAME}
    elasticsearch.password: ${ELASTICSEARCH_PASSWORD}
---
#kibana-service.yaml 
apiVersion: v1
kind: Service
metadata:
  namespace: logging
  name: kibana
  labels:
    app: kibana
spec:
  selector:
    app: kibana
  ports:
    - port: 5601
      name: http
---
#kibana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: logging 
  name: kibana
  labels:
    app: kibana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
        - name: kibana
          image: docker.elastic.co/kibana/kibana:7.4.0
          ports:
            - containerPort: 5601
          env:
            - name: SERVER_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: SERVER_HOST
              value: "0.0.0.0"
            - name: ELASTICSEARCH_HOSTS
              value: http://elasticsearch.logging.svc.cluster.local:9200
            - name: ELASTICSEARCH_USERNAME
              value: kibana
            - name: ELASTICSEARCH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: elasticsearch-pw-elastic
                  key: password
            - name: XPACK_MONITORING_ELASTICSEARCH_USEARNAME
              value: elastic
            - name: XPACK_MONITORING_ELASTICSEARCH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: efk-pw-elastic
                  key: password
          volumeMounts:
          - name: kibana-configmap
            mountPath: /usr/share/kibana/config
      volumes:
      - name: kibana-configmap
        configMap:
          name: kibana-configmap
---
#kibana-ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: kibana
  namespace: logging
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  # Specify the tls secret.
  tls:
  - secretName: prod-secret
    hosts:
    - kibana.example.com
   
  rules:
  - host: kibana.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: kibana
          servicePort: 5601

Now, let’s apply these files to deploy Kibana to K8s cluster.

$ kubectl apply  -f kibana-configmap.yaml \
                 -f kibana-service.yaml \
                 -f kibana-deployment.yaml \
                 -f kibana-ingress.yaml

Now, Open the Kibana with the domain name  https://kibana.example.com in your browser, which we have defined in our Ingress or user can expose the kiban service on Node Port and access the dashboard.

Now, login with username elastic and the password generated before and stored in a secret (efk-pw-elastic) and you will be redirected to the index page:

Last, create the separate admin user to access the kibana dashboard with role superuser.

Finally, we are ready to use the ElasticSearch + Kibana stack which will serve us to store and visualize our infrastructure and application data (metrics, logs and traces).

Next steps

In the following article [Collect Logs with Fluentd in K8s. (Part-2)], we will learn how to install and configure fluentd to collect the logs.

Prometheus-Alertmanager integration with MS-teams

As we know monitoring our infrastructure is one of the critical components of infrastructure management, which ensures the proper functioning of our applications and infrastructure. But it is of no use if we are not getting notifications for alarms and threats in our system. As a better practice, if we enable all of the notifications in a common work-space, it would be very helpful for our team to track the status and performance of our infrastructure.

Last week, all of a sudden my company chose to migrate from slack to MS-teams as a common chatroom. Which meant, now, notifications would also be configured to MS-teams. If you had search a bit, you will find that there isn’t any direct configuration for Ms-teams in alert manager as slack does. As a DevOps engineer I didn’t stop and looked beyond for more solutions and I found out that we need some proxy in between ALERTMANAGER and MS-teams for forwarding alerts and I proceeded to configure those.

There are a couple of tools, which we can use as a proxy, but I preferred to use prometheus-msteams, for a couple of reasons.

  • Well-structured documentation.
  • Easy to configure.
  • We have more control in hand, can customise alert notification and you can also configure to send notifications to multiple channels on MS-teams. Besides well-described documentation.
    I still faced some challenges and took half of the day of mine.

How it works?

Firstly, Prometheus sends an alert to ALERTMANAGER on basis of rules we configured in the Prometheus server. For instance, if memory usages of the server are more than 90%, it will generate an alert, and this alert will send to ALERTMANAGER by the Prometheus server. Afterward, ALERTMANGER will send this alert to prometheus-msteams which in turn send this alert in JSON format to MS-teams’s channel.

How to Run and Configure prometheus-msteams

We have multiple options to run prometheus-msteams

  1. Running on standalone Server (Using Binary)
  2. Running as a Docker Container

Running on Server

Firstly, you need to download the binary, click here to download the binary from the latest releases.

When you execute the binary with help on your system, you can see multiple options with description, which help us to run prometheus-msteams just like man-pages.

prometheus-msteams server --help

you can run promethues-msteams service as follow.

./prometheus-msteams server \
    -l localhost \
    -p 2000 \
    -w "Webhook of MS-teams channel"

Above options explanation

  • -l: On which address prometheus-msteams going to listen, the default address is “0.0.0.0”. In the above example, prometheus-msteams listening on the localhost.
  • -p: On which port prometheus-msteams going to listen, the default port is 2000
  • -w: The incoming webhook of MS-teams channel we are going to insert here.

Now you know how to run prometheus-msteams on the server, let’s configure it with ALERTMANAGER.

Step 1 (Creating Incoming Webhook)

Create a channel in Ms-teams where you want to send alerts. Click on connectors(found connectors in options of the channel), and then search for ‘incoming webhook’ connector, from where you can create a webhook of this channel. Incoming webhook is used to send notification from external services to track the activities.

Step 2 (Run prometheus-msteams)

Till now, you have an incoming webhook of a channel where you want to send the notification. After that, you need to setup prometheus-msteams, and run it.

To have more options in the future you can use config.yml to provide webhook. So that you can give multiple webhooks to send alerts to multiple channels in MS-teams in future if you need it.

$ sudo nano /opt/promethues-msteams/config.yml

Add webhooks as shown below. if you want to add another webhook, you can add right after first webhook.

connectors:
  - alert_channel: "WEBHOOK URL"

The next step is to add a template for custom notification.

$ sudo nano /opt/prometheus-msteams/card.tmpl

Copy the following content in your file, or you can modify the following template as per your requirements. This template can be customized and uses the Go Templating Engine.

{{ define "teams.card" }}
{
  "@type": "MessageCard",
  "@context": "http://schema.org/extensions",
  "themeColor": "{{- if eq .Status "resolved" -}}2DC72D
                 {{- else if eq .Status "firing" -}}
                    {{- if eq .CommonLabels.severity "critical" -}}8C1A1A
                    {{- else if eq .CommonLabels.severity "warning" -}}FFA500
                    {{- else -}}808080{{- end -}}
                 {{- else -}}808080{{- end -}}",
  "summary": "Prometheus Alerts",
  "title": "Prometheus Alert ({{ .Status }})",
  "sections": [ {{$externalUrl := .ExternalURL}}
  {{- range $index, $alert := .Alerts }}{{- if $index }},{{- end }}
    { 
      "facts": [
        {{- range $key, $value := $alert.Annotations }}
        {
          "name": "{{ reReplaceAll "_" "\\\\_" $key }}",
          "value": "{{ reReplaceAll "_" "\\\\_" $value }}"
        },
        {{- end -}}
        {{$c := counter}}{{ range $key, $value := $alert.Labels }}{{if call $c}},{{ end }}
        {
          "name": "{{ reReplaceAll "_" "\\\\_" $key }}",
          "value": "{{ reReplaceAll "_" "\\\\_" $value }}"
        }
        {{- end }}
      ],
      "markdown": true
    }
    {{- end }}
  ]
}
{{ end }}

Create prometheus-msteams user, and use --no-create-home and --shell /bin/false to restrict this user log into the server.

$ sudo useradd --no-create-home --shell /bin/false prometheus-msteams

Create a service file to run prometheus-msteams as service with the following command.

$ sudo nano /etc/systemd/system/prometheus-msteams.service

The service file tells systemd to run prometheus-msteams as the prometheus-msteams user, with the configuration file located /opt/promethues-msteams/config.yml, and template file located in the same directory.

Copy the following content into prometheus-msteams.service file.

[Unit]
Description=Prometheus-msteams
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus-msteams
Group=prometheus-msteams
Type=simple
ExecStart=/usr/local/bin/prometheus-msteams server -l localhost -p 2000 --config /opt/prometheus-msteams/config.yml --template-file /opt/prometheus-msteams/card.tmpl

[Install]
WantedBy=multi-user.target

promethues-msteams listen on localhost on 2000 port, and you have to provide configuration file and template also.

To use the newly created service, reload systemd.

$ sudo systemctl daemon-reload

Now Start promethues-msteams.

$ sudo systemctl start prometheus-msteams.service

Check, whether the service is running or not.

$ sudo systemctl status prometheus-msteams

Lastly, enable the service to start on the boot.

$ sudo systemctl enable prometheus-msteams

Now, prometheus-msteams is up and running, we can configure ALERTMANAGER to send alerts to prometheus-msteams.

Step 3(Configure ALERTMANAGER)

Open alertmanager.yml file in your favorite editor.

$ sudo vim /etc/alertmanager/alertmanager.yml

you can configure ALERTMANAGER as shown below.

global:
  resolve_timeout: 5m

templates:
  - '/etc/alertmanager/*.tmpl'

receivers:
- name: alert_channel
  webhook_configs:
  - url: 'http://localhost:2000/alert_channel'
    send_resolved: true

route:
  group_by: ['critical','severity']
  group_interval: 5m
  group_wait: 30s
  receiver: alert_channel
  repeat_interval: 3h

In the above configuration, ALERTMANAGER is sending alerts to prometheus-msteams, which is listening on localhost, and we pass send_resolved, which will send resolved alerts.

The critical alert to MS-teams will look like below.

When alert resolved, it will look like below.

Note: The logs of prometheus-msteams created in /var/log/syslog file. In this file you will find every notification send by prometheus-msteams. Apart from this, if something went wrong, and you are not getting notification, you can debug in syslog file

As Docker Container

you can also run prometheus-msteams as container in your system. All configuration files of prometheus-msteams going to be the same, you just need to run the following command.

docker run -d -p 2000:2000 \
    --name="promteams"  \
    -v /opt/prometheus-msteams/config.yml:/tmp/config.yml \
    -e CONFIG_FILE="/tmp/config.yml" \
    -v /opt/prometheus-msteams/card.tmpl:/tmp/card.tmpl \
    -e TEMPLATE_FILE="/tmp/card.tmpl" \
    docker.io/bzon/prometheus-msteams:v1.1.4

Now that you are all set to get alerts in MS-teams channel, you can see that it isn’t as difficult as you originally thought. Ofcourse, this is not the only way to get alerts on MS-teams. You can always use different tool like prome2teams, etc. With this, I think we are ready to move ahead and explore other monitoring tools as well.

I hope this blog post explains everything clearly. I would really appreciate to get feedback in comments.