Kaushik Gattu: August 2019

Friday, August 16, 2019

Docker container health check

Docker container health check

For example one webhook is running as a container

( how to know whether it is healthy or unhealthy ).

How the health check will make an added advantage to our application ?

In some scenarios, the container will be in running state, but we don’t see any interaction between the application container and the client.

One of the reason for these kind of behavior may be load.

In the above scenario we need to troubleshoot to understand the behavior of the container.

( we can troubleshoot the container with several approaches , for example,

docker ps

docker ps -a ( analyst why container is in stopped state and go through that specific container logs ).

( but in stopped state container will not provide the logs , then what is the best way to troubleshoot ??? )

How to check the healthcheck of the container using the healthcheck command in the Dockerfile ?

What is healthcheck ?

Healthcheck are exactly what they sound like - a way of checking the health of some resource. In the case of Docker, a health check is a command used to determine the health of a running container

When a healthcheck command specified, it tells Docker how to test the container to see if it’s working. With no health check specified, docker has no way of knowing whether or not the services running within your container are actually up or not

Take an example of python was flask framework

pythonapp/Dockerfile

FROM python:2.7

MAINTAINER Madhu Sudhan reddy "jmstechhome@gmail.com"

COPY . /app

WORKDIR /app

RUN pip install -r requirements.txt

ENTRYPOINT ["python"]

CMD ["app.py”]

Steps

#Simple Python Helloworld app using docker

Build the image using the following command

docker build -t pythonapp:v1 .

Run the Docker container using the command shown below.

docker run -it -p 80:5000 --name myapp pythonapp:v1

The application will be accessible at

http://<host_ip>:80

pythonapp/app.py

from flask import Flask

app = Flask(__name__)

@app.route("/")

def hello():

return "Hello world!!"

if __name__ == "__main__":

app.run(debug=True,host='0.0.0.0')

pythonapp/requirements.txt

flask

>>>>

Lets start with the requirements.txt:

Flask==0.12.2

And the Dockerfile

FROM python:3.6-alpine

COPY . /app

WORKDIR /app

RUN pip install -r requirements.txt

CMD [“python”, “app.py”]

And finally.app.py;

from flask import Flask

app = Flask(_name_)

@app.route(‘/‘)

def hello_world():

return ‘Hello world’

if_name_==‘_main_’:

app.run(host=“0.0.0.0”)

Now lets build the container

docker build -t docker-flask

This should build pretty quickly, Then we will run the container

docker run —rm —name docker-flask -p 5000:5000 docker-flask

Now test by opening up your browser to localhost:5000 you should see “Hello world”

Add a health check to the Dockerfile

Since the goal of the container is to serve the traffic on port 5000. Our health check should make sure this is happening

A health check is configured In the Dockerfile using the HEALTHCHECK instruction. There are two ways to use the HEALTHCHECK Instruction

HEALTHCHECK ( OPTIONS ) CMD command

Or if you want to disable a health check from a parent image:

HEALTHCHECKNONE

So we are obviously going to use the first. So lets add the HEALTHCHECK instruction , and we will use curl to ensure that our app is serving traffic on port 5000

So add this line to the Dockerfile right before the last line (CMD)

HEALTHCHECK CMD curl —fail http://localhost:5000/ || exit 1

In this case, we are using the default options, which are interval 30s, timeouts 30s, start-period 0s and retries 3. Read the health check instruction reference for more information on the options.

FROM python:3.6-alpine

COPY . /app

WORKDIR /app

RUN app add curl

RUN pip install -r requirements.txt

HEALTHCHECK CMD curl —fail http://localhost:5000/ || exit 1

CMD [“python”, “app.py” ]

See the health status

Lets rebuild and run our container

docker build -t docker-flask.

docker run —rm —name docker-flask -p 5000:5000 docker-flask

Now lets take a look at the health status. Notice we have the —name option to the above command so we can early inspect t the container

docker inspect —format=‘{{Json.State.Health}}’ docker-flask

If you run the immediately after the container starts, you will see status is starting

{“Status”:”starting”,”FailingStreak” :0, “Log” :[]}

And after that the health check name ( after the default interval of 30s):

{“Status”:”starting”,”FailingStreak” :0, “Log” :[ ( “start”: “27-10-23773ry23ry3ry82r3yr8”, “End”:” 2017-07-29”}

Thursday, August 15, 2019

How to install Gradle ?

Install on the MacOs

The current Gradle release is 5.6. You can download binaries and view docs for all Gradle versions from the

Prerequisites

$ java -version
java version "1.8.0_121"

Homebrew

$ brew install gradle



Step 3. Configure your system environment



$ export PATH=$PATH:/opt/gradle/gradle-5.6/bin



Step 4. Verify your installation

$ gradle -v



Install the gradlew 



gradle wrapper --gradle-version 2.13
Starting a Gradle Daemon (subsequent builds will be faster)



Upgrade with the Gradle Wrapper



$ ./gradlew wrapper --gradle-version=5.6 --distribution-type=bin






$ ./gradlew tasks
Downloading https://services.gradle.org/distributions/gradle-5.6-bin.zip
...

Monday, August 12, 2019

Chaos engineering and Just in time access for the network, vms and the keys

Chaos Engineering

Discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production

Introduction:

Traditional software testing approaches like unit, regression and integration testing validate known conditional and scenarios

A distributed system has services whose interactions can cause unpredictable behavior in production

To gauge stability of more complex distributed systems involving multiple components and services and the interactions therein

Examples of interactions

When a service is unavailable, does it fall back or fail gracefully without impacting the whole system E.g. can the account summary still be accessed when the payment system is down

What happens when things fail and retries cause additional burden on the system

When the site is slow, what If the users keep clicking again and again. Are the transactions is idempotent or is the backend overloaded

Ramfications:

Outages can have negative impact on brand reputation

Engineering costs to retroactively figure out the root cause of an issue

SLA breach can lead to service providers having to compensate

Need for chaos engineering

To comprehend the systematic effect of changes in distributed system

Understand vulnerable points of service

Improve resiliency of the system

Advantages

Customer satisfication

Increased availability and durability means no business outages for

More technical insights of the services:

Better understanding of system failure modes

Improved mean time to detection for issues

Reduction in repeated issues

Chaos engineering Stages

Stage0 :

Preparing for disaster

Establish observability

Define the critical dependencies

Define the non-critical dependencies

Create a disaster recovery failover playbook

Create a critical dependency failover playbook

Create a non-critical dependency failover playbook

Publish the above and get team-wide agreement

Manually execute a failover exercise

Stage01

Injecting chaos internally

Perform critical dependency failure tests in non-production

Publish test results

( like a vaccine, we inject harm to build immunity ).

Stage02

Pushing the envelope forward

Perform frequent, semi-automated tests

Execute a resiliency experiment in prod

Publish test results

Stage03

Automating chaos internally

Automate resiliency testing in non-production

Semi-automate disaster recovery failover

Stage04

Injected automated chaos everywhere

Integrate resiliency testing in CI/CD

Automate resiliency and disaster recovery failover testing in production

Gremlin : Injecting chaos in example services

Types of attacks

Shutdown

Time travel

CPU

Disk

Black hole

DNS

Memory

Latency

Process killer

Packet Loss

Application-level

GremlinD >>>>>> Register with plane via secret based authentication >>>> plane

Gremlin >>>>>>. Attack orders from users executed by gremlin client on host machine >>>> plane

How to implement chaos engineering ?

In example service

Identify all the candidate components for attacks

Create CPU and shutdown attack scripts

Attack generation on collector and logging tier

Integration with slack for alerting the teams

Attacks to be created for other services

Automated attacks

Resiliency implementation

Controlled automated attacks in all envs - sandbox and qa

JTAP ( Just-in-time access to prod )

What’s wrong with how things are now ?

No AD integration

Cross-cloud makes things tricky

Revocation is messy

Giving someone temporary access is not possible

How JTAP works ?

A web app that users login to and request access

User authentication via AD + MFA

User will be provided with unique credentials that have an expiry date

We are allowing the SSH key for four hours

SSH

ssh -I vis user@publicipaddress
eval $(ssh-agent -s)
ssh-add -t 3600 vic
Ssh-keygen -Lf vis-cert.pub ( it will show the validity for the 4 hours )
Cd /var/log >> auth.log
tail auth.log

Pros

No third-party dependency

No config change on target VMs every time there’s a new request

Existing methods will work

Auditing / traceability

Can revoke access for individual users, even after cert is issued

Things to keep in mind

Secrety of the CA key is paramount

Regular key rotation is recommended