Friday, August 16, 2019

Docker container health check


Docker container health check 

For example one webhook is  running as a container
( how to know whether it is healthy or unhealthy ).

How the health check will make an added advantage to our application ?

In some scenarios, the container will be in running state, but we don’t see any interaction between the application container and the client.

One of the reason for these kind of behavior may be load.

  1. In the above scenario we need to troubleshoot to understand the behavior of the container.

( we can troubleshoot the container with several approaches , for example,

docker ps 

docker ps -a  ( analyst why container is in stopped state and go through that specific container logs ).

( but in stopped state container will not provide the logs , then what is the best way to troubleshoot ??? )


How to check the healthcheck  of the container using the healthcheck command in the Dockerfile ?


What is healthcheck ?

Healthcheck are exactly  what they sound like - a way of checking the health of some resource. In the case of Docker, a health check is a command used to determine the health of a running container 

When a healthcheck command specified, it tells Docker how to test the container to see if it’s working. With no health check specified, docker has no way of knowing whether or not the services running within your container are actually up or not

Take an example  of python was flask framework 

pythonapp/Dockerfile 

FROM python:2.7
MAINTAINER Madhu Sudhan reddy  "jmstechhome@gmail.com"
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
ENTRYPOINT ["python"]
CMD ["app.py”]

Steps 

#Simple Python Helloworld app using docker
Build the image using the following command
docker build -t pythonapp:v1 .
Run the Docker container using the command shown below.
docker run -it -p 80:5000 --name myapp pythonapp:v1
The application will be accessible at
http://<host_ip>:80


pythonapp/app.py 

from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
       return "Hello world!!"
if __name__ == "__main__":
    app.run(debug=True,host='0.0.0.0')

pythonapp/requirements.txt
flask

>>>>

Lets start with the requirements.txt:
Flask==0.12.2

And the Dockerfile 

FROM python:3.6-alpine 
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD [“python”, “app.py”]
And finally.app.py;

from flask import Flask 
app = Flask(_name_)
@app.route(‘/‘)
def hello_world():
    return ‘Hello world’
if_name_==‘_main_’:
  app.run(host=“0.0.0.0”)

Now lets build the container 

docker  build -t docker-flask

This should build pretty quickly, Then we will run the container 

docker run —rm —name docker-flask -p 5000:5000 docker-flask

Now test by opening up your browser to localhost:5000 you should see “Hello world”

Add a health check to the Dockerfile 

Since the goal of the container is to serve the traffic on port 5000. Our health check should make sure this is happening 

A health check is configured In the Dockerfile using the HEALTHCHECK instruction. There are two ways to use the HEALTHCHECK Instruction 

HEALTHCHECK  ( OPTIONS ) CMD command 

Or if you want to disable a health check from a parent image:

HEALTHCHECKNONE

So we are obviously going to use the first. So lets add the HEALTHCHECK instruction , and we will use curl to ensure that our app is serving traffic on port 5000

So add this line to the Dockerfile right before the last line (CMD)

HEALTHCHECK CMD curl —fail http://localhost:5000/ || exit 1 

In this case, we are using the default options, which are interval 30s, timeouts 30s, start-period 0s and retries 3. Read the health check instruction reference for more information on the options.

FROM python:3.6-alpine
COPY . /app
WORKDIR /app 
RUN  app add curl 
RUN pip install -r requirements.txt
HEALTHCHECK CMD curl —fail http://localhost:5000/ || exit 1 
CMD [“python”, “app.py” ]

See the health status 

Lets rebuild and run our container 

docker build -t docker-flask.

docker run  —rm  —name docker-flask -p 5000:5000 docker-flask 

Now lets take a look at the health status. Notice we have the —name option to the above command so we can early inspect t the container

docker inspect —format=‘{{Json.State.Health}}’ docker-flask 

If you run the immediately after the container starts, you will see status is starting 

{“Status”:”starting”,”FailingStreak” :0, “Log” :[]}

And after that the health check name ( after the default interval of 30s):

{“Status”:”starting”,”FailingStreak” :0, “Log” :[ ( “start”: “27-10-23773ry23ry3ry82r3yr8”, “End”:” 2017-07-29”}



Thursday, August 15, 2019

How to install Gradle ?

Install on the MacOs

The current Gradle release is 5.6. You can download binaries and view docs for all Gradle versions from the

Prerequisites 

$ java -version
java version "1.8.0_121"

Homebrew 

$ brew install gradle

Step 3. Configure your system environment

$ export PATH=$PATH:/opt/gradle/gradle-5.6/bin

Step 4. Verify your installation

$ gradle -v

Install the gradlew 

gradle wrapper --gradle-version 2.13 Starting a Gradle Daemon (subsequent builds will be faster)

Upgrade with the Gradle Wrapper

$ ./gradlew wrapper --gradle-version=5.6 --distribution-type=bin



$ ./gradlew tasks
Downloading https://services.gradle.org/distributions/gradle-5.6-bin.zip
...

Monday, August 12, 2019

Chaos engineering and Just in time access for the network, vms and the keys


Chaos Engineering 

Discipline of experimenting on a distributed system in order to build  confidence  in the system’s capability to withstand turbulent conditions in production 

Introduction:

Traditional software testing approaches like unit, regression and  integration testing validate known conditional and scenarios 

A distributed system has services whose interactions can cause unpredictable behavior in production 

To gauge stability of more complex distributed systems involving  multiple components and services and the interactions therein

Examples of  interactions 

When a service is unavailable, does it fall back  or fail gracefully without impacting the whole system E.g. can the account summary still be accessed when the payment  system is down 

What happens when things fail and retries cause additional burden on the system 

When the site is slow, what If the users keep clicking again and again. Are the transactions is idempotent or is the backend overloaded 

Ramfications:

Outages can have negative impact on brand reputation 
Engineering costs to retroactively figure out the root cause of an issue 
SLA breach can lead to service providers having to compensate 

Need for chaos engineering 

To comprehend the systematic effect of changes in distributed system 
Understand vulnerable points of service 
Improve resiliency of the system 

Advantages 

Customer satisfication 

Increased availability  and durability means no business outages for 

More technical insights of the services:

Better understanding of system failure modes 
Improved mean time to detection for issues 
Reduction in repeated issues 


Chaos engineering Stages 

Stage0 : 

Preparing for disaster 

Establish observability 
Define the critical dependencies 
Define the non-critical dependencies 
Create a disaster recovery failover playbook 
Create a critical dependency failover playbook 
Create a non-critical dependency failover playbook 
Publish the above and get team-wide agreement 
Manually execute a failover exercise 

Stage01 

Injecting chaos internally 

Perform critical dependency failure tests in non-production 
Publish test results 

( like a vaccine, we inject harm to build immunity ).

Stage02 

Pushing the envelope forward 

Perform frequent, semi-automated tests 
Execute a resiliency experiment in prod 
Publish test results 

Stage03

Automating chaos internally 

Automate resiliency testing in non-production 
Semi-automate disaster recovery failover 

Stage04

Injected automated chaos everywhere 

Integrate resiliency testing in CI/CD
Automate resiliency and disaster recovery failover testing in production 

Gremlin : Injecting chaos in example services 

Types of attacks 

Shutdown 
Time travel 
CPU
Disk 
Black hole 
IO
DNS
Memory 
Latency 
Process killer 
Packet Loss 
Application-level 

GremlinD    >>>>>> Register with plane via secret based authentication   >>>>  plane 
Gremlin      >>>>>>. Attack orders from users executed by gremlin client on host machine >>>> plane 

How to implement chaos engineering ?

In example service 

Identify all the candidate components for attacks 
Create CPU and shutdown attack scripts 
Attack generation on collector and logging tier 
Integration with slack for alerting the teams 

2)

Attacks to be created for other services 
Automated attacks 
Resiliency implementation 
Controlled automated attacks in all envs - sandbox and qa 


JTAP  ( Just-in-time access to prod ) 

What’s wrong with how things are now ?

No AD integration 
Cross-cloud makes things tricky 
Revocation is messy 
Giving someone temporary access is not possible 

How JTAP works ?

A web app that users login to and request access 
User authentication via AD + MFA 
User will be provided with unique credentials that have an expiry date

We are allowing the SSH key for four hours 

SSH 

  1. ssh -I  vis user@publicipaddress
  2. eval $(ssh-agent -s)
  3. ssh-add -t 3600 vic
  4. Ssh-keygen -Lf vis-cert.pub ( it will show the validity for the 4 hours )
  5. Cd /var/log  >>  auth.log 
  6. tail auth.log  

Pros 

No third-party dependency 
No config change on target VMs every time there’s a new request 
Existing methods will work 
Auditing / traceability 
Can revoke access for individual users, even after cert is issued

Things to keep in mind 

Secrety of the CA key is paramount 
Regular key rotation is recommended