Kaushik Gattu: Chaos engineering and Just in time access for the network, vms and the keys

Chaos Engineering

Discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production

Introduction:

Traditional software testing approaches like unit, regression and integration testing validate known conditional and scenarios

A distributed system has services whose interactions can cause unpredictable behavior in production

To gauge stability of more complex distributed systems involving multiple components and services and the interactions therein

Examples of interactions

When a service is unavailable, does it fall back or fail gracefully without impacting the whole system E.g. can the account summary still be accessed when the payment system is down

What happens when things fail and retries cause additional burden on the system

When the site is slow, what If the users keep clicking again and again. Are the transactions is idempotent or is the backend overloaded

Ramfications:

Outages can have negative impact on brand reputation

Engineering costs to retroactively figure out the root cause of an issue

SLA breach can lead to service providers having to compensate

Need for chaos engineering

To comprehend the systematic effect of changes in distributed system

Understand vulnerable points of service

Improve resiliency of the system

Advantages

Customer satisfication

Increased availability and durability means no business outages for

More technical insights of the services:

Better understanding of system failure modes

Improved mean time to detection for issues

Reduction in repeated issues

Chaos engineering Stages

Stage0 :

Preparing for disaster

Establish observability

Define the critical dependencies

Define the non-critical dependencies

Create a disaster recovery failover playbook

Create a critical dependency failover playbook

Create a non-critical dependency failover playbook

Publish the above and get team-wide agreement

Manually execute a failover exercise

Stage01

Injecting chaos internally

Perform critical dependency failure tests in non-production

Publish test results

( like a vaccine, we inject harm to build immunity ).

Stage02

Pushing the envelope forward

Perform frequent, semi-automated tests

Execute a resiliency experiment in prod

Publish test results

Stage03

Automating chaos internally

Automate resiliency testing in non-production

Semi-automate disaster recovery failover

Stage04

Injected automated chaos everywhere

Integrate resiliency testing in CI/CD

Automate resiliency and disaster recovery failover testing in production

Gremlin : Injecting chaos in example services

Types of attacks

Shutdown

Time travel

CPU

Disk

Black hole

DNS

Memory

Latency

Process killer

Packet Loss

Application-level

GremlinD >>>>>> Register with plane via secret based authentication >>>> plane

Gremlin >>>>>>. Attack orders from users executed by gremlin client on host machine >>>> plane

How to implement chaos engineering ?

In example service

Identify all the candidate components for attacks

Create CPU and shutdown attack scripts

Attack generation on collector and logging tier

Integration with slack for alerting the teams

Attacks to be created for other services

Automated attacks

Resiliency implementation

Controlled automated attacks in all envs - sandbox and qa

JTAP ( Just-in-time access to prod )

What’s wrong with how things are now ?

No AD integration

Cross-cloud makes things tricky

Revocation is messy

Giving someone temporary access is not possible

How JTAP works ?

A web app that users login to and request access

User authentication via AD + MFA

User will be provided with unique credentials that have an expiry date

We are allowing the SSH key for four hours

SSH

ssh -I vis user@publicipaddress
eval $(ssh-agent -s)
ssh-add -t 3600 vic
Ssh-keygen -Lf vis-cert.pub ( it will show the validity for the 4 hours )
Cd /var/log >> auth.log
tail auth.log

Pros

No third-party dependency

No config change on target VMs every time there’s a new request

Existing methods will work

Auditing / traceability

Can revoke access for individual users, even after cert is issued

Things to keep in mind

Secrety of the CA key is paramount

Regular key rotation is recommended

Kaushik Gattu

Monday, August 12, 2019

Chaos engineering and Just in time access for the network, vms and the keys

No comments:

Post a Comment