Chaos Engineering
Discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production
Introduction:
Traditional software testing approaches like unit, regression and integration testing validate known conditional and scenarios
A distributed system has services whose interactions can cause unpredictable behavior in production
To gauge stability of more complex distributed systems involving multiple components and services and the interactions therein
Examples of interactions
When a service is unavailable, does it fall back or fail gracefully without impacting the whole system E.g. can the account summary still be accessed when the payment system is down
What happens when things fail and retries cause additional burden on the system
When the site is slow, what If the users keep clicking again and again. Are the transactions is idempotent or is the backend overloaded
Ramfications:
Outages can have negative impact on brand reputation
Engineering costs to retroactively figure out the root cause of an issue
SLA breach can lead to service providers having to compensate
Need for chaos engineering
To comprehend the systematic effect of changes in distributed system
Understand vulnerable points of service
Improve resiliency of the system
Advantages
Customer satisfication
Increased availability and durability means no business outages for
More technical insights of the services:
Better understanding of system failure modes
Improved mean time to detection for issues
Reduction in repeated issues
Chaos engineering Stages
Stage0 :
Preparing for disaster
Establish observability
Define the critical dependencies
Define the non-critical dependencies
Create a disaster recovery failover playbook
Create a critical dependency failover playbook
Create a non-critical dependency failover playbook
Publish the above and get team-wide agreement
Manually execute a failover exercise
Stage01
Injecting chaos internally
Perform critical dependency failure tests in non-production
Publish test results
( like a vaccine, we inject harm to build immunity ).
Stage02
Pushing the envelope forward
Perform frequent, semi-automated tests
Execute a resiliency experiment in prod
Publish test results
Stage03
Automating chaos internally
Automate resiliency testing in non-production
Semi-automate disaster recovery failover
Stage04
Injected automated chaos everywhere
Integrate resiliency testing in CI/CD
Automate resiliency and disaster recovery failover testing in production
Gremlin : Injecting chaos in example services
Types of attacks
Shutdown
Time travel
CPU
Disk
Black hole
IO
DNS
Memory
Latency
Process killer
Packet Loss
Application-level
GremlinD >>>>>> Register with plane via secret based authentication >>>> plane
Gremlin >>>>>>. Attack orders from users executed by gremlin client on host machine >>>> plane
How to implement chaos engineering ?
In example service
Identify all the candidate components for attacks
Create CPU and shutdown attack scripts
Attack generation on collector and logging tier
Integration with slack for alerting the teams
2)
Attacks to be created for other services
Automated attacks
Resiliency implementation
Controlled automated attacks in all envs - sandbox and qa
JTAP ( Just-in-time access to prod )
What’s wrong with how things are now ?
No AD integration
Cross-cloud makes things tricky
Revocation is messy
Giving someone temporary access is not possible
How JTAP works ?
A web app that users login to and request access
User authentication via AD + MFA
User will be provided with unique credentials that have an expiry date
We are allowing the SSH key for four hours
SSH
- ssh -I vis user@publicipaddress
- eval $(ssh-agent -s)
- ssh-add -t 3600 vic
- Ssh-keygen -Lf vis-cert.pub ( it will show the validity for the 4 hours )
- Cd /var/log >> auth.log
- tail auth.log
Pros
No third-party dependency
No config change on target VMs every time there’s a new request
Existing methods will work
Auditing / traceability
Can revoke access for individual users, even after cert is issued
Things to keep in mind
Secrety of the CA key is paramount
Regular key rotation is recommended
No comments:
Post a Comment