Monday, August 12, 2019

Chaos engineering and Just in time access for the network, vms and the keys


Chaos Engineering 

Discipline of experimenting on a distributed system in order to build  confidence  in the system’s capability to withstand turbulent conditions in production 

Introduction:

Traditional software testing approaches like unit, regression and  integration testing validate known conditional and scenarios 

A distributed system has services whose interactions can cause unpredictable behavior in production 

To gauge stability of more complex distributed systems involving  multiple components and services and the interactions therein

Examples of  interactions 

When a service is unavailable, does it fall back  or fail gracefully without impacting the whole system E.g. can the account summary still be accessed when the payment  system is down 

What happens when things fail and retries cause additional burden on the system 

When the site is slow, what If the users keep clicking again and again. Are the transactions is idempotent or is the backend overloaded 

Ramfications:

Outages can have negative impact on brand reputation 
Engineering costs to retroactively figure out the root cause of an issue 
SLA breach can lead to service providers having to compensate 

Need for chaos engineering 

To comprehend the systematic effect of changes in distributed system 
Understand vulnerable points of service 
Improve resiliency of the system 

Advantages 

Customer satisfication 

Increased availability  and durability means no business outages for 

More technical insights of the services:

Better understanding of system failure modes 
Improved mean time to detection for issues 
Reduction in repeated issues 


Chaos engineering Stages 

Stage0 : 

Preparing for disaster 

Establish observability 
Define the critical dependencies 
Define the non-critical dependencies 
Create a disaster recovery failover playbook 
Create a critical dependency failover playbook 
Create a non-critical dependency failover playbook 
Publish the above and get team-wide agreement 
Manually execute a failover exercise 

Stage01 

Injecting chaos internally 

Perform critical dependency failure tests in non-production 
Publish test results 

( like a vaccine, we inject harm to build immunity ).

Stage02 

Pushing the envelope forward 

Perform frequent, semi-automated tests 
Execute a resiliency experiment in prod 
Publish test results 

Stage03

Automating chaos internally 

Automate resiliency testing in non-production 
Semi-automate disaster recovery failover 

Stage04

Injected automated chaos everywhere 

Integrate resiliency testing in CI/CD
Automate resiliency and disaster recovery failover testing in production 

Gremlin : Injecting chaos in example services 

Types of attacks 

Shutdown 
Time travel 
CPU
Disk 
Black hole 
IO
DNS
Memory 
Latency 
Process killer 
Packet Loss 
Application-level 

GremlinD    >>>>>> Register with plane via secret based authentication   >>>>  plane 
Gremlin      >>>>>>. Attack orders from users executed by gremlin client on host machine >>>> plane 

How to implement chaos engineering ?

In example service 

Identify all the candidate components for attacks 
Create CPU and shutdown attack scripts 
Attack generation on collector and logging tier 
Integration with slack for alerting the teams 

2)

Attacks to be created for other services 
Automated attacks 
Resiliency implementation 
Controlled automated attacks in all envs - sandbox and qa 


JTAP  ( Just-in-time access to prod ) 

What’s wrong with how things are now ?

No AD integration 
Cross-cloud makes things tricky 
Revocation is messy 
Giving someone temporary access is not possible 

How JTAP works ?

A web app that users login to and request access 
User authentication via AD + MFA 
User will be provided with unique credentials that have an expiry date

We are allowing the SSH key for four hours 

SSH 

  1. ssh -I  vis user@publicipaddress
  2. eval $(ssh-agent -s)
  3. ssh-add -t 3600 vic
  4. Ssh-keygen -Lf vis-cert.pub ( it will show the validity for the 4 hours )
  5. Cd /var/log  >>  auth.log 
  6. tail auth.log  

Pros 

No third-party dependency 
No config change on target VMs every time there’s a new request 
Existing methods will work 
Auditing / traceability 
Can revoke access for individual users, even after cert is issued

Things to keep in mind 

Secrety of the CA key is paramount 
Regular key rotation is recommended 




No comments:

Post a Comment