Kaushik Gattu: May 2020

Sunday, May 31, 2020

SRE

Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production. The goal is to bridge the gap between the development team that needs to ship continuously, and the operations team that that is responsible for the reliability of the production environment. Site reliability engineering shifts the responsibility of production reliability to the SRE on the development team.

Site reliability engineers typically spend up to 50% of their time on the daily tasks that keep the application reliable and the rest of their time developing software.

A key skill of a software reliability engineer is that they have a deep understanding of the application. This includes knowledge of the code, how the application runs, how it is configured, and how it scales.

Some of the typical responsibilities of a site reliability engineer are to:

Proactively monitor and review application performance.
Handle on-call and emergency support.
Ensure that the software has good logging and diagnostics.
Create and maintain operational runbooks.
Help triage escalated support tickets.
Work on feature requests, defects, and other development tasks.
Contribute to the overall product roadmap.
Perform live site reviews and capture feedback for system outages.

Site reliability engineering versus DevOps

DevOps builds a healthy working relationship between the operations staff and the development team. By breaking down the silos between the two, DevOps produces a more robust, reliable product.

Both SRE and DevOps are methodologies that address an organization's need for a way to manage the production environment. As you've seen in the previous modules, DevOps feedback systems can identify problems and alert the developers, who then solve the issue. With SRE, a person on the development team looks for issues with site reliability on a daily basis and is probably the person who solves those problems, as well. While DevOps teams would usually choose to leave the production environment untouched unless absolutely necessary, SREs will likely make changes.

Site reliability engineering skills

The type of skills that are needed vary depending on the application, how and where it is deployed, and how it is monitored. For example, organizations that use serverless technologies won't need someone with in-depth knowledge of Windows or Linux systems management. However, these skills are critical to teams that use servers for deployments.

Other key skills for a good SRE focus on application monitoring and diagnostics. An SRE should have experience with application performance management tools like Application Insights. They should also understand application logging best practices and exception handling.

Continuously monitor applications and services

What is continuous monitoring?

Continuous monitoring refers to the processes and technologies that you can use to monitor each phase of your application's lifecycle. Continuous monitoring helps you validate the health, performance, and reliability of your application and infrastructure as changes move from development to production.

Continuous monitoring builds on CI/CD concepts, which help you develop and deliver software faster and more reliably to provide continuous value to your users.

What is observability?

Observability refers to making data available from within the system that you wish to monitor. Monitoring is the actual task of collecting and displaying this data.

What is Azure Monitor?

Azure Monitor is a service in Azure that provides full-stack observability across applications and infrastructure both in the cloud and on-premises.

Azure Monitor works with development tools such as Visual Studio and Visual Studio Code so you can use it during your development and test phases. It integrates with Azure DevOps to provide release management and work item management during your deployment phases.

Azure Monitor also integrates with IT service management (ITSM) and Security information and event management (SIEM) tools to help you track issues and incidents within your existing IT processes.

Enable monitoring on your applications

Applications are complex, and have many interconnected components. To visualize the end-to-end transactions and connections across all system, you need to enable monitoring on all of your web applications and services.

If you don't have an existing project in Azure DevOps, you might start with Azure DevOps Projects . Azure DevOps Projects makes it easy to create a starter CI/CD pipeline, using your existing code and Git repository.

Then, you can add continuous monitoring to your release pipeline by combining Azure Pipelines with Azure Application Insights. Application Insights is a feature of Azure Monitor that you can use to monitor your live applications. We'll take a closer look at Application Insights shortly.

Enable monitoring on your infrastructure

Applications are only as reliable as their underlying infrastructure. Having monitoring enabled across your entire infrastructure helps you achieve full observability and makes it easier to discover root causes when something fails.

Azure Monitor helps you track the health and performance of your entire hybrid infrastructure, including virtual machines, containers, storage, and networks.

With Azure Monitor, you can collect:

Platform metrics, activity logs, and diagnostics logs from most of your Azure resources with no additional configuration needed.
Monitoring data for VMs. (Azure Monitor for VMs )
Monitoring data for AKS clusters. (Azure Monitor for containers )

Reference : https://docs.microsoft.com/en-us/learn/modules/route-system-feedback/2-continuously-monitor

inner loop

The inner loop is the iterative process that developers perform when they write, build, and debug code. There are, of course, other things that a developer does, but the inner loop is the core set of steps that developers perform over and over before they share their work.

Exactly what goes into each developer's inner loop depends on the technologies they work with, the tools they use, and their own preferences.

For example, if you're writing a C# library, your inner loop might include coding, building, and testing. If you're doing web development, your inner loop might include coding, bundling, and refreshing your web browser to see your progress.

The following illustrates these two types of inner loops:

Each of these loops might include a fourth step, where you commit and integrate your changes with the team's central repository, for example, on GitHub.

In reality, most codebases consist of multiple moving parts. The definition of an inner loop on any single codebase can vary, depending on the project.

Which steps in the inner loop add value?

In Assess your existing software development process , you and the A team learned how a value stream map (VSM) can help you analyze your current release cycle process. Like the team's VSM, you can measure your inner loop to see which parts have value to the customer and which parts are eating up time without producing any value.

You can group the steps within the inner loop into three broad categories: experimentation, feedback collection, and tax.

In the example of building a C# library, here's how you might categorize each step:

Of all the steps in the inner loop, coding is the only one that adds customer value. Building and testing code are important, but ultimately you use these as tools to gain feedback about whether the changes provide sufficient value. For example, does the code compile? Does the feature satisfy the requirements? Does the feature work correctly with other features?

A tax defines work that neither adds value nor provides feedback, but is still necessary. In contrast, you can categorize unnecessary work as waste and then eliminate that work.

How can I optimize the inner loop?

Now that we've categorized the steps within the inner loop, here are some general statements that we can make:

The activities within the inner loop should happen as quickly as possible.
The total loop execution time should be proportional to the changes that you're making.
You should minimize the time it takes to collect feedback, but maximize the quality of the feedback that you get.
You should minimize the tax you pay by eliminating it where it isn't necessary. For example, commit your changes only after all tests pass.
Remember that, as the size of your codebase grows, so does the size of the inner loop. For example, having more code means you need to run more tests, which in turn slows down the inner loop.

If you've ever worked on a large, monolithic codebase, it's possible to get into a situation where even small changes require a disproportionate amount of time to execute the feedback collection steps of the inner loop.

There's no single solution that ensures that your inner loop is optimal. But it's important to notice a slowdown when it happens and to address what's causing it.

Here are a few things you and your team can do to optimize the inner loop:

Only build and test what changed.
Cache intermediate build results to speed up full builds.
Break up the codebase into smaller units.

You can gain immediate benefits by implementing the first two recommendations. However, use caution when breaking your codebase into smaller units. When done incorrectly, breaking your codebase into too many small units can have the opposite effect: a tangled loop.

What are tangled loops?

A tangled loop happens when multiple processes, each with its own inner loop, become dependent on one another.

Say that your monolithic codebase has some set of core functionality that does much the difficult work your application needs to perform. You might package that code into a helper library.

To do this, you would typically move your library code to a separate repository and then set up a CI/CD pipeline that builds, tests, and packages the library. The pipeline might then publish the result to a package server. You would then configure your application to pull that library from the package server.

Development of the library code forms its own inner loop. When you make changes to the library and, for example, submit a pull request to merge your changes, you transition the workflow from the inner loop to the outer loop. The outer loop includes anything that depends on your library, for example, your monolithic application.

Initially, you might see some benefits. For example, you might see decreased build times in your application because the library code is already built for you. It's likely, though, that you'll need to develop an application feature that requires new capabilities in the library. This is where teams who have incorrectly separated their codebases start to feel pain.

When you evolve code in two separate repositories where a dependency is present, then you'll probably experience some friction. In terms of the inner and outer loops, the inner loop of the original codebase now includes the outer loop of the library code that was previously separated out.

Outer loops can include taxes such as code reviews, security scanning, package signing, and release approvals. You don't want to pay that tax every time you add a function to the library that your application needs.

In practice, this situation can force developers to work around processes or code in order to move forward. Such workarounds can build up taxes that you'll have to pay at some point.

This doesn't mean that breaking code up into separate packages is a bad thing. You just need to carefully consider the impact your decisions have on the outer loop.

configuration tools

Popular configuration management tools:

1. Ansible
2. Azure Automation
3. Azure Custom Script Extension
4. Chef
5. Cloud-init
6. PowerShell DSC
7. Puppet

For each tool, you'll get a general sense for how it works, which programming languages are involved, and how it integrates with Azure.

Ansible

Ansible is an open-source product, sponsored by Red Hat, that automates cloud provisioning, configuration management, and application deployments. You can use Ansible to provision Azure resources such as virtual machines, containers, networks, and even complete cloud infrastructures. You can also use Ansible to configure your Azure resources after they're provisioned, which is the focus in this module.

In addition to Azure, Ansible supports other public clouds as well as private cloud frameworks.

With Ansible, you write playbooks that express your desired configuration. A playbook is a YAML file, making it a form of declarative automation.

Here's a basic example that you'll work with later. It defines service accounts for users named testuser1 and testuser2.

---

- hosts: all

become: yes

tasks:

- name: Add service accounts

user:

name: "{{ item }}"

comment: service account

create_home: no

shell: /usr/sbin/nologin

state: present

loop:

- testuser1

- testuser2

You use the ansible-playbook command to apply this configuration, like this:

ansible-playbook \

--inventory ./azure_rm.yml \

--user azureuser \

--private-key ~/.ssh/ansible_rsa \

--limit=tag_Ansible_test1 \

./users.yml \

You'll see this process in greater detail later in this module.

If you run this command multiple times, Ansible configures the user accounts only if they don't exist or have changed. This command is therefore an idempotent operation.

Ansible is also agentless, so you don't have to install Ansible software on the managed machines. However, you do need to install Python on your managed machines. By default, Ansible connects to Linux machines over the SSH protocol, and Windows machines over WinRM.

You typically use a control machine to manage your systems. A control machine includes the Ansible software and the playbooks you need to run. The control machine pushes configuration changes to your nodes. Later in this module, you'll set up a control machine and run Ansible playbooks from that control machine in Azure Pipelines.

Although Ansible is agentless, both the control machine and managed nodes require Python to enable Ansible to connect to remote systems and issue commands on those systems.

Ansible inventories

In Ansible, the inventory is a file that defines the hosts upon which the tasks in a playbook operate. Ansible represents what systems it manages by using an .ini or YAML file that puts all of your managed machines in groups of your own choosing.

For a VM deployment on Azure, you could define each VM and its IP address or hostname, similar to this:

hosts:

vm1:

ansible_host: 13.79.22.89

vm2:

ansible_host: 40.87.135.194

This is an example of a static inventory. If these IP addresses change, or if you add or remove systems, you'd need to update this inventory file over time.

A more flexible approach is to use a dynamic inventory. A dynamic inventory enables Ansible to discover which systems to configure at runtime.

Here's the dynamic inventory file you're going to use in this module:

plugin: azure_rm

include_vm_resource_groups:

- learn-ansible-rg

auth_source: auto

keyed_groups:

- prefix: tag

key: tags

This inventory specifies that each VM in the learn-ansible-rg resource group belongs to the inventory. The keyed_groups part groups VMs by their tag names. You'll work with a complete example later in this module.

Ansible on Azure

There are a number of ways you can use Ansible on Azure.

On Azure Marketplace , you'll find a number of images that you can use. They include:

Red Hat Ansible instance on Linux , published by Microsoft.

You can use this image to bring up a control machine, which includes Ansible, the Azure CLI, and other tools, to manage your fleet.

Ansible Tower , published by Red Hat.

Ansible Tower helps organizations scale IT automation and manage complex deployments across physical, virtual, and cloud infrastructures. With Ansible Tower, you can:

Provision Azure environments using pre-built Ansible playbooks.
Use role-based access control (RBAC) to define who or what can see, change, or delete objects, or utilize specific capabilities.
Maintain centralized logging for complete auditability and compliance.
Use the many content resources available on Ansible Galaxy.

You can also set up Ansible on a Linux VM running on Azure, or in your datacenter, and use that as your control machine. Although Ansible doesn't support Windows as the control machine, you can run Ansible from Windows through Windows Subsystem for Linux, Azure Cloud Shell, or Visual Studio Code.

Azure Automation

Azure Automation is a service in Azure that helps you automate manual tasks. Automation has the concept of a runbook, which is a set of tasks that perform some automated procedure in Automation. Tasks in a runbook are written in PowerShell, PowerShell Workflow , or Python. You can run a runbook either manually or on a schedule.

Here's a basic example that uses PowerShell Workflow to stop a running service:

Workflow Stop-MyService

{

$Output = InlineScript {

$Service = Get-Service -Name MyService

$Service.Stop()

$Service

}

$Output.Name

}

Although the name implies that you can use Azure Automation only on Azure, it's more flexible than that. Automation has a feature called hybrid runbook worker. This feature gives Automation access to resources in other clouds or in your on-premises environment that would otherwise be blocked by a firewall.

Automation also provides a Desired State Configuration (DSC) pull server that enables you to create definitions for how a specified set of VMs should be configured. DSC then ensures that the required configuration is applied and that the VM stays consistent. Automation DSC runs on both Windows and Linux.

Azure Custom Script Extension

The Custom Script Extension is a way to download and run scripts on your Azure VMs. You can run the extension when you create a VM, or any time after the VM is in use.

You can store your scripts in Azure Storage or in a public location, such as GitHub. You can run scripts manually or as part of a more automated deployment.

You can use the Custom Script Extension with Windows or Linux VMs. Here's an example that uses the az vm extension set command to run a Bash script that installs Nginx web server on a Linux VM.

az vm extension set \

--resource-group my-rg \

--vm-name my-vm \

--name customScript \

--publisher Microsoft.Azure.Extensions \

--version 2.0 \

--settings '{"fileUris":["https://raw.githubusercontent.com/MicrosoftDocs/mslearn-welcome-to-azure/master/configure-nginx.sh"]}' \

--protected-settings '{"commandToExecute": "./configure-nginx.sh"}'

Chef

Chef is an infrastructure automation tool that enables you to configure and manage your systems.

Chef helps you to manage your infrastructure in the cloud, on-premises, or in a hybrid environment. You express your configurations by writing recipes that describe everything your systems need to run your application. Chef recipes use a declarative syntax that's based on the Ruby programming language. A recipe uses the .rb file extension.

A Chef recipe is made up of resources. Chef provides built-in resource types that enable you to configure various parts of the system. For example, the package resource enables you to install or remove a package. The service resource enables you to manage a service.

Here's the Chef recipe that installs Internet Information Services (IIS) web server on Windows, which you saw earlier in this module:

powershell_script 'Install IIS' do

action :run

code 'Add-WindowsFeature Web-Server'

end

service 'w3svc' do

action [ :enable, :start ]

end

template 'c:\inetpub\wwwroot\Default.htm' do

source 'Default.htm.erb'

rights :read, 'Everyone'

end

Most Chef resources are idempotent, meaning you can apply the same configuration repeatedly.

You can package multiple recipes into a cookbook. A cookbook might contain recipes that configure the various parts of MySQL, Nginx, OpenSSL, or any other kind of software.

Building on the previous code example, an IIS cookbook might contain recipes that configure application pools, virtual directories, and virtual sites. You can define roles to specify which recipes are applied to a system based on that system's function. For example, you might define the "webserver" role to run recipes that install and configure IIS, Apache, or Nginx web servers. The "database" role might run recipes that install and configure MySQL or Microsoft SQL Server.

Chef and the Chef community maintain cookbooks on Chef Supermarket .

Chef on Azure

There are a number of ways you can use Chef on Azure.

On Azure Marketplace , you'll find a number of images that you can use. They include:

Chef Extension for Windows and Linux, published by Chef Software.

These images come with the Chef Client. Chef Client is an agent that runs on each node that's managed through Chef. Chef Client applies the cookbooks and recipes you specify. Chef Client can also send reporting data back to a Chef Server or a Chef Automate server, so that you can track and audit your configuration runs over time.
Chef Automate , published by Chef Software.

Chef Automate enables you to package and test your applications, and provision and update your infrastructure. Using Chef, you can manage all of it with compliance and security checks, and dashboards that give you visibility into your entire stack.

You can also set up Chef on a Linux or Windows VM running on Azure, or in your datacenter.

Cloud-init

Cloud-init, by Canonical, is a way to customize a Linux VM as it boots for the first time. You can use cloud-init to install packages, write files, and configure users.

You write cloud-init files by using YAML. Consider this basic cloud-init configuration that installs PIP, the package manager for Python, and NumPy, a package for scientific computing with Python.

#cloud-config

packages:

- python-pip

runcmd:

- pip install numpy

In this example, packages specifies the list of packages to install. Here, we install python-pip. runcmd specifies the list of commands to run on first boot. Here, we use PIP to install the NumPy package.

This configuration is declarative, meaning you don't need to specify how to install python-pip. Cloud-init recognizes the Linux distribution that's running, and can use the appropriate package manager to install the python-pip package. For example, it can install apt on Debian-based systems or yum on Red Hat Enterprise Linux.

Here's an example that uses the Azure CLI to bring up an Ubuntu VM on Azure and apply this configuration.

az vm create \

--resource-group my-rg \

--name my-vm \

--admin-username azureuser \

--image UbuntuLTS \

--custom-data cloud-init.txt \

--generate-ssh-keys

The --custom-data argument specifies the cloud-init configuration to run when the VM boots for the first time. The cloud-init configuration is located in cloud-init.txt.

PowerShell DSC

PowerShell Desired State Configuration (DSC) is a management platform that defines the configuration of target machines. You can use PowerShell DSC to manage Windows or Linux systems.

DSC configurations define what to install and configure on a machine. A local configuration manager (LCM) engine runs on each target node that processes requested actions based on pushed configurations. A pull server is a web service that runs on a central host to store the DSC configurations and associated resources. The pull server communicates with the LCM engine on each target node to provide the required configurations and report on compliance.

Here's a basic example that uses PowerShell DSC to configure IIS on Windows.

Configuration MyWebsite {

Import-DscResource -ModuleName PsDesiredStateConfiguration

Node "localhost" {

WindowsFeature WebServer {

Ensure = "Present"

Name = "Web-Server"

}

Configuration MyWebsite {

Import-DscResource -ModuleName PsDesiredStateConfiguration

Node "localhost" {

WindowsFeature WebServer {

Ensure = "Present"

Name = "Web-Server"

}

You would then compile this configuration into a Management Object Format (MOF) file, which is the format that DSC can consume. To do this, you run the configuration like a function. Here's an example:

. .\MyWebsite.ps1

MyWebsite

The first line makes the configuration function available in the console. The second line runs the configuration. The result is a folder, named MyWebsite, which contains a file named localhost.mof.

To apply the configuration, you run the Start-DscConfiguration cmdlet, like this:

Start-DscConfiguration .\MyWebsite

Puppet

Puppet is an automation platform that handles the application delivery and deployment process. Agents are installed on target machines to allow Puppet Master to run manifests that define the desired configuration of your infrastructure.

You express your configurations by writing manifest files known as Puppet Program files. Manifests describe everything your systems need to run your application. Puppet manifests use a declarative syntax that's based on the Ruby programming language. A manifest uses the .pp file extension.

A Puppet manifest is made up of resources. Puppet provides built-in resource types that enable you to configure various parts of the system. For example, the file resource enables you to manage a file. The service resource enables you to manage a service.

Here's a basic Puppet manifest that installs IIS web server on Windows:

$iis_features = ['Web-WebServer']

iis_feature { $iis_features:

ensure => 'present'

}

You can package multiple manifests into a module. Puppet and their partners maintain modules at Puppet Forge .

In this example, the iis_feature resource is provided by the puppetlabs-iis module, which helps you manage IIS sites and application pools.

Puppet on Azure

There are a number of ways you can use Puppet on Azure.

On Azure Marketplace , you'll find a number of images that you can use. They include:

Puppet Agent , published by Puppet, is a virtual machine extension that installs the Puppet agent on your Windows VM.
Puppet Enterprise , published by Puppet, enables you to automate the entire lifecycle of your infrastructure.

You can also set up Puppet on a Linux or Windows VM running on Azure, or in your datacenter.