Becoming a rockstar SRE electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Excel in site reliability engineering by learning from field-driven lessons on observability and reliability in code, architecture, process, systems management, costs, and people to minimize downtime and enhance developers' output Purchase of the print or Kindle book includes a free eBook in th...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Proffitt, Jeremy, author (author), Anami, Rod, author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England ; Mumbai : Packt [2023]
Edición:	1st ed
Materias:	Reliability (Engineering) Computer engineering.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009742736806719

Tabla de Contenidos:

Cover
Title Page
Copyright and Credits
Dedication
Contributors
Table of Contents
Preface
Part 1 - Understanding the Basics of Who, What, and Why
Chapter 1: SRE Job Role - Activities and Responsibilities
Making this journey personal
SRE driving forces
SRE skills
SRE traits
Understanding the mindset and hobbies of an SRE
SRE affinity game
SRE guiding principles
SRE hobbies
DevOps engineers versus SRE versus others
DevOps and site reliability engineers
Software and site reliability engineers
Describing an SRE's main responsibilities
An overview of the daily activities of an SRE
People that inspire
Jeremy's recognition - Paul Tyma, former CTO, LendingTree
Rod's recognition - Ingo Averdunk, Distinguished Engineer, IBM, and Gene Brown, Distinguished Engineer, Kyndryl
Summary
Further reading
Chapter 2: Fundamental Numbers - Reliability Statistics
SLA commitment - a conversation, not a number
Internal partner SLAs
External partner SLAs
The cost of more 9s in an SLA
A final word on SLAs
Defining and leveraging SLOs and SLIs
SLOs
SLOs and time
Tracking outage frequency with the MTBF
Measuring the downtime with the MTTR
Understanding the customer and revenue impact
Transparency in outages
The rockstar SRE's SLA
Summary
Chapter 3: Imperfect Habits - Duct Tape Architecture and Spaghetti Code
The business of software development - let's start with the dollars
Defining the "value" of software to a business
The value of protecting business
The value of growing a business
The value of saving labor costs
The A/B testing mindset - the art of change in customer interaction
A/B testing in customer flows
Analyzing the results of A/B testing
Leveraging A/B testing to satisfy quarterly numbers.
Dedication to the craft of development - and why some are just here for a job
A quick guide to communicating with your colleagues
Reviewing the merge request - it's about training, oversight, and reliability
Avoiding the typical rubber stamp mentality
A word on production deployments
Why businesses want us to outright ignore best practices
The truth about the ownership of a developer's time
Understanding the flaws in how we estimate development cost
Fast, good, cheap - pick one
Why is observability the answer to reliability issues?
The cost of highly available architecture
Mixing good and bad - tricks to wrapping bad code and making it resilient
Alerting that fires actions
Adding additional logging to monitor potential issues
Using try catch to encapsulate exceptions
Retries to the rescue…or not
Summary
Part 2 - Implementing Observability for Site Reliability Engineering
Chapter 4: Essential Observability - Metrics, Events, Logs, and Traces (MELT)
Technical requirements
Accomplishing systems monitoring and telemetry
Monitoring targets for infrastructure
Monitoring types and tools
Monitoring golden signals
Monitoring data
Understanding APM
Getting to know topology self-discovery, the blast radius, predictability, and correlation
Alerting - the art of doing it quietly
The user perspective notification trigger principle
Event-to-incident mapping principle
Mixing everything into observability
Outages versus downtime
Observability architecture
Observability effectiveness
In practice - applying what you have learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
Chapter 5: Resolution Path - Master Troubleshooting
Properly defining the problem - and what to ask and not ask
Source of information.
The knowledge base of the reporter
Naming conventions
False urgency
Executive summary
Breaking down and testing systems
Breaking down hardware versus the operating system
Breaking down a web API
Understanding the steps
The problems with this method of troubleshooting
Previous and common events - checking for the simple problems
Prior Root Cause Analysis (RCA) documents
Timeline analysis
Comparison
The best approach
Effective research both online and among peers
The art of the Google search
Skimming the content quickly and refining it
Never forget your internal resources
Breaking down source code efficiently
Code you've never seen
When that fails
Logging plus code
In practice - applying what you've learned
Summary
Chapter 6: Operational Framework - Managing Infrastructure and Systems
Technical requirements
Approaching systems administration as a discipline
Design
Installation
Configuration
App deployment
Management
Upgrade
Uninstallation
Understanding IT service management
ITIL
DevOps
Seeing systems administration as multiple layers and multiple towers
Automating systems provisioning and management
Infrastructure as Code
Immutable infrastructure
In practice - applying what you've learned
Lab architecture
Lab contents
Lab instructions
Summary
Further readings
Chapter 7: Data Consumed - Observability Data Science
Technical requirements
Making data-driven decisions
Defining the question and options
Determining which data to use
Identifying which data is already available
Collecting the missing data
Analyzing all datasets together
Presenting the decision as a record
Documenting the lessons learned in the process
Solving problems through a scientific approach
Formulation
Hypothesis.
Prediction
Experiment
Analysis
Understanding the most common statistical methods
Percentages
Mean, average, and standard deviation
Quantiles and percentiles
Histograms
Using other mathematical models in observability
Visualizing histograms with Grafana
In practice - applying what you've learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
Part 3 - Applying Architecture for Reliability
Chapter 8: Reliable Architecture - Systems Strategy and Design
Technical requirements
Designing for reliability
Architectural aspects
Reliability equations
Design patterns
Modern applications
Splitting and balancing the workload
Splitting
Balancing
Failing over - almost as good
Scaling up and out - horizontal versus vertical
Horizontal
Vertical
Autoscaling
In practice - applying what you've learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
Chapter 9: Valued Automation - Toil Discovery and Elimination
Technical requirements
Eliminating toil
Toil redefined
Why toil is bad
Handling toil the right way
Treating automation as a software problem
Document
Algorithm
Code
Automating the (in)famous CI/CD pipeline
Continuous integration
Continuous delivery
Production releases
In practice - applying what you've learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
Chapter 10: Exposing Pipelines - GitOps and Testing Essentials
A basic pipeline - building automation to deploy infrastructure as code architecture and code
Pipelines in chronological order
Pipeline templates
Errors or breaks in pipelines
Using containers in pipelines
Pipeline artifacts
Pipeline troubleshooting tips.
Automating compliance and security in pipelines
Library age
Application security testing
Dynamic Application Security Testing (DAST)
Static Application Security Testing (SAST)
Secrets scanning
Automated linting for code quality and standards
Compiling with linting feedback
Validating functionality during deployment with automated testing
Why is testing so important to reliability?
Test data
The types of testing
When to test a pipeline
Testing observability
Automated rollbacks
The reduction of developer toil through automated processes
What is the impact of addressing toil?
In practice - applying what you've learned
Preparing AWS for the lab
Creating your repository
Adding secrets to your repository
Downloading and committing the lab files
Understanding the pipeline
Adding more steps
Testing but not deploying
Lab final thoughts
Summary
Chapter 11: Worker Bees - Orchestrations of Serverless, Containers, and Kubernetes
Technical requirements
The multiple definitions of serverless
Serverless Framework
Serverless computing
Serverless functions
Monitoring serverless functions
Errors
Containers and why we love them
Isolation
Immutability
Promotability
Tagging
Rollbacks
Security
Signable
Monitoring containers
Kubernetes and other ways to orchestrate containers
Health checks
Crashing and force-closing containers
HTTP-based load balancing
Server load balancing
Containers as a Service (CaaS)
Simple container orchestration
Kubernetes
Deployment techniques and workers
Traditional replacement deployment
Rolling deployment
A/B or blue/green deployment
Canary deployment
Automation and rolling back failed deployments
Rollback metrics
When to roll back
How to roll back.
In practice - applying what you've learned.

Becoming a rockstar SRE electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Ejemplares similares