Becoming a rockstar SRE electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems
Excel in site reliability engineering by learning from field-driven lessons on observability and reliability in code, architecture, process, systems management, costs, and people to minimize downtime and enhance developers' output Purchase of the print or Kindle book includes a free eBook in th...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England ; Mumbai :
Packt
[2023]
|
Edición: | 1st ed |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009742736806719 |
Tabla de Contenidos:
- Cover
- Title Page
- Copyright and Credits
- Dedication
- Contributors
- Table of Contents
- Preface
- Part 1 - Understanding the Basics of Who, What, and Why
- Chapter 1: SRE Job Role - Activities and Responsibilities
- Making this journey personal
- SRE driving forces
- SRE skills
- SRE traits
- Understanding the mindset and hobbies of an SRE
- SRE affinity game
- SRE guiding principles
- SRE hobbies
- DevOps engineers versus SRE versus others
- DevOps and site reliability engineers
- Software and site reliability engineers
- Describing an SRE's main responsibilities
- An overview of the daily activities of an SRE
- People that inspire
- Jeremy's recognition - Paul Tyma, former CTO, LendingTree
- Rod's recognition - Ingo Averdunk, Distinguished Engineer, IBM, and Gene Brown, Distinguished Engineer, Kyndryl
- Summary
- Further reading
- Chapter 2: Fundamental Numbers - Reliability Statistics
- SLA commitment - a conversation, not a number
- Internal partner SLAs
- External partner SLAs
- The cost of more 9s in an SLA
- A final word on SLAs
- Defining and leveraging SLOs and SLIs
- SLOs
- SLOs and time
- Tracking outage frequency with the MTBF
- Measuring the downtime with the MTTR
- Understanding the customer and revenue impact
- Transparency in outages
- The rockstar SRE's SLA
- Summary
- Chapter 3: Imperfect Habits - Duct Tape Architecture and Spaghetti Code
- The business of software development - let's start with the dollars
- Defining the "value" of software to a business
- The value of protecting business
- The value of growing a business
- The value of saving labor costs
- The A/B testing mindset - the art of change in customer interaction
- A/B testing in customer flows
- Analyzing the results of A/B testing
- Leveraging A/B testing to satisfy quarterly numbers.
- Dedication to the craft of development - and why some are just here for a job
- A quick guide to communicating with your colleagues
- Reviewing the merge request - it's about training, oversight, and reliability
- Avoiding the typical rubber stamp mentality
- A word on production deployments
- Why businesses want us to outright ignore best practices
- The truth about the ownership of a developer's time
- Understanding the flaws in how we estimate development cost
- Fast, good, cheap - pick one
- Why is observability the answer to reliability issues?
- The cost of highly available architecture
- Mixing good and bad - tricks to wrapping bad code and making it resilient
- Alerting that fires actions
- Adding additional logging to monitor potential issues
- Using try catch to encapsulate exceptions
- Retries to the rescue…or not
- Summary
- Part 2 - Implementing Observability for Site Reliability Engineering
- Chapter 4: Essential Observability - Metrics, Events, Logs, and Traces (MELT)
- Technical requirements
- Accomplishing systems monitoring and telemetry
- Monitoring targets for infrastructure
- Monitoring types and tools
- Monitoring golden signals
- Monitoring data
- Understanding APM
- Getting to know topology self-discovery, the blast radius, predictability, and correlation
- Alerting - the art of doing it quietly
- The user perspective notification trigger principle
- Event-to-incident mapping principle
- Mixing everything into observability
- Outages versus downtime
- Observability architecture
- Observability effectiveness
- In practice - applying what you have learned
- Lab architecture
- Lab contents
- Lab instructions
- Summary
- Further reading
- Chapter 5: Resolution Path - Master Troubleshooting
- Properly defining the problem - and what to ask and not ask
- Source of information.
- The knowledge base of the reporter
- Naming conventions
- False urgency
- Executive summary
- Breaking down and testing systems
- Breaking down hardware versus the operating system
- Breaking down a web API
- Understanding the steps
- The problems with this method of troubleshooting
- Previous and common events - checking for the simple problems
- Prior Root Cause Analysis (RCA) documents
- Timeline analysis
- Comparison
- The best approach
- Effective research both online and among peers
- The art of the Google search
- Skimming the content quickly and refining it
- Never forget your internal resources
- Breaking down source code efficiently
- Code you've never seen
- When that fails
- Logging plus code
- In practice - applying what you've learned
- Summary
- Chapter 6: Operational Framework - Managing Infrastructure and Systems
- Technical requirements
- Approaching systems administration as a discipline
- Design
- Installation
- Configuration
- App deployment
- Management
- Upgrade
- Uninstallation
- Understanding IT service management
- ITIL
- DevOps
- Seeing systems administration as multiple layers and multiple towers
- Automating systems provisioning and management
- Infrastructure as Code
- Immutable infrastructure
- In practice - applying what you've learned
- Lab architecture
- Lab contents
- Lab instructions
- Summary
- Further readings
- Chapter 7: Data Consumed - Observability Data Science
- Technical requirements
- Making data-driven decisions
- Defining the question and options
- Determining which data to use
- Identifying which data is already available
- Collecting the missing data
- Analyzing all datasets together
- Presenting the decision as a record
- Documenting the lessons learned in the process
- Solving problems through a scientific approach
- Formulation
- Hypothesis.
- Prediction
- Experiment
- Analysis
- Understanding the most common statistical methods
- Percentages
- Mean, average, and standard deviation
- Quantiles and percentiles
- Histograms
- Using other mathematical models in observability
- Visualizing histograms with Grafana
- In practice - applying what you've learned
- Lab architecture
- Lab contents
- Lab instructions
- Summary
- Further reading
- Part 3 - Applying Architecture for Reliability
- Chapter 8: Reliable Architecture - Systems Strategy and Design
- Technical requirements
- Designing for reliability
- Architectural aspects
- Reliability equations
- Design patterns
- Modern applications
- Splitting and balancing the workload
- Splitting
- Balancing
- Failing over - almost as good
- Scaling up and out - horizontal versus vertical
- Horizontal
- Vertical
- Autoscaling
- In practice - applying what you've learned
- Lab architecture
- Lab contents
- Lab instructions
- Summary
- Further reading
- Chapter 9: Valued Automation - Toil Discovery and Elimination
- Technical requirements
- Eliminating toil
- Toil redefined
- Why toil is bad
- Handling toil the right way
- Treating automation as a software problem
- Document
- Algorithm
- Code
- Automating the (in)famous CI/CD pipeline
- Continuous integration
- Continuous delivery
- Production releases
- In practice - applying what you've learned
- Lab architecture
- Lab contents
- Lab instructions
- Summary
- Further reading
- Chapter 10: Exposing Pipelines - GitOps and Testing Essentials
- A basic pipeline - building automation to deploy infrastructure as code architecture and code
- Pipelines in chronological order
- Pipeline templates
- Errors or breaks in pipelines
- Using containers in pipelines
- Pipeline artifacts
- Pipeline troubleshooting tips.
- Automating compliance and security in pipelines
- Library age
- Application security testing
- Dynamic Application Security Testing (DAST)
- Static Application Security Testing (SAST)
- Secrets scanning
- Automated linting for code quality and standards
- Compiling with linting feedback
- Validating functionality during deployment with automated testing
- Why is testing so important to reliability?
- Test data
- The types of testing
- When to test a pipeline
- Testing observability
- Automated rollbacks
- The reduction of developer toil through automated processes
- What is the impact of addressing toil?
- In practice - applying what you've learned
- Preparing AWS for the lab
- Creating your repository
- Adding secrets to your repository
- Downloading and committing the lab files
- Understanding the pipeline
- Adding more steps
- Testing but not deploying
- Lab final thoughts
- Summary
- Chapter 11: Worker Bees - Orchestrations of Serverless, Containers, and Kubernetes
- Technical requirements
- The multiple definitions of serverless
- Serverless Framework
- Serverless computing
- Serverless functions
- Monitoring serverless functions
- Errors
- Containers and why we love them
- Isolation
- Immutability
- Promotability
- Tagging
- Rollbacks
- Security
- Signable
- Monitoring containers
- Kubernetes and other ways to orchestrate containers
- Health checks
- Crashing and force-closing containers
- HTTP-based load balancing
- Server load balancing
- Containers as a Service (CaaS)
- Simple container orchestration
- Kubernetes
- Deployment techniques and workers
- Traditional replacement deployment
- Rolling deployment
- A/B or blue/green deployment
- Canary deployment
- Automation and rolling back failed deployments
- Rollback metrics
- When to roll back
- How to roll back.
- In practice - applying what you've learned.