Becoming a rockstar SRE electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Excel in site reliability engineering by learning from field-driven lessons on observability and reliability in code, architecture, process, systems management, costs, and people to minimize downtime and enhance developers' output Purchase of the print or Kindle book includes a free eBook in th...

Descripción completa

Detalles Bibliográficos
Otros Autores: Proffitt, Jeremy, author (author), Anami, Rod, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England ; Mumbai : Packt [2023]
Edición:1st ed
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009742736806719
Tabla de Contenidos:
  • Cover
  • Title Page
  • Copyright and Credits
  • Dedication
  • Contributors
  • Table of Contents
  • Preface
  • Part 1 - Understanding the Basics of Who, What, and Why
  • Chapter 1: SRE Job Role - Activities and Responsibilities
  • Making this journey personal
  • SRE driving forces
  • SRE skills
  • SRE traits
  • Understanding the mindset and hobbies of an SRE
  • SRE affinity game
  • SRE guiding principles
  • SRE hobbies
  • DevOps engineers versus SRE versus others
  • DevOps and site reliability engineers
  • Software and site reliability engineers
  • Describing an SRE's main responsibilities
  • An overview of the daily activities of an SRE
  • People that inspire
  • Jeremy's recognition - Paul Tyma, former CTO, LendingTree
  • Rod's recognition - Ingo Averdunk, Distinguished Engineer, IBM, and Gene Brown, Distinguished Engineer, Kyndryl
  • Summary
  • Further reading
  • Chapter 2: Fundamental Numbers - Reliability Statistics
  • SLA commitment - a conversation, not a number
  • Internal partner SLAs
  • External partner SLAs
  • The cost of more 9s in an SLA
  • A final word on SLAs
  • Defining and leveraging SLOs and SLIs
  • SLOs
  • SLOs and time
  • Tracking outage frequency with the MTBF
  • Measuring the downtime with the MTTR
  • Understanding the customer and revenue impact
  • Transparency in outages
  • The rockstar SRE's SLA
  • Summary
  • Chapter 3: Imperfect Habits - Duct Tape Architecture and Spaghetti Code
  • The business of software development - let's start with the dollars
  • Defining the "value" of software to a business
  • The value of protecting business
  • The value of growing a business
  • The value of saving labor costs
  • The A/B testing mindset - the art of change in customer interaction
  • A/B testing in customer flows
  • Analyzing the results of A/B testing
  • Leveraging A/B testing to satisfy quarterly numbers.
  • Dedication to the craft of development - and why some are just here for a job
  • A quick guide to communicating with your colleagues
  • Reviewing the merge request - it's about training, oversight, and reliability
  • Avoiding the typical rubber stamp mentality
  • A word on production deployments
  • Why businesses want us to outright ignore best practices
  • The truth about the ownership of a developer's time
  • Understanding the flaws in how we estimate development cost
  • Fast, good, cheap - pick one
  • Why is observability the answer to reliability issues?
  • The cost of highly available architecture
  • Mixing good and bad - tricks to wrapping bad code and making it resilient
  • Alerting that fires actions
  • Adding additional logging to monitor potential issues
  • Using try catch to encapsulate exceptions
  • Retries to the rescue…or not
  • Summary
  • Part 2 - Implementing Observability for Site Reliability Engineering
  • Chapter 4: Essential Observability - Metrics, Events, Logs, and Traces (MELT)
  • Technical requirements
  • Accomplishing systems monitoring and telemetry
  • Monitoring targets for infrastructure
  • Monitoring types and tools
  • Monitoring golden signals
  • Monitoring data
  • Understanding APM
  • Getting to know topology self-discovery, the blast radius, predictability, and correlation
  • Alerting - the art of doing it quietly
  • The user perspective notification trigger principle
  • Event-to-incident mapping principle
  • Mixing everything into observability
  • Outages versus downtime
  • Observability architecture
  • Observability effectiveness
  • In practice - applying what you have learned
  • Lab architecture
  • Lab contents
  • Lab instructions
  • Summary
  • Further reading
  • Chapter 5: Resolution Path - Master Troubleshooting
  • Properly defining the problem - and what to ask and not ask
  • Source of information.
  • The knowledge base of the reporter
  • Naming conventions
  • False urgency
  • Executive summary
  • Breaking down and testing systems
  • Breaking down hardware versus the operating system
  • Breaking down a web API
  • Understanding the steps
  • The problems with this method of troubleshooting
  • Previous and common events - checking for the simple problems
  • Prior Root Cause Analysis (RCA) documents
  • Timeline analysis
  • Comparison
  • The best approach
  • Effective research both online and among peers
  • The art of the Google search
  • Skimming the content quickly and refining it
  • Never forget your internal resources
  • Breaking down source code efficiently
  • Code you've never seen
  • When that fails
  • Logging plus code
  • In practice - applying what you've learned
  • Summary
  • Chapter 6: Operational Framework - Managing Infrastructure and Systems
  • Technical requirements
  • Approaching systems administration as a discipline
  • Design
  • Installation
  • Configuration
  • App deployment
  • Management
  • Upgrade
  • Uninstallation
  • Understanding IT service management
  • ITIL
  • DevOps
  • Seeing systems administration as multiple layers and multiple towers
  • Automating systems provisioning and management
  • Infrastructure as Code
  • Immutable infrastructure
  • In practice - applying what you've learned
  • Lab architecture
  • Lab contents
  • Lab instructions
  • Summary
  • Further readings
  • Chapter 7: Data Consumed - Observability Data Science
  • Technical requirements
  • Making data-driven decisions
  • Defining the question and options
  • Determining which data to use
  • Identifying which data is already available
  • Collecting the missing data
  • Analyzing all datasets together
  • Presenting the decision as a record
  • Documenting the lessons learned in the process
  • Solving problems through a scientific approach
  • Formulation
  • Hypothesis.
  • Prediction
  • Experiment
  • Analysis
  • Understanding the most common statistical methods
  • Percentages
  • Mean, average, and standard deviation
  • Quantiles and percentiles
  • Histograms
  • Using other mathematical models in observability
  • Visualizing histograms with Grafana
  • In practice - applying what you've learned
  • Lab architecture
  • Lab contents
  • Lab instructions
  • Summary
  • Further reading
  • Part 3 - Applying Architecture for Reliability
  • Chapter 8: Reliable Architecture - Systems Strategy and Design
  • Technical requirements
  • Designing for reliability
  • Architectural aspects
  • Reliability equations
  • Design patterns
  • Modern applications
  • Splitting and balancing the workload
  • Splitting
  • Balancing
  • Failing over - almost as good
  • Scaling up and out - horizontal versus vertical
  • Horizontal
  • Vertical
  • Autoscaling
  • In practice - applying what you've learned
  • Lab architecture
  • Lab contents
  • Lab instructions
  • Summary
  • Further reading
  • Chapter 9: Valued Automation - Toil Discovery and Elimination
  • Technical requirements
  • Eliminating toil
  • Toil redefined
  • Why toil is bad
  • Handling toil the right way
  • Treating automation as a software problem
  • Document
  • Algorithm
  • Code
  • Automating the (in)famous CI/CD pipeline
  • Continuous integration
  • Continuous delivery
  • Production releases
  • In practice - applying what you've learned
  • Lab architecture
  • Lab contents
  • Lab instructions
  • Summary
  • Further reading
  • Chapter 10: Exposing Pipelines - GitOps and Testing Essentials
  • A basic pipeline - building automation to deploy infrastructure as code architecture and code
  • Pipelines in chronological order
  • Pipeline templates
  • Errors or breaks in pipelines
  • Using containers in pipelines
  • Pipeline artifacts
  • Pipeline troubleshooting tips.
  • Automating compliance and security in pipelines
  • Library age
  • Application security testing
  • Dynamic Application Security Testing (DAST)
  • Static Application Security Testing (SAST)
  • Secrets scanning
  • Automated linting for code quality and standards
  • Compiling with linting feedback
  • Validating functionality during deployment with automated testing
  • Why is testing so important to reliability?
  • Test data
  • The types of testing
  • When to test a pipeline
  • Testing observability
  • Automated rollbacks
  • The reduction of developer toil through automated processes
  • What is the impact of addressing toil?
  • In practice - applying what you've learned
  • Preparing AWS for the lab
  • Creating your repository
  • Adding secrets to your repository
  • Downloading and committing the lab files
  • Understanding the pipeline
  • Adding more steps
  • Testing but not deploying
  • Lab final thoughts
  • Summary
  • Chapter 11: Worker Bees - Orchestrations of Serverless, Containers, and Kubernetes
  • Technical requirements
  • The multiple definitions of serverless
  • Serverless Framework
  • Serverless computing
  • Serverless functions
  • Monitoring serverless functions
  • Errors
  • Containers and why we love them
  • Isolation
  • Immutability
  • Promotability
  • Tagging
  • Rollbacks
  • Security
  • Signable
  • Monitoring containers
  • Kubernetes and other ways to orchestrate containers
  • Health checks
  • Crashing and force-closing containers
  • HTTP-based load balancing
  • Server load balancing
  • Containers as a Service (CaaS)
  • Simple container orchestration
  • Kubernetes
  • Deployment techniques and workers
  • Traditional replacement deployment
  • Rolling deployment
  • A/B or blue/green deployment
  • Canary deployment
  • Automation and rolling back failed deployments
  • Rollback metrics
  • When to roll back
  • How to roll back.
  • In practice - applying what you've learned.