#4 DevOps

Feedback Loops: Monitoring, Observability, and Learning from Failure

Dives into monitoring architectures, incident response workflows, and creating psychological safety for blameless retrospectives. Real examples from microservices and ML pipelines.

14:36 The DevOps Handbook(book) 40,1 MB 2 uppspelningar
Avslutad

Audio Player

0:00 / 14:36
Hastighet:
Sovtimer:

Transkript

DevOps Handbook Study Guide

Quiz: Short Answer Questions (2-3 sentences)

According to Mike Rother, what was the most important practice the Lean community missed when trying to replicate Toyota's performance, and what is its purpose?
Explain the difference between large batch size and small batch size (single-piece flow) in the context of the brochure folding example.
What is "waste" according to Shigeo Shingo, and what are three major types of manufacturing waste he identified?
Describe a greenfield project in the context of technology and why starting with one might be easier.
What are "systems of record" and "systems of engagement" within the concept of bimodal IT?
How does embedding Operations engineers into service teams contribute to market-oriented outcomes?
What is the primary purpose of a version control system in software development?
Explain the concept of "dark launches" and their benefit as described in the text.
What is telemetry in the context of technology systems, and what is the goal of creating it?
According to Dan North, when deciding between an ERROR or WARN logging level, what thought experiment should you use?
Quiz Answer Key

Rother identified the improvement kata as the missed practice. Its purpose is to create structure for the daily, habitual practice of improvement work, as consistent practice improves outcomes.
In large batch, you complete one step for all items before moving to the next step for all items. In small batch (single-piece flow), you complete all steps for a single item before starting the next one.
Waste is defined as using any material or resource beyond what the customer requires and is willing to pay for. Three types of waste are inventory, overproduction, and waiting. (Any three from the list are acceptable).
A greenfield project is a new software project or initiative built from scratch with few constraints. It can be easier because you worry less about existing code, processes, and teams.
Systems of record are ERP-like systems running the core business where data correctness is paramount. Systems of engagement are customer-facing or employee-facing systems like e-commerce or productivity applications.
Embedding Ops engineers makes product teams more self-sufficient by reducing reliance on centralized Operations. This aligns Ops priorities with product team goals and connects them more closely to customers.
A version control system records changes to files or sets of files, allowing developers to track, compare, merge, and restore past revisions, minimizing risks.
Dark launches involve releasing a feature to a small segment of customers to progressively roll it out and halt the release if problems are found. This minimizes the number of customers impacted by defects or performance issues.
Telemetry is an automated process for collecting measurements and data from remote points for monitoring. The goal is to continuously create telemetry to confirm services are operating correctly and quickly identify problems.
North suggests imagining being woken up at 4 a.m. If the condition doesn't warrant an urgent wake-up call, it is likely a WARN level rather than an ERROR.
Essay Format Questions (No Answers Provided)

Analyze the significance of the "improvement kata" as described by Mike Rother in the context of achieving world-class agility, reliability, and security in technology organizations. How does this daily, habitual practice differ from traditional approaches to improvement, and what are its potential benefits?
Compare and contrast the large batch size strategy with the small batch size (single-piece flow) strategy as discussed in the text. Explain the dramatic difference in outcomes using the provided example and discuss why small batch sizes are generally preferred in a DevOps context.
Discuss the importance of eliminating hardship and waste in the value stream, drawing upon Shigeo Shingo's definition of waste. How do the different types of waste manifest in a technology value stream, and what are the potential consequences of not addressing them?
Evaluate the benefits and challenges of adopting a market-oriented team structure compared to a functional orientation. How does the embedding of Operations engineers into service teams support a market-oriented approach, and what are the implications for collaboration and achieving DevOps outcomes?
Explain the critical role of telemetry in supporting disciplined problem-solving behavior and achieving organizational goals. Discuss the different levels of metrics (business, application, infrastructure) and how they contribute to understanding reality and detecting when that understanding is incorrect.
Glossary of Key Terms

Toyota Kata: A framework for managing people for improvement, adaptiveness, and superior results, emphasizing structured, daily practice for improvement.
Improvement Kata: The core practice identified by Mike Rother, focusing on creating structure for the daily, habitual practice of improvement work to improve outcomes.
Large Batch Size: A strategy where a single step is completed for all items in a batch before moving to the next step for the entire batch.
Small Batch Size / Single-Piece Flow: A strategy where all steps required to complete a single item are performed sequentially before starting the next item.
Waste (Lean Definition): The use of any material or resource beyond what the customer requires and is willing to pay for. Examples include inventory, overproduction, waiting, defects, etc.
Greenfield Project: A new software project or initiative built from scratch with few constraints, applications, and infrastructure created anew.
Bimodal IT: A concept describing the wide spectrum of IT services enterprises support, typically categorized into systems of record and systems of engagement.
Systems of Record: ERP-like systems that run the core business, where data correctness and transactions are paramount.
Systems of Engagement: Customer-facing or employee-facing systems designed for interaction and productivity.
Market-Oriented Teams: Teams structured to optimize for speed, often responsible for feature development, testing, securing, deploying, and supporting their service end-to-end.
Functional Orientation: A team structure optimized for cost, where teams are organized around specific functions (e.g., development, operations, testing).
Service Teams: Interchangeable term used for feature teams, product teams, development teams, and delivery teams, primarily responsible for developing, testing, and securing code to deliver value.
Version Control System: A system that records changes to files or sets of files, allowing for tracking, comparing, merging, and restoring past revisions.
Commits / Revisions: Groups of changes recorded within a version control system, along with metadata like who made the change and when.
Trunk: The main development line in a version control system.
Long-Lived Private Branches / Feature Branches: Development branches that exist for an extended period before merging back into the main development line, often resulting in large batch sizes of changes.
Blue-Green Deployment: A release pattern that involves running two identical production environments (blue and green) and switching traffic between them to deploy a new version with minimal downtime and risk.
Dark Launch: A release pattern where a feature is progressively rolled out to small segments of customers, allowing for monitoring and halting the release if problems are found, without the customer necessarily being aware of the feature initially.
Telemetry: An automated communications process by which measurements and other data are collected at remote points and transmitted for monitoring and analysis.
Logging Levels: Categories used to classify log messages based on their severity or informational value (e.g., DEBUG, INFO, WARN, ERROR, FATAL).
StatsD: A widely used metrics library designed to make it easy for developers to generate telemetry from their code.
Graphite / Grafana: Tools often used in conjunction with StatsD to render metric events into graphs and dashboards for visualization and analysis.
Business Metrics: Telemetry related to key business outcomes, such as sales transactions, revenue, user signups, or churn rate.
Application Metrics: Telemetry related to application performance and behavior, such as transaction times, user response times, or application faults.
Infrastructure Metrics: Telemetry related to the underlying infrastructure, such as web server traffic, CPU load, or disk usage.
Customer Acquisition Funnel: The theoretical steps a potential customer takes to make a purchase, with measurable journey events.
Vanity Metrics: Metrics that provide little useful information for making actionable decisions.
Gaussian Distribution / Normal Distribution: A symmetrical, bell-shaped probability distribution often found in natural phenomena and data sets.
Standard Deviation (σ): A measure of the amount of variation or dispersion of a set of values.
Mean (µ): The average of a set of values.
Non-Gaussian Distribution: A probability distribution that does not have the classic, symmetrical bell curve shape.
Outlier Detection: Statistical and visualization techniques used to identify data points that significantly vary from the norm.
Anomaly Detection: Techniques used to identify unusual patterns or data points that deviate from expected behavior.
Periodic/Seasonal Metric Data: Telemetry that exhibits regular and predictable patterns over time (daily, weekly, yearly).
Site Reliability Engineer (SRE): A term coined at Google for Ops engineers who have a software engineering background and are tasked with operations work.
Code Review: The practice of having fellow engineers scrutinize changes to applications or environments before they are committed or deployed.
Imposter Syndrome: A psychological term describing individuals who are unable to internalize their accomplishments and feel like frauds despite external evidence of their competence.
Blameless Post-Mortem: A meeting or process conducted after an incident or accident to understand what happened and identify systemic improvements, focusing on how the system contributed to the outcome rather than blaming individuals.
Counterfactual Statements: Statements that frame a problem in terms of an imagined system ("If I had known...") rather than the system that actually exists.
Game Days: Exercises that simulate disaster scenarios or system failures to test the resilience of services and the organization's ability to respond and recover.
Disaster Recovery Program (DiRT): A program dedicated to simulating disaster scenarios to test and improve disaster recovery capabilities.
Improvement Blitz / Kaizen Blitz: A dedicated and concentrated period of time (often several days) to address a particular issue and implement process improvements.
Technical Debt: The cost incurred later by choosing an easy or quick solution now instead of a better approach that would take longer.
Fixits: Dedicated periods, often short (e.g., 24 hours), where teams self-organize to address problems they care about, typically non-feature work like paying down technical debt.
Brakeman: A static analysis security vulnerability scanner specifically for Ruby on Rails applications.
PCI DSS (Payment Card Industry Data Security Standards): A set of security standards designed to protect cardholder data.
Cardholder Data Environment (CDE): The people, processes, and technology that store, process, or transmit cardholder data or sensitive authentication data, as defined by PCI DSS.
Separation of Duties: A security principle that involves dividing responsibilities and privileges among different individuals or teams to prevent a single person from being able to perform actions that could compromise security without oversight.

Källor

The-Devops-Handbook-How-To-Create-World-Class-Agility-Reliability-And-Security-In-Technology-Organizations-978-1942788003