#6 DevOps

Enterprise Transformation: Scaling DevOps and Secure Collaboration

Blueprint for scaling DevOps across large organizations. Covers platform engineering, security compliance automation, and team structures for cloud/ML initiatives

17:07 The DevOps Handbook(book) 47,0 MB 4 uppspelningar
Avslutad

Audio Player

0:00 / 17:07
Hastighet:
Sovtimer:

Transkript

Briefing Document: DevOps Principles from "The DevOps Handbook"

Source: Excerpts from "the-devops-handbook-how-to-create-world-class-agility-reliability-and-security-in-technology-organizations-978-1942788003.pdf"

Overview: These excerpts from "The DevOps Handbook" delve into core principles and practices that drive high-performing technology organizations, drawing heavily on lessons from Lean manufacturing, particularly the Toyota Production System. The document highlights the importance of creating a culture of feedback and learning, minimizing waste, enabling market-oriented teams, automating processes, and integrating security throughout the value stream. Real-world case studies, such as Etsy and Google, are used to illustrate the successful application of these concepts.

Key Themes and Ideas:

The Importance of Improvement Kata and Daily Practice:
Mike Rother's "Toyota Kata" emphasizes that simply adopting Lean tools doesn't replicate the performance of companies like Toyota. The key lies in the "improvement kata," which involves structured, habitual practice of improvement work.
This daily practice is crucial for improving outcomes. The cycle of establishing desired future states, setting weekly targets, and continuously improving daily work is the guiding principle.
This concept is mirrored in DevOps by fostering a culture of continuous learning and improvement.
Minimizing Batch Sizes to Increase Flow and Reduce Risk:
The difference between large and small batch sizes in work processes is dramatic. Using a small batch strategy, like "single-piece flow," significantly reduces the time it takes for a single item (e.g., a brochure, a code change) to be completed.
In technology, this translates to working in smaller code changes and deploying more frequently. Large batch sizes in code commits lead to significant chaos and rework, as seen in the HP LaserJet example.
"Small batch development" and frequent commits to trunk are essential for a smooth and safe deployment pipeline.
Identifying and Eliminating Waste in the Value Stream:
Drawing from Shigeo Shingo's work on the Toyota Production System, waste is defined as "the use of any material or resource beyond what the customer requires and is willing to pay for."
Seven major types of manufacturing waste are identified: inventory, overproduction, extra processing, transportation, waiting, motion, and defects.
In a technology value stream, this translates to reducing activities that don't directly add value to the customer, such as excessive handoffs, manual processes, and waiting for environments.
Functional vs. Market-Oriented Teams:
Achieving DevOps outcomes requires moving away from a "functional orientation" (optimizing for cost) towards a "market orientation" (optimizing for speed).
Market-oriented teams are cross-functional and independent, responsible for the entire lifecycle of their service, from conception to retirement, including development, testing, security, deployment, and support.
This structure enables many small teams to work safely and independently, quickly delivering value to the customer.
Embedding Operations Engineers into Service Teams:
A key way to enable market-oriented outcomes is by embedding Operations engineers directly into product or service teams.
This reduces reliance on centralized Operations teams and aligns the Ops engineers' priorities with the goals of the product teams they are supporting.
This practice fosters closer connection to internal and external customers and allows teams to become more self-sufficient in service delivery and support.
The Critical Role of Version Control:
Version control is a mandatory practice for individual developers and teams. It records changes, allows for commits, comparisons, merges, and restoring past revisions.
It minimizes risks by providing a mechanism to revert to previous versions in production.
Significant problems arise when developers work in long-lived private branches and merge infrequently, resulting in large batches of changes.
Creating Production-Like Environments on Demand:
A significant hardship in traditional IT is the difficulty and delay in obtaining testing environments. Developers often resort to using old, potentially unreliable environments.
The goal is to enable the rapid, automated development of production-like environments on demand. This requires checking environment configurations into version control and using automated configuration systems.
The concept of "immutable infrastructure" where environments are rebuilt rather than repaired, is a valuable practice in this area.
Decoupling Database Changes from Application Changes:
Traditionally, database changes are tightly coupled with application releases, making rollbacks difficult and potentially leading to data loss.
A better approach is to decouple these changes by only making additive changes to the database (never mutating existing objects) and ensuring applications are backward-compatible with older database versions.
IMVU is cited as an example, achieving fifty deployments per day, some requiring database changes, by adopting this pattern.
Low-Risk Release Patterns (Blue-Green Deployment, Canary Releases, Feature Toggles):
To reduce the risk associated with deployments, various release patterns are employed.
Blue-Green Deployment: Maintaining two identical production environments (blue and green). Traffic is switched from the old (blue) to the new (green). This allows for quick rollback by switching traffic back to the blue environment. Dixons Retail is an example of this pattern applied to point-of-sale systems.
Canary Release: Progressively rolling out a new feature or version to a small segment of users before a full rollout. This allows for monitoring and halting the release if issues arise, minimizing the impact on the majority of customers.
Feature Toggles (or Flags): Enabling or disabling features dynamically in production. This allows for releasing code with features turned off and then enabling them for specific user segments or all users, independent of code deployments.
Creating and Utilizing Telemetry for Problem Detection and Understanding Reality:
Telemetry is defined as the automated collection and transmission of data from remote points for monitoring.
Creating telemetry at all levels of the application stack (business, application, infrastructure) in production and pre-production environments is crucial.
Telemetry allows for confirming correct service operation, quickly determining what is going wrong when problems occur, and making informed decisions.
It helps build an understanding of reality and detect when that understanding is incorrect.
Etsy's "Church of Graphs" culture, where they track everything and make it easy for engineers to add telemetry, is highlighted. Their creation and open-sourcing of StatsD is an example of making it easy to instrument code.
Overlaying production deployment activities on telemetry graphs helps identify the causal relationship between changes and issues.
Actionable business metrics, as opposed to vanity metrics, are essential for informing product changes and enabling experimentation (A/B testing). They also provide context for infrastructure metrics, fostering better collaboration between Dev and Ops.
Leveraging Statistical Analysis for Anomaly Detection:
Simple statistical techniques, like using means and standard deviations, can help detect potential problems by alerting when a metric deviates significantly from its average.
However, simply using a "3 standard deviation" rule on data that isn't normally distributed can lead to over-alerting, as illustrated by the downloads per minute example from Dr. Toufic Boubez.
More advanced techniques, such as those used by Netflix to predict viewing demand based on historical seasonal patterns, are necessary for non-Gaussian data. Metrics related to user behavior often have predictable daily, weekly, and yearly patterns.
Institutionalizing Code and Environment Reviews:
Reviews of changes to applications or environments by fellow engineers are crucial for improving quality, enabling cross-training, peer learning, and skill improvement.
Requiring reviews prior to committing code to trunk is a logical place to implement this practice, especially for higher-risk changes like database modifications or security-sensitive components.
Review statistics should be monitored to ensure reviews are not merely being "rubber-stamped."
Fostering a Culture of Blameless Post-Mortems:
When incidents or accidents occur, a "just culture" is essential, focusing on systemic improvements rather than blaming individuals.
Blameless post-mortems are conducted to understand what happened, why it made sense for people to take the actions they did, and how to prevent recurrence.
Focusing on the system as it exists, rather than imagined systems, is crucial. Counterfactual statements ("If I had known...") should be reframed to understand the reality of the situation at the time of the incident.
Publicizing post-mortems increases organizational learning and transparency.
Conducting Game Days to Build Resilience and Practice Disaster Recovery:
Game Days involve simulating disruptive events (e.g., server failures, network outages) in a controlled environment.
The goal is to progressively create more resilient services and a higher degree of assurance that operations can resume when inopportune events occur.
Game Days also create significant learning opportunities and build a more resilient organization. Google's Disaster Recovery Program (DiRT) is a prime example, simulating various catastrophic events.
Reserving Time for Organizational Learning and Improvement:
Inspired by the Toyota Production System's "improvement blitz" (or "kaizen blitz"), it's crucial to dedicate concentrated periods of time to address specific issues or technical debt.
These rituals, such as day- or week-long improvement blitzes or "Fixits" (as practiced at Google), involve teams self-organizing to fix problems they care about, with no feature work allowed.
These events help enforce the practice of reserving time for improvement work and fostering cross-functional collaboration. Facebook's HipHop PHP compiler project originated from a hack day, demonstrating the potential of such initiatives.
Integrating Information Security into the Daily Work of Development and Operations:
Security should not be a separate, siloed function but integrated throughout the DevOps value stream.
Embedding security responsibilities and expertise within development and operations teams is essential.
Creating security telemetry and making it visible to developers can help them understand the hostile operating environment and the impact of their code. Etsy's graph showing SQL injection attempts is a powerful example.
Automating security testing and integrating it into the deployment pipeline helps catch vulnerabilities early.
While integration is key, certain regulatory requirements (like PCI DSS) may necessitate physically and logically separating systems handling sensitive data and implementing strict separation of duties, as illustrated by Etsy's ICHT application case study.
Considering Both Systems of Record and Systems of Engagement:
Gartner's concept of bimodal IT acknowledges that enterprises support a wide spectrum of services.
Systems of Record (e.g., ERP, HR, financial systems) prioritize correctness and data integrity.
Systems of Engagement (e.g., customer-facing applications) prioritize agility and responsiveness.
DevOps practices are particularly well-suited for Systems of Engagement due to their focus on speed and frequent changes, but the principles of automation and improvement can be applied to Systems of Record as well.
Summary of Key Takeaways:

High-performing technology organizations prioritize continuous improvement and learning, often adopting practices inspired by Lean methodologies.
Minimizing batch sizes and increasing flow through the value stream leads to faster delivery and reduced risk.
Identifying and eliminating waste in all forms is fundamental to efficiency.
Empowering independent, cross-functional, market-oriented teams is key to achieving speed and agility.
Integrating Operations into development teams fosters collaboration and shared responsibility.
Version control, automated environments, and decoupled database changes are essential technical practices.
Implementing low-risk release patterns enables frequent and safe deployments.
Comprehensive telemetry and statistical analysis are vital for understanding system behavior, detecting anomalies, and making data-driven decisions.
A culture of blameless post-mortems promotes learning from failures.
Regularly scheduled time for improvement work and simulating disruptive events builds resilience and organizational learning.
Information security must be integrated into the daily work of all teams.
Different approaches may be needed for Systems of Record and Systems of Engagement, but DevOps principles are applicable to both.
NotebookLM kan göra misstag, så dubbelkolla svaren.

Källor

The-Devops-Handbook-How-To-Create-World-Class-Agility-Reliability-And-Security-In-Technology-Organizations-978-1942788003