#5 DevOps

Culture of Experimentation: Risk-Taking and Organizational Learning

Strategies for fostering innovation, implementing architectural safety nets, and converting failures into improvements. Case studies from tech giants and scaling startups

15:18 The DevOps Handbook(book) 42,0 MB 2 uppspelningar
Avslutad

Audio Player

0:00 / 15:18
Hastighet:
Sovtimer:

Transkript

FAQ
What is the "improvement kata" and how does it relate to the Toyota Production System?
The "improvement kata," as described by Mike Rother in "Toyota Kata," is a core practice derived from observing and codifying the Toyota Production System. It emphasizes creating a structured, habitual routine for daily improvement work. The idea is that consistent, daily practice of establishing desired future states, setting weekly targets, and continually improving daily activities is what drives superior outcomes, much like it did at Toyota. Rother's research suggested that the Lean community had missed this fundamental practice, focusing instead on toolkits without the underlying cultural routine of continuous improvement.

How does using small batch sizes impact the speed and efficiency of work, as exemplified by the brochure folding exercise?
The difference between large and small batch sizes is dramatic. In a large batch strategy, each step of a process (like folding, inserting, sealing, and stamping brochures) is completed for the entire batch before moving to the next step. This leads to a long lead time for the first completed item. In contrast, a small batch strategy, or "single-piece flow," involves completing all steps for one item before starting the next. This significantly reduces the time to get the first item completed, making the overall process much faster and allowing for quicker feedback and identification of issues.

What are "systems of record" and "systems of engagement" within the context of enterprise IT, and why is this distinction important?
Gartner research popularized the concept of "bimodal IT," which broadly categorizes enterprise services. "Systems of record" are the core business systems, such as ERP, HR, and financial reporting, where the accuracy and integrity of transactions and data are paramount. "Systems of engagement" are customer-facing or employee-facing systems, like e-commerce platforms and productivity applications. This distinction is important because these two types of systems often have different requirements for agility, reliability, and security, influencing how they are developed, deployed, and managed.

How does adopting a "market-oriented" team structure, as opposed to a "functional orientation," contribute to achieving DevOps outcomes?
Functional orientation optimizes for cost by grouping individuals with similar skills, often leading to handoffs between teams and slower delivery. Market orientation, on the other hand, optimizes for speed by creating small, cross-functional, and independent teams (often called service teams or feature teams). These teams are responsible for the entire lifecycle of a service, from conception to retirement, including development, testing, security, deployment, and support. This autonomy allows them to deliver value to the customer more quickly and safely.

What is the significance of version control systems in a technology organization, and how do they support DevOps practices?
Version control systems are increasingly a mandatory practice for developers and teams. They record changes to files, allowing for the storage, comparison, merging, and restoration of past revisions. In the context of DevOps, version control is critical. It provides a single source of truth for code and configurations, minimizes risks by enabling easy rollback, and plays a vital role in automated environment builds and continuous integration pipelines. The consistent use of version control is seen as a predictor of organizational performance.

How can an organization utilize production telemetry to improve their systems and respond to issues effectively?
Telemetry involves the automated collection and transmission of data from applications and environments. Production telemetry is essential for confirming that services are operating correctly and quickly identifying what's going wrong when problems occur, ideally before customers are impacted. By collecting and analyzing data at the business, application, and infrastructure levels, organizations can gain insights into system behavior, anticipate problems, and make informed decisions. Tools like StatsD and Graphite can make instrumenting code for telemetry easier, allowing developers to easily add metrics and visualize them alongside production changes.

How can statistical analysis, such as using means and standard deviations, help in detecting potential problems within systems based on telemetry data?
Statistical analysis of telemetry data can be valuable for anticipating problems. While simple methods like setting alerts based on a fixed number of standard deviations from the mean can be a starting point, it's crucial to understand the distribution of the data. If the data does not have a Gaussian (normal) distribution, a simple standard deviation rule can lead to excessive alerting. Techniques that account for non-Gaussian distributions or periodic/seasonal patterns (common in user data) can provide more accurate anomaly detection, helping to identify situations that deviate from historical norms and potentially indicate issues.

What are some examples of rituals or practices organizations can institutionalize to encourage organizational learning and pay down technical debt?
Organizations can institutionalize rituals to foster learning and address technical debt. Examples include "improvement blitzes" or "kaizen blitzes," which are dedicated, concentrated periods (often several days) where teams focus intently on a specific problem area, such as a problematic code section or environment issue, without feature work. These blitzes can involve cross-functional teams, promoting collaboration and knowledge sharing. Another practice is conducting "blameless post-mortems" after incidents. These reviews focus on understanding the systemic factors that contributed to the issue, rather than blaming individuals, and are crucial for creating a learning culture and preventing recurrence. Google's Disaster Recovery Program (DiRT) is an example of simulating failure to build resilience and learn.

Källor

The-Devops-Handbook-How-To-Create-World-Class-Agility-Reliability-And-Security-In-Technology-Organizations-978-1942788003