Menu
AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Ryan Peterman

96,516 views 12 days ago

Video Summary

Mark Brooker, a distinguished engineer at AWS, discusses his extensive career focused on building robust distributed systems, emphasizing the critical importance of hands-on experience and deep understanding derived from operational involvement. He highlights that true impact comes from identifying and solving significant problems, a skill honed through customer engagement, technical trend analysis, and meticulous post-mortem reviews, which he has analyzed in the thousands. Brooker also touches on the evolution of software engineering with AI, the enduring value of deep technical expertise, and the necessity for engineers, particularly senior ones, to remain adaptable and hands-on in the face of rapid technological change, asserting that a lack of practical engagement leads to inaccurate opinions.

An interesting fact revealed is that Brooker has analyzed between 3,000 and 4,000 cloud system post-mortems and Amazon COEs throughout his career, underscoring his commitment to learning from failures.

Short Highlights

  • Identifying problems that matter is more impactful than the volume of work, often found by engaging broadly with customers and analyzing technical trends.
  • Spending 15 years on call provided invaluable practical knowledge for building distributed systems by deeply understanding post-mortems and COEs.
  • A robust post-mortem culture is crucial for organizational and strategic improvement, moving beyond fixing proximal causes to addressing systemic issues.
  • Caching, while beneficial for performance, carries significant risks of metastable failures if not implemented carefully, leading to system instability.
  • AI is poised to fundamentally change software engineering by potentially alleviating supply constraints, enabling greater software creation, and shifting career focus towards problem-solving and adaptability.

Key Details

The Importance of Hands-On Experience and Problem Identification [0:00]

  • True understanding and impact in engineering stem from hands-on involvement rather than purely theoretical knowledge.
  • Identifying problems that matter, by engaging with customers and analyzing technical trends, is more impactful than the volume of one's work.
  • Early career engineers should focus on understanding customer needs and business context to find impactful problems.

"If you aren't doing it hands-on, your opinion about it is very likely to be completely wrong."

Learning Through Operational Experience and Post-Mortems [0:56]

  • A realization early in a career highlighted that the direction of work and the problems addressed are more impactful than simply working more hours.
  • Identifying important problems involves going broad, listening to customer pain points, understanding their investments, and observing industry trends.
  • Technical trends like faster networks, storage, and compute (CPUs, GPUs) create opportunities for innovation.

"What would be your advice on how do you find problems that matter?"

The Role of On-Call in Deep System Understanding [5:38]

  • Spending 15 years on call provided practical knowledge in building distributed systems through analyzing post-mortems and COEs.
  • On-call is a vital way to learn how systems truly run, behave, and how customers use them, including unexpected usage patterns.
  • The goal of on-call should be deep understanding and bringing that knowledge back to improve systems and communicate learnings broadly.

"I would say that the majority of my inractice knowledge about how to build distributed systems has come from being on call uh and analyzing and deeply understanding these post-mortems and and and COE's."

Building a Strong Post-Mortem Culture at AWS [7:53]

  • AWS utilizes a weekly meeting mechanism where engineers and leaders discuss COEs (customer operational escalations) and post-mortems to extract lessons and apply them company-wide.
  • This practice forces leadership and engineers to deeply understand system operations and their underlying causes, leading to better product design and architectural decisions.
  • Post-mortems should lead to concrete action items at tactical, organizational, and strategic levels, potentially inspiring new tools or services.

"And I think that particular mechanism that particular kind of w in wed Wednesday morning meeting that we have um is one of the things that has been a core almost causal factor behind a you know AWS's success"

The Nuances of Caching in Distributed Systems [24:26]

  • While caching leverages locality principles for performance, it introduces a problematic "mode" where the cache is empty or contains stale data, leading to system slowdowns or outages.
  • Metastable failures occur when a system shifts into a degraded state from which it cannot recover autonomously due to issues like database contention or network saturation.
  • Brooker prefers avoiding caching where possible, favoring patterns like complete materialized views or scalable backends like DSQL or DynamoDB.

"But the downside of caches especially in distributed systems is they have this mode right like they have this um you know the there's a mode where the cache is full and the cache is full of the right data in time and space to perform very well and there's a mode where the cache is empty or contains the wrong data"

The Future of Software Engineering with AI [29:39]

  • Software development has historically been supply-constrained, and the economic landscape for building software is rapidly changing, offering vast opportunities for more and better software.
  • Software careers will evolve, requiring adaptation to new technologies and methodologies, with success predicated on the ability to lead and adapt to change.
  • There will be a spectrum of software practice, including niche areas in older techniques and a mainstream embracing AI-powered development for unprecedented speed and cost efficiency.

"I deeply believe about software is we have only just started to see the impact that software is going to have on the world."

Balancing Doing and Discussing for Career Growth [58:02]

  • A career requires balancing "doing" (hands-on work) with "discussing" (communication and visibility).
  • Being 100% on the "doing" side leads to high expertise but can limit impact and visibility; being 100% on the "discussing" side can lead to an "apparent competence" without deep skill.
  • The optimal balance involves being deeply engaged in practice while also communicating effectively, with a personal preference leaning towards a 75-25 split favoring hands-on work.

"On the doing versus discussing axis, I I kind of view the doing one as if you were too far, you would be underrated. And if you were too far on the discussing, you would be overrated."

The Value of Writing and Humility in Engineering [49:59]

  • Writing is a powerful multiplier for technical professionals, allowing ideas to be shared broadly and lastingly, forcing mental clarity and deeper thinking.
  • Distinguishing valuable documentation from busywork depends on understanding the purpose: creating artifacts for the future, capturing key decisions, or sharpening one's own thinking.
  • Embracing new tools hands-on is crucial; those who don't engage directly often have inaccurate mental models, making their opinions on these tools essentially fiction.

"Writing forces a level of mental clarity that speaking making slide decks etc doesn't and you know that's something that has also really been my experience of sitting down to write something down forces me to think that through at a depth that I wouldn't have been been forced to think it through without that."

Advice for Junior and Senior Engineers in a Changing Landscape [37:11]

  • Junior engineers need to understand customer and business context early in their careers, moving beyond pure code conversion. The industry is shifting towards integrated problem-solving from the outset.
  • Senior engineers must remain hands-on, deeply understanding how software development practices have changed and leveraging their experience with new tools, rather than relying solely on influence from a distance.
  • Both junior and senior engineers must be adaptable, curious, and hands-on to thrive, with an emphasis on continuous learning and acknowledging the evolving tools and practices.

"And so really is getting back to, you know, why are you here? Why did you get into this career? And I think it really gets us as technology focused people closer to the our original answer to that."

Admired Engineers and Continuous Learning [01:04:39]

  • Engineers like Elva Muan are admired for their ability to operate at multiple levels, from deep technical details (e.g., Paxos) to high-level strategy and communication.
  • The best engineers are not celebrities but focus on having impact, doing cool work for customers, and optimizing for engineering education.
  • Foundational knowledge in computer science, infrastructure, and networking remains critical, even with rapid industry changes, and can be leveraged more effectively with new tools.

"Um, and I really admired that ability to work sort of almost at at every level. And I was like, 'Wow, you know, this is this is something I aspire to.'"

Other People Also See