AWS re:Invent 2025 - Under the hood: Architecting Amazon EKS for scale and performance (CNS429)

AWS Events

34 views • 2 days ago

Video Summary

Amazon EKS is presented as a powerful, managed Kubernetes service designed for scale and performance, particularly for demanding AI/ML workloads. The video highlights EKS's evolution, from its initial launch to advanced features like Automode, Hybrid Nodes, and the open-sourced Carpenter project. A significant focus is placed on the "UltraScale" clusters, capable of supporting up to 100,000 nodes, which necessitated a complete architectural rethink of etcd. This innovation allows for massive AI/ML training and inference operations within a single cluster. Furthermore, EKS introduces "Provisioned Control Plane" to offer predictable, high-performance tiers for customers who need more control than the standard auto-scaling model.

Anthropic, a leading AI company, shares their journey and motivations for adopting EKS, particularly their reliance on UltraScale clusters for training large language models like Claude. They emphasize the challenges of managing massive workloads, optimizing resource utilization, and the importance of features like parallel image pulling and advanced scheduling. Anthropic's experience underscores how EKS is enabling cutting-edge AI development by providing a robust, scalable, and flexible platform, pushing the boundaries of what's possible in cloud-native AI.

Short Highlights

Amazon EKS has evolved over eight years, supporting 93% of companies using Kubernetes in production.
EKS enables AI/ML workloads by providing scalable infrastructure for data processing, model training, and inference.
UltraScale clusters support up to 100,000 nodes, requiring significant architectural changes to etcd for performance.
Provisioned Control Plane offers tiered performance options for predictable control plane behavior and capacity management.
Anthropic leverages EKS, particularly UltraScale clusters, for training large language models like Claude, managing over 99% of their compute via EKS.

Key Details

Under the Hood: Amazon EKS - Architecting for Scale and Performance [0:02]

Key Insights:
- Kubernetes has revolutionized application deployment, with 93% of companies using it in production or evaluation.
- Managing Kubernetes at scale (hundreds of clusters) is a significant challenge.
- Amazon EKS is a fully managed, CNCF-certified Kubernetes service that offloads operational burdens.
- AWS runs tens of millions of clusters at scale, making EKS a trusted platform for reliable, scalable, and secure applications.
- EKS has evolved over eight years, launching in 2018 with features like managed control planes, managed add-ons, IPv6 support, Automode, and Hybrid Nodes.
- The Carpenter project, an open-source cluster autoscaler donated to CNCF, helps autoscale and bin-pack applications efficiently.

"Your mission is to deliver the applications and not worry about the infrastructure unless you really require to."

Accelerating AI/ML with Amazon EKS [04:42]

Key Insights:
- EKS provides scalable infrastructure for AI/ML workloads, including autonomous vehicle development, robotics, and generative AI.
- Generative AI and large language model training on EKS leverage AWS EC2 capabilities for scalable infrastructure and inference endpoints.
- EKS enables running thousands of agents in the emerging field of agentic AI at scale.
- AI/ML workloads are characterized by vast data requirements, compute intensity, high-bandwidth low-latency networking, and parallel storage.
- Gartner predicts that by 2028, 95% of new AI deployments will use Kubernetes, up from less than 30% today.

"The architectural approach allows organizations like Anthropic to also run the AI/ML workloads on EKS."

Why Customers Choose EKS for AI/ML [07:10]

Key Insights:
- EKS is upstream-compliant and offers a rich tooling ecosystem with vibrant open-source integrations.
- It provides unparalleled customization capabilities, allowing granular control over infrastructure and EC2 instance types.
- EKS integrates with a wide variety of AWS services across compute, storage, and networking.
- Hybrid Nodes enable running AI/ML workloads on-premises or at the edge, connecting to the EKS control plane.
- Carpenter helps achieve cost optimization through efficient management and bin-packing of compute resources.

"EKS because it's upstream uh confirment it already comes with a tooling ecosystem right you can quick start with a vibrant open source system with all of the tooling integrations."

Amazon EKS Control Plane Architecture and Resiliency [08:39]

Key Insights:
- EKS control planes have API server instances across two Availability Zones (AZs) and etcd spread across three AZs for high availability.
- The control plane and etcd run in private VPCs for isolation.
- API server endpoints can be configured as public, private, or both, with a network load balancer.
- Deterministic resiliency is employed during failures, involving pausing updates, redirecting traffic to healthy AZs, relocating leader-elected controllers, and maintaining etcd quorum.
- EKS proactively tests failure scenarios, including packet drops and service disconnects.

"Our highest priority is what we want to aim is for the static stability even when there is a a failure and existing cluster continue to run and provide that availability for you to run in your applications and to be able to connect to your Kubernetes control plane."

EKS Control Plane Scaling Architecture [11:03]

Key Insights:
- EKS control plane scaling is triggered by multiple signals including CPU, memory, node count, and etcd size.
- Recent improvements include parallel API server and etcd scaling, blue-green deployments, and S3 prefetching to reduce warm-up times.
- Scaling is intelligent and conservative, using progressively larger instances and smart cooldown periods (reduced to 15 minutes).
- Automatic QPS and burst rate adjustments are made for controller frameworks and schedulers.
- These optimizations help maintain 99.95% availability and reduce scaling time from 50 minutes to 10 minutes.

"This actually helps reduce the warm-up times for all of the components and the scaling process is both intelligent and conservative as well."

EKS Data Plane Options [12:18]

Key Insights:
- EKS offers a range of data plane options: self-managed node groups, managed node groups, Automode node pools, and hybrid nodes.
- Automode node pools shift operational responsibility to AWS for managing instances and add-ons.
- Carpenter, a CNCF project, acts as a next-generation autoscaler supporting various instance types, GPUs, and capacity reservations for cost optimization.
- Hybrid nodes allow leveraging on-premises or edge infrastructure connected to the EKS control plane.

"Carpenter gives you the flexibility of choosing different instance types from on demand and the spot instances."

EKS Instance Type and Silicon Innovation [13:41]

Key Insights:
- EKS supports a broad array of EC2 instance types, including general purpose, compute-optimized, storage-optimized, high IO, and graphics-intensive.
- Choices include different processors (AWS, Intel), high memory footprints, accelerated compute, and various storage and networking options.
- EKS provides access to AWS silicon innovations like Nitro and Graviton for optimal price performance.
- Customers have flexible purchasing options: on-demand, savings plans, and spot instances.
- EKS supports various GPU instance types (e.g., g5, p5) and accelerated frameworks like Tranium and Inferentia.

"With EKS you have access to all of the silicon innovation that is happening here at uh AWS whether it is powered by nitro or the graviton for the best price performances as well."

EKS Features Accelerating AI/ML Innovation [15:18]

Key Insights:
- Millions of GPU-powered instances are used with Amazon EKS, doubling since the previous year.
- EKS simplifies usage with Automode and Hybrid Nodes, and optimizes compute with features like Parallel Pull for OCI images.
- Accelerated AMIs significantly reduce setup time for GPU and AI accelerators.
- Kubernetes DRA enables fine-grained sharing and allocation of GPU resources across multiple AI workloads.
- S3 Mountpoint drivers provide direct access to data in S3 as a local file system, reducing data loading times.

"With the launch of the Kubernetes DRA, you can actually enable the fine grained um sharing and um allocations of the GPU resources across multiple of AI workloads and the pods."

EKS Day-2 Operations and Architectural Patterns [17:32]

Key Insights:
- Node health and auto-repair features monitor node health, integrating with EC2 health checks and Kubernetes node conditions to replace unhealthy nodes.
- Container Insights provides comprehensive observability into EKS AI/ML workloads.
- Customers architect solutions with EKS for AI/ML tooling, Jupyter notebooks, workflow configuration, job scheduling, and model registries.
- EKS serves as a unified platform for data pipelines, model development, training, and inference, integrating with other AWS services.
- Customers implement MLOps workflows on EKS, running analysis, model development, and deployment using frameworks like Kubeflow, Ray, and MLflow.

"This is all great right but AI/ML workloads have some key challenges even after all of these building the stronger foundations and the features that we talked about."

Challenges and the Introduction of EKS UltraScale Clusters [19:42]

Key Insights:
- AI/ML training requires massive, coordinated compute across thousands of instances with low latency and high bandwidth.
- Orchestration frameworks often struggle to work across multiple clusters, increasing operational overhead.
- Customers seek reduced operational overhead, simplified cluster management, shared governance, improved cost efficiency, and increased resource utilization.
- EKS UltraScale clusters are introduced to support up to 100,000 nodes in a single cluster, enabling the management of 800,000 Nvidia GPUs or 1.6 million AWS Tranium chips.

"Customers are really looking for reduced operational overhead, simplified cluster management and also have the shared governance on these clusters, thus allowing them to get to the improved cost efficiency that they're looking for."

Reimagining etcd for UltraScale Clusters [22:05]

Key Insights:
- etcd, the heart of Kubernetes configuration, was traditionally a three-node cluster using Raft consensus.
- etcd was not designed for 100,000 nodes per cluster, necessitating reimagining its architecture.
- Key changes include offloading consensus to a purpose-built multi-AZ transaction journal, moving from disk-based Bold DB to an in-memory database (using temps), and partitioning high-traffic keys (nodes, pods, leases, events) into dedicated etcd stores.
- These changes allow for horizontal scaling of etcd and unlock orders of magnitude higher read/write throughput.
- The API semantics expected by Kubernetes remain unchanged.

"Traditional etcd was never designed to handle this volume of uh this volume of objects."

UltraScale Cluster Performance and Object Management [26:50]

Key Insights:
- UltraScale clusters have been tested with massive AI/ML workloads, including stateful sets on 100,000 nodes and mixed-mode fine-tuning jobs.
- All these workloads run within a single UltraScale cluster, avoiding the complexity of multi-cluster coordination.
- At peak, UltraScale clusters manage tens of millions of objects, including approximately 8 million pods, 100,000 node objects, 6 million lease objects, and tens of millions of events.
- The database size for a single UltraScale cluster supports up to 20 GB for etcd, 2.5x that of standard EKS.
- Achieved read throughput is around 7,500 requests per second, and write throughput peaks at 8,000-9,000 requests per second, significantly outperforming traditional etcd.

"This is clearly unprecedented scale. uh traditional etcd was never designed to handle this volume of uh this volume of objects."

UltraScale Cluster Latency and Throughput Optimizations [28:56]

Key Insights:
- Read, write, and delete requests are served between 100 milliseconds to 1 second at P99.
- List requests, which can return millions of objects, are responded to between 5 to 20 seconds, below the upstream SLO of 30 seconds.
- Optimizations include tuning request timeouts, retry strategies, worker parallelism, and throttling rules.
- An upstream change allows serving consistent read requests from an API server cache, reducing latency by eliminating etcd round trips.
- List request encoding was optimized from batch to incremental, significantly reducing memory consumption.

"Because you can have all the throughput in the world, but if you can if you're taking seconds to serve a request, uh that will still, you know, impact your workloads."

Optimizing Data Plane and Networking for AI/ML Workloads [32:04]

Key Insights:
- AI/ML workloads run on instance types with up to 100 Gbps network bandwidth and high IOPS/throughput EBS volumes.
- Large container images (often >5GB) are handled efficiently using AWS Sochi snapshot to parallelize download and unpacking.
- CNI changes allow a single pod to connect to all network cards on an instance, accessing the full network bandwidth.
- The combination of these features reduces the time from pod scheduling to readiness by 3x.
- Carpenter pre-assigns IP prefixes to nodes during launch, further reducing node readiness time for scaled deployments.

"The combination of these two changes reduces the time it takes from a pod to go from being scheduled to running to having all the data it needs by 3x."

Introducing Provisioned Control Plane for EKS [34:34]

Key Insights:
- Provisioned Control Plane allows customers to proactively select a performance tier matching their business needs, rather than relying solely on reactive scaling.
- Three new performance tiers (XL, 2XL, 4XL) offer significantly higher API request concurrency, pod scheduling rates, and database sizes than standard clusters.
- Customers can temporarily scale up to a higher tier for events and scale back down, optimizing for both performance and cost.
- Standard control plane scales automatically based on workload demand within tier levels, suitable for predictable patterns or when guaranteed capacity isn't critical.
- Provisioned Control Plane is for predictable high performance capacity, performance-critical workloads, or massively scalable AI/ML workloads.

"Provisioned control pane is when you need predictable high performance capacity. That's when provision control pane comes in."

Understanding Provisioned Control Plane Tiers and Utilization [39:01]

Key Insights:
- Tiers are differentiated by API request concurrency (conversations with clients), pod scheduling rate (pods per second the scheduler can place), and maximum database size (up to 16 GB for etcd).
- Tier selection guidance includes considering custom operators/controllers (API concurrency), frequent large deployments (pod scheduling rate), and large clusters with many objects (database size).
- New metrics are available to monitor tier utilization in real-time, including API request concurrency, pod scheduling rate, and database size.
- Customers can proactively scale up before events, monitor metrics, and upgrade tiers on the fly if needed.

"If you find yourself running a lot of custom Kubernetes operators, controllers, I would consider looking at how you're doing with respect to API request concurrency and select a higher tier."

Anthropic's Journey with EKS and UltraScale Clusters [43:34]

Key Insights:
- Anthropic uses EKS extensively, managing over 99% of their compute, for developing AI models like Claude.
- They emphasize scaling applications to fit within a single Kubernetes cluster rather than splitting across multiple.
- UltraScale clusters provide a single pane of glass for observability and simplify application design around Kubernetes.
- Large training jobs and image sizes (35GB) require optimizations like parallel pulling, significantly reducing pod readiness time.
- Anthropic uses a custom scheduler, Cgrapher, that scales with the number of workloads, not just pods, for efficient scheduling of large training jobs.
- They focus on optimizing DNS, avoiding service meshes, and utilizing object stores with pre-fetch buffers for data access.

"I really want Kubernetes to sort of work for the application. I don't want to design my application around Kubernetes."

Anthropic's Future Needs and EKS Evolution [56:41]

Key Insights:
- Anthropic is looking forward to namespace controllers for better failure domain isolation in large clusters.
- They are excited about Carpenter for capacity reservations for GPU and Tranium workloads.
- Moving towards IPv6 for large-scale flat networks to enable services to communicate without a service mesh.
- EFS CSI with multi-attach is used for Jupyter users with multi-pod workflows requiring better consistency than code replication.
- They anticipate upstream improvements in endpoint slices and core DNS multi-core capabilities.

"Seeing that get name spaced out getting sharded out is going to give me the failure domains and the properties that I care about."

Closing Thoughts and Resources [58:18]

Key Insights:
- EKS has become a trusted foundation for running battle-tested clusters, with UltraScale clusters serving specialized workloads.
- Provisioned Control Plane makes UltraScale learnings accessible to all EKS customers for optimized performance and cost.
- Resources like user guides, monthly workshops, and the AI on EKS website are available for customers.
- A 500-level chalk talk provides deep technical details on EKS and transaction journal architecture.

"EKS has become the foundation trusted way to run clusters at you know battle tested reliability went to ultra scale like our sort of motivations behind why we build that for specialized workloads like the ones that Anthropic is running and then really like how we're making all of that available to you through through provision control plane."