Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 3 - Tranformers & Large Language Models

Stanford Online

3,662 views • yesterday

Video Summary

Here's a summary of the provided YouTube video transcript:

This lecture introduces Large Language Models (LLMs), building upon previous discussions of self-attention and transformer architectures. LLMs are defined as language models that predict the next token, distinguished by their immense scale in terms of model parameters (billions), training data (hundreds of billions to trillions of tokens), and computational requirements. While earlier models like BERT are encoder-only and do not produce text, modern LLMs are typically decoder-only and excel at text-to-text tasks.

A significant advancement discussed is the Mixture of Experts (MoE) architecture. This approach, inspired by the idea of activating only relevant experts for a given input, aims to increase model capacity without proportionally increasing computational cost per inference. MoEs involve a gating network that routes input to specialized "expert" networks, often feed-forward neural networks within the transformer block. Sparse MoEs, which select only the top-K experts, are particularly effective for efficiency. The lecture also delves into the nuances of response generation, contrasting greedy decoding, beam search, and sampling methods, with temperature sampling emerging as a key parameter for controlling output diversity and creativity.

The discussion then moves to techniques for optimizing LLM inference. This includes the use of KV caching to store intermediate computations, Group Query Attention (GQA) to reduce memory bandwidth, and memory management strategies like paged attention to handle large contexts efficiently. Further optimizations explore reducing model size through multi-latent attention. Finally, advanced inference techniques like speculative decoding and multi-token prediction are introduced, which utilize smaller or integrated models to accelerate generation by predicting multiple tokens in parallel or using a draft model to guide the main model's output, all while aiming to maintain output quality and efficiency.

Short Highlights

Introduces Large Language Models (LLMs) as scaled-up, decoder-only transformers for text generation.
Explains Mixture of Experts (MoE) architecture for increased capacity and efficiency.
Details response generation strategies: greedy decoding, beam search, and sampling (with temperature).
Covers inference optimizations: KV caching, Group Query Attention (GQA), paged attention, and multi-latent attention.
Discusses advanced generation techniques like speculative decoding and multi-token prediction for faster inference.

Key Details

Introduction to LLMs and Transformer Architectures [0:05]

The lecture introduces Large Language Models (LLMs) as a key topic, building on previous concepts like self-attention and transformers.
LLMs are defined as language models that predict the probability of the next token in a sequence.
Key characteristics of LLMs include their large model size (billions of parameters), massive training data (hundreds of billions to trillions of tokens), and significant computational requirements.
Existing transformer-based models are categorized into three main types: encoder-decoder, encoder-only (like BERT), and decoder-only.
Modern LLMs are primarily decoder-only and perform text-to-text tasks.