Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 2 - Transformer-Based Models & Tricks

Stanford Online

8,232 views • yesterday

Video Summary

This lecture covers key advancements and variations in transformer architectures, focusing on self-attention mechanisms and their evolution. It begins by recapping the concept of self-attention, its mathematical formulation, and the transformer architecture's encoder-decoder structure. The discussion then delves into the critical role of positional embeddings in transformers, explaining their necessity due to the loss of sequential information and exploring two methods: learned embeddings and fixed sinusoidal embeddings, highlighting the advantages of the latter for generalization.

The lecture also examines modifications to the original transformer, including positional embeddings, layer normalization, and attention mechanisms. It details how rotary position embeddings (RoPE) are employed in modern models to directly incorporate positional information into attention calculations, demonstrating their mathematical basis. Furthermore, it discusses changes in layer normalization, from post-norm to pre-norm and the use of RMS norm for efficiency, and explores variations in attention, such as sliding window attention and multi-query attention, aimed at optimizing computational complexity and memory usage.

Finally, the lecture introduces encoder-only architectures, exemplified by BERT, emphasizing its bidirectional nature and its suitability for classification tasks. It elaborates on BERT's pre-training objectives (Masked Language Model and Next Sentence Prediction) and fine-tuning process, along with its limitations such as context length and latency. It also touches upon distillation and Roberta as methods to address these limitations, showcasing the ongoing evolution and optimization of transformer models.

Short Highlights

Positional Embeddings: Essential for transformers to retain position information, with sinusoidal embeddings offering better generalization than learned ones.
Layer Normalization: Evolution from post-norm to pre-norm and RMS norm for improved training stability and efficiency.
Rotary Position Embeddings (RoPE): A modern approach to inject positional information directly into attention calculations, ensuring relative distance is captured.
BERT Architecture: An encoder-only model focused on bidirectional representations, trained with Masked Language Model (MLM) and Next Sentence Prediction (NSP) objectives.
Model Variations: Techniques like sliding window attention, multi-query attention, distillation, and Roberta aim to optimize performance, reduce complexity, and improve efficiency.

Key Details

Lecture Logistics and Recap [0:05]

Audio quality of the previous lecture was suboptimal.
The final exam date is a placeholder and might be moved earlier in the week.
Lecture one introduced the concept of self-attention, where each token attends to all others using queries, keys, and values.
The self-attention mechanism can be expressed by the formula: softmax(QK^T / sqrt(dk) * V).
The Transformer architecture, composed of an encoder and decoder, was initially developed for machine translation.
Multi-head attention allows the model to learn different ways of projecting inputs into queries, keys, and values.