Lecture 1
From fundamentals to the transformer architecture β and a plan for building one from scratch
Welcome everyone. Today we're diving into Large Language Models. This lecture sets the stage for everything we'll build throughout the course. We're going to cover what LLMs actually are, why they matter, and how they work at a high level. By the end, you'll have a mental map of the entire pipeline β from raw text data all the way to a fine-tuned model that can do useful things. Think of this as the bird's-eye view before we zoom in and start coding in the lectures that follow. Let's get started.
This is our roadmap. We have three big themes today. First, high-level explanations of the fundamental concepts behind LLMs β what they are, where they fit in the AI landscape. Second, insights into the transformer architecture, which is the engine under the hood of virtually every modern LLM. And third, a plan for building an LLM from scratch β the actual roadmap we'll follow through the rest of this course. Keep this table of contents in mind as we go; every section builds on the previous one. By the end of this lecture, you should be able to explain to someone what an LLM is, why transformers matter, and what the training pipeline looks like.
Section 1.1
Let's start at the very beginning. Before we can build one, we need to understand what a large language model actually is β and just as importantly, what it isn't. This section grounds us in the key definitions and shows where LLMs sit in the broader landscape of AI.
A neural network designed to understand, generate, and respond to human-like text β trained on massive amounts of text data.
Here's our working definition. An LLM is a neural network designed to understand, generate, and respond to human-like text. Three things to notice here. First, "large" refers to both the model size β we're talking tens or hundreds of billions of parameters β and the dataset size, sometimes encompassing huge portions of the publicly available internet. Second, the core training objective is deceptively simple: predict the next word. That's it. This harnesses the sequential nature of language to learn context, structure, and relationships. It surprises many researchers that such a simple task produces such capable models. Third, these models are built on the transformer architecture, which we'll explore in depth later. The transformer lets the model pay selective attention to different parts of the input β and that's what makes it so good at handling the nuances of human language.
LLMs are a form of Generative AI β deep neural networks that create new content (text, images, media)
Walk through this diagram from the outside in. Artificial intelligence is the broadest umbrella β it includes rule-based systems, genetic algorithms, expert systems, even fuzzy logic. Machine learning is a subset: algorithms that learn from data rather than being explicitly programmed. Deep learning narrows further to neural networks with three or more layers. And LLMs sit inside deep learning β they're a specific application of deep neural networks trained on massive text corpora. Notice the "Generative AI" label off to the side. Because LLMs generate text, they fall under generative AI, or GenAI. But GenAI is broader β it also includes image generators, music generators, and so on. The key takeaway: LLMs live at the intersection of deep learning and generative AI, and they inherit techniques and ideas from every layer of this hierarchy.
This comparison is crucial for understanding why deep learning changed everything. On the left β traditional machine learning. A human expert has to manually identify and extract the relevant features. Think about building a spam filter the old way: you'd hand-engineer features like "contains the word FREE" or "has more than three exclamation marks." It works for narrow tasks, but it doesn't scale. On the right β deep learning. The model automatically discovers which features matter, directly from the raw data. No human expert needed for feature engineering. This is why deep learning unlocked performance on complex language tasks like parsing detailed instructions, contextual analysis, and generating coherent original text. Previous approaches could classify spam just fine, but they couldn't write an email from a list of keywords β something that's trivial for today's LLMs.
Section 1.2
Now that we know what LLMs are and where they fit, let's look at what they can actually do. This section covers the practical applications that have made LLMs so transformative β and why you might want to build your own.
Translation
Machine translation between languages
Text Generation
Fiction, articles, code
Chatbots
ChatGPT, Gemini, virtual assistants
Sentiment Analysis
Understanding opinion & tone
Summarization
Condensing lengthy passages
Knowledge Retrieval
Medicine, law, specialized domains
LLMs are invaluable for automating almost any task involving parsing and generating text.
Look at the breadth here β six major application areas, and this isn't even exhaustive. Machine translation, generating novel text including fiction, articles, and code, powering chatbots and virtual assistants like ChatGPT and Gemini, sentiment analysis, summarization, and knowledge retrieval from vast document collections in specialized fields like medicine or law. What's remarkable is the contrast with earlier NLP models. Those older models were typically designed for one specific task β you'd build a separate model for translation, another for summarization, another for sentiment. LLMs demonstrate broad proficiency across all of these tasks with a single model. That versatility is what makes them so powerful and why they've ushered in a new era for NLP.
Custom LLMs (e.g., BloombergGPT for finance) can outperform general-purpose LLMs on specific tasks.
This is the motivational slide β why bother building from scratch when you can just use ChatGPT? Five reasons. First, understanding mechanics and limitations β you can't truly understand what an LLM can and can't do until you've built one. Second, domain-specific models. Research shows that custom-built LLMs tailored for specific tasks can outperform general-purpose ones. Call out BloombergGPT here β it's a real example of an LLM specialized for finance that outperforms generic models on financial tasks. Third, data privacy β many companies simply can't share sensitive data with third-party providers. Fourth, on-device deployment β smaller custom models can run directly on customer devices. And fifth, autonomy β you control the model, the data, and the update cycle. These aren't theoretical benefits; they're driving real investment in custom LLM development right now.
Section 1.3
Let's get into the how. This section lays out the actual pipeline β the stages you go through to get from a blank slate to a working, task-specific language model. This is the roadmap for our entire course.
Walk through the flow left to right. You start with raw text β and "raw" is important here, it means regular text without any labeling. No human annotator has gone through and tagged anything. The model trains on this raw text in the pretraining phase, using self-supervised learning β it generates its own labels from the structure of the data itself. The "pre" in "pretraining" tells you this is the initial phase where the model develops a broad understanding of language. What comes out is a foundation model β a general-purpose model that understands language but isn't specialized for anything yet. Then comes fine-tuning, where you take that foundation model and train it further on a smaller, labeled dataset for your specific task. The result is a task-specific model. This two-stage approach is the standard recipe across the industry.
Train on instructionβanswer pairs
β Personal assistants, chatbots
Train on textβlabel pairs
β Spam filters, sentiment analysis
Two columns here, and it's important to keep them distinct. On the left, instruction fine-tuning. Your labeled dataset consists of instruction-answer pairs β "Summarize this text" paired with a good summary, "Translate this to French" paired with the translation. This is how you build a chatbot or assistant. On the right, classification fine-tuning. Your dataset is text paired with class labels β "This email is spam," "This review is positive." This is how you build classifiers. Both start from the same pretrained foundation model, but they produce very different end products. Understanding this distinction matters because it determines what kind of data you need to collect and how you structure your training pipeline.
Section 1.4
Now we get to the engine room. The transformer is the architecture that makes all of this possible. We'll look at where it came from, how it works, and the two major variants that emerged from it.
Vaswani et al., 2017 β the architecture behind most modern LLMs
Originally designed for machine translation (English β German/French)
This is the paper that started it all β "Attention Is All You Need" by Vaswani et al., published in 2017. Before this, most sequence models relied on recurrent neural networks, which process text one token at a time. The transformer broke away from that by introducing a mechanism that could look at all positions in the input simultaneously. Two submodules to note. The encoder processes input text and encodes it into numerical representations β vectors that capture meaning. The decoder takes those encoded vectors and generates output text. The original transformer used both, repeated six times each. But as we'll see, different applications found they only needed one half. This paper is arguably the single most important paper in modern AI β everything we're building in this course traces back to it.
Self-attention allows the model to weigh the importance of different words relative to each other in a sequence.
Self-attention is the key innovation inside the transformer, and it's worth understanding intuitively before we implement it later. The mechanism allows the model to weigh the importance of different words or tokens in a sequence relative to each other. Why does this matter? Because meaning in language depends heavily on context. The word "bank" means something different in "river bank" versus "bank account." Self-attention lets the model capture these long-range dependencies and contextual relationships. When processing a word, the model looks at every other word in the sequence and asks: "How relevant is each of these to understanding the current word?" That's the core idea. We'll implement this from scratch in a later lecture, but for now, just understand that this is what gives transformers their power.
β Text classification, sentiment, document categorization
β Translation, summarization, fiction, code generation
From the original transformer, two major families emerged, and they went in different directions. On the left, BERT β Bidirectional Encoder Representations from Transformers. BERT uses only the encoder submodule. It's trained using masked word prediction: you hide a word in a sentence and ask the model to predict what's missing. Because it can look at context from both directions β left and right β it's "bidirectional." BERT excels at understanding tasks like classification and question answering. On the right, GPT β Generative Pre-trained Transformer. GPT uses only the decoder submodule. It's trained on next-word prediction: given a sequence, predict what comes next. It can only look left β at the preceding context. GPT excels at generation tasks. In this course, we're building a GPT-style model, so the decoder side is our focus.
Generalizes to tasks without any prior examples
Learns from a minimal number of examples in the input
GPT models can perform tasks they weren't explicitly trained for β enabled by massive pretraining.
This slide shows something remarkable about GPT-style models. Zero-shot learning means the model can handle a completely new task without any specific examples. You just describe the task in the prompt, and the model generalizes from its pretraining knowledge. For instance, you can ask it to translate a sentence to French even though it wasn't explicitly trained as a translation model. Few-shot learning takes it one step further β you provide a small number of examples in the prompt, and the model learns the pattern on the fly. Maybe you show it two examples of English-to-French translation, and then give it a third sentence to translate. This ability to generalize without task-specific fine-tuning was one of the most surprising and impactful discoveries in LLM research. It's what makes these models so versatile in practice.
Section 1.5
We've talked about architecture, but an equally important ingredient is data. This section looks at the sheer scale of data needed to train models like GPT-3 and what that costs. The numbers here are eye-opening.
| Dataset | Description | Tokens | Share |
|---|---|---|---|
| CommonCrawl | Web crawl data | 410B | 60% |
| WebText2 | Web crawl data | 19B | 22% |
| Books1 | Internet book corpus | 12B | 8% |
| Books2 | Internet book corpus | 55B | 8% |
| Wikipedia | High-quality text | 3B | 3% |
Total: ~499B tokens available Β· Model trained on 300B tokens Β· CommonCrawl alone β 570 GB
Let's walk through this table because the numbers tell a story. CommonCrawl, filtered, contributes 410 billion tokens and makes up 60% of the training mix. That's a filtered version of a massive web crawl β the unfiltered version is around 570 GB of text. WebText2 adds 19 billion tokens at 22% β this is curated web content. Books1 and Books2 together contribute about 67 billion tokens at 16%. And Wikipedia adds 3 billion tokens at 3%. The total is roughly 499 billion tokens, but here's an interesting detail: GPT-3 was actually trained on about 300 billion tokens, so it didn't even see the full dataset once. The key takeaway: the diversity of sources matters just as much as the raw volume. Web text, books, and encyclopedic content each contribute different qualities to the model's understanding.
Estimated cloud computing cost to pretrain GPT-3
Let that number sink in β $4.6 million just for the compute to pretrain GPT-3. And that's a 2020 estimate; it doesn't include the data collection, cleaning, researcher salaries, failed experiments, or infrastructure. This is why pretraining from scratch is something only a handful of organizations can afford. It's also why the two-stage approach matters so much: you pretrain once at enormous cost, then fine-tune many times at relatively low cost. For us in this course, we'll work with much smaller models, but understanding the scale helps you appreciate why transfer learning and fine-tuning are so important in practice. When someone says "just retrain the model," this number is the answer for why that's not trivial.
Section 1.6
Now let's zoom in on the specific architecture we'll be building. We've seen how GPT relates to the original transformer; now we'll look at exactly what makes GPT tick and why the decoder-only design works so well.
Point out the visual here β the encoder is crossed out. GPT strips away the encoder entirely and keeps only the decoder portion of the original transformer. This is a key architectural choice. The original transformer used six encoder blocks and six decoder blocks; GPT-3 scales the decoder to 96 transformer layers with 175 billion parameters. Why drop the encoder? Because for text generation, you don't need a separate encoding step β the decoder can both process the input context and generate the output in one pass. This simplification made the architecture more elegant and, as it turned out, incredibly powerful. The decoder processes tokens left to right, attending only to previous positions, which makes it naturally suited for sequential text generation.
Here's the central paradox of LLMs: the training task is absurdly simple β just predict the next word. That's it. But from this simple objective emerges remarkable capabilities. Why does it work? Because next-word prediction is a form of self-supervised learning. You don't need any human-labeled data. The "label" for each training example is simply the next word in the text. The structure of language itself provides the supervision signal. To accurately predict what comes next, the model has to learn grammar, facts, reasoning patterns, style, and even elements of common sense. It's all baked into the sequential structure of language. This is one of the deepest insights in modern AI: a simple objective applied at massive scale can produce emergent intelligence.
Each output becomes part of the input for the next prediction β one word at a time.
Walk through each step here β this makes the generation process concrete. Step one: the model receives a prompt and predicts the most likely next token. Step two: that predicted token is appended to the input, and the whole sequence is fed back into the model. Step three: the model predicts the next token given the now-longer sequence. Step four: repeat. This is what "autoregressive" means β each prediction depends on all previous predictions. The model generates one token at a time, always conditioning on everything that came before. It's sequential, it's iterative, and it's how every GPT-style model generates text. When you see ChatGPT streaming words one by one, this is exactly what's happening under the hood.
This is one of the most fascinating aspects of LLMs. Emergent behaviors are capabilities the model wasn't explicitly trained for. For example, GPT models can perform translation β even though they were trained purely on next-word prediction, not on parallel translation corpora. How? Because the pretraining data naturally contains multilingual text and implicit translation patterns. The model picks up on these patterns without being told to. This is what we mean by "emergent" β the behaviors emerge from scale and data diversity, not from explicit training. It's worth pausing here because this has profound implications: we can't always predict in advance what a sufficiently large model will be capable of. This is both exciting and a source of ongoing research and debate.
Alternative architectures aim to improve computational efficiency, but whether they can match transformer-based LLMs remains to be seen.
This is a common misconception worth clearing up. Three key points. First, not all transformers are LLMs β the transformer architecture is also used in computer vision, protein folding, and other domains. Second, not all LLMs are transformers β some LLMs are built on recurrent or convolutional architectures, though transformers dominate today. Third, "LLM" refers to the scale and application, while "transformer" refers to the architecture. You can have a small transformer model that nobody would call an LLM, and you could theoretically have an LLM built on a different architecture. Precision in terminology matters when you're discussing these systems, especially in technical contexts.
Section 1.7
We've covered the what and why. Now let's preview the how β the three stages we'll work through in the rest of this course. This is your roadmap going forward.
Data preparation & sampling, attention mechanism, LLM architecture
Training loop, model evaluation, loading pretrained weights
Classification model or personal assistant from labeled data
Three stage cards to walk through. Stage 1 is Foundation β this is where we build the building blocks: data preparation, tokenization, the attention mechanism, and the LLM architecture itself. We'll code all of this from scratch. Stage 2 is Pretraining β implementing the training loop, evaluating the model, and understanding how to load pretrained weights when training from scratch isn't practical. Stage 3 is Fine-tuning β taking a pretrained model and specializing it, either as a classification model or as a personal assistant. Notice the flow connects these stages: you can't fine-tune without pretraining, and you can't pretrain without the foundational components. Each stage builds on the previous one, and each corresponds to upcoming lectures in this course. This is the journey we're embarking on together.
Let's crystallize what we've covered. One: LLMs have transformed the field of natural language processing β they represent a genuine paradigm shift. Two: modern LLMs are trained in two steps β pretraining on unlabeled text, then fine-tuning on labeled data for specific tasks. Three: the transformer architecture with its self-attention mechanism is the foundation β this is what made the breakthrough possible. Four: large, diverse datasets are essential β quality and diversity matter as much as raw volume. Five: despite being trained on the simple task of next-word prediction, LLMs exhibit emergent properties that weren't explicitly trained. And a bonus insight from the PDF: fine-tuned LLMs can outperform general-purpose LLMs on specific tasks, which is why building your own still matters. Keep these five points in mind β they're the thread that connects everything we'll do from here on out.
Lecture 1 β Complete
Next: Lecture 2 β Working with Text Data
And that wraps up Lecture 1. We've covered a lot of ground: from definitions and applications to the transformer architecture, training data at scale, the GPT architecture specifically, and the three-stage roadmap for building an LLM from scratch. In the next lecture, we'll roll up our sleeves and start with the first practical step β working with text data and building a tokenizer. That's where the coding begins. See you in Lecture 2.