Lecture 1 — Understanding Large Language Models

← → / Space to navigate | Kirill Kalinin | SIIL

Lecture 1

Understanding Large Language Models

From fundamentals to the transformer architecture — and a plan for building one from scratch

📝 Notes

Welcome everyone. Today we're diving into Large Language Models. This lecture sets the stage for everything we'll build throughout the course. We're going to cover what LLMs actually are, why they matter, and how they work at a high level. By the end, you'll have a mental map of the entire pipeline — from raw text data all the way to a fine-tuned model that can do useful things. Think of this as the bird's-eye view before we zoom in and start coding in the lectures that follow. Let's get started.

This Lecture Covers

High-level explanations of the fundamental concepts behind large language models (LLMs)
Insights into the transformer architecture from which LLMs are derived
A plan for building an LLM from scratch

Contents

1.1 What Is an LLM?
1.2 Applications of LLMs
1.3 Stages of Building & Using LLMs
1.4 The Transformer Architecture
1.5 Utilizing Large Datasets
1.6 A Closer Look at GPT
1.7 Building an LLM

📝 Notes

This is our roadmap. We have three big themes today. First, high-level explanations of the fundamental concepts behind LLMs — what they are, where they fit in the AI landscape. Second, insights into the transformer architecture, which is the engine under the hood of virtually every modern LLM. And third, a plan for building an LLM from scratch — the actual roadmap we'll follow through the rest of this course. Keep this table of contents in mind as we go; every section builds on the previous one. By the end of this lecture, you should be able to explain to someone what an LLM is, why transformers matter, and what the training pipeline looks like.

Section 1.1

What Is an LLM?

📝 Notes

Let's start at the very beginning. Before we can build one, we need to understand what a large language model actually is — and just as importantly, what it isn't. This section grounds us in the key definitions and shows where LLMs sit in the broader landscape of AI.

Large Language Model

A neural network designed to understand, generate, and respond to human-like text — trained on massive amounts of text data.

The "large" refers to both model size (billions of parameters) and training data
Core training task: next-word prediction — surprisingly simple yet remarkably effective
Built on the transformer architecture with self-attention

📝 Notes

Here's our working definition. An LLM is a neural network designed to understand, generate, and respond to human-like text. Three things to notice here. First, "large" refers to both the model size — we're talking tens or hundreds of billions of parameters — and the dataset size, sometimes encompassing huge portions of the publicly available internet. Second, the core training objective is deceptively simple: predict the next word. That's it. This harnesses the sequential nature of language to learn context, structure, and relationships. It surprises many researchers that such a simple task produces such capable models. Third, these models are built on the transformer architecture, which we'll explore in depth later. The transformer lets the model pay selective attention to different parts of the input — and that's what makes it so good at handling the nuances of human language.

Where LLMs Fit

Artificial Intelligence

Systems with human-like intelligence

▼

Machine Learning

Algorithms that learn rules automatically from data

▼

Deep Learning

Neural networks with many layers

▼

Large Language Models

LLMs are a form of Generative AI — deep neural networks that create new content (text, images, media)

📝 Notes

Walk through this diagram from the outside in. Artificial intelligence is the broadest umbrella — it includes rule-based systems, genetic algorithms, expert systems, even fuzzy logic. Machine learning is a subset: algorithms that learn from data rather than being explicitly programmed. Deep learning narrows further to neural networks with three or more layers. And LLMs sit inside deep learning — they're a specific application of deep neural networks trained on massive text corpora. Notice the "Generative AI" label off to the side. Because LLMs generate text, they fall under generative AI, or GenAI. But GenAI is broader — it also includes image generators, music generators, and so on. The key takeaway: LLMs live at the intersection of deep learning and generative AI, and they inherit techniques and ideas from every layer of this hierarchy.

Deep Learning vs Traditional ML

Traditional ML

Requires manual feature extraction by human experts
e.g., spam detection: count trigger words, exclamation marks, uppercase words
Good for narrow, well-defined tasks

Deep Learning

No manual feature extraction — learns features automatically
Uses multi-layer neural networks to model complex patterns
Excels at nuance, context, generation

📝 Notes

This comparison is crucial for understanding why deep learning changed everything. On the left — traditional machine learning. A human expert has to manually identify and extract the relevant features. Think about building a spam filter the old way: you'd hand-engineer features like "contains the word FREE" or "has more than three exclamation marks." It works for narrow tasks, but it doesn't scale. On the right — deep learning. The model automatically discovers which features matter, directly from the raw data. No human expert needed for feature engineering. This is why deep learning unlocked performance on complex language tasks like parsing detailed instructions, contextual analysis, and generating coherent original text. Previous approaches could classify spam just fine, but they couldn't write an email from a list of keywords — something that's trivial for today's LLMs.

Section 1.2

Applications of LLMs

📝 Notes

Now that we know what LLMs are and where they fit, let's look at what they can actually do. This section covers the practical applications that have made LLMs so transformative — and why you might want to build your own.

What Can LLMs Do?

🌐

Translation
Machine translation between languages

✍️

Text Generation
Fiction, articles, code

💬

Chatbots
ChatGPT, Gemini, virtual assistants

📊

Sentiment Analysis
Understanding opinion & tone

📋

Summarization
Condensing lengthy passages

🔬

Knowledge Retrieval
Medicine, law, specialized domains

LLMs are invaluable for automating almost any task involving parsing and generating text.

📝 Notes

Look at the breadth here — six major application areas, and this isn't even exhaustive. Machine translation, generating novel text including fiction, articles, and code, powering chatbots and virtual assistants like ChatGPT and Gemini, sentiment analysis, summarization, and knowledge retrieval from vast document collections in specialized fields like medicine or law. What's remarkable is the contrast with earlier NLP models. Those older models were typically designed for one specific task — you'd build a separate model for translation, another for summarization, another for sentiment. LLMs demonstrate broad proficiency across all of these tasks with a single model. That versatility is what makes them so powerful and why they've ushered in a new era for NLP.

Why Build Your Own LLM?

Understand the mechanics and limitations from the ground up
Gain knowledge for pretraining or fine-tuning on domain-specific data
Data privacy — avoid sharing sensitive data with third-party providers
On-device deployment — smaller custom LLMs on laptops and phones
Full autonomy — control updates and modifications as needed

Custom LLMs (e.g., BloombergGPT for finance) can outperform general-purpose LLMs on specific tasks.

📝 Notes

This is the motivational slide — why bother building from scratch when you can just use ChatGPT? Five reasons. First, understanding mechanics and limitations — you can't truly understand what an LLM can and can't do until you've built one. Second, domain-specific models. Research shows that custom-built LLMs tailored for specific tasks can outperform general-purpose ones. Call out BloombergGPT here — it's a real example of an LLM specialized for finance that outperforms generic models on financial tasks. Third, data privacy — many companies simply can't share sensitive data with third-party providers. Fourth, on-device deployment — smaller custom models can run directly on customer devices. And fifth, autonomy — you control the model, the data, and the update cycle. These aren't theoretical benefits; they're driving real investment in custom LLM development right now.

Section 1.3

Stages of Building & Using LLMs

📝 Notes

Let's get into the how. This section lays out the actual pipeline — the stages you go through to get from a blank slate to a working, task-specific language model. This is the roadmap for our entire course.

Two-Stage Training Process

📚

Raw Text

Internet, books, Wikipedia

→

PRETRAINSelf-supervised
next-word prediction

→

Foundation Model

Text completion, few-shot

→

FINE-TUNE

Labeled dataset

→

Task Model

Classifier, assistant, etc.

Pretraining uses self-supervised learning — no manual labels needed
Fine-tuning adapts the model with a smaller, labeled dataset

📝 Notes

Walk through the flow left to right. You start with raw text — and "raw" is important here, it means regular text without any labeling. No human annotator has gone through and tagged anything. The model trains on this raw text in the pretraining phase, using self-supervised learning — it generates its own labels from the structure of the data itself. The "pre" in "pretraining" tells you this is the initial phase where the model develops a broad understanding of language. What comes out is a foundation model — a general-purpose model that understands language but isn't specialized for anything yet. Then comes fine-tuning, where you take that foundation model and train it further on a smaller, labeled dataset for your specific task. The result is a task-specific model. This two-stage approach is the standard recipe across the industry.

Two Types of Fine-Tuning

Instruction Fine-Tuning

Train on instruction–answer pairs

Input: "Translate to German: Hello"
Label: "Hallo"

→ Personal assistants, chatbots

Classification Fine-Tuning

Train on text–label pairs

Input: "You won a free iPhone!"
Label: "spam"

→ Spam filters, sentiment analysis

📝 Notes

Two columns here, and it's important to keep them distinct. On the left, instruction fine-tuning. Your labeled dataset consists of instruction-answer pairs — "Summarize this text" paired with a good summary, "Translate this to French" paired with the translation. This is how you build a chatbot or assistant. On the right, classification fine-tuning. Your dataset is text paired with class labels — "This email is spam," "This review is positive." This is how you build classifiers. Both start from the same pretrained foundation model, but they produce very different end products. Understanding this distinction matters because it determines what kind of data you need to collect and how you structure your training pipeline.

Section 1.4

The Transformer Architecture

📝 Notes

Now we get to the engine room. The transformer is the architecture that makes all of this possible. We'll look at where it came from, how it works, and the two major variants that emerged from it.

Attention Is All You Need

Vaswani et al., 2017 — the architecture behind most modern LLMs

⬅ Encoder

Processes input text
Encodes into numerical representations (vectors)
Captures contextual information

Decoder ➡

Takes encoded vectors
Generates output text one word at a time
Uses self-attention mechanism

Originally designed for machine translation (English → German/French)

📝 Notes

This is the paper that started it all — "Attention Is All You Need" by Vaswani et al., published in 2017. Before this, most sequence models relied on recurrent neural networks, which process text one token at a time. The transformer broke away from that by introducing a mechanism that could look at all positions in the input simultaneously. Two submodules to note. The encoder processes input text and encodes it into numerical representations — vectors that capture meaning. The decoder takes those encoded vectors and generates output text. The original transformer used both, repeated six times each. But as we'll see, different applications found they only needed one half. This paper is arguably the single most important paper in modern AI — everything we're building in this course traces back to it.

The Self-Attention Mechanism

Self-attention allows the model to weigh the importance of different words relative to each other in a sequence.

Captures long-range dependencies and contextual relationships
Enables coherent, contextually relevant output generation
The key innovation that makes transformers so effective

📝 Notes

Self-attention is the key innovation inside the transformer, and it's worth understanding intuitively before we implement it later. The mechanism allows the model to weigh the importance of different words or tokens in a sequence relative to each other. Why does this matter? Because meaning in language depends heavily on context. The word "bank" means something different in "river bank" versus "bank account." Self-attention lets the model capture these long-range dependencies and contextual relationships. When processing a word, the model looks at every other word in the sequence and asks: "How relevant is each of these to understanding the current word?" That's the core idea. We'll implement this from scratch in a later lecture, but for now, just understand that this is what gives transformers their power.

Two Approaches: BERT vs GPT

BERT Encoder

Masked word prediction
"This is an __ of how __ I can be"
Fills in missing words

→ Text classification, sentiment, document categorization

GPT Decoder

Next-word generation
"This is an example of how concise I can" → be
Generates one word at a time

→ Translation, summarization, fiction, code generation

📝 Notes

From the original transformer, two major families emerged, and they went in different directions. On the left, BERT — Bidirectional Encoder Representations from Transformers. BERT uses only the encoder submodule. It's trained using masked word prediction: you hide a word in a sentence and ask the model to predict what's missing. Because it can look at context from both directions — left and right — it's "bidirectional." BERT excels at understanding tasks like classification and question answering. On the right, GPT — Generative Pre-trained Transformer. GPT uses only the decoder submodule. It's trained on next-word prediction: given a sequence, predict what comes next. It can only look left — at the preceding context. GPT excels at generation tasks. In this course, we're building a GPT-style model, so the decoder side is our focus.

Zero-Shot & Few-Shot Learning

Zero-Shot

→ "Translate English to German: breakfast =>"
← "Frühstück"

Generalizes to tasks without any prior examples

Few-Shot

→ "gaot => goat, sheo => shoe, pohne =>"
← "phone"

Learns from a minimal number of examples in the input

GPT models can perform tasks they weren't explicitly trained for — enabled by massive pretraining.

📝 Notes

This slide shows something remarkable about GPT-style models. Zero-shot learning means the model can handle a completely new task without any specific examples. You just describe the task in the prompt, and the model generalizes from its pretraining knowledge. For instance, you can ask it to translate a sentence to French even though it wasn't explicitly trained as a translation model. Few-shot learning takes it one step further — you provide a small number of examples in the prompt, and the model learns the pattern on the fly. Maybe you show it two examples of English-to-French translation, and then give it a third sentence to translate. This ability to generalize without task-specific fine-tuning was one of the most surprising and impactful discoveries in LLM research. It's what makes these models so versatile in practice.

Section 1.5

Utilizing Large Datasets

📝 Notes

We've talked about architecture, but an equally important ingredient is data. This section looks at the sheer scale of data needed to train models like GPT-3 and what that costs. The numbers here are eye-opening.

GPT-3 Pretraining Dataset

Dataset	Description	Tokens	Share
CommonCrawl	Web crawl data	410B	60%
WebText2	Web crawl data	19B	22%
Books1	Internet book corpus	12B	8%
Books2	Internet book corpus	55B	8%
Wikipedia	High-quality text	3B	3%

Total: ~499B tokens available · Model trained on 300B tokens · CommonCrawl alone ≈ 570 GB

📝 Notes

Let's walk through this table because the numbers tell a story. CommonCrawl, filtered, contributes 410 billion tokens and makes up 60% of the training mix. That's a filtered version of a massive web crawl — the unfiltered version is around 570 GB of text. WebText2 adds 19 billion tokens at 22% — this is curated web content. Books1 and Books2 together contribute about 67 billion tokens at 16%. And Wikipedia adds 3 billion tokens at 3%. The total is roughly 499 billion tokens, but here's an interesting detail: GPT-3 was actually trained on about 300 billion tokens, so it didn't even see the full dataset once. The key takeaway: the diversity of sources matters just as much as the raw volume. Web text, books, and encyclopedic content each contribute different qualities to the model's understanding.

$4.6M

Estimated cloud computing cost to pretrain GPT-3

Scale and diversity of data allows broad performance across tasks
Pretrained models can be fine-tuned with smaller datasets, reducing costs
Many pretrained LLMs are available as open source

📝 Notes

Let that number sink in — $4.6 million just for the compute to pretrain GPT-3. And that's a 2020 estimate; it doesn't include the data collection, cleaning, researcher salaries, failed experiments, or infrastructure. This is why pretraining from scratch is something only a handful of organizations can afford. It's also why the two-stage approach matters so much: you pretrain once at enormous cost, then fine-tune many times at relatively low cost. For us in this course, we'll work with much smaller models, but understanding the scale helps you appreciate why transfer learning and fine-tuning are so important in practice. When someone says "just retrain the model," this number is the answer for why that's not trivial.

Section 1.6

A Closer Look at GPT

📝 Notes

Now let's zoom in on the specific architecture we'll be building. We've seen how GPT relates to the original transformer; now we'll look at exactly what makes GPT tick and why the decoder-only design works so well.

GPT: Decoder-Only Architecture

Encoder

Not used

Decoder

GPT uses only this

→

Output

Next word prediction

Autoregressive — each new word is predicted based on all preceding words
GPT-3: 96 transformer layers · 175 billion parameters
Compare: original transformer had only 6 encoder + 6 decoder layers

📝 Notes

Point out the visual here — the encoder is crossed out. GPT strips away the encoder entirely and keeps only the decoder portion of the original transformer. This is a key architectural choice. The original transformer used six encoder blocks and six decoder blocks; GPT-3 scales the decoder to 96 transformer layers with 175 billion parameters. Why drop the encoder? Because for text generation, you don't need a separate encoding step — the decoder can both process the input context and generate the output in one pass. This simplification made the architecture more elegant and, as it turned out, incredibly powerful. The decoder processes tokens left to right, attending only to previous positions, which makes it naturally suited for sequential text generation.

The Power of Next-Word Prediction

The model is simply trained to
predict the next word

Self-supervised — labels come from the structure of data itself
Enables training on massive unlabeled text without manual annotation
Despite its simplicity, this task produces remarkably capable models

📝 Notes

Here's the central paradox of LLMs: the training task is absurdly simple — just predict the next word. That's it. But from this simple objective emerges remarkable capabilities. Why does it work? Because next-word prediction is a form of self-supervised learning. You don't need any human-labeled data. The "label" for each training example is simply the next word in the text. The structure of language itself provides the supervision signal. To accurately predict what comes next, the model has to learn grammar, facts, reasoning patterns, style, and even elements of common sense. It's all baked into the sequential structure of language. This is one of the deepest insights in modern AI: a simple objective applied at massive scale can produce emergent intelligence.

Iterative Text Generation

"This is" → an

"This is an" → example

"This is an example" → of

"This is an example of" → ...

Each output becomes part of the input for the next prediction — one word at a time.

📝 Notes

Walk through each step here — this makes the generation process concrete. Step one: the model receives a prompt and predicts the most likely next token. Step two: that predicted token is appended to the input, and the whole sequence is fed back into the model. Step three: the model predicts the next token given the now-longer sequence. Step four: repeat. This is what "autoregressive" means — each prediction depends on all previous predictions. The model generates one token at a time, always conditioning on everything that came before. It's sequential, it's iterative, and it's how every GPT-style model generates text. When you see ChatGPT streaming words one by one, this is exactly what's happening under the hood.

Emergent Behaviors

"The ability to perform tasks that the model wasn't explicitly trained to perform is called an emergent behavior."

GPT can perform translation despite being trained only on next-word prediction
Emerges naturally from exposure to vast multilingual, multi-context data
One model for diverse tasks — no need for task-specific architectures

📝 Notes

This is one of the most fascinating aspects of LLMs. Emergent behaviors are capabilities the model wasn't explicitly trained for. For example, GPT models can perform translation — even though they were trained purely on next-word prediction, not on parallel translation corpora. How? Because the pretraining data naturally contains multilingual text and implicit translation patterns. The model picks up on these patterns without being told to. This is what we mean by "emergent" — the behaviors emerge from scale and data diversity, not from explicit training. It's worth pausing here because this has profound implications: we can't always predict in advance what a sufficiently large model will be capable of. This is both exciting and a source of ongoing research and debate.

Transformers ≠ LLMs

Not all transformers are LLMs — transformers are also used for computer vision
Not all LLMs are transformers — some use recurrent or convolutional architectures
In practice, the terms are often used synonymously in the literature

Alternative architectures aim to improve computational efficiency, but whether they can match transformer-based LLMs remains to be seen.

📝 Notes

This is a common misconception worth clearing up. Three key points. First, not all transformers are LLMs — the transformer architecture is also used in computer vision, protein folding, and other domains. Second, not all LLMs are transformers — some LLMs are built on recurrent or convolutional architectures, though transformers dominate today. Third, "LLM" refers to the scale and application, while "transformer" refers to the architecture. You can have a small transformer model that nobody would call an LLM, and you could theoretically have an LLM built on a different architecture. Precision in terminology matters when you're discussing these systems, especially in technical contexts.

Section 1.7

Building a Large Language Model

📝 Notes

We've covered the what and why. Now let's preview the how — the three stages we'll work through in the rest of this course. This is your roadmap going forward.

Three Stages

Foundation

Data preparation & sampling, attention mechanism, LLM architecture

Pretraining

Training loop, model evaluation, loading pretrained weights

Fine-Tuning

Classification model or personal assistant from labeled data

Data + Architecture

→

Pretrain on unlabeled text

→

Fine-tune for tasks

📝 Notes

Three stage cards to walk through. Stage 1 is Foundation — this is where we build the building blocks: data preparation, tokenization, the attention mechanism, and the LLM architecture itself. We'll code all of this from scratch. Stage 2 is Pretraining — implementing the training loop, evaluating the model, and understanding how to load pretrained weights when training from scratch isn't practical. Stage 3 is Fine-tuning — taking a pretrained model and specializing it, either as a classification model or as a personal assistant. Notice the flow connects these stages: you can't fine-tune without pretraining, and you can't pretrain without the foundational components. Each stage builds on the previous one, and each corresponds to upcoming lectures in this course. This is the journey we're embarking on together.

∑

Key Takeaways

LLMs are deep neural networks trained on massive text data, built on the transformer architecture
Training = pretraining (self-supervised, next-word prediction) + fine-tuning (labeled data)
The transformer's self-attention mechanism enables selective focus on relevant input parts
GPT uses only the decoder, BERT uses the encoder — different tasks, same base architecture
GPT models exhibit emergent behaviors — performing tasks they weren't explicitly trained for

📝 Notes

Let's crystallize what we've covered. One: LLMs have transformed the field of natural language processing — they represent a genuine paradigm shift. Two: modern LLMs are trained in two steps — pretraining on unlabeled text, then fine-tuning on labeled data for specific tasks. Three: the transformer architecture with its self-attention mechanism is the foundation — this is what made the breakthrough possible. Four: large, diverse datasets are essential — quality and diversity matter as much as raw volume. Five: despite being trained on the simple task of next-word prediction, LLMs exhibit emergent properties that weren't explicitly trained. And a bonus insight from the PDF: fine-tuned LLMs can outperform general-purpose LLMs on specific tasks, which is why building your own still matters. Keep these five points in mind — they're the thread that connects everything we'll do from here on out.

Understanding Large Language Models

Lecture 1 — Complete

Transformer Architecture Self-Attention Next-Word Prediction GPT Fine-Tuning

Next: Lecture 2 — Working with Text Data

📝 Notes

And that wraps up Lecture 1. We've covered a lot of ground: from definitions and applications to the transformer architecture, training data at scale, the GPT architecture specifically, and the three-stage roadmap for building an LLM from scratch. In the next lecture, we'll roll up our sleeves and start with the first practical step — working with text data and building a tokenizer. That's where the coding begins. See you in Lecture 2.