Part VI: Productionization
Chapter 30

Multimodal Models

Vision transformers, CLIP, and multimodal LLMs
20 Exercises
30.1

Every model in this book so far has lived in a world of text. But the real world is rich with images, audio, and video, and much of human communication and knowledge is non-textual. A model that can SEE a photo, READ a chart, or HEAR speech is far more useful than one limited to text. This chapter extends the LLM to MULTIPLE MODALITIES — and the central idea turns out to be a beautiful and surprisingly simple extension of everything we have built.

What Multi-modal Models Can Do

CapabilityExample
Image descriptionDescribe what is in a photo
Visual question answering'What color is the car in this image?'
Chart & document understandingRead a graph, table, or scanned page
OCR and text-in-imagesExtract and reason about text within images
Visual reasoning'Why is this meme funny?' / diagram interpretation
Audio understandingTranscribe and answer questions about speech
GroundingPoint to where an object is in an image

The Big Idea: Turn Everything Into Tokens

Here is the key insight that makes multi-modal models work, and it is elegantly simple. A Transformer processes a sequence of TOKENS — it does not care what those tokens originally were. If we can turn an IMAGE into a sequence of tokens that live in the SAME space as text tokens, the very same Transformer can process images and text together, attending across both. Multi-modality is, at its heart, the problem of converting other modalities into tokens the language model already knows how to handle.

MM Note: One Architecture, Many Modalities
The Transformer is modality-agnostic: it operates on sequences of vectors (Chapter 13). Text becomes vectors via token embeddings; images can become vectors via a vision encoder; audio can become vectors via an audio encoder. Once everything is a sequence of vectors in a shared space, the SAME attention mechanism mixes information across modalities. This is why the Transformer became the universal architecture — it generalizes far beyond text.
So the question 'how do we make an LLM see?' reduces to 'how do we turn an image into vectors that the LLM can attend to alongside text?'. The rest of this chapter answers that question — first the components (vision encoders, CLIP), then how they connect to the LLM (LLaVA), then training, then audio.
Intuition: A Shared Language of Vectors
Imagine two people who speak different languages but both understand a third, common language. To communicate, each translates into the common language. Multi-modal models do this with VECTORS: images and text are different 'languages', and the shared embedding space is the common language they both translate into. Once an image is translated into the shared vector space, the LLM can 'read' it just as it reads text.
The whole game of multi-modality is learning these translations so that an image of a dog and the word 'dog' land in nearby places in the shared space. Get the translation right, and the model can reason across modalities seamlessly.
30.2

Text is naturally a sequence of tokens, but an image is a grid of pixels — how do we turn it into a sequence? The answer, from the Vision Transformer (ViT; Dosovitskiy et al., 2020), is delightfully simple: cut the image into PATCHES, and treat each patch as a token. This single idea is what lets the Transformer process images.

Patches as Tokens

A ViT splits an image into a grid of fixed-size patches — say 16×16 pixels each. A 224×224 image becomes a 14×14 grid = 196 patches. Each patch is flattened and passed through a small linear layer to produce a vector — a 'patch embedding'. Now the image is a sequence of 196 vectors, exactly the form a Transformer expects. Patches are to images what tokens are to text.

Shape Trace: Image to patch tokens

OperationShapeNote
input image(224, 224, 3)raw pixels
split into patches(196, 16×16×3)16x16 patches
flatten each patch(196, 768)flatten pixels
linear projection(196, 768)patch embeddings
+ position embeddings(196, 768)where each patch is
→ Transformer(196, 768)a sequence of tokens!

Just like text tokens, patch tokens get POSITION EMBEDDINGS so the model knows where each patch was in the image (top-left, center, etc.) — because, as with text, attention is otherwise order-blind (Chapter 13). With patches embedded and positioned, a standard Transformer can process the image, with each patch attending to every other patch to build up an understanding of the whole scene.

PythonPatchifying an image (the ViT idea)
import torch

def patchify(image, patch_size=16):
    """Turn an image into a sequence of patch tokens."""
    # image: (C, H, W) -> grid of (patch_size x patch_size) patches
    C, H, W = image.shape
    patches = image.unfold(1, patch_size, patch_size) \
                   .unfold(2, patch_size, patch_size)
    # rearrange into (num_patches, patch_size*patch_size*C)
    patches = patches.reshape(C, -1, patch_size * patch_size)
    patches = patches.permute(1, 0, 2).reshape(patches.size(1), -1)
    return patches   # (num_patches, patch_size*patch_size*C)

# A 224x224 image with 16x16 patches -> 196 patch tokens.
# Each is linearly projected to the model dim and given a position embedding.
# Then a standard Transformer processes them -- exactly like text tokens.
Patch token
A fixed-size square region of an image (e.g. 16×16 pixels), flattened and linearly projected into a vector, treated as one token in the sequence a Transformer processes — the visual analogue of a text token.
MM Note: Patches Are the Bridge
The patch-as-token idea is the bridge that lets all the Transformer machinery from Part III apply to images. Once an image is a sequence of patch tokens, attention, layers, and everything else work unchanged. The Vision Transformer showed that you do not need image-specific architectures (like convolutional networks) — the same Transformer that processes text processes images, given patches. This unification is what made multi-modal LLMs natural.
A practical note: the number of patch tokens grows with image resolution, and these tokens consume context window just like text tokens. A high-resolution image can become hundreds or thousands of tokens — which is why image-token efficiency (Section 30.6) matters for multi-modal models.
30.3

Turning an image into patch tokens is only the first step. Those raw patch embeddings don't yet capture MEANING — they're just projected pixels. A VISION ENCODER is a Transformer (a ViT) that processes the patch tokens into rich feature vectors that capture what is actually in the image: objects, textures, relationships, text. The vision encoder is the 'eye' of a multi-modal model.

The Vision Transformer as Encoder

Arch Stack: A Vision Transformer (ViT) encoder

image features(196, 768) meaningful vectors
Transformer blocks (×N)self-attention over patches
patch + position embeddings(196, 768)
patchifyimage → 196 patch tokens
input image(224, 224, 3)

The ViT encoder works just like the text Transformers of Part III: stacked blocks of self-attention and feed-forward layers, with each patch attending to all others. Through these layers, the patch representations become increasingly meaningful — early layers capture edges and textures, later layers capture objects and scene-level meaning. The output is a sequence of feature vectors, one per patch, that richly describe the image's content.

ML Connection: The Same Architecture, A Different Input
It is worth pausing on how little had to change. The vision encoder is the SAME Transformer architecture from Part III — self-attention, feed-forward, residuals, layer norm — applied to patch tokens instead of word tokens. The deep lesson of the last decade is that the Transformer is a general-purpose sequence processor; what differs across modalities is only how you turn the raw input into tokens. Vision, audio, even protein sequences and time series — all yield to the same architecture with the right tokenization.
This is why mastering the Transformer (Part III) was such a high-leverage investment: it is the substrate for nearly all modern AI, across every modality. Multi-modal models are not a new architecture so much as the same architecture fed new kinds of tokens.

Where the Vision Encoder Comes From

Crucially, multi-modal LLMs usually do not train a vision encoder from scratch. They start from a PRETRAINED one — most often a CLIP vision encoder (next section) — which has already learned rich visual features from massive image data. This is transfer learning: reuse a vision encoder that already 'sees' well, and connect it to the language model. The next section explains CLIP, the pretrained vision encoder that powers most multi-modal LLMs.

30.4

CLIP (Contrastive Language-Image Pre-training; Radford et al., 2021) is one of the most influential ideas in multi-modal AI. It learns a SHARED embedding space where images and their text descriptions land in the same place. A photo of a dog and the text 'a dog' get nearby vectors. This shared space is the foundation that lets a language model understand images, and CLIP's vision encoder is the 'eye' inside most multi-modal LLMs.

Contrastive Learning: Match the Pairs

CLIP trains on hundreds of millions of (image, caption) pairs scraped from the web. The training objective is CONTRASTIVE: given a batch of image-caption pairs, push each image's embedding CLOSE to its OWN caption's embedding, and FAR from all the OTHER captions in the batch. The model learns to align matching images and texts while separating mismatched ones — building, across millions of pairs, a shared space where meaning is shared across modalities.

Pipeline Flow: How CLIP learns a shared image-text space

1Collect pairsHundreds of millions of (image, caption) pairs from the web
2Encode bothImage encoder → image vector; text encoder → text vector
3Contrastive lossPull matching pairs together, push mismatches apart
4ResultA shared space where 'a dog' ≈ a photo of a dog
textCLIP contrastive objective (sketch)
For a batch of N (image, text) pairs:
    compute similarity(image_i, text_j) for ALL i, j
    the DIAGONAL (i == j) are the TRUE matching pairs

Maximize similarity on the diagonal (matches),
minimize it off-diagonal (mismatches).
# A symmetric cross-entropy over images-to-texts and texts-to-images.

After training, CLIP can measure how well any image matches any text — enabling zero-shot classification ('is this image more like "a cat" or "a dog"?'), image search by text, and more. But for multi-modal LLMs, the prize is CLIP's VISION ENCODER: a network that produces image features already ALIGNED with language. Those features are the perfect input to hand to a language model, because they already live in a space connected to text meaning.

MM Note: Why CLIP Features Are Ideal for LLMs
CLIP's vision encoder is special because its image features were trained to align with TEXT. This means the visual features already 'speak the language of language' — they encode the kind of semantic content that text describes (objects, attributes, relationships) rather than raw low-level pixel statistics. Feeding CLIP features to a language model is therefore far easier than feeding raw pixels: the features are already in a language-aligned, meaningful form.
This is why the dominant recipe for multi-modal LLMs (LLaVA, next section) starts with a frozen CLIP vision encoder. CLIP did the hard work of learning language-aligned visual features; the multi-modal LLM just needs to connect those features to the LLM's input.
30.5

We now have the pieces: a language model (the whole book so far) and a vision encoder that produces language-aligned image features (CLIP). LLaVA (Large Language and Vision Assistant; Liu et al., 2023) showed a strikingly simple way to connect them, and it became the dominant recipe for vision-language models. The key component is just a small PROJECTION layer.

The Architecture: Encoder + Projector + LLM

Arch Stack: The LLaVA architecture

LLM (the language model)processes image + text tokens together
projected image tokensnow in the LLM's embedding space
projection layer (the bridge)maps CLIP features → LLM space
CLIP vision encoder (frozen)image → language-aligned features
input image(224, 224, 3)

The recipe has three parts. (1) A VISION ENCODER (frozen CLIP) turns the image into feature vectors. (2) A PROJECTION layer — often just a small multi-layer perceptron — maps those vectors into the LLM's embedding space, producing 'image tokens' that look to the LLM like word embeddings. (3) The LLM processes these image tokens INTERLEAVED with text tokens, attending across both. The image tokens are simply prepended to (or mixed with) the text tokens in the sequence.

The Projection Layer Is the Crucial Bridge

The projection layer is the heart of the connection. CLIP features live in CLIP's space; the LLM expects embeddings in ITS space. The projector translates between them — turning each image feature vector into a vector that the LLM can process as if it were a token embedding. Remarkably, this projector can be small (a one- or two-layer MLP), because CLIP features are already language-aligned (Section 30.4) — the translation needed is modest.

PythonThe LLaVA-style connection
import torch; import torch.nn as nn

class VisionLanguageModel(nn.Module):
    def __init__(self, clip_encoder, llm, clip_dim=1024, llm_dim=4096):
        super().__init__()
        self.vision = clip_encoder        # frozen CLIP vision encoder
        self.llm    = llm                  # the language model
        # The PROJECTOR: maps CLIP features -> LLM embedding space
        self.projector = nn.Sequential(
            nn.Linear(clip_dim, llm_dim), nn.GELU(), nn.Linear(llm_dim, llm_dim)
        )

    def forward(self, image, text_tokens):
        # 1. Encode the image -> feature vectors (one per patch)
        feats = self.vision(image)              # (num_patches, clip_dim)
        # 2. Project into the LLM's space -> 'image tokens'
        image_tokens = self.projector(feats)      # (num_patches, llm_dim)
        # 3. Embed text and CONCATENATE image + text tokens
        text_emb = self.llm.embed(text_tokens)    # (T, llm_dim)
        seq = torch.cat([image_tokens, text_emb], dim=0)
        # 4. The LLM processes the combined sequence as usual
        return self.llm(seq)

# The image becomes a handful of 'tokens' the LLM attends to alongside
# the text. To the LLM, an image is just more tokens in the sequence.
MM Note: Images Become 'Words' the LLM Reads
The conceptual payoff: after projection, an image is just a sequence of tokens prepended to the text — the LLM 'reads' the image tokens exactly as it reads word tokens, attending across both to answer questions about the image. The LLM did not need a new architecture; it needed its input vocabulary extended with image tokens via the projector. A picture quite literally becomes a few hundred 'words' the model can read.
This simplicity is why LLaVA was so influential: a powerful vision-language model from a frozen vision encoder, a frozen (or lightly-tuned) LLM, and a small trainable projector. Most of the heavy lifting was already done by the separately-pretrained vision encoder and LLM; the projector just connects them.
30.6

How is a vision-language model trained? Building on the LLaVA recipe, training typically happens in STAGES, reusing the pretraining and SFT ideas from earlier in the book. The staged approach is efficient because it leverages the already-trained components and only learns what is genuinely new.

Stage 1: Alignment (Train the Projector)

The first stage trains ONLY the projection layer, with both the vision encoder and the LLM FROZEN. Using image-caption pairs, the projector learns to map image features into the LLM's space such that the LLM can describe the image. Since only the small projector trains, this stage is cheap. Its goal is purely to teach the bridge — to align the image tokens with the LLM's expectations.

Stage 2: Instruction Tuning (Visual SFT)

The second stage is SFT (Chapter 22) on multi-modal INSTRUCTION data — examples of images paired with questions and good answers ('What is in this image?' → a helpful description; 'Read the text in this sign' → the text). Here the projector AND usually the LLM are trained (the vision encoder often stays frozen). This teaches the model to FOLLOW INSTRUCTIONS about images — to be a helpful visual assistant, not just a captioner.

Pipeline Flow: Staged training of a vision-language model

1Pretrained partsStart from a pretrained CLIP encoder + a pretrained LLM
2Stage 1: alignTrain ONLY the projector on image-caption pairs (cheap)
3Stage 2: visual SFTInstruction-tune on image+question+answer data
4(Optional) preferenceRLHF/DPO on visual responses for helpfulness
MM Note: Reusing Everything We Built
Notice how multi-modal training reuses the entire book: a pretrained LLM (Part IV), the SFT recipe (Chapter 22), and even preference optimization (Chapters 23–24) applied to visual responses. The genuinely new parts are small — the projector and the multi-modal data. This is why capable vision-language models could be built so quickly after strong LLMs existed: most of the ingredients were already on the shelf.
The staged approach also reflects good engineering: train the cheap new bridge first (Stage 1), then do the more expensive instruction tuning (Stage 2). Freezing the expensive, already-good components (vision encoder, often the LLM) keeps training affordable and avoids damaging their pretrained abilities.

Where Multi-modal Data Comes From

Multi-modal instruction data is often generated by using a strong existing model: show a powerful model an image with annotations, and have it generate questions and answers about the image. This 'distillation' approach (echoing Chapter 22's data sources) scales the creation of visual instruction data cheaply. Human-annotated data and existing captioned datasets supplement it.

30.7

A practical reality shapes multi-modal models: images consume TOKENS, and tokens are expensive (they fill the context window and cost compute, per Chapter 27). A single high-resolution image can become hundreds or thousands of tokens. Managing this 'image-token budget' is a key engineering concern, and it explains many design choices in real multi-modal systems.

The Resolution-Tokens Trade-off

More image resolution means more patches means more tokens. A 224×224 image at 16×16 patches is 196 tokens; a 1024×1024 image is over 4,000 tokens — more than many entire text prompts. High resolution lets the model read fine detail (small text, distant objects) but at a steep token cost. Multi-modal systems must balance visual detail against the token budget.

ApproachWhat it does
Fixed low resolutionFew tokens, but misses fine detail
Tiling / croppingSplit a high-res image into tiles, each encoded separately
Token pooling / mergingCombine nearby patch tokens to reduce count
Resampler (Q-Former)Learn a fixed small number of query tokens to summarize the image
Adaptive resolutionUse more tokens only when detail is needed

A common technique is a RESAMPLER (like the Q-Former in BLIP-2, or a Perceiver resampler): instead of passing all patch tokens to the LLM, a small module learns to compress them into a FIXED, small number of tokens (say 32 or 64) that summarize the image. This decouples the LLM's token cost from the image resolution — the LLM always sees a fixed, manageable number of image tokens regardless of how big the image is.

MM Note: The Token Budget Drives Design
Many design differences between multi-modal models come down to how they handle the image-token budget. Some pass all patch tokens (simple, accurate, expensive); some use a resampler to compress to a fixed small count (efficient, may lose detail); some tile high-res images and process tiles (good for documents and fine text, many tokens). There is no single best answer — it depends on whether your application needs fine visual detail (documents, charts) or coarse understanding (scene description).
This connects directly to the inference economics of Chapter 27: image tokens are decode-time and context-window costs just like text tokens. A model that uses 4,000 tokens per image is far more expensive to serve than one using 64 — so token efficiency is both a capability and a cost decision.
30.8

Once you understand the vision recipe, audio follows the same pattern — which is the beauty of the tokenize-everything approach. To make a model HEAR, we need to turn audio into tokens the LLM can process, then connect them with a projector, just as we did for images. The modality changes; the recipe does not.

Turning Audio Into Tokens

Audio is a waveform — a 1D signal of amplitudes over time. Two common ways to tokenize it: (1) convert the waveform into a SPECTROGRAM (a 2D time-frequency image) and patchify it like an image, or (2) use an AUDIO ENCODER (like Whisper's encoder) that converts audio into a sequence of feature vectors directly. Either way, the result is a sequence of vectors — audio tokens — that a projector maps into the LLM's space.

Pipeline Flow: Making an LLM hear (mirrors the vision recipe)

1Audio waveformRaw sound signal over time
2Audio encoderSpectrogram + encoder → audio feature vectors
3ProjectorMap audio features → LLM embedding space
4LLMProcess audio tokens alongside text

Notice this is EXACTLY the structure of the vision pipeline (Section 30.5), with 'audio encoder' swapped for 'vision encoder'. The projector and LLM roles are identical. This is the power of the shared-token framework: a new modality just needs an encoder that turns it into vectors and a projector that maps those vectors into the LLM's space. Speech recognition, audio question-answering, and music understanding all follow this template.

Vision pipelineAudio pipeline
Image → patchesWaveform → spectrogram/frames
Vision encoder (CLIP/ViT)Audio encoder (Whisper-style)
Projector → LLM spaceProjector → LLM space
LLM reads image tokensLLM reads audio tokens
VQA, captioning, OCRTranscription, audio Q&A
MM Note: Output Modalities Too
So far we have discussed multi-modal INPUT (the model perceives images/audio). Models can also produce multi-modal OUTPUT — generating images (covered in Chapter 7's generative models), speech, or other modalities. Some models are 'any-to-any', accepting and producing multiple modalities. The input side (perception) and output side (generation) use related but distinct techniques; this chapter focuses on input, where the tokenize-and-project recipe dominates.
Increasingly, frontier models are natively multi-modal — trained on text, images, and audio together from the start, rather than bolting vision onto a text model afterward. Native multi-modality can yield deeper cross-modal understanding, though the staged LLaVA-style approach remains the most accessible and widely-used recipe.
30.9

Step back and see the single principle unifying everything in this chapter: the SHARED EMBEDDING SPACE. Every modality — text, images, audio — is converted into vectors in a common space, where the model can reason across them all. This idea is the conceptual core of multi-modal AI, and it is worth understanding deeply.

One Space, Many Modalities

In a multi-modal model, a word, an image patch, and an audio frame all become vectors in the SAME space. Because they share a space, the attention mechanism can relate them: a question token can attend to image tokens, an image region can be linked to a word. The model does not have separate 'vision reasoning' and 'language reasoning' — it has ONE reasoning process operating over a unified sequence of multi-modal tokens. This is what enables genuine cross-modal understanding, like answering a text question about an image.

Intuition: Why a Shared Space Enables Cross-Modal Reasoning
Imagine each modality lived in its OWN separate space, with no connection. Then the model could process images and text, but never RELATE them — it could not answer 'what color is the car in this photo?' because the word 'car' and the image of the car would be in incomparable spaces. A SHARED space is what lets the model connect 'car' (text) to the car (pixels): they are nearby vectors the attention mechanism can link.
This is why CLIP's contribution was so foundational — it explicitly LEARNED a shared image-text space. Everything downstream (LLaVA, audio LLMs) builds on the principle that putting modalities in one space is what makes reasoning across them possible. The shared space is not an implementation detail; it is the whole idea.

Alignment Is the Hard Part

The central challenge of multi-modality is ALIGNMENT — getting the different modalities to occupy the shared space consistently, so that semantically-related things across modalities land near each other. CLIP achieves it with contrastive learning; LLaVA's projector achieves it for the LLM's space; audio encoders achieve it for sound. When alignment is good, cross-modal reasoning works; when it is poor, the model 'sees' the image but cannot connect it to language. Most of the difficulty and the research in multi-modality is about achieving good alignment.

ML Connection: Embeddings, All the Way Down
This chapter closes a loop opened back in Chapter 8 on embeddings. The whole edifice of modern AI rests on representing things — words, images, sounds, concepts — as vectors in meaningful spaces, where geometry encodes meaning and nearby vectors mean similar things. Multi-modality is the grand extension of that idea: put EVERYTHING in a shared vector space, and a single model can reason across all of it.
From word2vec (Chapter 8) to CLIP to multi-modal LLMs, the through-line is the same: learn good embeddings, and remarkable capabilities follow. The shared embedding space is perhaps the single most important idea in modern machine learning, and multi-modal models are its most striking demonstration.
30.10

Multi-modal models are powerful but imperfect. A clear-eyed view of their strengths and weaknesses is essential for using them well and not over-trusting them.

Strong atWeak at / fails
Describing image contentPrecise spatial reasoning (exact positions)
Visual question answeringCounting many objects accurately
Reading clear text (OCR)Tiny / low-quality / dense text
Chart & document understandingComplex diagrams, fine measurements
General scene understandingReasoning needing pixel precision
Common visual conceptsRare/novel visual content unseen in training

Why the Weaknesses Exist

Many limitations trace to the patch tokenization. Because the image is summarized into a limited number of patch tokens, FINE DETAIL can be lost — small text, precise positions, exact counts. The model sees a compressed representation, not every pixel. Higher resolution (more tokens) helps but costs more (Section 30.7). Other weaknesses mirror text LLMs: they can hallucinate about images (describing objects that aren't there), and their visual knowledge is bounded by training data.

⚠️
Multi-modal Models Hallucinate Too
Just as text LLMs invent facts, vision-language models can 'hallucinate' visual content — confidently describing objects, text, or details that are not actually in the image. This often happens when the model relies on its language priors (what USUALLY appears in such scenes) over what is actually shown. For example, asked about a kitchen photo, it might mention a refrigerator that isn't visible because kitchens usually have one.
The same caution as everywhere in this book applies: do not over-trust the output, especially for precise or critical visual details (medical images, exact readings, fine text). Verify important visual claims, and be aware that confident-sounding descriptions can be wrong — a vision-language model's fluency is not a guarantee of visual accuracy.
30.11

The chapter's recipe (encoder + projector + LLM) is the dominant pattern, but the multi-modal landscape has several architectural approaches and a clear direction of travel. A brief map helps place what you have learned.

ApproachHow it fuses modalities
Projection (LLaVA)Vision encoder → projector → LLM input. Simple, dominant.
Cross-attention (Flamingo)LLM layers cross-attend to image features
Resampler (BLIP-2)Q-Former compresses image to a few query tokens
Native multi-modalTrained on all modalities jointly from scratch
Unified any-to-anyOne model accepts and generates multiple modalities

Two Ways to Fuse: Input vs Cross-Attention

There are two main ways to inject visual information into an LLM. The LLaVA approach puts image tokens directly into the LLM's INPUT sequence (simple, lets every layer attend to the image). The Flamingo approach (Alayrac et al., 2022) instead adds CROSS-ATTENTION layers inside the LLM that attend to image features (keeps the image separate from the text sequence, can be more parameter-efficient for many images). Both work; the input-injection approach (LLaVA) became more popular for its simplicity.

MM Note: Toward Native Multi-modality
The field is moving from 'bolt a vision encoder onto a text LLM' toward NATIVE multi-modal models trained on text, images, and audio together from the beginning. Native training can produce deeper cross-modal understanding, because the model learns the modalities jointly rather than stitching them together afterward. The frontier models increasingly take this approach.
But the staged, projection-based recipe of this chapter (LLaVA) remains the most accessible and instructive: it builds a capable vision-language model from off-the-shelf parts with a small projector, and it makes the core ideas — tokenize, project into a shared space, attend across modalities — crystal clear. Understanding it is the foundation for understanding the more integrated approaches.

Multi-modal Meets Everything Else

Multi-modal capabilities combine with the rest of Part VI: multi-modal RAG retrieves relevant images as well as text (Chapter 29); multi-modal agents can SEE a screen and act on it (Chapter 28); multi-modal reasoning models think about images step by step (Chapter 25). Vision and audio are not a separate track — they extend every capability we have built, letting models perceive and act in a richer, more human world.

30.12

Let us consolidate the chapter into one coherent picture of how a model comes to see and hear, from raw input to grounded answer.

Pipeline Flow: The complete multi-modal recipe

1Tokenize the modalityImage → patches; audio → spectrogram/frames
2EncodeA pretrained encoder (CLIP for vision) → meaningful features
3ProjectA small projector maps features into the LLM's space
4FuseModality tokens join text tokens in one shared-space sequence
5Attend & reasonThe LLM attends across all modalities to reason
6GenerateA grounded answer about the image/audio + text

The Three Ideas to Remember

If you remember three things from this chapter, make them these. First, the TRANSFORMER IS MODALITY-AGNOSTIC — it processes sequences of vectors, whatever they originally were. Second, MULTI-MODALITY IS TOKENIZATION — the whole problem reduces to turning each modality into tokens in a shared space (encoder + projector). Third, the SHARED EMBEDDING SPACE is what enables cross-modal reasoning — putting everything in one space lets attention relate a word to a pixel to a sound.

MM Note: The Elegance of the Solution
It is worth appreciating how ELEGANT the multi-modal solution is. We did not invent a new kind of model for vision, another for audio, and a way to glue them together. We took the one architecture we already had — the Transformer — and fed it new kinds of tokens. The same attention, the same layers, the same training recipes (pretraining, SFT, preference optimization) all carried over. Multi-modality is less a new field than a natural consequence of the token-and-attention paradigm.
This is the deep lesson of the chapter, and of much of modern AI: find a general representation (tokens in a shared vector space) and a general processor (attention), and capability after capability follows from the same foundation. Master the foundation, and the modalities take care of themselves.
30.13

Multi-modal Quick-Reference

ConceptKey ideaRemember
Core ideaTurn every modality into tokensTransformer is modality-agnostic
Patch tokensCut image into patches = tokensViT: patches are visual tokens
Vision encoderViT processes patches into featuresSame architecture, new input
CLIPContrastive image-text alignmentLanguage-aligned visual features
LLaVAEncoder + projector + LLMThe projector is the bridge
Staged trainingAlign projector, then visual SFTReuses the whole book
Token budgetImages cost many tokensResamplers compress them
Audio LLMsSame recipe, audio encoderTokenize-and-project again
Shared spaceAll modalities in one spaceEnables cross-modal reasoning

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.

Exercise 1: Pen & Paper
Explain the core idea that makes multi-modal models possible. Why is the Transformer described as 'modality-agnostic'?
Exercise 2: Pen & Paper
Describe how an image is turned into tokens (the ViT patch idea). For a 336×336 image with 14×14 patches, how many patch tokens result?
Exercise 3: Pen & Paper
Why do patch tokens need position embeddings? What would go wrong without them?
Exercise 4: Pen & Paper
Explain CLIP's contrastive objective. Why does training on (image, caption) pairs produce a shared image-text space?
Exercise 5: Pen & Paper
Why are CLIP's vision features especially well-suited as input to an LLM, compared to raw pixels?
Exercise 6: Pen & Paper
Describe the LLaVA architecture (encoder, projector, LLM). What is the role of the projector, and why can it be small?
Exercise 7: Pen & Paper
Explain the two-stage training of a vision-language model. What is frozen and what trains in each stage, and why?
Exercise 8: Pen & Paper
Explain the image-token budget. Why are high-resolution images expensive, and how does a resampler help?
Exercise 9: Pen & Paper
Show how the audio pipeline mirrors the vision pipeline. What is the only component that fundamentally changes?
Exercise 10: Pen & Paper
Explain why a SHARED embedding space (not separate per-modality spaces) is what enables cross-modal reasoning. Give an example that requires it.
Exercise 11: Code
Implement patchify: turn an image tensor into a sequence of flattened patch tokens. Verify the token count for several image sizes and patch sizes.
Exercise 12: Code
Build a minimal Vision Transformer encoder (patch embed + position embed + a few Transformer blocks) and run an image through it to get patch features.
Exercise 13: Code
Implement CLIP's contrastive loss for a batch of image and text embeddings. Verify it pulls matching pairs together and pushes mismatches apart.
Exercise 14: Code
Use a pretrained CLIP model to do zero-shot classification: score an image against several text labels and pick the best. Test on a few images.
Exercise 15: Code
Implement the LLaVA-style projector (an MLP) that maps vision features to an LLM's embedding dimension. Confirm the output shape matches the LLM's token embeddings.
Exercise 16: Code Lab
Connect a frozen vision encoder to a small LLM via a projector. Concatenate image tokens with text tokens and run the combined sequence through the LLM.
Exercise 17: Code Lab
Implement Stage-1 alignment training: freeze the encoder and LLM, and train ONLY the projector on image-caption pairs so the model can describe images.
Exercise 18: Code
Implement a simple resampler that compresses N patch tokens into a fixed K query tokens (e.g. via cross-attention). Show the LLM's token count becomes independent of image resolution.
Exercise 19: Code
Build an audio tokenization pipeline: convert a waveform to a spectrogram and patchify it like an image, producing audio tokens for an LLM.
Exercise 20: Code (Challenge)
Build a minimal end-to-end vision-language model: a (small/pretrained) vision encoder, a trainable projector, and a small LLM. Train the projector on image-caption pairs (Stage 1), then instruction-tune on a small set of image-question-answer examples (Stage 2). Evaluate it on held-out visual questions, and probe its limits by testing fine-detail tasks (counting, small text) where you expect it to struggle — then explain those failures in terms of patch tokenization and the token budget.

Further reading: “Learning Transferable Visual Models From Natural Language Supervision” (Radford et al., 2021, CLIP). “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020, ViT). “Visual Instruction Tuning” (Liu et al., 2023, LLaVA). “Flamingo” (Alayrac et al., 2022) for cross-attention fusion. “BLIP-2” (Li et al., 2023) for the Q-Former resampler. “Robust Speech Recognition via Large-Scale Weak Supervision” (Radford et al., 2022, Whisper) for audio encoding. Surveys on multi-modal large language models for the broader landscape.


Next → Chapter 31: Serving at Scale

You now have a fast (Chapter 27), tool-using (Chapter 28), knowledge-grounded (Chapter 29), and multi-modal (this chapter) model. The final chapter of Part VI tackles the last deployment challenge: serving the model RELIABLY to millions of users. Chapter 31 covers the systems engineering of production LLM serving — load balancing and autoscaling across many GPUs and replicas, routing and scheduling requests, managing reliability and cost at scale, and the full production stack that turns a model into a dependable service. It closes Part VI by completing the journey from a trained model to a real, scalable product — and bridges to the frontier techniques of Part VII.

20 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →