Part VI: Productionization

Chapter 30

Multi-modal LLMs

Teaching models to see and hear: vision encoders, image tokenization, the CLIP and LLaVA recipes, audio LLMs, and the art of aligning different modalities in a shared embedding space.

20 Exercises

Learning Objectives

1.	Explain why and how a text model can be extended to other modalities.
2.	Understand how images are turned into tokens a Transformer can process.
3.	Understand vision encoders and the Vision Transformer (ViT).
4.	Understand CLIP and contrastive alignment of images and text.
5.	Understand the LLaVA recipe: connecting a vision encoder to an LLM.
6.	Explain how a projection layer maps image features into the LLM's space.
7.	Understand how multi-modal models are trained in stages.
8.	Extend the ideas to audio and other modalities.
9.	Reason about what multi-modal models can and cannot do.
10.	Appreciate the shared-embedding-space principle that unifies modalities.

Every model in this book so far has lived in a world of text. But the real world is rich with images, audio, and video, and much of human communication and knowledge is non-textual. A model that can SEE a photo, READ a chart, or HEAR speech is far more useful than one limited to text. This chapter extends the LLM to MULTIPLE MODALITIES — and the central idea turns out to be a beautiful and surprisingly simple extension of everything we have built.

What Multi-modal Models Can Do

Capability	Example
Image description	Describe what is in a photo
Visual question answering	'What color is the car in this image?'
Chart & document understanding	Read a graph, table, or scanned page
OCR and text-in-images	Extract and reason about text within images
Visual reasoning	'Why is this meme funny?' / diagram interpretation
Audio understanding	Transcribe and answer questions about speech
Grounding	Point to where an object is in an image

The Big Idea: Turn Everything Into Tokens

Here is the key insight that makes multi-modal models work, and it is elegantly simple. A Transformer processes a sequence of TOKENS — it does not care what those tokens originally were. If we can turn an IMAGE into a sequence of tokens that live in the SAME space as text tokens, the very same Transformer can process images and text together, attending across both. Multi-modality is, at its heart, the problem of converting other modalities into tokens the language model already knows how to handle.

✧

MM Note: One Architecture, Many Modalities

The Transformer is modality-agnostic: it operates on sequences of vectors (Chapter 13). Text becomes vectors via token embeddings; images can become vectors via a vision encoder; audio can become vectors via an audio encoder. Once everything is a sequence of vectors in a shared space, the SAME attention mechanism mixes information across modalities. This is why the Transformer became the universal architecture — it generalizes far beyond text.

So the question 'how do we make an LLM see?' reduces to 'how do we turn an image into vectors that the LLM can attend to alongside text?'. The rest of this chapter answers that question — first the components (vision encoders, CLIP), then how they connect to the LLM (LLaVA), then training, then audio.

✧

Intuition: A Shared Language of Vectors

Imagine two people who speak different languages but both understand a third, common language. To communicate, each translates into the common language. Multi-modal models do this with VECTORS: images and text are different 'languages', and the shared embedding space is the common language they both translate into. Once an image is translated into the shared vector space, the LLM can 'read' it just as it reads text.

The whole game of multi-modality is learning these translations so that an image of a dog and the word 'dog' land in nearby places in the shared space. Get the translation right, and the model can reason across modalities seamlessly.

Text is naturally a sequence of tokens, but an image is a grid of pixels — how do we turn it into a sequence? The answer, from the Vision Transformer (ViT; Dosovitskiy et al., 2020), is delightfully simple: cut the image into PATCHES, and treat each patch as a token. This single idea is what lets the Transformer process images.

Patches as Tokens

A ViT splits an image into a grid of fixed-size patches — say 16×16 pixels each. A 224×224 image becomes a 14×14 grid = 196 patches. Each patch is flattened and passed through a small linear layer to produce a vector — a 'patch embedding'. Now the image is a sequence of 196 vectors, exactly the form a Transformer expects. Patches are to images what tokens are to text.

Shape Trace: Image to patch tokens

Operation	Shape	Note
input image	(224, 224, 3)	raw pixels
split into patches	(196, 16×16×3)	16x16 patches
flatten each patch	(196, 768)	flatten pixels
linear projection	(196, 768)	patch embeddings
+ position embeddings	(196, 768)	where each patch is
→ Transformer	(196, 768)	a sequence of tokens!

Just like text tokens, patch tokens get POSITION EMBEDDINGS so the model knows where each patch was in the image (top-left, center, etc.) — because, as with text, attention is otherwise order-blind (Chapter 13). With patches embedded and positioned, a standard Transformer can process the image, with each patch attending to every other patch to build up an understanding of the whole scene.

Python•Patchifying an image (the ViT idea)
import torch

def patchify(image, patch_size=16):
    """Turn an image into a sequence of patch tokens."""
    # image: (C, H, W) -> grid of (patch_size x patch_size) patches
    C, H, W = image.shape
    patches = image.unfold(1, patch_size, patch_size) \
                   .unfold(2, patch_size, patch_size)
    # rearrange into (num_patches, patch_size*patch_size*C)
    patches = patches.reshape(C, -1, patch_size * patch_size)
    patches = patches.permute(1, 0, 2).reshape(patches.size(1), -1)
    return patches   # (num_patches, patch_size*patch_size*C)

# A 224x224 image with 16x16 patches -> 196 patch tokens.
# Each is linearly projected to the model dim and given a position embedding.
# Then a standard Transformer processes them -- exactly like text tokens.

Patch token

A fixed-size square region of an image (e.g. 16×16 pixels), flattened and linearly projected into a vector, treated as one token in the sequence a Transformer processes — the visual analogue of a text token.

✧

MM Note: Patches Are the Bridge

The patch-as-token idea is the bridge that lets all the Transformer machinery from Part III apply to images. Once an image is a sequence of patch tokens, attention, layers, and everything else work unchanged. The Vision Transformer showed that you do not need image-specific architectures (like convolutional networks) — the same Transformer that processes text processes images, given patches. This unification is what made multi-modal LLMs natural.

A practical note: the number of patch tokens grows with image resolution, and these tokens consume context window just like text tokens. A high-resolution image can become hundreds or thousands of tokens — which is why image-token efficiency (Section 30.6) matters for multi-modal models.

Turning an image into patch tokens is only the first step. Those raw patch embeddings don't yet capture MEANING — they're just projected pixels. A VISION ENCODER is a Transformer (a ViT) that processes the patch tokens into rich feature vectors that capture what is actually in the image: objects, textures, relationships, text. The vision encoder is the 'eye' of a multi-modal model.

The Vision Transformer as Encoder

Arch Stack: A Vision Transformer (ViT) encoder

image features	(196, 768) meaningful vectors
Transformer blocks (×N)	self-attention over patches
patch + position embeddings	(196, 768)
patchify	image → 196 patch tokens
input image	(224, 224, 3)

The ViT encoder works just like the text Transformers of Part III: stacked blocks of self-attention and feed-forward layers, with each patch attending to all others. Through these layers, the patch representations become increasingly meaningful — early layers capture edges and textures, later layers capture objects and scene-level meaning. The output is a sequence of feature vectors, one per patch, that richly describe the image's content.

▶

ML Connection: The Same Architecture, A Different Input

It is worth pausing on how little had to change. The vision encoder is the SAME Transformer architecture from Part III — self-attention, feed-forward, residuals, layer norm — applied to patch tokens instead of word tokens. The deep lesson of the last decade is that the Transformer is a general-purpose sequence processor; what differs across modalities is only how you turn the raw input into tokens. Vision, audio, even protein sequences and time series — all yield to the same architecture with the right tokenization.

This is why mastering the Transformer (Part III) was such a high-leverage investment: it is the substrate for nearly all modern AI, across every modality. Multi-modal models are not a new architecture so much as the same architecture fed new kinds of tokens.

Where the Vision Encoder Comes From

Crucially, multi-modal LLMs usually do not train a vision encoder from scratch. They start from a PRETRAINED one — most often a CLIP vision encoder (next section) — which has already learned rich visual features from massive image data. This is transfer learning: reuse a vision encoder that already 'sees' well, and connect it to the language model. The next section explains CLIP, the pretrained vision encoder that powers most multi-modal LLMs.

CLIP (Contrastive Language-Image Pre-training; Radford et al., 2021) is one of the most influential ideas in multi-modal AI. It learns a SHARED embedding space where images and their text descriptions land in the same place. A photo of a dog and the text 'a dog' get nearby vectors. This shared space is the foundation that lets a language model understand images, and CLIP's vision encoder is the 'eye' inside most multi-modal LLMs.

Contrastive Learning: Match the Pairs

CLIP trains on hundreds of millions of (image, caption) pairs scraped from the web. The training objective is CONTRASTIVE: given a batch of image-caption pairs, push each image's embedding CLOSE to its OWN caption's embedding, and FAR from all the OTHER captions in the batch. The model learns to align matching images and texts while separating mismatched ones — building, across millions of pairs, a shared space where meaning is shared across modalities.

Pipeline Flow: How CLIP learns a shared image-text space

1	Collect pairs	Hundreds of millions of (image, caption) pairs from the web
2	Encode both	Image encoder → image vector; text encoder → text vector
3	Contrastive loss	Pull matching pairs together, push mismatches apart
4	Result	A shared space where 'a dog' ≈ a photo of a dog

text•CLIP contrastive objective (sketch)
For a batch of N (image, text) pairs:
    compute similarity(image_i, text_j) for ALL i, j
    the DIAGONAL (i == j) are the TRUE matching pairs

Maximize similarity on the diagonal (matches),
minimize it off-diagonal (mismatches).
# A symmetric cross-entropy over images-to-texts and texts-to-images.

After training, CLIP can measure how well any image matches any text — enabling zero-shot classification ('is this image more like "a cat" or "a dog"?'), image search by text, and more. But for multi-modal LLMs, the prize is CLIP's VISION ENCODER: a network that produces image features already ALIGNED with language. Those features are the perfect input to hand to a language model, because they already live in a space connected to text meaning.

✧

MM Note: Why CLIP Features Are Ideal for LLMs

CLIP's vision encoder is special because its image features were trained to align with TEXT. This means the visual features already 'speak the language of language' — they encode the kind of semantic content that text describes (objects, attributes, relationships) rather than raw low-level pixel statistics. Feeding CLIP features to a language model is therefore far easier than feeding raw pixels: the features are already in a language-aligned, meaningful form.

This is why the dominant recipe for multi-modal LLMs (LLaVA, next section) starts with a frozen CLIP vision encoder. CLIP did the hard work of learning language-aligned visual features; the multi-modal LLM just needs to connect those features to the LLM's input.

We now have the pieces: a language model (the whole book so far) and a vision encoder that produces language-aligned image features (CLIP). LLaVA (Large Language and Vision Assistant; Liu et al., 2023) showed a strikingly simple way to connect them, and it became the dominant recipe for vision-language models. The key component is just a small PROJECTION layer.

The Architecture: Encoder + Projector + LLM

Arch Stack: The LLaVA architecture

LLM (the language model)	processes image + text tokens together
projected image tokens	now in the LLM's embedding space
projection layer (the bridge)	maps CLIP features → LLM space
CLIP vision encoder (frozen)	image → language-aligned features
input image	(224, 224, 3)

The recipe has three parts. (1) A VISION ENCODER (frozen CLIP) turns the image into feature vectors. (2) A PROJECTION layer — often just a small multi-layer perceptron — maps those vectors into the LLM's embedding space, producing 'image tokens' that look to the LLM like word embeddings. (3) The LLM processes these image tokens INTERLEAVED with text tokens, attending across both. The image tokens are simply prepended to (or mixed with) the text tokens in the sequence.

The Projection Layer Is the Crucial Bridge

The projection layer is the heart of the connection. CLIP features live in CLIP's space; the LLM expects embeddings in ITS space. The projector translates between them — turning each image feature vector into a vector that the LLM can process as if it were a token embedding. Remarkably, this projector can be small (a one- or two-layer MLP), because CLIP features are already language-aligned (Section 30.4) — the translation needed is modest.

Python•The LLaVA-style connection
import torch; import torch.nn as nn

class VisionLanguageModel(nn.Module):
    def __init__(self, clip_encoder, llm, clip_dim=1024, llm_dim=4096):
        super().__init__()
        self.vision = clip_encoder        # frozen CLIP vision encoder
        self.llm    = llm                  # the language model
        # The PROJECTOR: maps CLIP features -> LLM embedding space
        self.projector = nn.Sequential(
            nn.Linear(clip_dim, llm_dim), nn.GELU(), nn.Linear(llm_dim, llm_dim)
        )

    def forward(self, image, text_tokens):
        # 1. Encode the image -> feature vectors (one per patch)
        feats = self.vision(image)              # (num_patches, clip_dim)
        # 2. Project into the LLM's space -> 'image tokens'
        image_tokens = self.projector(feats)      # (num_patches, llm_dim)
        # 3. Embed text and CONCATENATE image + text tokens
        text_emb = self.llm.embed(text_tokens)    # (T, llm_dim)
        seq = torch.cat([image_tokens, text_emb], dim=0)
        # 4. The LLM processes the combined sequence as usual
        return self.llm(seq)

# The image becomes a handful of 'tokens' the LLM attends to alongside
# the text. To the LLM, an image is just more tokens in the sequence.

✧

MM Note: Images Become 'Words' the LLM Reads

The conceptual payoff: after projection, an image is just a sequence of tokens prepended to the text — the LLM 'reads' the image tokens exactly as it reads word tokens, attending across both to answer questions about the image. The LLM did not need a new architecture; it needed its input vocabulary extended with image tokens via the projector. A picture quite literally becomes a few hundred 'words' the model can read.

This simplicity is why LLaVA was so influential: a powerful vision-language model from a frozen vision encoder, a frozen (or lightly-tuned) LLM, and a small trainable projector. Most of the heavy lifting was already done by the separately-pretrained vision encoder and LLM; the projector just connects them.

How is a vision-language model trained? Building on the LLaVA recipe, training typically happens in STAGES, reusing the pretraining and SFT ideas from earlier in the book. The staged approach is efficient because it leverages the already-trained components and only learns what is genuinely new.

Stage 1: Alignment (Train the Projector)

The first stage trains ONLY the projection layer, with both the vision encoder and the LLM FROZEN. Using image-caption pairs, the projector learns to map image features into the LLM's space such that the LLM can describe the image. Since only the small projector trains, this stage is cheap. Its goal is purely to teach the bridge — to align the image tokens with the LLM's expectations.

Stage 2: Instruction Tuning (Visual SFT)

The second stage is SFT (Chapter 22) on multi-modal INSTRUCTION data — examples of images paired with questions and good answers ('What is in this image?' → a helpful description; 'Read the text in this sign' → the text). Here the projector AND usually the LLM are trained (the vision encoder often stays frozen). This teaches the model to FOLLOW INSTRUCTIONS about images — to be a helpful visual assistant, not just a captioner.

Pipeline Flow: Staged training of a vision-language model

1	Pretrained parts	Start from a pretrained CLIP encoder + a pretrained LLM
2	Stage 1: align	Train ONLY the projector on image-caption pairs (cheap)
3	Stage 2: visual SFT	Instruction-tune on image+question+answer data
4	(Optional) preference	RLHF/DPO on visual responses for helpfulness

✧

MM Note: Reusing Everything We Built

Notice how multi-modal training reuses the entire book: a pretrained LLM (Part IV), the SFT recipe (Chapter 22), and even preference optimization (Chapters 23–24) applied to visual responses. The genuinely new parts are small — the projector and the multi-modal data. This is why capable vision-language models could be built so quickly after strong LLMs existed: most of the ingredients were already on the shelf.

The staged approach also reflects good engineering: train the cheap new bridge first (Stage 1), then do the more expensive instruction tuning (Stage 2). Freezing the expensive, already-good components (vision encoder, often the LLM) keeps training affordable and avoids damaging their pretrained abilities.

Where Multi-modal Data Comes From

Multi-modal instruction data is often generated by using a strong existing model: show a powerful model an image with annotations, and have it generate questions and answers about the image. This 'distillation' approach (echoing Chapter 22's data sources) scales the creation of visual instruction data cheaply. Human-annotated data and existing captioned datasets supplement it.

A practical reality shapes multi-modal models: images consume TOKENS, and tokens are expensive (they fill the context window and cost compute, per Chapter 27). A single high-resolution image can become hundreds or thousands of tokens. Managing this 'image-token budget' is a key engineering concern, and it explains many design choices in real multi-modal systems.

The Resolution-Tokens Trade-off

More image resolution means more patches means more tokens. A 224×224 image at 16×16 patches is 196 tokens; a 1024×1024 image is over 4,000 tokens — more than many entire text prompts. High resolution lets the model read fine detail (small text, distant objects) but at a steep token cost. Multi-modal systems must balance visual detail against the token budget.

Approach	What it does
Fixed low resolution	Few tokens, but misses fine detail
Tiling / cropping	Split a high-res image into tiles, each encoded separately
Token pooling / merging	Combine nearby patch tokens to reduce count
Resampler (Q-Former)	Learn a fixed small number of query tokens to summarize the image
Adaptive resolution	Use more tokens only when detail is needed

A common technique is a RESAMPLER (like the Q-Former in BLIP-2, or a Perceiver resampler): instead of passing all patch tokens to the LLM, a small module learns to compress them into a FIXED, small number of tokens (say 32 or 64) that summarize the image. This decouples the LLM's token cost from the image resolution — the LLM always sees a fixed, manageable number of image tokens regardless of how big the image is.

✧

MM Note: The Token Budget Drives Design

Many design differences between multi-modal models come down to how they handle the image-token budget. Some pass all patch tokens (simple, accurate, expensive); some use a resampler to compress to a fixed small count (efficient, may lose detail); some tile high-res images and process tiles (good for documents and fine text, many tokens). There is no single best answer — it depends on whether your application needs fine visual detail (documents, charts) or coarse understanding (scene description).

This connects directly to the inference economics of Chapter 27: image tokens are decode-time and context-window costs just like text tokens. A model that uses 4,000 tokens per image is far more expensive to serve than one using 64 — so token efficiency is both a capability and a cost decision.

Once you understand the vision recipe, audio follows the same pattern — which is the beauty of the tokenize-everything approach. To make a model HEAR, we need to turn audio into tokens the LLM can process, then connect them with a projector, just as we did for images. The modality changes; the recipe does not.

Turning Audio Into Tokens

Audio is a waveform — a 1D signal of amplitudes over time. Two common ways to tokenize it: (1) convert the waveform into a SPECTROGRAM (a 2D time-frequency image) and patchify it like an image, or (2) use an AUDIO ENCODER (like Whisper's encoder) that converts audio into a sequence of feature vectors directly. Either way, the result is a sequence of vectors — audio tokens — that a projector maps into the LLM's space.

Pipeline Flow: Making an LLM hear (mirrors the vision recipe)

1	Audio waveform	Raw sound signal over time
2	Audio encoder	Spectrogram + encoder → audio feature vectors
3	Projector	Map audio features → LLM embedding space
4	LLM	Process audio tokens alongside text

Notice this is EXACTLY the structure of the vision pipeline (Section 30.5), with 'audio encoder' swapped for 'vision encoder'. The projector and LLM roles are identical. This is the power of the shared-token framework: a new modality just needs an encoder that turns it into vectors and a projector that maps those vectors into the LLM's space. Speech recognition, audio question-answering, and music understanding all follow this template.

Vision pipeline	Audio pipeline
Image → patches	Waveform → spectrogram/frames
Vision encoder (CLIP/ViT)	Audio encoder (Whisper-style)
Projector → LLM space	Projector → LLM space
LLM reads image tokens	LLM reads audio tokens
VQA, captioning, OCR	Transcription, audio Q&A

✧

MM Note: Output Modalities Too

So far we have discussed multi-modal INPUT (the model perceives images/audio). Models can also produce multi-modal OUTPUT — generating images (covered in Chapter 7's generative models), speech, or other modalities. Some models are 'any-to-any', accepting and producing multiple modalities. The input side (perception) and output side (generation) use related but distinct techniques; this chapter focuses on input, where the tokenize-and-project recipe dominates.

Increasingly, frontier models are natively multi-modal — trained on text, images, and audio together from the start, rather than bolting vision onto a text model afterward. Native multi-modality can yield deeper cross-modal understanding, though the staged LLaVA-style approach remains the most accessible and widely-used recipe.

Step back and see the single principle unifying everything in this chapter: the SHARED EMBEDDING SPACE. Every modality — text, images, audio — is converted into vectors in a common space, where the model can reason across them all. This idea is the conceptual core of multi-modal AI, and it is worth understanding deeply.

One Space, Many Modalities

In a multi-modal model, a word, an image patch, and an audio frame all become vectors in the SAME space. Because they share a space, the attention mechanism can relate them: a question token can attend to image tokens, an image region can be linked to a word. The model does not have separate 'vision reasoning' and 'language reasoning' — it has ONE reasoning process operating over a unified sequence of multi-modal tokens. This is what enables genuine cross-modal understanding, like answering a text question about an image.

✧

Intuition: Why a Shared Space Enables Cross-Modal Reasoning

Imagine each modality lived in its OWN separate space, with no connection. Then the model could process images and text, but never RELATE them — it could not answer 'what color is the car in this photo?' because the word 'car' and the image of the car would be in incomparable spaces. A SHARED space is what lets the model connect 'car' (text) to the car (pixels): they are nearby vectors the attention mechanism can link.

This is why CLIP's contribution was so foundational — it explicitly LEARNED a shared image-text space. Everything downstream (LLaVA, audio LLMs) builds on the principle that putting modalities in one space is what makes reasoning across them possible. The shared space is not an implementation detail; it is the whole idea.

Alignment Is the Hard Part

The central challenge of multi-modality is ALIGNMENT — getting the different modalities to occupy the shared space consistently, so that semantically-related things across modalities land near each other. CLIP achieves it with contrastive learning; LLaVA's projector achieves it for the LLM's space; audio encoders achieve it for sound. When alignment is good, cross-modal reasoning works; when it is poor, the model 'sees' the image but cannot connect it to language. Most of the difficulty and the research in multi-modality is about achieving good alignment.

▶

ML Connection: Embeddings, All the Way Down

This chapter closes a loop opened back in Chapter 8 on embeddings. The whole edifice of modern AI rests on representing things — words, images, sounds, concepts — as vectors in meaningful spaces, where geometry encodes meaning and nearby vectors mean similar things. Multi-modality is the grand extension of that idea: put EVERYTHING in a shared vector space, and a single model can reason across all of it.

From word2vec (Chapter 8) to CLIP to multi-modal LLMs, the through-line is the same: learn good embeddings, and remarkable capabilities follow. The shared embedding space is perhaps the single most important idea in modern machine learning, and multi-modal models are its most striking demonstration.

Multi-modal models are powerful but imperfect. A clear-eyed view of their strengths and weaknesses is essential for using them well and not over-trusting them.

Strong at	Weak at / fails
Describing image content	Precise spatial reasoning (exact positions)
Visual question answering	Counting many objects accurately
Reading clear text (OCR)	Tiny / low-quality / dense text
Chart & document understanding	Complex diagrams, fine measurements
General scene understanding	Reasoning needing pixel precision
Common visual concepts	Rare/novel visual content unseen in training

Why the Weaknesses Exist

Many limitations trace to the patch tokenization. Because the image is summarized into a limited number of patch tokens, FINE DETAIL can be lost — small text, precise positions, exact counts. The model sees a compressed representation, not every pixel. Higher resolution (more tokens) helps but costs more (Section 30.7). Other weaknesses mirror text LLMs: they can hallucinate about images (describing objects that aren't there), and their visual knowledge is bounded by training data.

⚠️

Multi-modal Models Hallucinate Too

Just as text LLMs invent facts, vision-language models can 'hallucinate' visual content — confidently describing objects, text, or details that are not actually in the image. This often happens when the model relies on its language priors (what USUALLY appears in such scenes) over what is actually shown. For example, asked about a kitchen photo, it might mention a refrigerator that isn't visible because kitchens usually have one.

The same caution as everywhere in this book applies: do not over-trust the output, especially for precise or critical visual details (medical images, exact readings, fine text). Verify important visual claims, and be aware that confident-sounding descriptions can be wrong — a vision-language model's fluency is not a guarantee of visual accuracy.

The chapter's recipe (encoder + projector + LLM) is the dominant pattern, but the multi-modal landscape has several architectural approaches and a clear direction of travel. A brief map helps place what you have learned.

Approach	How it fuses modalities
Projection (LLaVA)	Vision encoder → projector → LLM input. Simple, dominant.
Cross-attention (Flamingo)	LLM layers cross-attend to image features
Resampler (BLIP-2)	Q-Former compresses image to a few query tokens
Native multi-modal	Trained on all modalities jointly from scratch
Unified any-to-any	One model accepts and generates multiple modalities

Two Ways to Fuse: Input vs Cross-Attention

There are two main ways to inject visual information into an LLM. The LLaVA approach puts image tokens directly into the LLM's INPUT sequence (simple, lets every layer attend to the image). The Flamingo approach (Alayrac et al., 2022) instead adds CROSS-ATTENTION layers inside the LLM that attend to image features (keeps the image separate from the text sequence, can be more parameter-efficient for many images). Both work; the input-injection approach (LLaVA) became more popular for its simplicity.

✧

MM Note: Toward Native Multi-modality

The field is moving from 'bolt a vision encoder onto a text LLM' toward NATIVE multi-modal models trained on text, images, and audio together from the beginning. Native training can produce deeper cross-modal understanding, because the model learns the modalities jointly rather than stitching them together afterward. The frontier models increasingly take this approach.

But the staged, projection-based recipe of this chapter (LLaVA) remains the most accessible and instructive: it builds a capable vision-language model from off-the-shelf parts with a small projector, and it makes the core ideas — tokenize, project into a shared space, attend across modalities — crystal clear. Understanding it is the foundation for understanding the more integrated approaches.

Multi-modal Meets Everything Else

Multi-modal capabilities combine with the rest of Part VI: multi-modal RAG retrieves relevant images as well as text (Chapter 29); multi-modal agents can SEE a screen and act on it (Chapter 28); multi-modal reasoning models think about images step by step (Chapter 25). Vision and audio are not a separate track — they extend every capability we have built, letting models perceive and act in a richer, more human world.

Let us consolidate the chapter into one coherent picture of how a model comes to see and hear, from raw input to grounded answer.

Pipeline Flow: The complete multi-modal recipe

1	Tokenize the modality	Image → patches; audio → spectrogram/frames
2	Encode	A pretrained encoder (CLIP for vision) → meaningful features
3	Project	A small projector maps features into the LLM's space
4	Fuse	Modality tokens join text tokens in one shared-space sequence
5	Attend & reason	The LLM attends across all modalities to reason
6	Generate	A grounded answer about the image/audio + text

The Three Ideas to Remember

If you remember three things from this chapter, make them these. First, the TRANSFORMER IS MODALITY-AGNOSTIC — it processes sequences of vectors, whatever they originally were. Second, MULTI-MODALITY IS TOKENIZATION — the whole problem reduces to turning each modality into tokens in a shared space (encoder + projector). Third, the SHARED EMBEDDING SPACE is what enables cross-modal reasoning — putting everything in one space lets attention relate a word to a pixel to a sound.

✧

MM Note: The Elegance of the Solution

It is worth appreciating how ELEGANT the multi-modal solution is. We did not invent a new kind of model for vision, another for audio, and a way to glue them together. We took the one architecture we already had — the Transformer — and fed it new kinds of tokens. The same attention, the same layers, the same training recipes (pretraining, SFT, preference optimization) all carried over. Multi-modality is less a new field than a natural consequence of the token-and-attention paradigm.

This is the deep lesson of the chapter, and of much of modern AI: find a general representation (tokens in a shared vector space) and a general processor (attention), and capability after capability follows from the same foundation. Master the foundation, and the modalities take care of themselves.

Multi-modal Quick-Reference

Concept	Key idea	Remember
Core idea	Turn every modality into tokens	Transformer is modality-agnostic
Patch tokens	Cut image into patches = tokens	ViT: patches are visual tokens
Vision encoder	ViT processes patches into features	Same architecture, new input
CLIP	Contrastive image-text alignment	Language-aligned visual features
LLaVA	Encoder + projector + LLM	The projector is the bridge
Staged training	Align projector, then visual SFT	Reuses the whole book
Token budget	Images cost many tokens	Resamplers compress them
Audio LLMs	Same recipe, audio encoder	Tokenize-and-project again
Shared space	All modalities in one space	Enables cross-modal reasoning

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.

✎

Exercise 1: Pen & Paper

Explain the core idea that makes multi-modal models possible. Why is the Transformer described as 'modality-agnostic'?

✎

Exercise 2: Pen & Paper

Describe how an image is turned into tokens (the ViT patch idea). For a 336×336 image with 14×14 patches, how many patch tokens result?

✎

Exercise 3: Pen & Paper

Why do patch tokens need position embeddings? What would go wrong without them?

✎

Exercise 4: Pen & Paper

Explain CLIP's contrastive objective. Why does training on (image, caption) pairs produce a shared image-text space?

✎

Exercise 5: Pen & Paper

Why are CLIP's vision features especially well-suited as input to an LLM, compared to raw pixels?

✎

Exercise 6: Pen & Paper

Describe the LLaVA architecture (encoder, projector, LLM). What is the role of the projector, and why can it be small?

✎

Exercise 7: Pen & Paper

Explain the two-stage training of a vision-language model. What is frozen and what trains in each stage, and why?

✎

Exercise 8: Pen & Paper

Explain the image-token budget. Why are high-resolution images expensive, and how does a resampler help?

✎

Exercise 9: Pen & Paper

Show how the audio pipeline mirrors the vision pipeline. What is the only component that fundamentally changes?

✎

Exercise 10: Pen & Paper

Explain why a SHARED embedding space (not separate per-modality spaces) is what enables cross-modal reasoning. Give an example that requires it.

✎

Exercise 11: Code

Implement patchify: turn an image tensor into a sequence of flattened patch tokens. Verify the token count for several image sizes and patch sizes.

✎

Exercise 12: Code

Build a minimal Vision Transformer encoder (patch embed + position embed + a few Transformer blocks) and run an image through it to get patch features.

✎

Exercise 13: Code

Implement CLIP's contrastive loss for a batch of image and text embeddings. Verify it pulls matching pairs together and pushes mismatches apart.

✎

Exercise 14: Code

Use a pretrained CLIP model to do zero-shot classification: score an image against several text labels and pick the best. Test on a few images.

✎

Exercise 15: Code

Implement the LLaVA-style projector (an MLP) that maps vision features to an LLM's embedding dimension. Confirm the output shape matches the LLM's token embeddings.

✎

Exercise 16: Code Lab

Connect a frozen vision encoder to a small LLM via a projector. Concatenate image tokens with text tokens and run the combined sequence through the LLM.

✎

Exercise 17: Code Lab

Implement Stage-1 alignment training: freeze the encoder and LLM, and train ONLY the projector on image-caption pairs so the model can describe images.

✎

Exercise 18: Code

Implement a simple resampler that compresses N patch tokens into a fixed K query tokens (e.g. via cross-attention). Show the LLM's token count becomes independent of image resolution.

✎

Exercise 19: Code

Build an audio tokenization pipeline: convert a waveform to a spectrogram and patchify it like an image, producing audio tokens for an LLM.

✎

Exercise 20: Code (Challenge)

Build a minimal end-to-end vision-language model: a (small/pretrained) vision encoder, a trainable projector, and a small LLM. Train the projector on image-caption pairs (Stage 1), then instruction-tune on a small set of image-question-answer examples (Stage 2). Evaluate it on held-out visual questions, and probe its limits by testing fine-detail tasks (counting, small text) where you expect it to struggle — then explain those failures in terms of patch tokenization and the token budget.

Further reading: “Learning Transferable Visual Models From Natural Language Supervision” (Radford et al., 2021, CLIP). “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020, ViT). “Visual Instruction Tuning” (Liu et al., 2023, LLaVA). “Flamingo” (Alayrac et al., 2022) for cross-attention fusion. “BLIP-2” (Li et al., 2023) for the Q-Former resampler. “Robust Speech Recognition via Large-Scale Weak Supervision” (Radford et al., 2022, Whisper) for audio encoding. Surveys on multi-modal large language models for the broader landscape.

Next → Chapter 31: Serving at Scale

You now have a fast (Chapter 27), tool-using (Chapter 28), knowledge-grounded (Chapter 29), and multi-modal (this chapter) model. The final chapter of Part VI tackles the last deployment challenge: serving the model RELIABLY to millions of users. Chapter 31 covers the systems engineering of production LLM serving — load balancing and autoscaling across many GPUs and replicas, routing and scheduling requests, managing reliability and cost at scale, and the full production stack that turns a model into a dependable service. It closes Part VI by completing the journey from a trained model to a real, scalable product — and bridges to the frontier techniques of Part VII.

✎ 20 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 29. Retrieval-Augmented Generation

NextCh 31. Serving at Scale

→

Multi-modal LLMs

Learning Objectives

Beyond Text: Why Multi-modal?

What Multi-modal Models Can Do

The Big Idea: Turn Everything Into Tokens

Turning Images Into Tokens

Patches as Tokens

Shape Trace: Image to patch tokens

Vision Encoders

The Vision Transformer as Encoder

Arch Stack: A Vision Transformer (ViT) encoder

Where the Vision Encoder Comes From

CLIP: Aligning Images and Text

Contrastive Learning: Match the Pairs

Pipeline Flow: How CLIP learns a shared image-text space

LLaVA: Connecting Vision to an LLM

The Architecture: Encoder + Projector + LLM

Arch Stack: The LLaVA architecture

The Projection Layer Is the Crucial Bridge

Training Multi-modal Models

Stage 1: Alignment (Train the Projector)

Stage 2: Instruction Tuning (Visual SFT)

Pipeline Flow: Staged training of a vision-language model

Where Multi-modal Data Comes From

The Image-Token Budget

The Resolution-Tokens Trade-off

Audio LLMs

Turning Audio Into Tokens

Pipeline Flow: Making an LLM hear (mirrors the vision recipe)

The Shared Embedding Space

One Space, Many Modalities

Alignment Is the Hard Part

Capabilities and Limitations

Why the Weaknesses Exist

The Multi-modal Landscape

Two Ways to Fuse: Input vs Cross-Attention

Multi-modal Meets Everything Else

Putting It Together

Pipeline Flow: The complete multi-modal recipe

The Three Ideas to Remember

Chapter Summary & Exercises

Multi-modal Quick-Reference

Exercises