Multimodal Models
Every model in this book so far has lived in a world of text. But the real world is rich with images, audio, and video, and much of human communication and knowledge is non-textual. A model that can SEE a photo, READ a chart, or HEAR speech is far more useful than one limited to text. This chapter extends the LLM to MULTIPLE MODALITIES — and the central idea turns out to be a beautiful and surprisingly simple extension of everything we have built.
What Multi-modal Models Can Do
| Capability | Example |
|---|---|
| Image description | Describe what is in a photo |
| Visual question answering | 'What color is the car in this image?' |
| Chart & document understanding | Read a graph, table, or scanned page |
| OCR and text-in-images | Extract and reason about text within images |
| Visual reasoning | 'Why is this meme funny?' / diagram interpretation |
| Audio understanding | Transcribe and answer questions about speech |
| Grounding | Point to where an object is in an image |
The Big Idea: Turn Everything Into Tokens
Here is the key insight that makes multi-modal models work, and it is elegantly simple. A Transformer processes a sequence of TOKENS — it does not care what those tokens originally were. If we can turn an IMAGE into a sequence of tokens that live in the SAME space as text tokens, the very same Transformer can process images and text together, attending across both. Multi-modality is, at its heart, the problem of converting other modalities into tokens the language model already knows how to handle.
Text is naturally a sequence of tokens, but an image is a grid of pixels — how do we turn it into a sequence? The answer, from the Vision Transformer (ViT; Dosovitskiy et al., 2020), is delightfully simple: cut the image into PATCHES, and treat each patch as a token. This single idea is what lets the Transformer process images.
Patches as Tokens
A ViT splits an image into a grid of fixed-size patches — say 16×16 pixels each. A 224×224 image becomes a 14×14 grid = 196 patches. Each patch is flattened and passed through a small linear layer to produce a vector — a 'patch embedding'. Now the image is a sequence of 196 vectors, exactly the form a Transformer expects. Patches are to images what tokens are to text.
Shape Trace: Image to patch tokens
| Operation | Shape | Note |
|---|---|---|
| input image | (224, 224, 3) | raw pixels |
| split into patches | (196, 16×16×3) | 16x16 patches |
| flatten each patch | (196, 768) | flatten pixels |
| linear projection | (196, 768) | patch embeddings |
| + position embeddings | (196, 768) | where each patch is |
| → Transformer | (196, 768) | a sequence of tokens! |
Just like text tokens, patch tokens get POSITION EMBEDDINGS so the model knows where each patch was in the image (top-left, center, etc.) — because, as with text, attention is otherwise order-blind (Chapter 13). With patches embedded and positioned, a standard Transformer can process the image, with each patch attending to every other patch to build up an understanding of the whole scene.
import torch
def patchify(image, patch_size=16):
"""Turn an image into a sequence of patch tokens."""
# image: (C, H, W) -> grid of (patch_size x patch_size) patches
C, H, W = image.shape
patches = image.unfold(1, patch_size, patch_size) \
.unfold(2, patch_size, patch_size)
# rearrange into (num_patches, patch_size*patch_size*C)
patches = patches.reshape(C, -1, patch_size * patch_size)
patches = patches.permute(1, 0, 2).reshape(patches.size(1), -1)
return patches # (num_patches, patch_size*patch_size*C)
# A 224x224 image with 16x16 patches -> 196 patch tokens.
# Each is linearly projected to the model dim and given a position embedding.
# Then a standard Transformer processes them -- exactly like text tokens.Turning an image into patch tokens is only the first step. Those raw patch embeddings don't yet capture MEANING — they're just projected pixels. A VISION ENCODER is a Transformer (a ViT) that processes the patch tokens into rich feature vectors that capture what is actually in the image: objects, textures, relationships, text. The vision encoder is the 'eye' of a multi-modal model.
The Vision Transformer as Encoder
Arch Stack: A Vision Transformer (ViT) encoder
| image features | (196, 768) meaningful vectors |
| Transformer blocks (×N) | self-attention over patches |
| patch + position embeddings | (196, 768) |
| patchify | image → 196 patch tokens |
| input image | (224, 224, 3) |
The ViT encoder works just like the text Transformers of Part III: stacked blocks of self-attention and feed-forward layers, with each patch attending to all others. Through these layers, the patch representations become increasingly meaningful — early layers capture edges and textures, later layers capture objects and scene-level meaning. The output is a sequence of feature vectors, one per patch, that richly describe the image's content.
Where the Vision Encoder Comes From
Crucially, multi-modal LLMs usually do not train a vision encoder from scratch. They start from a PRETRAINED one — most often a CLIP vision encoder (next section) — which has already learned rich visual features from massive image data. This is transfer learning: reuse a vision encoder that already 'sees' well, and connect it to the language model. The next section explains CLIP, the pretrained vision encoder that powers most multi-modal LLMs.
CLIP (Contrastive Language-Image Pre-training; Radford et al., 2021) is one of the most influential ideas in multi-modal AI. It learns a SHARED embedding space where images and their text descriptions land in the same place. A photo of a dog and the text 'a dog' get nearby vectors. This shared space is the foundation that lets a language model understand images, and CLIP's vision encoder is the 'eye' inside most multi-modal LLMs.
Contrastive Learning: Match the Pairs
CLIP trains on hundreds of millions of (image, caption) pairs scraped from the web. The training objective is CONTRASTIVE: given a batch of image-caption pairs, push each image's embedding CLOSE to its OWN caption's embedding, and FAR from all the OTHER captions in the batch. The model learns to align matching images and texts while separating mismatched ones — building, across millions of pairs, a shared space where meaning is shared across modalities.
Pipeline Flow: How CLIP learns a shared image-text space
| 1 | Collect pairs | Hundreds of millions of (image, caption) pairs from the web |
| 2 | Encode both | Image encoder → image vector; text encoder → text vector |
| 3 | Contrastive loss | Pull matching pairs together, push mismatches apart |
| 4 | Result | A shared space where 'a dog' ≈ a photo of a dog |
For a batch of N (image, text) pairs:
compute similarity(image_i, text_j) for ALL i, j
the DIAGONAL (i == j) are the TRUE matching pairs
Maximize similarity on the diagonal (matches),
minimize it off-diagonal (mismatches).
# A symmetric cross-entropy over images-to-texts and texts-to-images.After training, CLIP can measure how well any image matches any text — enabling zero-shot classification ('is this image more like "a cat" or "a dog"?'), image search by text, and more. But for multi-modal LLMs, the prize is CLIP's VISION ENCODER: a network that produces image features already ALIGNED with language. Those features are the perfect input to hand to a language model, because they already live in a space connected to text meaning.
We now have the pieces: a language model (the whole book so far) and a vision encoder that produces language-aligned image features (CLIP). LLaVA (Large Language and Vision Assistant; Liu et al., 2023) showed a strikingly simple way to connect them, and it became the dominant recipe for vision-language models. The key component is just a small PROJECTION layer.
The Architecture: Encoder + Projector + LLM
Arch Stack: The LLaVA architecture
| LLM (the language model) | processes image + text tokens together |
| projected image tokens | now in the LLM's embedding space |
| projection layer (the bridge) | maps CLIP features → LLM space |
| CLIP vision encoder (frozen) | image → language-aligned features |
| input image | (224, 224, 3) |
The recipe has three parts. (1) A VISION ENCODER (frozen CLIP) turns the image into feature vectors. (2) A PROJECTION layer — often just a small multi-layer perceptron — maps those vectors into the LLM's embedding space, producing 'image tokens' that look to the LLM like word embeddings. (3) The LLM processes these image tokens INTERLEAVED with text tokens, attending across both. The image tokens are simply prepended to (or mixed with) the text tokens in the sequence.
The Projection Layer Is the Crucial Bridge
The projection layer is the heart of the connection. CLIP features live in CLIP's space; the LLM expects embeddings in ITS space. The projector translates between them — turning each image feature vector into a vector that the LLM can process as if it were a token embedding. Remarkably, this projector can be small (a one- or two-layer MLP), because CLIP features are already language-aligned (Section 30.4) — the translation needed is modest.
import torch; import torch.nn as nn
class VisionLanguageModel(nn.Module):
def __init__(self, clip_encoder, llm, clip_dim=1024, llm_dim=4096):
super().__init__()
self.vision = clip_encoder # frozen CLIP vision encoder
self.llm = llm # the language model
# The PROJECTOR: maps CLIP features -> LLM embedding space
self.projector = nn.Sequential(
nn.Linear(clip_dim, llm_dim), nn.GELU(), nn.Linear(llm_dim, llm_dim)
)
def forward(self, image, text_tokens):
# 1. Encode the image -> feature vectors (one per patch)
feats = self.vision(image) # (num_patches, clip_dim)
# 2. Project into the LLM's space -> 'image tokens'
image_tokens = self.projector(feats) # (num_patches, llm_dim)
# 3. Embed text and CONCATENATE image + text tokens
text_emb = self.llm.embed(text_tokens) # (T, llm_dim)
seq = torch.cat([image_tokens, text_emb], dim=0)
# 4. The LLM processes the combined sequence as usual
return self.llm(seq)
# The image becomes a handful of 'tokens' the LLM attends to alongside
# the text. To the LLM, an image is just more tokens in the sequence.How is a vision-language model trained? Building on the LLaVA recipe, training typically happens in STAGES, reusing the pretraining and SFT ideas from earlier in the book. The staged approach is efficient because it leverages the already-trained components and only learns what is genuinely new.
Stage 1: Alignment (Train the Projector)
The first stage trains ONLY the projection layer, with both the vision encoder and the LLM FROZEN. Using image-caption pairs, the projector learns to map image features into the LLM's space such that the LLM can describe the image. Since only the small projector trains, this stage is cheap. Its goal is purely to teach the bridge — to align the image tokens with the LLM's expectations.
Stage 2: Instruction Tuning (Visual SFT)
The second stage is SFT (Chapter 22) on multi-modal INSTRUCTION data — examples of images paired with questions and good answers ('What is in this image?' → a helpful description; 'Read the text in this sign' → the text). Here the projector AND usually the LLM are trained (the vision encoder often stays frozen). This teaches the model to FOLLOW INSTRUCTIONS about images — to be a helpful visual assistant, not just a captioner.
Pipeline Flow: Staged training of a vision-language model
| 1 | Pretrained parts | Start from a pretrained CLIP encoder + a pretrained LLM |
| 2 | Stage 1: align | Train ONLY the projector on image-caption pairs (cheap) |
| 3 | Stage 2: visual SFT | Instruction-tune on image+question+answer data |
| 4 | (Optional) preference | RLHF/DPO on visual responses for helpfulness |
Where Multi-modal Data Comes From
Multi-modal instruction data is often generated by using a strong existing model: show a powerful model an image with annotations, and have it generate questions and answers about the image. This 'distillation' approach (echoing Chapter 22's data sources) scales the creation of visual instruction data cheaply. Human-annotated data and existing captioned datasets supplement it.
A practical reality shapes multi-modal models: images consume TOKENS, and tokens are expensive (they fill the context window and cost compute, per Chapter 27). A single high-resolution image can become hundreds or thousands of tokens. Managing this 'image-token budget' is a key engineering concern, and it explains many design choices in real multi-modal systems.
The Resolution-Tokens Trade-off
More image resolution means more patches means more tokens. A 224×224 image at 16×16 patches is 196 tokens; a 1024×1024 image is over 4,000 tokens — more than many entire text prompts. High resolution lets the model read fine detail (small text, distant objects) but at a steep token cost. Multi-modal systems must balance visual detail against the token budget.
| Approach | What it does |
|---|---|
| Fixed low resolution | Few tokens, but misses fine detail |
| Tiling / cropping | Split a high-res image into tiles, each encoded separately |
| Token pooling / merging | Combine nearby patch tokens to reduce count |
| Resampler (Q-Former) | Learn a fixed small number of query tokens to summarize the image |
| Adaptive resolution | Use more tokens only when detail is needed |
A common technique is a RESAMPLER (like the Q-Former in BLIP-2, or a Perceiver resampler): instead of passing all patch tokens to the LLM, a small module learns to compress them into a FIXED, small number of tokens (say 32 or 64) that summarize the image. This decouples the LLM's token cost from the image resolution — the LLM always sees a fixed, manageable number of image tokens regardless of how big the image is.
Once you understand the vision recipe, audio follows the same pattern — which is the beauty of the tokenize-everything approach. To make a model HEAR, we need to turn audio into tokens the LLM can process, then connect them with a projector, just as we did for images. The modality changes; the recipe does not.
Turning Audio Into Tokens
Audio is a waveform — a 1D signal of amplitudes over time. Two common ways to tokenize it: (1) convert the waveform into a SPECTROGRAM (a 2D time-frequency image) and patchify it like an image, or (2) use an AUDIO ENCODER (like Whisper's encoder) that converts audio into a sequence of feature vectors directly. Either way, the result is a sequence of vectors — audio tokens — that a projector maps into the LLM's space.
Pipeline Flow: Making an LLM hear (mirrors the vision recipe)
| 1 | Audio waveform | Raw sound signal over time |
| 2 | Audio encoder | Spectrogram + encoder → audio feature vectors |
| 3 | Projector | Map audio features → LLM embedding space |
| 4 | LLM | Process audio tokens alongside text |
Notice this is EXACTLY the structure of the vision pipeline (Section 30.5), with 'audio encoder' swapped for 'vision encoder'. The projector and LLM roles are identical. This is the power of the shared-token framework: a new modality just needs an encoder that turns it into vectors and a projector that maps those vectors into the LLM's space. Speech recognition, audio question-answering, and music understanding all follow this template.
| Vision pipeline | Audio pipeline |
|---|---|
| Image → patches | Waveform → spectrogram/frames |
| Vision encoder (CLIP/ViT) | Audio encoder (Whisper-style) |
| Projector → LLM space | Projector → LLM space |
| LLM reads image tokens | LLM reads audio tokens |
| VQA, captioning, OCR | Transcription, audio Q&A |
Step back and see the single principle unifying everything in this chapter: the SHARED EMBEDDING SPACE. Every modality — text, images, audio — is converted into vectors in a common space, where the model can reason across them all. This idea is the conceptual core of multi-modal AI, and it is worth understanding deeply.
One Space, Many Modalities
In a multi-modal model, a word, an image patch, and an audio frame all become vectors in the SAME space. Because they share a space, the attention mechanism can relate them: a question token can attend to image tokens, an image region can be linked to a word. The model does not have separate 'vision reasoning' and 'language reasoning' — it has ONE reasoning process operating over a unified sequence of multi-modal tokens. This is what enables genuine cross-modal understanding, like answering a text question about an image.
Alignment Is the Hard Part
The central challenge of multi-modality is ALIGNMENT — getting the different modalities to occupy the shared space consistently, so that semantically-related things across modalities land near each other. CLIP achieves it with contrastive learning; LLaVA's projector achieves it for the LLM's space; audio encoders achieve it for sound. When alignment is good, cross-modal reasoning works; when it is poor, the model 'sees' the image but cannot connect it to language. Most of the difficulty and the research in multi-modality is about achieving good alignment.
Multi-modal models are powerful but imperfect. A clear-eyed view of their strengths and weaknesses is essential for using them well and not over-trusting them.
| Strong at | Weak at / fails |
|---|---|
| Describing image content | Precise spatial reasoning (exact positions) |
| Visual question answering | Counting many objects accurately |
| Reading clear text (OCR) | Tiny / low-quality / dense text |
| Chart & document understanding | Complex diagrams, fine measurements |
| General scene understanding | Reasoning needing pixel precision |
| Common visual concepts | Rare/novel visual content unseen in training |
Why the Weaknesses Exist
Many limitations trace to the patch tokenization. Because the image is summarized into a limited number of patch tokens, FINE DETAIL can be lost — small text, precise positions, exact counts. The model sees a compressed representation, not every pixel. Higher resolution (more tokens) helps but costs more (Section 30.7). Other weaknesses mirror text LLMs: they can hallucinate about images (describing objects that aren't there), and their visual knowledge is bounded by training data.
The chapter's recipe (encoder + projector + LLM) is the dominant pattern, but the multi-modal landscape has several architectural approaches and a clear direction of travel. A brief map helps place what you have learned.
| Approach | How it fuses modalities |
|---|---|
| Projection (LLaVA) | Vision encoder → projector → LLM input. Simple, dominant. |
| Cross-attention (Flamingo) | LLM layers cross-attend to image features |
| Resampler (BLIP-2) | Q-Former compresses image to a few query tokens |
| Native multi-modal | Trained on all modalities jointly from scratch |
| Unified any-to-any | One model accepts and generates multiple modalities |
Two Ways to Fuse: Input vs Cross-Attention
There are two main ways to inject visual information into an LLM. The LLaVA approach puts image tokens directly into the LLM's INPUT sequence (simple, lets every layer attend to the image). The Flamingo approach (Alayrac et al., 2022) instead adds CROSS-ATTENTION layers inside the LLM that attend to image features (keeps the image separate from the text sequence, can be more parameter-efficient for many images). Both work; the input-injection approach (LLaVA) became more popular for its simplicity.
Multi-modal Meets Everything Else
Multi-modal capabilities combine with the rest of Part VI: multi-modal RAG retrieves relevant images as well as text (Chapter 29); multi-modal agents can SEE a screen and act on it (Chapter 28); multi-modal reasoning models think about images step by step (Chapter 25). Vision and audio are not a separate track — they extend every capability we have built, letting models perceive and act in a richer, more human world.
Let us consolidate the chapter into one coherent picture of how a model comes to see and hear, from raw input to grounded answer.
Pipeline Flow: The complete multi-modal recipe
| 1 | Tokenize the modality | Image → patches; audio → spectrogram/frames |
| 2 | Encode | A pretrained encoder (CLIP for vision) → meaningful features |
| 3 | Project | A small projector maps features into the LLM's space |
| 4 | Fuse | Modality tokens join text tokens in one shared-space sequence |
| 5 | Attend & reason | The LLM attends across all modalities to reason |
| 6 | Generate | A grounded answer about the image/audio + text |
The Three Ideas to Remember
If you remember three things from this chapter, make them these. First, the TRANSFORMER IS MODALITY-AGNOSTIC — it processes sequences of vectors, whatever they originally were. Second, MULTI-MODALITY IS TOKENIZATION — the whole problem reduces to turning each modality into tokens in a shared space (encoder + projector). Third, the SHARED EMBEDDING SPACE is what enables cross-modal reasoning — putting everything in one space lets attention relate a word to a pixel to a sound.
Multi-modal Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Core idea | Turn every modality into tokens | Transformer is modality-agnostic |
| Patch tokens | Cut image into patches = tokens | ViT: patches are visual tokens |
| Vision encoder | ViT processes patches into features | Same architecture, new input |
| CLIP | Contrastive image-text alignment | Language-aligned visual features |
| LLaVA | Encoder + projector + LLM | The projector is the bridge |
| Staged training | Align projector, then visual SFT | Reuses the whole book |
| Token budget | Images cost many tokens | Resamplers compress them |
| Audio LLMs | Same recipe, audio encoder | Tokenize-and-project again |
| Shared space | All modalities in one space | Enables cross-modal reasoning |
Exercises
Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.
Further reading: “Learning Transferable Visual Models From Natural Language Supervision” (Radford et al., 2021, CLIP). “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020, ViT). “Visual Instruction Tuning” (Liu et al., 2023, LLaVA). “Flamingo” (Alayrac et al., 2022) for cross-attention fusion. “BLIP-2” (Li et al., 2023) for the Q-Former resampler. “Robust Speech Recognition via Large-Scale Weak Supervision” (Radford et al., 2022, Whisper) for audio encoding. Surveys on multi-modal large language models for the broader landscape.
Next → Chapter 31: Serving at Scale
You now have a fast (Chapter 27), tool-using (Chapter 28), knowledge-grounded (Chapter 29), and multi-modal (this chapter) model. The final chapter of Part VI tackles the last deployment challenge: serving the model RELIABLY to millions of users. Chapter 31 covers the systems engineering of production LLM serving — load balancing and autoscaling across many GPUs and replicas, routing and scheduling requests, managing reliability and cost at scale, and the full production stack that turns a model into a dependable service. It closes Part VI by completing the journey from a trained model to a real, scalable product — and bridges to the frontier techniques of Part VII.