Chapter 6 · Predict, Don't Reconstruct

CONCEPT 6.1

Masked autoencoder (MAE)

Split an image into patches, hide most of them, and train an encoder (seeing only the visible patches) plus a light decoder to reconstruct the missing pixels. To fill the gaps the model must learn both local texture and global structure.

Drag the mask ratio up to 75%+ and hit reconstruct. Collapse isn't the danger here — a constant output can't possibly rebuild different patches. The danger is something subtler…

🌍 Everyday analogy: it’s the “fill in the blank” game, for pictures. Just like you can read “the c_t sat on the m_t” because language has structure, the model can fill missing patches because images do too. To guess the hidden bits it has to actually understand the scene.

💡 Key idea: masking is a flexible, label-free way to create a hard prediction task — the seed of generative self-supervision.

Masked autoencoder

CONCEPT 6.2 · 6.3

Generative vs joint-embedding → JEPA

Force a model to predict every pixel and it burns capacity on textures, noise, even JPEG artifacts — detail that's often irrelevant. Joint-embedding methods (MoCo, SimCLR, BYOL, DINO) instead compare views in representation space — abstract and semantic — but lean on hand-designed augmentations.

JEPA takes the road in between: keep the predictive challenge, but predict in embedding space, guided by a small conditioning variable z that says what to predict (here, the target block's location). Drag the target block — the predictor must produce the embedding of whatever content lives there.

🌍 Everyday analogy: a TV weather forecaster doesn’t paint every individual raindrop on the map — they predict the gist: “rainy, around 12°C.” MAE paints every raindrop (and every speck of sensor noise); JEPA forecasts the gist (the embedding). The variable z is the sticky-note saying which day to forecast.

Go deeper: why predicting the “gist” is smarter than pixels

Some things in an image are genuinely unpredictable — the exact pattern of leaves on a tree, film grain, JPEG artifacts. A pixel-predictor (MAE) is graded on getting those exactly right, so it burns capacity memorising noise that helps no downstream task. JEPA predicts in representation space, where “leaves” is a concept and the unpredictable details have already been abstracted away. It keeps the hard, useful part of prediction (what’s broadly there) and drops the impossible, useless part (every pixel). That’s also why it doesn’t need hand-designed augmentations: the masking itself creates the prediction challenge.

💡 Key idea: generative = predict pixels · joint-embedding = match views · JEPA = predict missing embeddings, conditioned on z. No pixel reconstruction, no augmentation crutch.

🎯

Compare: watch the MAE panel sweat over pixels while JEPA nails the abstract target.

MAE pixels vs JEPA embeddings (I-JEPA)

CONCEPT 6.4

From images to video: V-JEPA

The same idea extends naturally to video. Now we mask spatio-temporal tubes — a region across several frames — and predict their features from the visible context. To do this well the model must learn that objects persist, motion continues, and actions have consequences — all from raw video.

Play the clip: the shaded tube is hidden, and the model predicts its features over time. V-JEPA (and the scaled-up V-JEPA 2) beat pixel-prediction methods with a frozen backbone.

🌍 Everyday analogy: cover the screen for a second while someone pours coffee. You don’t panic — you know the cup kept filling, the stream kept flowing, the mug didn’t teleport. To predict the hidden tube’s features, the model must learn those same commonsense facts about how the world keeps going.

💡 Key idea: predicting features through time forces a model to understand how scenes evolve — the foundation for world models.

Spatio-temporal tube masking

CONCEPT 6.5

One frozen encoder, many tasks

After pre-training, the encoder is frozen and reused. Bolt on a small attentive probe (a couple of transformer blocks) for action recognition. Or pipe the video features through an MLP into a language model for visual question answering — “what changed on the table while the camera looked away?”

Pick a downstream head and a question to see how the same representation feeds very different tasks.

🌍 Everyday analogy: train one excellent pair of eyes, then lend them to different specialists — a referee who calls the action, a commentator who answers questions about the play — without ever re-growing the eyes. “Frozen” just means we don’t retrain the expensive part; we bolt a cheap, task-specific head on top.

💡 Key idea: a strong frozen representation is a reusable substrate — perception, Q&A, and (next chapter) planning all plug into it.

Downstream heads & VQA