Masked autoencoder (MAE)
Split an image into patches, hide most of them, and train an encoder (seeing only the visible patches) plus a light decoder to reconstruct the missing pixels. To fill the gaps the model must learn both local texture and global structure.
Drag the mask ratio up to 75%+ and hit reconstruct. Collapse isn't the danger here — a constant output can't possibly rebuild different patches. The danger is something subtler…
Generative vs joint-embedding → JEPA
Force a model to predict every pixel and it burns capacity on textures, noise, even JPEG artifacts — detail that's often irrelevant. Joint-embedding methods (MoCo, SimCLR, BYOL, DINO) instead compare views in representation space — abstract and semantic — but lean on hand-designed augmentations.
JEPA takes the road in between: keep the predictive challenge, but predict in embedding space, guided by a small conditioning variable z that says what to predict (here, the target block's location). Drag the target block — the predictor must produce the embedding of whatever content lives there.
Go deeper: why predicting the “gist” is smarter than pixels
Some things in an image are genuinely unpredictable — the exact pattern of leaves on a tree, film grain, JPEG artifacts. A pixel-predictor (MAE) is graded on getting those exactly right, so it burns capacity memorising noise that helps no downstream task. JEPA predicts in representation space, where “leaves” is a concept and the unpredictable details have already been abstracted away. It keeps the hard, useful part of prediction (what’s broadly there) and drops the impossible, useless part (every pixel). That’s also why it doesn’t need hand-designed augmentations: the masking itself creates the prediction challenge.
From images to video: V-JEPA
The same idea extends naturally to video. Now we mask spatio-temporal tubes — a region across several frames — and predict their features from the visible context. To do this well the model must learn that objects persist, motion continues, and actions have consequences — all from raw video.
Play the clip: the shaded tube is hidden, and the model predicts its features over time. V-JEPA (and the scaled-up V-JEPA 2) beat pixel-prediction methods with a frozen backbone.
One frozen encoder, many tasks
After pre-training, the encoder is frozen and reused. Bolt on a small attentive probe (a couple of transformer blocks) for action recognition. Or pipe the video features through an MLP into a language model for visual question answering — “what changed on the table while the camera looked away?”
Pick a downstream head and a question to see how the same representation feeds very different tasks.