Chapter 05

No Negatives Needed

Negatives are expensive. What if a network could learn just by predicting one view from another — with no negatives at all? It sounds impossible (surely it collapses?), yet a few careful asymmetries make it work. Welcome to distillation: BYOL and DINO.

CONCEPT 5.1

BYOL: student predicts teacher

Two networks: a student (trained by gradient descent) and a teacher (its target). The student looks at one view and tries to predict the teacher's representation of another view. With nothing else, this collapses instantly to a constant. Three asymmetries save it:

① the teacher is a slow EMA of the student · ② a stop-gradient on the teacher branch (it's a target, not optimised) · ③ only the student has an extra predictor head. Toggle them off one by one and watch collapse return.

🌍 Everyday analogy: a student learning from a calm, steady mentor — but through frosted glass (the predictor head), so they can’t just trace the mentor’s answer; they have to reconstruct it. And the mentor never copies the student back (stop-gradient). Remove those guards and the pair just agree to both hand in blank pages.
Go deeper: why doesn’t this collapse without negatives?

This stunned a lot of researchers. With no negatives, the “write C on everything” shortcut should win — both branches output a constant and the loss is zero. What saves it is that the three asymmetries make that shortcut a moving target the student can never quite reach: the teacher keeps drifting (EMA), the student must route through an extra predictor the teacher doesn’t have, and gradients only flow one way. Empirically the representation stays rich and spread out. It feels like a magic trick — and to a degree the field is still arguing about exactly why it works.

💡 Key idea: asymmetry stops both branches from taking the same trivial shortcut at the same time. This is BYOL — Bootstrap Your Own Latent.
🧪
Break it: turn off all three guards and watch the variance crash to zero.
BYOL collapse guards
CONCEPT 5.2

DINO: centering & sharpening

DINO drops the predictor and instead turns each output into a probability distribution over learned prototypes (soft, self-invented categories). The student is trained to match the teacher's distribution.

Two opposing forces keep it healthy. Centering subtracts the teacher's running average so no single prototype can dominate — but alone it pushes toward a boring uniform output. A low teacher temperature (sharpening) counteracts that, making the teacher confident. Balance the two.

🌍 Everyday analogy: sorting laundry into bins nobody labelled. Centering stops every sock landing in one overflowing bin (“everything is class 4!”). Sharpening stops the opposite failure — limply spreading each item across all bins equally. You want confident, decisive sorting that still uses all the bins.
💡 Key idea: centering prevents one-prototype collapse; sharpening prevents uniform collapse. Together → rich, peaked-but-diverse targets. This is DINO.
Prototype distribution · center vs sharpen
CONCEPT 5.3

Multi-crop: local meets global

Instead of just two views, DINO makes many crops: a couple of large global crops and several small local ones. The teacher sees only the global crops; the student sees everything. To match the teacher, the student must infer the whole from a tiny piece.

Resample the crops. Each small local patch (green) has to be mapped toward the global understanding (blue) — that's how the model learns objects are recognisable even from fragments.

🌍 Everyday analogy: a zoomed-in photo of an elephant’s ear. You can still say “elephant.” Force the student to match the big-picture view from nothing but a tiny crop, over and over, and it learns those part→whole links — so a fragment becomes enough to recognise the whole.
💡 Key idea: "local → global" prediction teaches part-whole structure for free.
Multi-crop views
← Chapter 4 Chapter 6 · Predict, Don't Reconstruct →