BYOL: student predicts teacher
Two networks: a student (trained by gradient descent) and a teacher (its target). The student looks at one view and tries to predict the teacher's representation of another view. With nothing else, this collapses instantly to a constant. Three asymmetries save it:
① the teacher is a slow EMA of the student · ② a stop-gradient on the teacher branch (it's a target, not optimised) · ③ only the student has an extra predictor head. Toggle them off one by one and watch collapse return.
Go deeper: why doesn’t this collapse without negatives?
This stunned a lot of researchers. With no negatives, the “write C on everything” shortcut should win — both branches output a constant and the loss is zero. What saves it is that the three asymmetries make that shortcut a moving target the student can never quite reach: the teacher keeps drifting (EMA), the student must route through an extra predictor the teacher doesn’t have, and gradients only flow one way. Empirically the representation stays rich and spread out. It feels like a magic trick — and to a degree the field is still arguing about exactly why it works.
DINO: centering & sharpening
DINO drops the predictor and instead turns each output into a probability distribution over learned prototypes (soft, self-invented categories). The student is trained to match the teacher's distribution.
Two opposing forces keep it healthy. Centering subtracts the teacher's running average so no single prototype can dominate — but alone it pushes toward a boring uniform output. A low teacher temperature (sharpening) counteracts that, making the teacher confident. Balance the two.
Multi-crop: local meets global
Instead of just two views, DINO makes many crops: a couple of large global crops and several small local ones. The teacher sees only the global crops; the student sees everything. To match the teacher, the student must infer the whole from a tiny piece.
Resample the crops. Each small local patch (green) has to be mapped toward the global understanding (blue) — that's how the model learns objects are recognisable even from fragments.