Chapter 08

Anti-Collapse by Design

EMA teachers and stop-gradients dodge collapse with tricks. A cleaner family fights it with math: keep every embedding dimension informative and non-redundant. This road runs whitening → Barlow Twins → VICReg → SIGReg, and finishes at LeJEPA.

CONCEPT 8.1

Whitening & the covariance matrix

The covariance matrix tells you which embedding dimensions move together. A collapsed embedding has zero variance everywhere. Whitening rescales and rotates the embeddings so the covariance becomes the identity: every direction has unit variance and none are correlated.

Toggle whitening: the tilted, squashed cloud becomes a clean circular blob, and the off-diagonal covariance entries snap to zero.

🌍 Everyday analogy: a graphic equaliser for your embeddings. Whitening turns up the quiet frequencies and turns down the booming ones so every band carries its share — and it makes sure the “left” and “right” channels aren’t just duplicates of each other.
💡 Key idea: identity covariance directly forbids the zero-variance collapse — but full whitening needs a costly matrix inversion.
Whitening · covariance → identity
CONCEPT 8.2

Barlow Twins: cross-correlation → identity

Barlow Twins gets whitening's effect without the inversion. It builds the cross-correlation matrix between the embeddings of two augmented views and pushes it toward the identity: diagonal → 1 (matching dimensions agree across views) and off-diagonal → 0 (different dimensions don't copy each other).

Hit train and watch the matrix light up along the diagonal and darken everywhere else.

🌍 Everyday analogy: a quality-control checklist filled in by two inspectors. Every row should match across the two of them (diagonal → 1: they agree), but no two rows should ask the same question (off-diagonal → 0: no wasted, duplicated checks).
💡 Key idea: one objective reconciles the two views and makes the dimensions non-redundant.
Barlow Twins cross-correlation
CONCEPT 8.3

VICReg: three jobs, three terms

VICReg makes the anti-collapse logic explicit by splitting it into three losses: variance (keep every dimension alive), invariance (pull matching views together), and covariance (stop dimensions carrying the same info).

Set each weight to zero in turn: kill variance and the cloud collapses; kill covariance and the dimensions become redundant; kill invariance and the two views drift apart.

🌍 Everyday analogy: three rules for a gym class. Variance: nobody’s allowed to just stand still (every dimension stays active). Invariance: dance partners stay in sync (the two views match). Covariance: don’t let everyone do the exact same exercise (dimensions shouldn’t duplicate each other). Drop any one rule and the class falls apart in a different way.
💡 Key idea: keep the views close, keep variance healthy, decorrelate the dimensions — a clear, tunable recipe (at the cost of balancing three terms).
⚖️
Break each term: zero out variance, then covariance, then invariance — feel what each one defends.
VICReg · variance / invariance / covariance
CONCEPT 8.4

SIGReg: regularise the whole distribution

Why juggle three terms? Instead of controlling only mean and covariance, regularise the entire embedding distribution toward one simple target: an isotropic Gaussian — a balanced bell-shaped cloud, equally spread in every direction.

Checking Gaussianity in high dimensions is expensive, so SIGReg sketches it: look at many random 1-D projections. If enough random slices look standard-Gaussian, the whole distribution is pulled into shape. Train and watch the histogram hug the bell curve.

🌍 Everyday analogy: you can’t weigh a whole crowd at once, so you photograph it from many random angles. If every snapshot shows the same tidy bell-shaped silhouette, the crowd as a whole must be well-shaped. SIGReg checks Gaussianity the same way — through many random 1-D “photos.”
💡 Key idea: one principled regulariser keeps the whole embedding distribution healthy — this is SIGReg (sketched isotropic Gaussian regularisation).
SIGReg · random projections vs Gaussian
CONCEPT 8.5 · FINALE

LeJEPA: two losses, one world model

Put the pieces together. JEPA says what to predict — future or missing embeddings, never pixels. SIGReg says how not to collapse — keep the latent space well-distributed. Combine them and you train a world model end-to-end with just two loss terms.

Applied to control, LeJEPA world models are tiny — around 15M parameters, trainable on a single GPU in hours, and they plan dramatically faster than huge foundation-model world models while staying competitive. Early days, but a very promising direction.

🌍 Everyday analogy: one chef, two habits. Habit ① “taste and predict the next flavour the dish needs” (JEPA). Habit ② “keep the pantry full and varied so you’re never down to one ingredient” (SIGReg). A small kitchen, but it turns out fast, excellent meals.
💡 The whole story in one line: learn to predict in concept space (JEPA) + keep that space alive (SIGReg) = a small, fast, label-free world model.
LeJEPA · the two-loss recipe
← Chapter 7 🏁 Back to the journey map