Whitening & the covariance matrix
The covariance matrix tells you which embedding dimensions move together. A collapsed embedding has zero variance everywhere. Whitening rescales and rotates the embeddings so the covariance becomes the identity: every direction has unit variance and none are correlated.
Toggle whitening: the tilted, squashed cloud becomes a clean circular blob, and the off-diagonal covariance entries snap to zero.
Barlow Twins: cross-correlation → identity
Barlow Twins gets whitening's effect without the inversion. It builds the cross-correlation matrix between the embeddings of two augmented views and pushes it toward the identity: diagonal → 1 (matching dimensions agree across views) and off-diagonal → 0 (different dimensions don't copy each other).
Hit train and watch the matrix light up along the diagonal and darken everywhere else.
VICReg: three jobs, three terms
VICReg makes the anti-collapse logic explicit by splitting it into three losses: variance (keep every dimension alive), invariance (pull matching views together), and covariance (stop dimensions carrying the same info).
Set each weight to zero in turn: kill variance and the cloud collapses; kill covariance and the dimensions become redundant; kill invariance and the two views drift apart.
SIGReg: regularise the whole distribution
Why juggle three terms? Instead of controlling only mean and covariance, regularise the entire embedding distribution toward one simple target: an isotropic Gaussian — a balanced bell-shaped cloud, equally spread in every direction.
Checking Gaussianity in high dimensions is expensive, so SIGReg sketches it: look at many random 1-D projections. If enough random slices look standard-Gaussian, the whole distribution is pulled into shape. Train and watch the histogram hug the bell curve.
LeJEPA: two losses, one world model
Put the pieces together. JEPA says what to predict — future or missing embeddings, never pixels. SIGReg says how not to collapse — keep the latent space well-distributed. Combine them and you train a world model end-to-end with just two loss terms.
Applied to control, LeJEPA world models are tiny — around 15M parameters, trainable on a single GPU in hours, and they plan dramatically faster than huge foundation-model world models while staying competitive. Early days, but a very promising direction.