Train two views to agree
Take two camera views of the same scene — left and right, for instance. Encoder A compresses the left view into a code vector a; Encoder B compresses the right view into a code vector b. We add a consistency loss (the distance ‖a−b‖²) that is zero only when both codes are identical.
Press train and watch the two points in embedding space converge. Both encoders are being nudged toward describing the scene the same way — discovering the signal the two views share, without a single label.
The trap: representation collapse
Here's the villain of this entire course. The laziest way to make two outputs agree is to make both outputs constant, ignoring the input entirely. Loss = zero, knowledge learned = zero.
Each dot is one input's embedding. Hit train (naive) and watch the whole cloud crush into a single point — that's collapse. Now switch on the variance floor and retrain: the cloud is forced to stay spread out and meaningful.
IMAX — measure the shared signal
How do we reward agreement without inviting collapse? Becker & Hinton’s answer starts with a simple model. Imagine both encoders output a single number. Call them a and b. Each is a noisy measurement of the same hidden signal s:
a = s + noise_A b = s + noise_B
Now look at what happens when you add or subtract them:
- a + b = 2s + (noise_A + noise_B) → the signal is doubled; noise only adds weakly
- a − b = noise_A − noise_B → the signal cancels completely; only disagreement remains
So: var(a+b) is large when the shared signal is strong (good!), and var(a−b) is small when agreement is high (also good!). Maximise the first, minimise the second — and you automatically maximise mutual information between the two views.
The sliders let you control signal strength and per-view noise. Watch the scatter plot: when signal is strong, points cluster along the diagonal a = b. When noise dominates, the cloud scatters.
Go deeper: why is this exactly “mutual information”?
Mutual information measures how much knowing one view tells you about the other. Under the Gaussian model above (a = s + noise_A, b = s + noise_B with independent Gaussian noise), maximising var(a+b) while minimising var(a−b) is mathematically identical to maximising the mutual information I(a;b). So “make the shared part loud and the private part quiet” isn’t a hack — it’s information theory in disguise. The catch: it relies on Gaussian and symmetry assumptions that rarely hold on real data, which is why later methods (Chapter 3 onwards) drop them.
Prediction makes agreement harder to fake
IMAX needs Gaussian assumptions that rarely hold. Here’s a cleaner guardrail that works without them: instead of asking both outputs to simply match, reveal only part of the output to the model and make it predict the hidden part.
Why does this stop collapse? A collapsed model outputs the same constant vector no matter the input. But if the task is to predict a held-out piece from the visible piece, a constant predictor just predicts the mean — it can never capture real structure. The only way to succeed is for the encoder to preserve genuine, varying information about the input.
In the sim, the cyan bars are “seen” and the amber bars are “hidden.” Watch what happens when you force collapse: the predictor tries to guess the amber values from the flat cyan input — and consistently fails because there’s no signal left to use.