Chapter 2 · Two Views, One Truth

CONCEPT 2.1

Train two views to agree

Take two camera views of the same scene — left and right, for instance. Encoder A compresses the left view into a code vector a; Encoder B compresses the right view into a code vector b. We add a consistency loss (the distance ‖a−b‖²) that is zero only when both codes are identical.

Press train and watch the two points in embedding space converge. Both encoders are being nudged toward describing the scene the same way — discovering the signal the two views share, without a single label.

🌍 Everyday analogy: two witnesses describe the same fender-bender from opposite corners. If both genuinely saw it, their stories should line up on the facts that matter — colour of the car, who ran the red light. Forcing the “stories” to agree squeezes out the shared truth and discards what's particular to each vantage point.

💡 Key idea: agreement between views is a free training signal — the data labels itself through its own structure.

Agreement trainer

CONCEPT 2.2

The trap: representation collapse

Here's the villain of this entire course. The laziest way to make two outputs agree is to make both outputs constant, ignoring the input entirely. Loss = zero, knowledge learned = zero.

Each dot is one input's embedding. Hit train (naive) and watch the whole cloud crush into a single point — that's collapse. Now switch on the variance floor and retrain: the cloud is forced to stay spread out and meaningful.

🌍 Everyday analogy: tell two students “just agree with each other on the exam” and the laziest winning move is for both to answer “C” to every question. They agree perfectly and learn absolutely nothing. That’s collapse — a loophole, not an education.

💡 Key idea: every method from here on is, at heart, a clever trick to avoid collapse.

💥

Try it: collapse the cloud, then rescue it with the variance floor.

Collapse chamber

CONCEPT 2.3

IMAX — measure the shared signal

How do we reward agreement without inviting collapse? Becker & Hinton’s answer starts with a simple model. Imagine both encoders output a single number. Call them a and b. Each is a noisy measurement of the same hidden signal s:

a = s + noise_A b = s + noise_B

Now look at what happens when you add or subtract them:

a + b = 2s + (noise_A + noise_B) → the signal is doubled; noise only adds weakly
a − b = noise_A − noise_B → the signal cancels completely; only disagreement remains

So: var(a+b) is large when the shared signal is strong (good!), and var(a−b) is small when agreement is high (also good!). Maximise the first, minimise the second — and you automatically maximise mutual information between the two views.

The sliders let you control signal strength and per-view noise. Watch the scatter plot: when signal is strong, points cluster along the diagonal a = b. When noise dominates, the cloud scatters.

🌍 Everyday analogy: noise-cancelling headphones use two microphones. Add their signals and the music (shared) gets louder. Subtract them and only the random seat-rattle (private noise) survives. IMAX says: pump up the music, cancel the rattle — and measure how much music there is.

Go deeper: why is this exactly “mutual information”?

Mutual information measures how much knowing one view tells you about the other. Under the Gaussian model above (a = s + noise_A, b = s + noise_B with independent Gaussian noise), maximising var(a+b) while minimising var(a−b) is mathematically identical to maximising the mutual information I(a;b). So “make the shared part loud and the private part quiet” isn’t a hack — it’s information theory in disguise. The catch: it relies on Gaussian and symmetry assumptions that rarely hold on real data, which is why later methods (Chapter 3 onwards) drop them.

💡 Key idea: collapse means all outputs are constant → variance = 0. Keeping var(a+b) high guarantees the model hasn’t collapsed. This is IMAX (information maximisation).

Mutual-information meter

CONCEPT 2.4

Prediction makes agreement harder to fake

IMAX needs Gaussian assumptions that rarely hold. Here’s a cleaner guardrail that works without them: instead of asking both outputs to simply match, reveal only part of the output to the model and make it predict the hidden part.

Why does this stop collapse? A collapsed model outputs the same constant vector no matter the input. But if the task is to predict a held-out piece from the visible piece, a constant predictor just predicts the mean — it can never capture real structure. The only way to succeed is for the encoder to preserve genuine, varying information about the input.

In the sim, the cyan bars are “seen” and the amber bars are “hidden.” Watch what happens when you force collapse: the predictor tries to guess the amber values from the flat cyan input — and consistently fails because there’s no signal left to use.

🌍 Everyday analogy: a crossword. Because across-clues constrain the down-answers, you can’t scribble the same letter everywhere — the grid only works when pieces genuinely fit. Asking the model to predict one part from another imposes that same real constraint.

💡 Key idea: prediction is a natural anti-collapse force — and this “predict the missing piece” idea is the seed that grows into JEPA in Chapter 6.

Predict-the-held-out-part