Chapter 03

Push & Pull

IMAX from Chapter 2 worked in theory — but it needed Gaussian assumptions and broke down on real images. The next idea is bolder and simpler: forget hand-crafted formulas. Just say "similar things should land near each other in space, different things should land far apart." That's contrastive learning — and the key ingredient that stops collapse is the negative.

CONCEPT 3.1

Embeddings: anchor, positive, negative

An embedding is a point in space — a compressed description of what the model thinks matters about an input. Three roles:

  • Anchor — the reference input (e.g. a photo of a cat)
  • Positive — a different view of the same thing: crop the same cat, rotate it, change brightness. Still the same cat, just a different glimpse. Should land near the anchor.
  • Negative — a completely different input (e.g. a photo of a dog). Should land far from the anchor.

No labels are needed: positives are created by augmenting the anchor (free), and negatives are just other items in the batch (also free).

Drag the anchor around. The positive is pulled toward it; the negatives are pushed away. Toggle negatives off and see why: without the “push apart” force, everything drifts together — collapse again.

🌍 Everyday analogy: a wedding seating plan. Close friends (positives) belong at the same table; feuding relatives (negatives) go to opposite corners. Without that “keep them apart” rule, the lazy solution is to cram everyone onto one table — collapse, wedding edition.
💡 Key idea: negatives are what stop collapse. Without them, “pull similar things together” has only one stable solution: everything in the same spot.
Embedding space · drag the anchor
CONCEPT 3.2

Margins, hinges & triplets

If we only ever push negatives away, that force can grow without limit. A hinge loss fixes this: once a negative is farther than a margin m, stop pushing — it costs nothing. A triplet loss goes relative: the anchor just needs to be closer to its positive than to its negative, by a margin.

Drag the points. The bands show the margin. Watch the loss vanish the moment the negative clears the margin.

🌍 Everyday analogy: a nightclub bouncer with a velvet rope. He only hassles people crowding the entrance; once you’re comfortably past the rope he stops caring. The margin m is that rope — negatives beyond it cost nothing, so the model stops wasting energy shoving them ever further away.
💡 Key idea: margins keep training stable and carve out a structured feature space instead of an infinite shoving match.
Triplet & margin lab
CONCEPT 3.3

InfoNCE: the lineup game

Triplet loss uses one negative at a time — a weak signal. What if we compared against a whole crowd at once? InfoNCE does exactly that:

  1. Take a query q (the anchor’s embedding) and a batch of keys: one true match k⁺ and many distractors k₁…kₙ.
  2. Score q against every key with a dot product (higher = more similar).
  3. Run a softmax over the scores → a probability for each key.
  4. The loss is −log P(k⁺): zero when we’re perfectly confident in the right key, high when we’re confused.

The temperature τ controls sharpness. Low τ: the softmax is very peaked — tiny score differences cause big confidence swings (brittle). High τ: the softmax is flat — the model barely distinguishes the true match from noise (useless). The beam widths in the sim show this: watch how narrowing or widening τ changes the distribution.

🌍 Everyday analogy: a police lineup. The witness (query) must identify the one real suspect (k⁺) among look-alikes. Temperature is the witness’s decisiveness: low τ = “it’s definitely #3”; high τ = “honestly… could be anyone.” Too decisive and a sleepless night makes the witness wrong. Too vague and the lineup is useless.
Go deeper: why this scales and what “NCE” means

NCE stands for Noise-Contrastive Estimation. The trick: frame self-supervised learning as a classification problem — “which key is real?” — and use the classification cross-entropy as the loss. This turns out to be a lower bound on mutual information (like IMAX, but without Gaussian assumptions). With enough negatives per batch, the bound gets tight, so the model genuinely learns to capture shared structure. In practice, larger batches (more negatives) give cleaner gradients and better representations.

💡 Key idea: one loss does two jobs — pull q toward k⁺ and push it away from every negative simultaneously. More negatives = stronger gradient = better representations.
🌡️
Play: use the temperature slider to find where the model is confident but not overconfident. Too low and it’s brittle; too high and the green bar disappears.
InfoNCE softmax & temperature
← Chapter 2 Chapter 4 · Scaling the Negatives →