Embeddings: anchor, positive, negative
An embedding is a point in space — a compressed description of what the model thinks matters about an input. Three roles:
- Anchor — the reference input (e.g. a photo of a cat)
- Positive — a different view of the same thing: crop the same cat, rotate it, change brightness. Still the same cat, just a different glimpse. Should land near the anchor.
- Negative — a completely different input (e.g. a photo of a dog). Should land far from the anchor.
No labels are needed: positives are created by augmenting the anchor (free), and negatives are just other items in the batch (also free).
Drag the anchor around. The positive is pulled toward it; the negatives are pushed away. Toggle negatives off and see why: without the “push apart” force, everything drifts together — collapse again.
Margins, hinges & triplets
If we only ever push negatives away, that force can grow without limit. A hinge loss fixes this: once a negative is farther than a margin m, stop pushing — it costs nothing. A triplet loss goes relative: the anchor just needs to be closer to its positive than to its negative, by a margin.
Drag the points. The bands show the margin. Watch the loss vanish the moment the negative clears the margin.
InfoNCE: the lineup game
Triplet loss uses one negative at a time — a weak signal. What if we compared against a whole crowd at once? InfoNCE does exactly that:
- Take a query q (the anchor’s embedding) and a batch of keys: one true match k⁺ and many distractors k₁…kₙ.
- Score q against every key with a dot product (higher = more similar).
- Run a softmax over the scores → a probability for each key.
- The loss is −log P(k⁺): zero when we’re perfectly confident in the right key, high when we’re confused.
The temperature τ controls sharpness. Low τ: the softmax is very peaked — tiny score differences cause big confidence swings (brittle). High τ: the softmax is flat — the model barely distinguishes the true match from noise (useless). The beam widths in the sim show this: watch how narrowing or widening τ changes the distribution.
Go deeper: why this scales and what “NCE” means
NCE stands for Noise-Contrastive Estimation. The trick: frame self-supervised learning as a classification problem — “which key is real?” — and use the classification cross-entropy as the loss. This turns out to be a lower bound on mutual information (like IMAX, but without Gaussian assumptions). With enough negatives per batch, the bound gets tight, so the model genuinely learns to capture shared structure. In practice, larger batches (more negatives) give cleaner gradients and better representations.