Chapter 4 · Scaling the Negatives

CONCEPT 4.1

Memory bank & the staleness problem

Idea: pre-compute an embedding for every image once, store them, and reuse them as negatives — no need to re-encode thousands of images each step. Suddenly you have a huge dictionary for free.

But there's a catch. The encoder keeps changing every step, while the stored embeddings were made by older versions of it. They go stale — inconsistent with today's encoder — and stale negatives give noisy gradients. Watch the bank rot as training marches on.

🌍 Everyday analogy: trying to recognise coworkers using last year’s ID photos. New haircuts, beards, glasses — the older the photo, the worse it matches the person in front of you. Stored embeddings rot the same way as the encoder keeps learning.

💡 Key idea: a big dictionary is useless if its entries don't speak the same “language” as the current encoder.

Memory bank · watch it go stale

CONCEPT 4.2

MoCo: momentum encoder + a queue

MoCo's two moves. First, a momentum encoder: the key encoder's weights are an exponential moving average (EMA) of the query encoder. It changes slowly, so all the stored negatives stay consistent. Second, a FIFO queue: each step the newest batch of keys is enqueued and the oldest is dropped.

This decouples the number of negatives from the batch size. Crank the momentum and the queue length and watch the consistency meter respond; lower momentum and the dictionary starts to wobble.

🌍 Everyday analogy: the teacher copies the student’s notebook slowly — a little each day — so the “answer key” never lurches around overnight; everything stays consistent. Meanwhile the negatives ride a sushi-train conveyor belt: each step a fresh plate is added at one end and the oldest plate is taken off the other.

Go deeper: why an EMA teacher and not just a copy?

If the key encoder were an exact copy of the query encoder, it would jump every single step and all your stored negatives would instantly become stale again — back to square one. The exponential moving average (m·old + (1−m)·new, with m ≈ 0.99) makes the teacher drift so gently that thousands of negatives encoded over many recent steps still “speak the same language.” That’s what lets MoCo keep a giant dictionary on a normal-sized GPU: the queue holds the negatives, the slow teacher keeps them comparable.

💡 Key idea: a slowly-evolving teacher keeps a large dictionary consistent — many negatives, no staleness, no giant batches.

🌀

Experiment: drop momentum to ~0.5 and see the negatives turn inconsistent.

Momentum encoder & FIFO queue

CONCEPT 4.3

SimCLR: augmentations & the similarity matrix

Two implementation details turn out to matter enormously. (1) Strong, random augmentations — crop, resize, blur, colour-jitter — which quietly define what the model should ignore. (2) A learnable projection head (an MLP) where the contrastive loss is applied.

For each batch we build a similarity matrix between all embeddings. Blue cells are positive pairs (the two views of one image); the rest are negatives. Training pushes the blue cells bright. Toggle augmentations and step training.

🌍 Everyday analogy: you recognise your friend whether they’re wearing a hat, standing in shadow, or photographed sideways. Augmentations teach exactly that — “these surface changes don’t change what this is.” And the projection head is like jotting rough notes for an exam, then throwing them away once you’ve actually learned the material.

💡 Key idea: augmentations are the secret ingredient — they tell the model which changes are "the same thing." This recipe is SimCLR (and, with MoCo's queue, MoCo v2).

SimCLR similarity matrix