Chapter 1 · See the Hidden World

CONCEPT 1.1

Random-dot stereograms

Two images of pure noise. Offset a hidden region of dots between the left and right view and — crossing your eyes — a 3D shape floats out. No edges, no texture, no labels. The depth lives only in the relationship between the two views.

Drag the slider to change how far the hidden region is shifted (the disparity). The reveal panel recovers the shape, brighter as the signal gets stronger.

🌍 Everyday analogy: remember those “Magic Eye” posters from the ’90s? Two slightly different speckly patterns, and if you relax your eyes a hidden dolphin pops out in 3D. Your two eyes are the two views; your brain is the network that finds the depth they secretly share.

💡 Key idea: useful structure can be hidden in the agreement between two views — waiting to be discovered without a single label.

🎯

Challenge: push the disparity up until the reveal locks in, then click the shape you see.

Stereogram lab

CONCEPT 1.2

What is a world model?

A world model is a system that predicts what happens next. Watch a ball bounce for a moment and you instinctively know where it'll be in a second — you're running a tiny physics simulator in your head.

Toggle the model on. Faint “ghost” dots show where the model thinks the ball will travel. The better its internal model of gravity and walls, the closer the ghosts hug reality.

🌍 Everyday analogy: a pool player chalks the cue and, before striking, sees in their mind exactly where the balls will scatter. That little mental simulator — “if this, then that” — is a world model. We all run one constantly; AI is just learning to build its own.

💡 Key idea: intelligence leans heavily on prediction. The whole rest of this course is about learning good predictive representations — without hand-labelling the world.

Predict-the-future sandbox

CONCEPT 1.3

Supervised vs. self-supervised

The obvious approach: feed the network both views and ask it to output the depth. That's supervised learning — and it only works if a human first labelled the correct depth for every example. Labels are slow, costly, and run out fast.

In the simulator, data streams in. With supervised learning you must hand-label each item (you have a limited budget). Self-supervised learning instead invents its own task from the data's structure — so it can learn from all of it.

🌍 Everyday analogy: supervised learning is paying tutors to hand-write the answer on the back of every flashcard — accurate, but you go broke fast and run out of cards. Self-supervised learning is a toddler with a box of blocks: nobody labels anything, yet by stacking and knocking them over the child learns gravity, balance and shape. The world is its own answer key.

💡 Key idea: the real question of this whole field — can a network learn useful structure from the relationship between views, with no labels at all?

The labelling-budget game

Random-dot stereograms

What is a world model?

Supervised vs. self-supervised

So… can a machine do this?