Random-dot stereograms
Two images of pure noise. Offset a hidden region of dots between the left and right view and — crossing your eyes — a 3D shape floats out. No edges, no texture, no labels. The depth lives only in the relationship between the two views.
Drag the slider to change how far the hidden region is shifted (the disparity). The reveal panel recovers the shape, brighter as the signal gets stronger.
What is a world model?
A world model is a system that predicts what happens next. Watch a ball bounce for a moment and you instinctively know where it'll be in a second — you're running a tiny physics simulator in your head.
Toggle the model on. Faint “ghost” dots show where the model thinks the ball will travel. The better its internal model of gravity and walls, the closer the ghosts hug reality.
Supervised vs. self-supervised
The obvious approach: feed the network both views and ask it to output the depth. That's supervised learning — and it only works if a human first labelled the correct depth for every example. Labels are slow, costly, and run out fast.
In the simulator, data streams in. With supervised learning you must hand-label each item (you have a limited budget). Self-supervised learning instead invents its own task from the data's structure — so it can learn from all of it.