Action-conditioned world model
"Action-conditioned" means the prediction depends not just on the current state but on the agent's chosen action. Encode the current frame, pick an action, and the predictor outputs the resulting future embedding.
Fire the thrusters. The ghost trail is the model imagining where the agent ends up before it actually moves. Different action → different predicted future.
Planning with the cross-entropy method
To find a good plan we roll out candidate action sequences through the model and score how close each gets to the goal. The cross-entropy method (CEM) is a beautifully simple "guess & improve" search:
① sample many random action sequences · ② roll each out & score it · ③ keep the best — the elite set · ④ refit a Gaussian to the elites & sample again. Repeat and the plans zero in on the goal. Running this fresh every step is model predictive control (MPC).
Go deeper: planning in a learned latent space
Classic model-predictive control needs an engineer to hand-write the physics (“the arm weighs X, the friction is Y”). Action-conditioned JEPA replaces that with a learned dynamics model that predicts the next embedding, and scores a plan by how close its predicted final embedding lands to the goal embedding. The cross-entropy method is just the search that finds good action sequences inside that imagined space. After roughly 62 hours of unlabelled robot video, this is enough to plan real manipulation tasks — no reward labels, no physics equations.