Generative World Modelling for Humanoids: 1X World Model Challenge
1X World Model Challenge OverviewAbstract
In this talk, I’ll present our methods which one both tracks of the 1X World Model Challenge.
The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes.
For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using adaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch.
Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.
Event
Location
Hawaii Conference Center
1801 Kalakaua Avenue, Honolulu, Hawaii HI 96815