World models, also known as world simulators, are being touted by some as the next big thing in AI.
AI pioneer Fei-Fei Li's World Labs has raised $230 million to build "large world models," and DeepMind hired one of the creators of OpenAI's video generator, Sora, to work on "world simulators." (Sora was released on Monday; here are some early impressions.)
But what the heck are these things?
World models take inspiration from the mental models of the world that humans develop naturally. Our brains take the abstract representations from our senses and form them into more concrete understanding of the world around us, producing what we called "models" long before AI adopted the phrase. The predictions our brains make based on these models influence how we perceive the world.
A paper by AI researchers David Ha and Jürgen Schmidhuber gives the example of a baseball batter. Batters have milliseconds to decide how to swing their bat -- shorter than the time it takes for visual signals to reach the brain. The reason they're able to hit a 100-mile-per-hour fastball is because they can instinctively predict where the ball will go, Ha and Schmidhuber say.
"For professional players, this all happens subconsciously," the research duo writes. "Their muscles reflexively swing the bat at the right time and location in line with their internal models' predictions. They can quickly act on their predictions of the future without the need to consciously roll out possible future scenarios to form a plan."
It's these subconscious reasoning aspects of world models that some believe are prerequisites for human-level intelligence.
Modeling the world
While the concept has been around for decades, world models have gained popularity recently in part because of their promising applications in the field of generative video.
Most, if not all, AI-generated videos veer into uncanny valley territory. Watch them long enough and something bizarre will happen, like limbs twisting and merging into each other.
While a generative model trained on years of video might accurately predict that a basketball bounces, it doesn't actually have any idea why -- just like language models don't really understand the concepts behind words and phrases. But a world model with ...