So, OpenAI released a new video model called Sora that's capable of generating up to one minute of HD video. The results are impressive, and given the media storm around it, you probably heard about it.
I won't bore you with the nth iteration of how capable the model is, just watch the demos:
If you can’t get enough of it, here’s 24 Sora examples from Twitter/X that are not in OpenAI's Sora webpage.
Sora and AI-video in general has three implications: Technical regarding movies and games, technical regarding computer vision and AGI, and philosophical regarding perception of mediated reality. (I'll leave out the ethical implications of dataset creation with unlicensed material without compensation for now.)
I'll discuss the first two technical implications (AI for movies and entertainment, and Sora for AGI-world models) in this post, and follow up with some musings on the philosophical implications in the coming days.
GOOD INTERNET is a reader supported online mag. If you like what i do here, you can support this thing by upgrading your subscription to a paid plan or use one of the other support options you can find at the bottom of this issue.
Sora at the Movies
While fanboys on the tweeties will tell you that we'll see AI-generated movies, like, tomorrow, AI-tech to be actually usable for movies are at least a decade away. Sure, we do see a lot of shortfilms coming out of this and Runway launched an AI Shortfilm Festival last year, but besides its novelty factor and experimental setup, these films are not very interesting from a cineastic perspective. They are weird and interesting for being so, but that's pretty much all there is. And if they are not weird, they are generic. (Yet.)
Ask yourself how a movie creates intensity. It is never the otherworldly outlandish orgy of visuals that you can create, but actors conveying a more-than-just-believable emotion that you can feel while they act it out on a screen. This, in combination with precise camera movements, framing and lighting, encodes what a film is trying to say into a sequence of frames.
Movies are not merely plausible. Movies are more than believable. A good movie touches you, because you can relate, make a connection to the character on screen and that's hard to do. Nothing you see in a movie is coincidental down to the very smallest details like the fly on the face of Jack Elam in Once Upon a Time in the West.
In generative AI, everything is coincidental and it is still remarkable just how random the movements of those figures still are, and i don't mean weird hands with that.
In one of those Sora-demo-vids, a woman walks down the street. She walks aimlessly, and we don't get any sense, why she's walking or where she goes. It's just somewhat coherent walking, but the ultimate goal of walking is not to make one step after the other, the goal of walking is to go somewhere. Sure, she's just wandering around, strolling around Tokyo on a busy night, but it still feels unconnected to anything, there's no aim, no meaning behind the movement. And we do see that in that frames, that this woman walking in Tokyo has nowhere to go, she just makes one step after the other. This is not cinematic, this is flat and boring advertising aesthetics.
These aimless and random movements, and all the missing details in mimic and body movement, makes it impossible for synthetic video to convey a more-than-just-believable emotion that you can feel.
At the same time that OpenAI demoed Sora, TikTok published a paper about a new technique they call Boximator for more controlled animation edits, where you can tell an AI that a hand should move up or whatever and with which you can control somewhat, which areas should move and where, but you still can't control how they move. But these techniques lack the finegrained details to make acting work. A hand moving up doesn't say anything about how it should go up, how exactly the singular fingers will move, how the hand rotates, how fast or slow the rythm of the movement should be, how single fingers twist while the hand moves, how the figure contracts it's muscles so that a mere hand movement is viewingly feelable across the whole body on screen. With AI you have no control over any of these, while an actress spends years in class to have exactly this fine grained, exact and precise control over her body.
If you think AI can do this level of detail of bodycontrol of an animated figure within the next 5 years, i'm more than willing to place a bet against it. And then we're still years away to apply this level of detailed movement controll to every object in every frame. Even classic CGI, which is here for 40 years now, is not fully there yet and you can see the difference between CGI and real actors every. single. time. Human eyes are pretty good at identifying real life, you know.
I'm more than convinced that Sora will become good enough to replace some stock video stuff, like Midjourney and Dall-E is good enough to replace some stock photo and generic illustration stuff, because stock media is, too, ultimately aimless and shallow representation of commodified, hollowed out emotions. But, besides some experimental stuff, this is not film and that is not art. I don't think AI will solve that problem for quite a while.
Games are a bit more tricky to assess because they are not as fixed a medium as movies are. Once we see realtime generative video-AI (and that may be a technical hurdle not to be broken for some time, early experimental successess aside), i can absolutely imagine game environments generated on the fly based on user input. Question remains if those are fun to play.
You see, in contrast to what Andrew White says on the tweeties, OpenAI Sora can not simulate Minecraft. It can produce an incoherent video in the same aesthetic, but there's a pig running backwards and the jumps of the character are incoherent and needless to say, Sora can not simulate all the crafting.
Sure, in a few years, you might be able to sort of direct a game and simulate it's world by prescripting your generative game environments with a system prompt, but as with movies: To turn this into a coherent experience that is also fun to play will take a lot of time.
World-Simulators for AGI
OpenAI releasing Sora has two reasons: One is to have a generative AI-product to attack the stock video market or provide video tools for moviemakers and make money from this. But the ultimate and bigger goal of OpenAI is not AI-cinema, but creating a world model for future AGI that is capable of "understanding" spatial realities, to build "general purpose simulators of the physical world". The visual output is a byproduct (and a good measurement of AIs "understanding" of those spatial realities).
Arguably, Sora is not anywhere near "understanding" physical reality. I mean sure it's way better at "understanding" what "eating spaghetti" means than the famed Will Smith eating spaghetti-GIF in a sense that while Sora has no idea how Spaghetti tastes and what al dente feels like when you bite pasta, but it clearly shows better "understanding" what noodles do when humans eat them.
Given the speed of advancements in AI-tech, i can imagine AI systems with coherent physical world models within a few years, depending on the quality of training data. "Coherent physical world models" in that context means: An AI system is capable of image recognition and "understanding" object constancy, a psychological feat kids develop at the age of six months and which means that we humans understand that an object stays in physical reality even when it's suddenly hidden by something. When i throw a blanket over a box, the box stays in reality and doesn't vanish. The AI model then "understands": What is a box and what is a blanket and how both behave in reality when they interact in various ways.
Sora is not there yet, clearly. In the demo videos you can sometimes see object constancy, sometimes you can't. Objects sometimes transform after being hidden by something (the windows on a train on a leaf suddenly vanish, a car going by in the background doesn't reappear after driving behind a tree, and so on).
Object constancy isn't the only thing Sora fails at. A dog walking outside a building from one window to the other would clearly fall down, which has nothing to do with "weird mutant hands" but a missing understanding of depth of field in combination with physical reality. Sora doesn't understand that a dog can't walk out of a window like that, it doesn't understand what a dog is and what "outside of a building" is. But if they combine this with image recognition ("what is a building", "what is a window", "what is a dog") with depth of field ("where is the dog in relation to the window and the building") and some knowledge about the physical world ("dogs walking out of windows will fall down"), it soon might develop the physical world model that OpenAI is aiming at.
Meta and their DINOv2 model made some good progress on understanding depth in visual inputs and with V-JEPA, they just released a new architecture for a "physical world model" capable of "detecting and understanding highly detailed interactions between objects". It's clear that both OpenAI and Meta are trying to develop components for multimodal AI-systems that have spatial "understanding" of the world, systems capable to read moving images kind of like we do, to understand movement, displacement, to get a sense for three dimensions, and to combine those visual-physical informations with audio and text.
And they are on to something.
In the fascinating paper Grounded language acquisition through the eyes and ears of a single child, researchers mounted some cameras to the heads of a child while he ran around all day, play, talk silly and point at stuff. They got 61 hours of video in 600,000 frames out of that, annotated with the 37,500 transcribed "utterances" of the child. Then they trained a "generic" neural network on this dataset.
Amazingly, their
model acquire(d) many word-referent mappings present in the child’s everyday experience, enable(d) zero-shot generalization to new visual referents, and align(ed) its visual and linguistic conceptual systems. These results show how critical aspects of grounded word meaning are learnable through joint representation and associative learning from one child’s input.
This study not only points at what AI models with spatial "understanding" combined with language and audio might be capable of in the coming years, but also how limited AI in it's current state still is.
According to (Alison) Gopnik, a psychology professor at University of California, Berkeley, babies have three core skills that AI systems lack. First, babies excel at imaginative model building, creating a conceptual framework to explain the world. They are also curious, adventure loving and embodied learners, actively exploring new environments, rather than being passively encased in lines of code. And babies are social animals, learning from all those they interact with, helping develop empathy, altruism and a moral sensibility.
So, OpenAI and Meta might be on to something, trying to give AI a spatial "understanding" with the goal for AI to be able to develop "physical world models", which correlate to text and audio and result in multimodal AI-Systems. But it's a very long path down that road and they are nowhere near a child which intuitively understands that dogs can't walk outside of buildings, that plastic chairs don't mutate and hover in the air, and that objects are persistent.
And even when we have AI-systems capable of somewhat "understanding" physical reality, i'd still bet that Robert DeNiro will beat them at playing a Schnitzel on the big screen for a very long time.
Take it away, Bob.