With all the buzz around OpenAI’s new Sora 2 video generation model, you might be wondering what makes it different from other previous SOTA models like Veo 3. Here’s the breakdown.
Visual fidelity:
Sora 2 has made improvements in visual fidelity, generating frames natively in 720p and then upscaling, to maintain sharp textures and object edges.
Object permanence has also been improved upon, thanks to incorporating Long Context Tuning research into the model’s architecture, allowing it to “remember” entities across cuts.
Fluid graphics have also been improved upon, partly thanks to improvements in the model’s understanding of physics.
Physics:
One of Sora 2’s biggest improvements is in its understanding of physics. This is largely due to incorporating a differentiable physics engine within the generative loop, allowing real-world dynamics to be learned. Accompanied by using a “referee model” to spot physics errors and encourage retraining, Sora 2 has an unprecedented level of quality when it comes to modeling dynamic processes and events.
Audio:
I think this is Sora 2’s biggest improvement, along with physics. Sora 2 tightly couples audio with video, even baking audio spectrograms into a shared latent space with that of video. This allows for realistic, layered audio with excellent synchronization with the video. Compared to other generative models, audio in Sora 2 feels much less like an afterthought.
Social interactions and virality:
Sora 2’s Cameo collaboration system allows users to insert their own likeness and voice into generated videos, encouraging personalized memes, reaction videos, and branded messages. While there are concerns regarding safeguarding identity, Sora places the owner of the likeness in control of their “cameo’s” usage. Combined with the Sora app that OpenAI has released, Sora 2 seems poised to encourage social interactions.
Leave a comment