Step Into AI Videos: OpenAI’s Sora 2 Adds Sound and You

▼ Summary
– OpenAI released Sora 2, a second-generation AI model that generates videos with synchronized dialogue and sound effects, and launched an iOS app for inserting users into AI videos via “cameos.”
– The model was demonstrated in a video featuring a photorealistic Sam Altman speaking in slightly unnatural tones against fantastical backdrops like a duck race and glowing garden.
– Sora 2 can create realistic background soundscapes, speech, and sound effects, joining competitors like Google’s Veo 3 and Alibaba’s Wan 2.5 in generating synchronized audio.
– It shows improved visual consistency, follows complex multi-shot instructions while maintaining coherency, and is described as OpenAI’s “GPT-3.5 moment for video.”
– The model simulates complex physical movements more accurately, addressing prior failures by avoiding object morphing and ensuring realistic physics, such as a basketball rebounding properly after a missed shot.
OpenAI has unveiled Sora 2, a powerful new video-generation model that integrates synchronized audio and visual effects for the first time. This second-generation system produces videos across multiple artistic styles, complete with realistic dialogue and environmental sounds. Alongside the model, OpenAI introduced a dedicated iOS social application. This tool enables users to place themselves directly into AI-crafted videos through a feature the company refers to as “cameos.”
To demonstrate Sora 2’s capabilities, OpenAI released a sample video. It presents a strikingly photorealistic version of CEO Sam Altman speaking to viewers. His voice carries a slightly synthetic quality as he stands within imaginative settings, including a whimsical ride-on duck race and a luminous garden of glowing mushrooms. The model is engineered to produce what OpenAI describes as “sophisticated background soundscapes, speech, and sound effects with a high degree of realism.”
This development places OpenAI in direct competition with other industry leaders. Back in May, Google’s Veo 3 became the first major video-synthesis model from a large AI lab to offer synchronized audio generation. Just recently, Alibaba launched Wan 2.5, an open-weights video model that also creates accompanying audio. With Sora 2, OpenAI officially enters this new arena of audiovisual AI.
Beyond sound, Sora 2 shows significant improvements in visual consistency compared to its predecessor. It can interpret and execute more complicated, multi-shot instructions while preserving a coherent narrative flow across scenes. OpenAI characterizes this leap as its “GPT-3.5 moment for video,” drawing a parallel to the pivotal advancement that ChatGPT represented in the evolution of text-generation AI.
The new model also appears to handle physical interactions with greater accuracy. OpenAI states that Sora 2 can realistically simulate intricate physical movements, such as Olympic gymnastics routines and triple axel jumps in figure skating. This marks a notable upgrade from the original Sora model released in February 2024. Following the debut of Sora 1 Turbo last year, observers noted several prominent failures in similar video-generation tasks, issues that OpenAI claims to have resolved with this latest version.
In its official announcement, the company noted that previous video models tended to be “overoptimistic.” They would often warp objects and distort reality just to fulfill a text prompt’s request. For instance, if a basketball player missed a shot, the ball might inexplicably teleport into the hoop. With Sora 2, if a player misses, the ball will realistically bounce off the backboard.
(Source: Ars Technica)





