OpenAI text-to-video model Sora wows X but still has weaknesses

OpenAI has unveiled its first video generation model, dubbed Sora, which can create detailed, movie-like scenes in resolutions up to 1080p.
OpenAI has unveiled its first video generation model, dubbed Sora, which can create detailed, movie-like scenes in resolutions up to 1080p.

Artificial intelligence (AI) firm OpenAI unveiled its first-ever text-to-video model to a strong reception on Thursday, though the firm admits the model still has a ways to go. 

OpenAI unveiled the new generative AI model, dubbed Sora, on Feb. 15, which is said to create detailed videos from simple text prompts, continue existing videos, and even generate scenes based on a still image.

According to a Feb. 15 blog post, OpenAI claimed the AI model can generate movie-like scenes in resolutions up to 1080p. These scenes can include multiple characters, specific types of motion and accurate details of the subject and background.

How Sora works

Much like OpenAI’s image-based predecessor, Dall-E 3, Sora operates on what’s known as a diffusion model.

Diffusion refers to a generative AI model creating its output by generating a video or an image with something that looks more like “static noise” and gradually transforming it by “removing the noise” over several steps.

The AI firm wrote that Sora has been built on past research from ChatGPT and Dall-E 3 models, which the firm claims makes the model better at more faithfully representing user inputs.

OpenAI admitted that Sora still contained several weaknesses and could struggle to simulate the physics of a complex scene accurately, namely by muddling up the nature of cause and effect.

“For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.”

The firm said the new tool could also confuse the “spatial details” of a given prompt by mixing up lefts and rights or failing to follow precise descriptions of directions.

Sora can accidentally generate physically implausible motion. Source: OpenAI

OpenAI said the new generative model is only available for now to “red teamers” — tech parlance for cybersecurity researchers — to assess “critical areas for harms or risks,” as well as select designers, visual artists and filmmakers to gather feedback on how to advance the model.

In December 2023, a report from Stanford University revealed that AI-powered image-generation tools using the AI database Laion were being trained on thousands of images of illegal child abuse material, something that raises serious ethical and legal concerns for text-to-image or video models.

Users on X left speechless

Dozens of video demos have been circulating on X showing examples of Sora in action, while Sora is now trending on X with over 173,000 posts.

In a bid to show off what the new generative model is capable of, OpenAI CEO Sam Altman opened himself up to custom video-generation requests from users on X, with the AI chief sharing a total of seven Sora-generated videos, varying from a duck on dragon back to golden retrievers recording a podcast on a mountain top.

AI commentator Mckay Wrigley — along with many others — wrote that the video generated by Sora had left him speechless.

In a Feb. 15 post to X, Nvidia senior researcher Jim Fan declared that anyone who believed Sora to be just another “creative toy,” like Dall-E 3, would be dead wrong.

In Fan’s view, Sora is less a video-generation tool and more a “data-driven physics engine,” as the AI model isn’t just generating abstract video but also deterministically creating the physics of objects in the scene itself.

Magazine: ‘Crypto is inevitable’ so we went ‘all in’ — Meet Vance Spencer, permabull