Inside OpenAI's Sora: Surge to #1 App, Key Product Decisions & How Video Models Learn Physics — Unsupervised Learning

Sora, OpenAI’s video generation app, launched in late 2025 and quickly became the #1 app in the App Store, driven by a combination of breakthrough model intelligence and a product designed around social creation rather than passive consumption.
- The app was built by a surprisingly small team of roughly 40 people (about 9–10 researchers, under 20 product engineers, and ~13 systems engineers), and its success exceeded even the team’s optimistic expectations.
- Unlike ChatGPT’s launch, which was initially underestimated, Sora’s team had ambitious goals from the start, though hitting #1 for a full month was still a surprise.

Cameo as the killer feature: The ability to insert yourself (via a short video and audio check) into AI-generated scenes emerged organically when an engineer on the team, Bobo, uploaded internal Slack videos and people started tagging each other. It quickly became the dominant behavior on the platform.
- Cameo makes generated content feel personal and human, distinguishing Sora from purely AI-generated feeds that users find hollow.
- The team initially worried the app had become “just cameos” but realized this was actually the core value.
Social-first, not consumption-first: The team deliberately designed the recommender system to prioritize creativity and connection over engagement-maximizing clickbait.
- The feed emphasizes content from people you know and remix chains, inspired by the team’s experience at Instagram where friend-prioritized feeds proved healthier.
- Remixing is treated as a first-class creative act—easy to trace and attribute in a way that isn’t possible with traditional photo/video.
Independent app, not a ChatGPT feature: The team prototyped social features inside ChatGPT (Reddit-style threaded image chains) but ultimately decided Sora needed its own surface because ChatGPT feels single-player and sacred—mixing public social behavior there would feel jarring.

Sora 1 was the “GPT-1 moment” for video: Launched in early 2023 alongside DALL-E 3, it was the first model capable of high-resolution, consistent generation beyond ~~1 second (up to 60 seconds), though at enormous cost (~~$50 per 720p video).
Sora 2 is the “GPT 3.5 moment”: A step-function jump in both intelligence and usability. It can handle complex physics (gymnastics routines, glass shattering) in a single shot, whereas Sora 1 required hundreds of attempts for coherent stories.
- API pricing dropped to cents on the dollar compared to Sora 1, and the team expects costs to continue falling by orders of magnitude.
- The model shows remarkable stylistic range—cinematic, doorbell footage, anime, podcast-style—without the mode collapse seen in competitors’ models.

At a fundamental level, both LLMs and video models learn world models through prediction tasks. A diffusion-based video model predicts the underlying signal in a noisily corrupted video; to do this well, it must implicitly learn how objects move, how light behaves, and how physical interactions unfold.
- This understanding is an emergent property of scale—models that internalize physics achieve lower loss, creating optimization pressure toward world-modeling.
- Video intelligence is multimodal in a unique way: a single model must capture both intellectual content (e.g., a calculus lecture) and physical complexity (e.g., every person shifting in the background of a gymnastics routine).

Long-horizon simulation: The next major frontier is simulating processes that unfold over hours, days, or years—useful for biology, physics research, and robotics. This requires fundamental breakthroughs beyond current techniques.
Scientific discovery via simulation: Bill Peebles predicts that by early 2028, video models will produce the first scientific breakthroughs by simulating physical phenomena (e.g., turbulence), likely in classical physics where observational data maps well to video.
Robotics and simulation data: The team is bullish on repurposing video models for robotics pre-training, since they deeply understand local motion and dexterity—areas where real-world trajectory data is scarce.

Sora launched a credit-based monetization system shortly after release, initially offering 30 free generations per day. This is seen as a first step, not the final model.
- The team is exploring brand integration (e.g., auctioning in-video product placement to brands) and plans to prioritize early-adopter rights holders and creators in monetization pilots.
- Character cameos (launched the day before the interview) allow rights holders to make their IP available for user-generated content, opening a new revenue channel.

The team uses OpenAI’s reasoning models as part of a multi-layered moderation stack, enabling a small team to manage safety at scale.
- Cameo introduced novel challenges—users must feel in control of their likeness—requiring careful guardrails around impersonation and public figure protections.
- The team acknowledges ongoing friction: content moderation failures are visible and frustrating, and they are iterating daily.

The team sees natural connections to ChatGPT (e.g., responding to “how do I fix my toilet?” with a generated instructional video) and to OpenAI’s browser/agent efforts (e.g., a video assistant helping book flights).
- However, they are cautious about merging entertainment into ChatGPT’s utility-driven experience and emphasize that any integration must be done thoughtfully.

Sora launched first in the US and Canada, then expanded to Korea, Japan, and Southeast Asia. Each region has developed a distinct creative flavor—Japanese creators, for instance, produce highly aesthetic content that differs markedly from US styles.
- Cross-cultural remix chains are a highlight, with users from different countries riffing on each other’s content.
- The team notes that even mundane content (e.g., someone discussing the Toronto accident) can become educational, echoing the learning dynamic that made TikTok compelling.

Changed minds: Rohan has accelerated his timelines for some AI capabilities but delayed expectations for consumer adoption and enterprise deployment, noting the gap between scientific progress and usable product interfaces. Bill has come to value human creative intent more than expected—even knowing a person iterated on a prompt or rejected samples makes generated content feel more meaningful.
What they’d build on the API: Thomas wants to build interactive storytelling or gaming experiences. Rohan is excited about gaming, where generative art removes a traditional bottleneck. Bill would focus on science or robotics-facing models.
Why so few consumer products: Building a good consumer product is extremely hard even without the added complexity of brand-new technology; the combination makes it exceptionally difficult to nail.

Summary