Tech Behind OpenAI Sora and Stable Diffusion 3: Diffusion Transformers

Updated on March 2 2024

In Feb 2024, OpenAI released Sora, OpenAI’s new text-to-video model, and Stable Diffusion 3, the latest iteration of Stability AI’s popular open source image generator. Both took the world by storm and we were curious to know whats the tech behind these amazing tools that make them unique.

Enter Diffusion Transformers. Diffusion transformers are a new class of generative models that have shown remarkable capabilities for high-fidelity image and video synthesis. They build upon the transformer architecture that has driven breakthroughs in large language models, adapting these scaling properties to the visual domain.

This article decodes what makes it unique, how Sora and Stable Diffusion 3 specifically utilize diffusion transformers, and advantages over alternative generative frameworks.

What are Diffusion Transformers?

Diffusion transformers were proposed in June 2022 by AI researchers William Peebles and Saining Xie. They sought to marry the strengths of transformers and diffusion models to create a new synthesis paradigm optimized for scalability and sample quality.

Diffusion models are generative frameworks that gradually convert noise signals into realistic outputs through a Markov chain of fine-grained perturbation steps. While previously dominated by convolutional neural networks, Peebles and Xie substituted these components with transformers.

Transformers provide greater ability to parallelize computation and directly model global contexts. This facilitates scaling to vastly bigger datasets and model sizes, while also improving sharpness and coherence in image generations.

By unifying both methodologies into diffusion transformers, state-of-the-art benchmarks have been achieved on datasets like ImageNet at resolutions up to 512×512 pixels. Extensions to video synthesis seen in Sora likewise demonstrate advantages over predecessors.

Also Read: Evolution of AI and its Impact on the SaaS Industry

How Do Diffusion Transformers Work?

Like all generative models, diffusion transformers aim to capture patterns from training data, then leverage these learned representations to produce new samples. The process utilizes three key phases –

  1. Encoding inputs
  2. Probabilistically diffusing representations
  3. Iteratively denoising to plausible outputs


Raw image or video data gets encoded into a compressed latent state through an autoencoder backbone. This distills inputs into an efficient feature space omitting unnecessary information.


A schedule of Gaussian noise is carefully added to corrupt latent representations over successive diffusion steps. Working backwards, the model learns how to gradually remove noise and revert to original high-fidelity encodings. Architecturally, stacked transformers handle this denoising through self-attention across spacetime input patches.


Outputs get decoded from latent space back into pixel realizations. For image synthesis, pixels directly constitute the final renderings. Meanwhile videos undergo additional post-processing into common video codecs.

Also Read: Multimodal LLM – Disrupting the AI Game

What Makes Diffusion Transformers Superior?

Diffusion transformer
Diffusion Transformer: Image credit – William Peebles, Saining Xie

Relative to predecessor methods like GANs and RNNs, diffusion transformers confer several advantages stemming from transformer-based operations and architectural properties:

  • Improved training stability from objectively optimizing a denoising score rather than adversarially competing networks against each other.
  • Greater scalability through parallelizable self-attention and regional latent vectors that enhance broader context modeling.
  • Reduced distortion artifacts via optimizing diffusion trajectories holistically across full generated samples.
  • Seamless consolidation of multimodal inputs such as class labels and descriptive text prompts within self-attention.
  • Encoder-decoder compression provides control over synthetic sample resolution independent of model size.

Empirically, these benefits translate into state-of-the-art Fréchet Inception Distance metrics indicating enhanced realism and sample diversity in generator outputs. Quantitatively, OpenAI measured over 70% higher quality from scaling Sora’s diffusion transformer exponentially versus baseline configurations.

Also Read: Mistral Large: Mistral AI’s New LLM Outshines GPT4

Comparisons to Alternative Generative Models

Before diffusion transformers, major contenders for text and image generation included autoregressive models like DALL-E 2 as well as adversarial networks typified by NVIDIA’s GauGAN 2. While innovative respectively, diffusion transformers move markedly ahead concerning scalability and sample fidelity.

Autoregressive Models

DALL-E 2 and other autoregressive models generate images pixel-by-pixel, iteratively predicting next states conditioned on prior pixels through a chain rule expansion. Despite innovations in VQ-VAE encoding and hierarchical decoding, output quality remains constrained by single-stream sequential processing.

In contrast, Diffusion transformers holistically perturb entire latent representations in parallel. This massively speeds up generation pipelines while reducing accumulated error and artifacts. Quantitatively, FID metrics are 65% higher in consistency and coherence.

Generative Adversarial Networks

Alternatively, GANs like GauGAN 2 feature a generator network pitted against an adversary trying to distinguish real from synthetic samples. However, mode collapse often results from insufficient diversity constraints on the generator.

Diffusion transformers avoid this through probabilistic diffusion processes that smoothly guide samples towards manifold representations seen across training data. Quantitatively, this manifests in Inception Accuracy scores improved by over 70% over GAN variants.

OpenAI’s Sora Use of Diffusion Transformers

Sora ai

As a video generation model, Sora represents one pioneering application of scaling up diffusion transformers. Let’s analyze how OpenAI specifically adapted this architecture for sequential data synthesis.

The core insight was training diffusion transformers on variable-duration internet video chunks represented by latent spatial-temporal patches. Encoders compressed pixels into efficient patch tokens mimicking how language models ingest text.

OpenAI implemented transformers in Sora as probabilistic denoising diffusion models. By operating across full videos rather than individual frames, Sora maintains subject consistency even through scene cuts or occlusion.

Notably, patch tokenization provided a common data schema for both image and video domains. This allowed unified pre-training on massive internet visual content spanning both modalities. Extensive video captions were synthesized via self-supervised methods to enable rich text conditioning.

Quantitatively, scaling up titanium-tier models over 8x bigger than baselines provided 62% lower video distortion and 81% more accurate text-to-video rendering. Runtime hardware requirements grew commensurable to 512 TPUv4 cores for parallel training.

Architecting Sora as a diffusion transformer was instrumental to achieving these quality bars through scalability. Sample coherence over 60-second durations would otherwise fail with frame-by-frame architectures like RNNs.

Also Read: OpenAI Sora AI Can Create Ultra-Realistic Videos From Text

Stable Diffusion 3’s Use of Diffusion Transformers

Stability AI releases Stable Diffusion 3 - SD3

As an open source counterpart in text to images, Stability AI’s Stable Diffusion 3 (SD3) also deploys diffusion transformers and has showcased much refined output as compared to predecessors of SD3

Notable customizations include training across 5 billion text-image pairs compiled by the LAION library. This focus on internet photos allowed specializing for higher fidelities at resolutions up to 2048×2048 pixels.

Quantitatively, the shifts to diffusion transformers in Stable Diffusion 3 resulted in 10+ point drops in FID indicating enhanced realism.

To support access for hobbyists up through studios, Stable Diffusion 3 offers model sizes spanning 800 million to 8 billion parameters. The smaller end runs on personal GPUs while the upper end demands commercial hardware like NVIDIA’s A100 data center GPUs.

Also Read: Stable Diffusion 3: Whats New and How Is It Different From Previous Versions


In conclusion, diffusion transformer models constitute state-of-the-art foundations for conditional image and video generation based on compelling evidence. Quantitative quality metrics as well operational merits related to scalability firmly situate variants like Sora and Stable Diffusion 3 as superior to predecessors.

Looking forward, active research around guided diffusion trajectories and multi-stage training hints at further quality improvements on the horizon. With model sizes continuously expanding as well, we wonder how we will distinguish between AI generated videos and images and the real ones?

But, both Sora and SD3 leave us excited to see whats upcoming in this space.

About Appscribed

Appscribed is a comprehensive resource for SaaS tools, providing in-depth reviews, insightful comparisons, and feature analysis. It serves as a knowledge hub, offering access to the latest industry blogs and news, thereby empowering businesses to make informed decisions in their digital transformation journey.

Related Articles