Diffusion Models Explained – From Noisy Images to State-of-the-Art Generation

For most of the 2010s, if you wanted to generate photorealistic images with a neural network, you worked with GANs – Generative Adversarial Networks. The results were impressive. The training was notoriously unstable. Mode collapse, where your generator learns to produce a narrow range of outputs rather than the full diversity of the training distribution, was a constant nemesis.

Then came diffusion models. Slower to generate, more stable to train, and ultimately more capable. By 2022 they had produced DALL·E 2, Stable Diffusion, and Imagen – systems that made the GAN era look like a rehearsal. Understanding how they work is worth the investment, because the core idea is both elegant and counterintuitive.

The Core Idea: Destruction and Reversal

A diffusion model is built around a surprisingly simple premise. Take an image. Gradually add Gaussian noise to it over many steps until it becomes pure static indistinguishable from random noise. This is the forward process – easy to define, completely automatic, requires no learning.

Now train a neural network to reverse this process. Given a slightly noisy image, predict what the noise was, so you can subtract it and recover a slightly less noisy image. If you can do this reliably at every noise level, you can start from pure noise and iteratively denoise your way to a clean image.

That’s it. The core of diffusion is learning to reverse a noise process you defined yourself.

The forward process is usually written as a Markov chain: each noisy step depends only on the previous one. After enough steps (typically hundreds to thousands in early formulations), the original image information is entirely destroyed. The terminal distribution is just isotropic Gaussian noise.

The reverse process mirrors this structure. At each reverse step, the model estimates the noise component and subtracts it. Crucially, the model doesn’t directly predict the clean image – it predicts the noise, which turns out to be an equivalent but better-behaved prediction target.

What the Neural Network Actually Learns

The workhorse of most modern diffusion models is a U-Net architecture – an encoder-decoder with skip connections that allow high-frequency spatial information to flow directly from encoder layers to the corresponding decoder layers without passing through the bottleneck.

The U-Net is conditioned on the noise level (or timestep) at which it’s operating, typically via sinusoidal positional embeddings similar to those used in transformers. This conditioning is important: the model needs to behave differently when denoising almost-clean images versus nearly-pure noise, because the relevant structure to preserve is completely different.

In text-conditioned models like Stable Diffusion, the U-Net is additionally conditioned on a text embedding – typically from a pretrained CLIP or T5 encoder — via cross-attention layers. This is what allows the model to steer generation toward a semantic target specified in natural language.

Latent Diffusion: The Efficiency Breakthrough

Running diffusion in pixel space is expensive. The forward and reverse processes operate on the full spatial resolution of the image, which means your U-Net is processing large tensors at every step.

The key insight in Latent Diffusion Models (LDM), the architecture underlying Stable Diffusion, is that you don’t need to do this in pixel space. Instead, you first train a variational autoencoder (VAE) that compresses images into a compact latent representation. Then you run the entire diffusion process in that latent space, which is typically 8× smaller in each spatial dimension.

At inference time, you denoise in latent space and then decode the final latent back to pixels using the VAE decoder. The perceptual quality is preserved because the VAE was trained to reconstruct images faithfully. But the computational cost of diffusion is dramatically reduced.

This is the architectural move that made high-resolution diffusion models practical to run on consumer hardware. The denoising U-Net operates on a 64×64 latent (representing a 512×512 pixel image) rather than the full 512×512 pixel grid.

Classifier-Free Guidance: Pushing Toward the Prompt

Even with text conditioning, early diffusion models didn’t always produce outputs that closely matched the input text. The model had learned the joint distribution of images and text descriptions, but at inference time you want to heavily bias generation toward the specific text condition, not just sample from the joint distribution.

Classifier-free guidance (CFG), introduced by Ho and Salimans in 2021, solved this elegantly. During training, you randomly drop the text conditioning for some percentage of samples, training the model to generate unconditionally as well as conditionally. At inference time, you run two forward passes: one with the text condition, one without. You then extrapolate in the direction from the unconditional prediction toward the conditional prediction, amplified by a guidance scale parameter.

A guidance scale of 1 gives you the conditional prediction directly. Higher values push the output aggressively toward the prompt, at the cost of some diversity and sometimes image naturalness. In practice, guidance scales between 7 and 15 tend to give the best tradeoff for photorealistic generation.

Why Diffusion Won the Generative Race

GANs are faster to sample from – one forward pass versus hundreds of denoising steps. But diffusion models have several structural advantages that ultimately mattered more.

Training stability: the diffusion objective is a weighted sum of denoising losses across noise levels, which is a simple regression objective. No adversarial dynamics, no mode collapse, no discriminator to balance against.

Coverage: because the forward process destroys all structure, the model must learn to generate the full diversity of the training distribution rather than finding a local optimum. GANs notoriously struggled to cover multimodal distributions.

Compositionality: the iterative generation process turns out to be surprisingly flexible. Inpainting, outpainting, image-to-image translation, ControlNet-style spatial conditioning – all of these can be implemented as modifications to the sampling procedure rather than changes to the model architecture.

The sampling speed problem has been substantially addressed by distillation methods (LCM, SDXL-Turbo) that reduce the required steps from hundreds to four or even one. The gap with GANs on latency is now narrow for most practical applications.

What we’re left with is an architecture that feels almost too principled to be as powerful as it is — a model trained to reverse a noise process it defined itself, scaling into one of the most capable generative systems ever built.

Written by

MLM Papers View 6 posts by MLM Papers →