Stable Diffusion

Stable Diffusion is a generative AI model that produces unique photorealistic images from text and image prompts. It originally launched in 2022. Besides images, you can also use the model to create videos and animations. The model is based on diffusion technology and uses latent space.

Stable Diffusion is a latent diffusion model that generates AI images from text. Instead of operating in the high-dimensional image space, it first compresses the image into the latent space.

In the simplest form, Stable Diffusion is a text-to-image model. Give it a text prompt. It will return an AI image matching the text.

Diffusion model

Stable Diffusion belongs to a class of deep learning models called diffusion models. They are generative models, meaning they are designed to generate new data similar to what they have seen in training. In the case of Stable Diffusion, the data are images.

Why is it called the diffusion model? Because its math looks very much like diffusion in physics. Let’s say I trained a diffusion model with only two kinds of images: cats and dogs. In the figure below, the two peaks on the left represent the groups of cat and dog images.

Forward diffusion

A forward diffusion process adds noise to a training image, gradually turning it into an uncharacteristic noise image. The forward process will turn any cat or dog image into a noise image. Eventually, you won’t be able to tell whether they are initially a dog or a cat. (This is important)

It’s like a drop of ink fell into a glass of water. The ink drop diffuses in water. After a few minutes, It randomly distributes itself throughout the water. You can no longer tell whether it initially fell at the center or near the rim. Below is an example of an image undergoing forward diffusion. The dog image turns to random noise.

Forward diffusion

Reverse diffusion

Now comes the exciting part. What if we can reverse the diffusion? Like playing a video backward. Going backward in time. We will see where the ink drop was initially added. Starting from a noisy, meaningless image, reverse diffusion recovers a cat OR a dog image. This is the main idea.

Technically, every diffusion process has two parts: (1) drift and (2) random motion. The reverse diffusion drifts towards either cat OR dog images but nothing in between. That’s why the result can either be a cat or a dog.

How training is done

The idea of reverse diffusion is undoubtedly clever and elegant. But the million-dollar question is, “How can it be done?”. To reverse the diffusion, we need to know how much noise is added to an image. The answer is teaching a neural network model to predict the noise added. It is called the noise predictor in Stable Diffusion, and it is a U-Net model. The training goes as follows.

Pick a training image, like a photo of a cat.
Generate a random noise image.
Corrupt the training image by adding this noisy image up to a certain number of steps.
Teach the noise predictor to tell us how much noise was added. This is done by tuning its weights and showing it the correct answer.

Noise is sequentially added at each step. The noise predictor estimates the total noise added up to each step:

Noise predictor

Reverse diffusion

Now we have the noise predictor. How to use it? We first generate a completely random image and ask the noise predictor to tell us the noise. We then subtract this estimated noise from the original image. Repeat this process a few times. You will get an image of either a cat or a dog.

Reverse diffusion

Reverse diffusion works by subtracting the predicted noise from the image successively. You may notice we have no control over generating a cat or dog’s image. We will address this when we talk about conditioning. For now, image generation is unconditioned. You can read more about reverse diffusion sampling and samplers in this article (opens in a new tab).

Stable Diffusion model

Now I need to tell you some bad news: What we just talked about is NOT how Stable Diffusion works! The reason is that the above diffusion process is in image space. It is computationally very, very slow. You won’t be able to run on any single GPU, let alone a crappy GPU on your laptop.

The image space is enormous: a 512×512 image with three color channels is a 786,432-dimensional space! Diffusion models like Google’s Imagen (opens in a new tab) and Open AI’s DALL-E (opens in a new tab) are in pixel space. They have used some tricks to make the model faster but still not enough.

Latent diffusion model

Stable Diffusion is a latent diffusion model designed to solve the speed problem. Instead of operating in the high-dimensional image space, it first compresses the image into the latent space. The latent space is 48 times smaller so it reaps the benefit of crunching a lot fewer numbers. That’s why it’s a lot faster.

Variational Autoencoder

It is done using a technique called the variational autoencoder. That’s precisely what the VAE files are. The Variational Autoencoder (VAE) NN has two parts: an encoder and a decoder. The encoder compresses an image to a lower dimensional representation in the latent space. The decoder restores the image from the latent space.

Variational Autoencoder

Variational autoencoder transforms the image to and from the latent space. The latent space of Stable Diffusion model is 4x64x64, 48 times smaller than the image pixel space. All the forward and reverse diffusions are actually carried out in the latent space.

So during training, instead of generating a noisy image, it generates a random tensor in latent space (latent noise). Instead of corrupting an image with noise, it corrupts the representation of the image in latent space with the latent noise. The reason for doing that is it is a lot faster since the latent space is smaller.

Image Resolution & Upscaling

The image resolution is reflected in the size of the latent image tensor. The size of the latent image is 4x64x64 for 512×512 images only. It is 4x96x64 for a 768×512 portrait image. That’s why it takes longer and more VRAM to generate a larger image.

Since Stable Diffusion v1 is fine-tuned on 512×512 images, generating images larger than 512×512 could result in duplicate objects, e.g., the infamous two heads (opens in a new tab).

To generate a large print, keep at least one side of the image to 512 pixels. Use an AI upscaler (opens in a new tab) or image-to-image function for image upscaling. Alternatively, use the SDXL (opens in a new tab) model. It has a larger default size of 1,024 x 1,024 pixels.

Why is latent space possible?

You may wonder why the VAE can compress an image into a much smaller latent space without losing information. The reason is, unsurprisingly, natural images are not random. They have high regularity: A face follows a specific spatial relationship between the eyes, nose, cheek, and mouth. A dog has 4 legs and is a particular shape.

In other words, the high dimensionality of images is artifactual. Natural images can be readily compressed into the much smaller latent space without losing any information. This is called the manifold hypothesis (opens in a new tab) in machine learning.

Reverse diffusion in latent space

Here’s how latent reverse diffusion in Stable Diffusion works.

A random latent space matrix is generated.
The noise predictor estimates the noise of the latent matrix.
The estimated noise is then subtracted from the latent matrix.
Steps 2 and 3 are repeated up to specific sampling steps.
The decoder of VAE converts the latent matrix to the final image.

What is a VAE file?

VAE files (opens in a new tab) are used in Stable Diffusion v1 to improve eyes and faces. They are the decoder of the autoencoder we just talked about. By further fine-tuning the decoder, the model can paint finer details.

You may realize what's mentioned is not entirely true. Compressing an image into the latent space does lose information since the original VAE did not recover the fine details. Instead, the VAE decoder is responsible for painting fine details.

Text Conditioning (text-to-image)

Our understanding is incomplete: Where does the text prompt enter the picture? Without it, Stable Diffusion is not a text-to-image model. You will either get an image of a cat or a dog without any way to control it.

This is where conditioning comes in. The purpose of conditioning is to steer the noise predictor so that the predicted noise will give us what we want after subtracting from the image.

Below is an overview of how a text prompt is processed and fed into the noise predictor. The tokenizer first converts each word in the prompt to a number (token), which is then converted to a 768-value vector embedding. The embeddings are then processed by the text transformer and are ready to be consumed by the noise predictor.

Text conditioning

The overview shows how the text prompt is processed and fed into the noise predictor to steer image generation. In Stable Diffusion, the text prompt is tokenized and converted to embeddings. It is then processed by the text transformer and consumed by the noise predictor.

Tokenizer

The text prompt is first tokenized by a CLIP tokenizer (opens in a new tab). CLIP is a deep learning model developed by Open AI to produce text descriptions of any images. Stable Diffusion v1 uses CLIP’s tokenizer.

A tokenizer can only tokenize words it has seen during training. For example, there are “dream” and “beach” in the CLIP model but not “dreambeach”. Tokenizer would break up the word “dreambeach” into two tokens “dream” and “beach”. So one word does not always mean one token.

Another fine print is the space character is also part of a token. In the above case, the phrase “dream beach” produces two tokens “dream” and “[space]beach”. These tokens are not the same as that produced by “dreambeach” which is “dream” and “beach” (without space before beach).

Stable Diffusion models (v1) are limited to using 75 tokens in a prompt, which is not always the same as 75 words!

Embedding

Stable diffusion v1 uses Open AI’s ViT-L/14 (opens in a new tab) Clip model. The embedding is a 768-value vector. Each token has its own unique embedding vector. Embedding is fixed by the CLIP model, which is learned during training.

Why do we need embedding? As we discussed, some words are closely related to each other. Embeddings allow us to take advantage of this information. For example, the embeddings of man, gentleman, and guy are nearly identical because they can be used interchangeably. Monet, Manet, and Degas all painted in impressionist styles but in different ways. The names have close but not identical embeddings.

Scientists have shown that finding the proper embeddings can trigger arbitrary objects and styles, a fine-tuning technique called textual inversion (opens in a new tab).

Feeding embeddings to noise predictor

This stage is from embeddings to the noise predictor. The embedding needs to be further processed by the text transformer before feeding into the noise predictor. The transformer is like a universal adapter for conditioning. In this case, its input is text embedding vectors, but it could as well be something else like class labels, images, and depth maps (opens in a new tab). The transformer not only further processes the data but also provides a mechanism to include different conditioning modalities.

Cross-attention

The output of the text transformer is used multiple times by the noise predictor throughout the U-Net. The U-Net consumes it by a cross-attention mechanism. That’s where the prompt meets the image.

Let’s use the prompt “A man with blue eyes” as an example. Stable Diffusion pairs the two words “blue” and “eyes” together (self-attention within the prompt) so that it generates a man with blue eyes but not a man with a blue shirt. It then uses this information to steer the reverse diffuse towards images containing blue eyes. (cross-attention between the prompt and the image)

A side note: Hypernetwork, a technique to fine-tune Stable Diffusion models, hijacks the cross-attention network to insert styles. LoRA models (opens in a new tab) modify the weights of the cross-attention module to change styles. The fact that modifying this module alone can fine-tune a Stabe Diffusion model tells you how important this module is.

Other Conditionings

The text prompt is not the only way a Stable Diffusion model can be conditioned.Both a text prompt and a depth image are used to condition the depth-to-image (opens in a new tab) model. ControlNet (opens in a new tab) conditions the noise predictor with detected outlines, human poses, etc, and achieves excellent controls over image generations.

Text-to-image step-by-step

In text-to-image, you give Stable Diffusion a text prompt, and it returns an image.

Random Tensor Generated in Latent Space

Stable Diffusion generates a random tensor in the latent space. You control this tensor by setting the seed (opens in a new tab) of the random number generator. If you set the seed to a certain value, you will always get the same random tensor. This is your image in latent space. But it is all noise for now.

Text-to-image

Noise Predictor Estimates Noise

The noise predictor U-Net then takes the latent noisy image and text prompt as input and predicts the noise, also in latent space (a 4x64x64 tensor).

Text-to-image

Subtract Noise from Latent Image

We then Subtract the latent noise from the latent image. This becomes your new latent image.

Text-to-image

The previous 2 steps are repeated for a certain number of sampling steps, for example, 20 times.

Convert Latent Image to Pixel Space

Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running Stable Diffusion.

Text-to-image

Here’s how to image evolves in each sampling step.

stable diffusion euler

Noise schedule

The image changes from noisy to clean. Do you wonder if the noise predictor not working well in the initial steps? Actually, this is only partly true. The real reason is we try to get to an expected noise at each sampling step. This is called the noise schedule. Below is an example.

Noise schedule

The noise schedule is something we define. We can choose to subtract the same amount of noise at each step. Or we can subtract more in the beginning, like above. The sampler (opens in a new tab) subtracts just enough noise in each step to reach the expected noise in the next step. That’s what you see in the step-by-step image.

Image-to-image step-by-step

Image-to-image transforms an image into another one using Stable Diffusion. It is first proposed in the SDEdit (opens in a new tab) method. SDEdit can be applied to any diffusion model. So we have image-to-image for Stable Diffusion (a latent diffusion model).

An input image and a text prompt are supplied as the input in image-to-image. The generated image will be conditioned by both the input image and text prompt. for example, using this amateur drawing and the prompt “photo of perfect green apple with stem, water droplets, dramatic lighting” as inputs, image-to-image can turn it into a professional drawing:

Image-to-image

Step by step, here’s how image-to-image works.

Input Image Encoded to Latent Space

The input image is first encoded to latent space.

Image-to-image

Noise Added to Latent Image

Noise is added to the latent image. Denoising strength controls how much noise is added. If it is 0, no noise is added. If it is 1, the maximum amount of noise is added so that the latent image becomes a complete random tensor.

Image-to-image