Jan 5, 2025

A Stroll in the Landscape of Diffusion Models - The Start

Diffusion is a method to generate samples from a data distribution. The data distribution can be anything ranging from images to edges in graphs. Diffusion is the reason for the high-quality AI artworks, videos, and pleasant AI-generated music.

Before getting into it, I want to explain the why behind this article. I have been reading up on diffusion models for a while, and sometimes I feel lost when reading a paper, and I sometimes wonder, “Where does that paper fit in the grand picture?“. I have written this article to gain context about the bigger picture and to answer that question. Note: I am a student interested in Diffusion Models and have read some papers on the topic. This article is nowhere near exhaustive and is my view of the research landscape.

With that said, we will start our journey now. As we begin, we see a billboard telling us we are close to the Land of Diffusion Models. The paper “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”[1] serves as the billboard.

The goal behind generative models is to model a complex distribution, such as that of images, with simpler distributions that allow us to sample new data points. This was the problem the authors of [1] wanted to tackle. Their idea is as follows:

Introduce noise to the data and slowly convert the data to noise. This is termed as the forward process
Then learn a reverse process, which iteratively adds information to noise, thus restoring the data distribution

This process seems like a viable direction due to the following reasons:

Let’s say you have a mouse that you have to help move towards a target; keeping cheese along the way at short intervals will have a higher success rate than just keeping cheese at the target. Similarly, an iterative reverse process seems more viable than jumping to the target data distribution.
We will also use a fan to spread the smell of cheese, making it easier for the mouse to follow and reach the target. Adding noise has a similar effect, spreading the information density. Thus, we can feel its presence from further away.
Noise can have different forms, and when we use noise such as Gaussian Noise, we can use an earlier proven result by William Feller in 1949. It demonstrates that the forward and reverse diffusion processes have the same functional form. This makes the learning easier.

To learn the reverse process, the authors used a network with Convolutional layers, which learned the mean and covariance parameters of the reverse process.

However, we couldn’t model complex data distributions such as images with this method.

Knowing that we are close to reaching the land of Diffusion Models, we march forward. As we walk forward, we can see a parallel road to the one we are walking; this road is marked by the approach of Score Matching and Langevin Dynamics in the paper “Generative Modeling by Estimating Gradients of the Data Distribution”[2] .

What is Score Matching? The score function of a distribution is given as $$\nabla_x log p(x)$$ The authors minimized the Fisher Divergence to estimate the score. The Fisher Divergence is the l2-distance between the original data’s score and the estimated score. However, this involves computing the data’s score, which is not possible. To circumvent this issue, the authors have used score-matching algorithms that minimize without knowing the original data’s score function.

It’s trivial to sample when we have the distribution, but how do we sample when we have the score function? Authors have used Langevin Dynamics to do so. It is a way to use the mystical equation called the Langevin Equation (I am not probing deeper into the equation as it’s more in physics than computer science) to model molecular systems. The authors reformulated the Langevin equation to be used to sample from the distribution iteratively through the score function.

As we move forward on the road, we come across a signpost welcoming us to the land of Diffusion models. Our signpost is the famous paper “Denoising Diffusion Probabilistic Models” [3].

The authors of [3] mainly proposed the following changes

Fix the Variance to time-dependent constants.
Learn the value of the noise, effectively learning a denoiser.

The authors used a Unet architecture with a Wide Resnet Block and added positional embeddings to ensure the model knows the noise level in each timestep.

Straight ahead, you see the parallel road intersecting with the road we are in, and this intersection is the paper [4], which unified both approaches till now using Stochastic Differential Equations

In the earlier approaches, we applied a finite number of noise perturbations to the image. The authors of the paper [4] change a data point into random noise continuously over time, which can be defined by fixed (no learnable parameters) SDE (Stochastic Differential Equation). Using an earlier proven result, we can say that the reverse process takes a form of a SDE which can be derived using the score. Based on the SDE used, the characteristics of the model change. Thus, SDEs are specific to the model and are similar to the hyperparameters used to define the model. They also proposed a more powerful sampler called the Predictor-Corrector sampler, which helps them to sample high-resolution images.

We now have our first look at the vast landscape past the intersection. A lot of factories in the west. Once you see past the chimneys of the factories, you can see the enormous mountain range. On the other side, you could see beautiful budding flowers in a pristine garden. Looking past the garden, you see a dense forest and a river flowing next to it, branching out to very far places. You hear some sounds coming from the river bank. You also notice that the road you traveled goes much further, and you wonder where.

We will continue our journey in the next post.

References

[1]

J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” Nov. 18, 2015, arXiv: arXiv:1503.03585. doi: 10.48550/arXiv.1503.03585.

[2]

Y. Song and S. Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution,” Oct. 10, 2020, arXiv: arXiv:1907.05600. doi: 10.48550/arXiv.1907.05600.

[3]

J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” Dec. 16, 2020, arXiv: arXiv:2006.11239. doi: 10.48550/arXiv.2006.11239.

[4]

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” Feb. 10, 2021, arXiv: arXiv:2011.13456. doi: 10.48550/arXiv.2011.13456.