Project 5 - Part A: The Power of Diffusion model

Background of the project:

In this part of the project, we are going to do a lot of fun things with the pre-trained diffusion model from DeepFloyd including denoising, inpainting, creating visual anagram and hybrid image.

Part 0. Setup

This part is just for us to see what results will the diffusion model return. We first need to follow the steps in the project spec to gain access to use the diffustion model. And then we use it to generate some image base on the prompt. Note that the prompt here are text embeding of a text instead of a real text. And we need to set a seed for this part and all the later parts of the project, I am using 180. Here are the rsults with different prompts and inference steps.

inference steps = 20:

an oil painting of a snowy mountain village:

a man wearing a hat:

a rocket ship:

inference steps = 50:

an oil painting of a snowy mountain village:

a man wearing a hat:

a rocket ship:

Observation: As you can see, with higher inference t, the image is a little bit better. It is normal because with a lower inference steps, it may be faster but give up quality


Part 1 - 1. Implementing the Forward Process

For this part, we are using the pre-trained denoiser of DeepFloyd. But first need to implement the forward pass so that we can add noise to the images. Typically, the clean image is \(x_0\) and \(x_T\) is pure noise. Which mean larger t has more noise, and for DeepFloyd models, T = 1000. So in this forward implementation, we need to add the noise to the image base on the t that is given. Which follows this formula: \(\sqrt{\bar{\alpha_t}} x_0 + \sqrt{1 - \bar{\alpha_t}} \epsilon\) where \(\epsilon\) ~ \(N (0, 1)\). Note that we did not just add the noise, but also scale the image. The \(\bar{\alpha}\) is called the alphas_cumprod variable. Which contains the \(\bar{\alpha_t}\) for \( t \in [0, 999] \). And we can get it by calling stage_1.scheduler.alphas_cumprod. Here are my results:

Original Image:

Images after adding noise:

t = 250

t = 500:

t = 750:

Observation: As you can see, the image is more noisy when t is higher.


Part 1 - 2. Classical Denoising

This part is farily simple, before we start to use the diffustion model for denoising, let's try the classical denoising method - Gaussian blur filtering. Here are my results:

Images Before Gaussian blur filtering:

t = 250

t = 500:

t = 750:

Images After Gaussian blur filtering:

t = 250

t = 500:

t = 750:

Observation: You can tell that the results is really bad, and that's why we need the diffusion model for the job.


Part 1 - 3. One-Step Denoising

In this part, we can finally start using the pre-trained diffusion model to do the denoising job. In this part, we are implementing one-step denosing. Which means that given a noisy image \(x_t\) and the timestep t, we predict the noise to directly obtain \(x_0\), which is the clean image. Note that since the model is trained with text conditionding, we also need to pass in a text prompt embedding, which we use "a high quality photo" here. Here are my results:

Images Before One-Step Denoising:

t = 250

t = 500:

t = 750:

Images After One-Step Denoising:

t = 250

t = 500:

t = 750:

Observation: As you can see, the predicted results of higher t (more noisy image) will be more depart from the original image.

Note: Since we add noise to the image using the formula: \(\sqrt{\bar{\alpha_t}} x_0 + \sqrt{1 - \bar{\alpha_t}} \epsilon\) where \(\epsilon\) ~ \(N (0, 1)\). When we obtain the denoised image using the perdiction of the noise, we need to derive tge formula to obtain \(x_0\).


Part 1 - 4. Iterative Denoising

Since diffusion models are designed to denoise iteratively. In this part, we are implementing iterative denoising. Which means that at every timestep, we denoise and obtain the image at previous timestep. like we can start with \(x_{1000}\), and get \(x_{999}\), and then \(x_{998}\), and keep continue until we get \(x_0\). But it will take a lot of time and computing poer if we run the diffusion model that many times. We can actually skip some steps. By using a strided_timesteps (an arrray of timesteps where strided_timesteps[0] is the largest t (990 in this case), and strided_timesteps[-1] is 0), we can predicted the image at strided_timesteps[i + 1] when we are at strided_timesteps[i]. For example, in this implementation, my stride is 30. Let's say I start at \(x_{990}\), I will then get \(x_{960}\), and then \(x_{930}\) until I get \(x_0\). Which save a lot of time and computing power but still works fine. For every iteration, we will need to perdict the image at previosu time using the following formula and constants:

where t is strided_timesteps[i], and t' is just strided_timesteps[i+1]. And the way to get \(x_0\) is the same as how we got the clean image in one-step denoising. Here are my results:

Iterative Denoising Images:

t = 690

t = 540:

t = 390:

Iterative Denoising Images:

t = 240

t = 90:

t = 0:

Result Comparison:

Iterative Denoising

One-Step Denoising:

Gaussian blur filtering

Observation: It is quite obvious that the iterative denoising method gives us the best result compare to one-step amd Gaussian blur.

Note: Note that the image timestep start at 690 here, it is because we start at strided_timesteps[10] (i_start = 10). This is because that we want to at least give model some information about the campenelle so that it can give us some results that still look like a campenelle.


Part 1 - 5. Diffusion Model Sampling

Like what we discussed at the end of the previous part, we use i_start = 10 to perserve some information of the original image so that the diffusion model will output some result that looks similar to the original image. Which means when we use i_start = 0, and pass in a pure noise, we are basically denoising the pure noise, which we can genrate image from scratch. Here are my results:

Sample Images:

Sample 1;

Sample 2:

Sample 3:

Sample 4:

Sample 5:

Observation: As we can see, the results here are actually a little blury and the quality is bad, some of them are just non-sense. Which we will fix that in the next section.


Part 1 - 6. Classifier-Free Guidance (CFG)

As we see in the previous part, the result are bad in quality. So in order to imporve out results, we can implement classifier-free guidance (CFG). For CFG, we first compute both conditional and unconditional noise estimate, which is \(\epsilon_{c}\) and \(\epsilon_{u}\). And then we get a new noise estimation following the formula: \(\epsilon =\epsilon_{u} + \gamma(\epsilon_{c} - \epsilon_{u})\). Where \(\gamma\) denote the strength of the CFG, and for a good quality image, we need \(\gamma\) > 1 (I use \(\gamma\) = 7 here). Note that the unconditional genration prompt is "a high quality photo" and unconditional guidance prompt is "", which is the null prompt. Here are my results:

Sample Images:

CFG Sample 1;

CFG Sample 2:

CFG Sample 3:

CFG Sample 4:

CFG Sample 5:

Observation: As we can see, the results here are a lot better in quality than the previous part without CFG. And there are no non-sensical images anymore.


Part 1 - 7. Image-to-image Translation

As we see in part 1 - 4, when we denoise an image with noise, the result it predicted is "edited". And with more noise, the editing effects is stronger. This is pretty normal since it need the model to "hallucinate" a little bit for it to denoise the image onto the manifold of natural images. So in this part, by starting at different i_start (different noise level), we can see different levels of editing. And when we start at a really low noise level, the results is going to be really similar to the original image, which is the SDEdit algorithm. Here are my results:

Original Image:

SDEdit with different noise level:

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:

Original Image:

SDEdit with different noise level:

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:

Original Image:

SDEdit with different noise level:

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:


Part 1 - 7 - 1. Editing Hand-Drawn and Web Images

We can also try to start with some unrelistic image and project it onto the natural image manifold using the same techniques that we use in the previous part. Here are my results:

Original Image:

SDEdit with different noise level:

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:

Original Image:

SDEdit with different noise level:

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:

Original Image:

SDEdit with different noise level:

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:

Part 1 - 7 - 2. Inpainting

We can also try to implement the same procedure to do inpainting. Which given an image and a binary mask, we left out everything outside the mask the same, but the thing inside the mask is generated by the diffusion model. Which is a really similar precedure as the previous part, but using the formula \(x_t = mx_t + (1 - m)forward(x_{origin, t})\), here are my results:

Original Image:

Mask:

Inpainted Image:

Original Image:

Mask:

Inpainted Image:

Original Image:

Mask:

Inpainted Image:

Part 1 - 7 - 3. Text-Conditional Image-to-image Translation

For this part, we are doing the samething as normal SDEdit, but thus time, instead of "a high quality photo", we are using other prompt to guide projection.

Original Image:

SDEdit with prompt: "a rocket ship":

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:

Original Image:

SDEdit with prompt: "a pencil":

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:

Original Image:

SDEdit with prompt: "a photo of a dog":

i_start = 1;

i_start = 3:

i_start = 5:

i_start = 7:

i_start = 10:

i_start = 20:


Part 1 - 8. Visual Anagrams

In this part, we are using the diffusion model to create a visual Anagram. Which will look like an image right-side up, and looks like another image upside down. Which we will need 2 prompts in total. The idea is actually really easy, we get the conditional and unconditonal noise perdiction of the first prompt and use CFG to get noise1, and then we flip the image and get the conditonal and unconditonal noise perdiction of the fliped image as well. And we use CFG to get noise2. We then flip the noise2 back to right-side up, and average noise1 and noise2 to get the final noise, the rest will be the same procedure as what we did in the ' previous parts. Here are my results:

an oil painting of an old man:

an oil painting of people around a campfire:

a rocket ship:

a pencil:

an oil painting of a snowy mountain village:

a photo of the amalfi cost:


Part 1 - 9. Hybrid Images

For this part, we are doing hybrid image, which the image will look like 1 thing viewing from close up, and another thing viewing from far away. The idea is really similar to the previous part. At first, we get the conditonal and unconditonal noise prediction for both prompts (no flipping), and get 2 CFG as the previous part. After we got 2 CFG, we use guassian blur to get the low frequency of one of the image and high frequency of the other image. And just add the 2 frequency together (no averaging) Here are my results:

Hybrid Image of skull and waterfall:

a lithograph of waterfalls:

a lithograph of a skull:

Hybrid Image of old man and snowy mountain village:

an oil painting of a snowy mountain village:

an oil painting of an old man:

Hybrid Image of snowy mountain village and waterfall:

a lithograph of waterfalls:

an oil painting of a snowy mountain village:

Project 5 - Part B: Diffusion Models from Scratch!

Background of the project:

We already see how and what the pre-trained diffusion model can achieve in the previous part, so we now can try to implement and train out own diffusion model. In this part, we are trainning the diffusion model using the MNIST hand-written digits dataset so that we can denoise a hand-written digits with noise back to the pure image of digits.

Part 1. Training a Single-Step Denoising UNet

Part 1 - 1. Implementing the UNet:

For this part, we are tainning an easy one-step denoiser that given a noisy image z, we predict the original clean image x by optimize over the following loss function: \(\mathbf{L} = \mathbf{E}_{z, x}\|\mathbf{D}_{\theta}(z) - x\|^2\). And we are implementing a network structure called Unet for the denoiser. And the main idea for the Unet is that it has an encoder where it downsample the image, and the decoder which upsample the encoded image. Also, to perserve the detaisl of the images, we will use what called skip-connections. The details are all in the project spec, which is easy to follow.

Part 1 - 2. Using the UNet to Train a Denoiser:

Before we start traing, we need to create the dataset first. Here, we create the noisy images using the formula : \(z = x + \sigma\epsilon\) where \(\epsilon\) ~ \(N(0, 1)\). which the clean image is the x in the loss funtion and z is the noisy image. Here are my results:

After we got the image sets, we can start trianing right now. Note that for training, we use the noisy images with \(\sigma = 0.5\). All the recommened setup and hyper parameters are given in the set, here are my results:

Training Loss:

Result at epoch 1:

Result at epoch 5:

Since we trained the networks using \(\sigma = 0.5\), let's see how the results will be if we usedifferent noise levels, here are my results:

Observation: As we can see, with higher noisy level (higher \(\sigma\)), the denoised results is worse, which makes sense since it is harder for the model to denoise with more noise in the image.


Part 2. Training a Diffusion Model

Just like in part A, after we finished one-step denoising, we can now implement iterative denoising. Although we are still using MSE loss as in the previous part, but this time, instead of making the model perdict the clean image, we want it to perdict the noise, so the loss function is as following: \(\mathbf{L} = \mathbf{E}_{\epsilon, x}\|\epsilon_{\theta}(z) - \epsilon\|^2\). Like what we did in part A for iterative denosing, \(x_t = \sqrt{\bar{\alpha_t}} x_0 + \sqrt{1 - \bar{\alpha_t}} \epsilon\), so we need all the constants, but this time , we are creating the constans ourselves. And of course, since we are doing iterative denoising, we need to take tinestep as an input to the model as well, so the final loss function will become: \(\mathbf{L} = \mathbf{E}_{\epsilon, x_0, t}\|\epsilon_{\theta}(x_t, t) - \epsilon\|^2\).

Part 2 - 1. Adding Time Conditioning to UNet

In order to add our time condition into the network, we need a new Block called FCB block. Note that since the t we pass in is a bacth of timesteps, we need to braodcast it to something that matches the shape of the results of upblocks. And since each t is a scalar, we need to normalize t to be in the range of [0, 1] here.

Part 2 - 2. Training the UNet

Now, we can finally start training our time-conditioned Unet. The detailed algorithm is in the project spec, but note that for every image, we pick a random timestep, and don't forget to normalize t. And we also add a learing rate scheduler here to update learning rate after each epoch. Here are my results using the recommened hyper parameters:

Part 2 - 3. Sampling from the UNet

The sampling here is technically the same as what we did in part A, but over here, we don't need to predict teh varaince. we can instead use the list \(\beta\) that we created. Here are my results:

Result at epoch 1:

Result at epoch 5:

Result at epoch 10:

Result at epoch 15:

Result at epoch 20:

Observation: As we can see, the results are getting better as the epoch number increases, but the final result are still somewhat noisy and bad. So we will implement class conditioned in the next section to get a even better result.


Part 2 - 4. Adding Class-Conditioning to UNet

As we see in the previous section, the results are not good enough, so we need to add class condition (labels 0 - 9) for better results and image genration. In order to do this, we need 2 more FCB blocks for class. Note that we want the class labels to be one_hot_encoded. Also, since we still want the model to work without class conditioning, we implement a drop out mask that will drop 10% of the class conditioning vector to a zero vector. Here are my results using the recommened hyper parameters:

Part 2 - 5. Sampling from the Class-Conditioned UNet

As what we observed in part A, we need to do CFG in order to get a good result. Note that for unconditional predictin, we just pass in a mask with 100% drop out rate. Here are my results:

Result at epoch 1:

Result at epoch 5:

Result at epoch 10:

Result at epoch 15:

Result at epoch 20:

Observation: As we can see, the results are so much better than the previous part, even early epoch gives a really good result.


Final Reflection of the Project

Althouhg this project is really long and hard, I still think it is really intersting to see how powerful the machine learning model are in computer vision. Which gives me more idea of what I can do with computer vision right now with all the different models. For no matter who is reading this text, this is a really great semester, thank you for your help.