DeepFloyd Diffusion Model
The first section of the following project uses the pretrained DeepFloyd model. To get a sense of the model, we can first take a sampling of iamges from a selection of text prompts and inference steps all using the seed "180". As is evident in images, when lower inference steps are used, the images tend to have a smooth quality to them, and when higher inference steps are used the images become more detailed representations of the prompt.
Sampling Loops
Initially, the goal is to generate high-quality images by progressively denoising noisy inputs by using pretrained DeepFloyd denoisers. In diffusion models, a clean image is corrupted by adding noise iteratively until it becomes pure noise. The reverse process removes this noise step by step, using the model's predictions to recover the clean image. Image generation begins with pure noise sampled from a Gaussian distribution. The model predicts and removes noise at each step, gradually refining the image until it is clean.
Forward Process
The forward process in diffusion models is used to progressively add noise
to a clean image. This process is mathematically defined using a Gaussian distribution
\( q(x_t|x_0) \), where the mean is scaled by \( \sqrt{\bar{\alpha}_t}x_0 \) and the
variance is proportional to \( (1 - \bar{\alpha}_t) \). The noisy image \( x_t \) at
timestep \( t \) was computed using the formula \( x_t = \sqrt{\bar{\alpha}_t}x_0 +
\sqrt{1 - \bar{\alpha}_t}\epsilon \), where \( \epsilon \) is sampled from a standard
normal distribution \( N(0,1) \). As \( t \) increases, the amount of noise added to the
image increases. For small values of \( t \), \( \bar{\alpha}_t \) remains close to 1,
resulting in less noise, while for larger values of \( t \), \( \bar{\alpha}_t \) approaches
0, adding more noise. The forward process
was implemented using the variable alphas_cumprod
, which contains precomputed
values of \( \bar{\alpha}_t \) for timesteps \( t \in [0, 999] \). A test image, resized to
64x64 pixels, was used to demonstrate the process. The forward process was applied at specific
timesteps (\( t = 250, 500, 750 \)) to generate progressively noisier versions of the image.
Clasical Denoising
The first way we can attempt to remove the noise from the image is by using a classical denoising method like gaussian blur filtering. As the following results show, achieving a good result with this method is close to impossible with this methedology.
One-Step Denoising
As we can see classical methods are not able to remove the noise from the image, we can instead attempt to remove the noise from the image using one step denoising. To do this, we use a preterained diffusion model to recover something closer to the clean image.
Iterative Denoising
To further improve this process we can implement iterative denoising. In this method the core denoising step uses a formula that calculates \(x_{t'}\) (the less noisy image) from \(x_t\) (the current noisy image). The formula is: $$ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}\beta_t}}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t(1-\bar{\alpha}_{t'})}}{1-\bar{\alpha}_t}x_t + v_\sigma $$ This formula works by balancing our clean image estimate (\(x_0\)), the current noisy image (\(x_t\)), and a variance term (\(v_\sigma\)), with their relative contributions controlled by the alpha and beta parameters. The weights are determined by the alphas and betas, where \(\alpha_t = \bar{\alpha}_t/\bar{\alpha}_{t'}\) and \(\beta_t = 1-\alpha_t\). The \(\bar{\alpha}_t\) values come from alphas_cumprod. This weighting ensures a smooth transition between noise levels, with the contribution of the estimated clean image increasing as we move to lower noise levels. Each iteration of our denoising loop applies this formula, stepping from one strided timestep to the next, progressively producing cleaner images until we reach the final result.
Diffusion Model Sampling
Another thing we can do with iterative denoising is to start from pure noise, then by applying the denoising process iteratively, we can essentially create an image out of nothing. The following are 5 results from this process.
Classifier Free Guidance (CFG)
The iterative denoising process often produces poor results, sometimes generating nonsensical images. To improve quality, Classifier-Free Guidance (CFG) combines conditional and unconditional noise estimates. The following formula used is: $${\epsilon} = \epsilon_{u} + \gamma (\epsilon_{c} - \epsilon_{u})$$ Here, \(\gamma\) is the guidance scale. When \(\gamma\) = 0, the estimate is purely unconditional; at \(\gamma\) = 1, it is fully conditional. For \(\gamma\)>1, higher-quality images are often produced. The denoising process is modified to include an unconditional noise estimate using an empty prompt embedding, as the model is trained to handle this case. The following are 5 results from this improved process.
Image-to-Image Translation
In this implementation, we used a combination of adding noise to an existing image and denoising it to effectively make edits to the image. The extent of the edit depends on the amount of noise added: more noise leads to larger edits as the diffusion model becomes more "creative" during the denoising process. This is because the denoising process forces the noisy image back onto natural images, allowing the model to "hallucinate" or generate new details. Using the SDEdit algorithm, we applied this process to the several images. By adding varying levels of noise and denoising without any conditioning, we generated a series of images that progressively resembled the test image as the noise level decreased. Starting indices of [1, 3, 5, 7, 10, 20] steps were used for the denoising process.
Editing Hand Drawn and Web Images
The above process works particularly well if you start with nonrealistic images. The following are a repeat of this process with a mix of nonrealistic and hand-drawn images.
Inpainting
Inpainting involves modifying an image based on a binary mask. Given an image and a mask, the same process we have been using generates new content wherever the mask equals 1, while preserving the original content where the mask equals 0.
Image-to-Image Translation
We can perform the same process as image-to-image translation, but with text prompts, so that the model can start from the text prompt and slowly become more like the target image.
Visual Anagrams
Visual Anagrams use diffusion models to create optical illusions by blending two prompts. An image is denoised with one prompt, for example: "an oil painting of an old man" to compute one noise estimate, while its flipped version is denoised with another prompt, say "an oil painting of people around a campfire". The flipped estimate is aligned, averaged with the first, and used for reverse diffusion, generating an image that changes appearance based on its orientation. This process is done according to the following equation: \[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \] \[ \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} \]
Hybrid Images
Factorized Diffusion was used to create hybrid images by combining elements of two distinct text prompts. The process involves generating two noise estimates using a diffusion model, each corresponding to a specific prompt. A composite noise estimate is then created by blending the low frequencies of one noise estimate with the high frequencies of the other. The algorithm to do so is defined as: \[ \epsilon_1 = \text{UNet}(x_t, t, p_1), \quad \epsilon_2 = \text{UNet}(x_t, t, p_2) \] \[ \epsilon = \text{lowpass}(\epsilon_1) + \text{highpass}(\epsilon_2) \]
Training a Single Step Denoising UNet
Given a noisy image \( z \), we aim to train a denoiser \( D_\theta \) such
that it maps \( z \) to a clean image \( x \). To do so, we can optimize over an L2 loss:
\[
L = \mathbb{E}_{z,x} \|D_\theta(z) - x\|^2
\]
The operations of this UNet are defined in the following diagrams:
To train our denoiser, we need to generate training data pairs of \( (z, x) \), where each \( x \) is a clean MNIST digit. For each training batch, we can generate pairs using the following noising process: \[ z = x + \sigma \epsilon, \quad \text{where } \epsilon \sim N(0, I). \] The result of this noising process looks like the following:
We can now train our model to be able to perform the denoising.
The model has been trained for \(\sigma = 0.5 \). The following are results from noise levels that it was not trained on.
Training a Diffusion Model
To implement diffusion with a U-Net model, the process is designed to predict the added noise \(\epsilon\) rather than the clean image \(\mathbf{x}\). This adjustment simplifies the training objective, transitioning the loss function from predicting \(\mathbf{x}\) to directly minimizing the difference between the predicted noise and the true noise. The final loss function used is: \[ L = \mathbb{E}_{\mathbf{x},\epsilon}\|\epsilon_\theta(z) - \epsilon\|^2 \] The training process involves iteratively denoising an image from pure noise \(\epsilon \sim N(0, I)\) to a realistic image. At each timestep \(t\), a noisy version of the image, \(\mathbf{x}_t\), is generated as a weighted combination of the clean image \(\mathbf{x}_0\) and noise \(\epsilon\), with weights determined by the variance schedule parameters \(\overline{\alpha}_t\) and \(1 - \overline{\alpha}_t\). This is calculated using the following equation: \[ \mathbf{x}_t = \sqrt{\overline{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \overline{\alpha}_t}\epsilon \] A variance schedule \(\beta_t\) is constructed with values linearly spaced from \(\beta_0 = 0.0001\) to \(\beta_T = 0.02\) over \(T\) timesteps. Using \(\beta_t\), the corresponding \(\alpha_t\) is defined as \(\alpha_t = 1 - \beta_t\), and the cumulative product of \(\alpha_t\) is used to compute \(\overline{\alpha}_t\) for each timestep: \[ \overline{\alpha}_t = \prod_{s=1}^t \alpha_s \] These parameters ensure that \(\overline{\alpha}_t\) is close to 1 for small \(t\), representing less noise, and close to 0 for large \(t\), representing pure noise. The denoising process uses a single U-Net \(\epsilon_\theta\) conditioned on the timestep \(t\). The U-Net is trained to predict the noise \(\epsilon\) added to \(\mathbf{x}_0\), allowing it to iteratively denoise the input. The final objective function minimizes the mean squared error between the predicted and true noise: \[ L = \mathbb{E}_{\mathbf{x}_0,\epsilon},t\|\epsilon_\theta(\mathbf{x}_t, t) - \epsilon\|^2 \] This approach avoids the need for training separate models for each timestep by using a single model conditioned on the timesteps. We also need to inject a scalar \(t\) into the U-Net to condition it. This was done according to the following diagrams:
This diagram uses an FCBlock (fully connected block) to inject the timestep into the U-Net. The FCBlock is defined as follows:
Now to train the time-conditioned UNet, we pick a random image from the training set, a random timestep \(t\), and train the denoiser to predict the noise in \(\mathbf{x}_t\). We repeat this for different images and different t until the model converges. And if we sample from the UNet we get the following results:
To improve the results and give us greater control over image generation, we can optionally
condition our UNet on the digit class (0-9). This requires adding two additional FCBlocks to the UNet.
Training and sampling from the class-conditioned UNet is shown below is almost identical to the time-conditioned UNet.
The improved results from this process are shown below: