Project 5: Diffusion Models
This project explores diffusion models through two main components:
- Part A: Experimenting with pretrained diffusion models (DeepFloyd IF)
- Part B: Building and training a diffusion model from scratch for MNIST image generation
Part 0: Setup and Initial Testing
To begin working with DeepFloyd IF, I first set up the necessary environment and access:
- Created a Hugging Face account and accepted the license for DeepFloyd/IF-I-XL-v1.0
- Generated and configured my Hugging Face Hub access token
- Downloaded the precomputed text embeddings to manage GPU memory constraints
Initial Generation Tests
Using a random seed of 42, I tested the model with three different prompts.
Playing (20 inference steps)
Playing (100 inference steps)
Observations
When comparing different inference steps, I found that the quality of the generated images are generally consistent. Not much difference in quality was observed between 20 and 100 inference steps. The model showed high quality/consistency of results when responding to the provided prompts. Particularly noteworthy was the ability of the model to generate diverse images that closely matched the textual descriptions provided in the prompts.
Part A: The Power of Diffusion Models
A1.1: Implementing the Forward Process
In this section, I implemented the forward process of a diffusion model, which progressively adds noise to an image. Starting with a clean test image of the Campanile (resized to 64x64), I applied increasing levels of noise at timesteps t = 250, 500, and 750. As shown in the images below, the forward process gradually transforms the clear image into increasingly noisy versions.
Original Image
t = 250
t = 500
t = 750
A1.2: Classical Denoising
In this section, I explored classical denoising methods using Gaussian blur filtering on the noisy images generated in the previous section. Despite attempting to optimize the blur parameters, the results demonstrate the limitations of classical denoising approaches when dealing with significant noise levels.
Comparison at t=250
Comparison at t=500
Comparison at t=750
A1.3: One-Step Denoising
In this section, I implemented one-step denoising using a pretrained diffusion model. The UNet model was used to estimate and remove noise from the images, conditioned on timesteps and the text prompt "a high quality photo". The results demonstrate the effectiveness of the learned denoising process compared to classical methods.
One-step Denoising Results at t=250
One-step Denoising Results at t=500
One-step Denoising Results at t=750
A1.4: Iterative Denoising
This section demonstrates the iterative denoising process using strided timesteps. Starting from a highly noisy image (t=690), we gradually denoise the image using steps of size 30. The results show the progression of the denoising process and compare it with one-step denoising and Gaussian blur methods. It is quite clear that the iterative denoising performs better than the one-step denoising and Gaussian blur methods.
Iterative Denoising Progress (Every 5th Step)
Iterative Denoising Result
One-step Denoising Result
Gaussian Blur Result
A1.5: Diffusion Model Sampling
In this section, we explore image generation from pure noise using the iterative denoising process. Starting with random noise and using the prompt "a high quality photo", we demonstrate the model's ability to generate images from scratch. Below are five samples generated using this method, showing the model's capability to create diverse outputs from random noise.
Five Generated Samples from Random Noise
A1.6: Classifier-Free Guidance (CFG)
This section demonstrates the power of Classifier-Free Guidance (CFG) in improving image generation quality. By combining conditional and unconditional noise estimates with a guidance scale of 7, we achieve significantly better results compared to the basic sampling method. Below are five samples generated using CFG, showing notably improved image quality and coherence.
Five Generated Samples using CFG (Scale = 7)
A1.7: Image-to-image Translation
This section explores image-to-image translation using different i_start values. Following the SDEdit algorithm, we add varying amounts of noise to images and then denoise them using CFG. The results show how different starting indices (noise levels) affect the balance between preserving the original image content and allowing creative modifications. Below are the results for i_start values [1, 3, 5, 7, 10, 20] using the prompt "a high quality photo".
Image-to-image Translation Results with Different Starting Indices
A1.7.1: Editing Hand-Drawn and Web Images
This section explores how the image-to-image translation process works with non-realistic source images. We experiment with both web-sourced images and hand-drawn sketches, demonstrating how the model projects these onto the natural image manifold.
Original Web Image
First Hand-drawn Sketch
Second Hand-drawn Sketch
Web Image Translation Progress (Noise Levels [1, 3, 5, 7, 10, 20])
First Sketch Translation Progress (Noise Levels [1, 3, 5, 7, 10, 20])
Second Sketch Translation Progress (Noise Levels [1, 3, 5, 7, 10, 20])
A1.7.2: Inpainting
In this section, I implemented inpainting following the RePaint paper's methodology. The process involves using a binary mask to selectively preserve original image content while generating new content in masked regions. For each denoising step, pixels outside the edit mask are replaced with the original image (with appropriate noise added), while pixels inside the mask are generated by the model.
Campanile Inpainting Mask
Campanile Inpainting Example
A1.7.3: Text-Conditional Image-to-image Translation
This section explores text-guided image-to-image translation, combining SDEdit's projection approach with text conditioning. Instead of using the generic "a high quality photo" prompt, we use specific text prompts to guide the generation process. This allows for controlled modifications while maintaining a balance between preserving the original image structure and incorporating elements from the text description. Below are the results using different noise levels [1, 3, 5, 7, 10, 20] with a specific text prompt.
Campanile (a rocket ship) (Noise Levels [1, 3, 5, 7, 10, 20])
A1.8: Visual Anagrams & Optical Illusions
In this section, I implemented visual anagrams using diffusion models to create images that reveal different content when viewed upside down. The technique involves averaging noise estimates from two different text prompts - one for the upright image and one for the inverted image. This creates fascinating optical illusions where a single image contains two distinct interpretations depending on its orientation.
Oil Painting of People Around a Campfire (Upright)
Oil Painting of an Old Man (Flipped)
A Man Wearing a Hat (Upright)
A Photo of a Man (Flipped)
A Photo of the Amalfi Coast (Upright)
A Photo of a Hipster Barista (Flipped)
A1.9: Hybrid Images & Factorized Diffusion
This section demonstrates the creation of hybrid images using factorized diffusion. The technique combines low frequencies from one noise estimate with high frequencies from another, creating images that appear different when viewed from different distances. Using a Gaussian blur with kernel size 33 and sigma 2, we create compelling hybrid images that reveal different content based on viewing distance.
Hybrid Image: Skull (Far) / Waterfall (Close)
Hybrid Image: Lithograph of a Skull (Far) / Amalfi Coast (Close)
Hybrid Image: Amalfi Coast (Far) / Dog (Close)
Part B: Diffusion Models from Scratch
Part B1: UNet Implementation
In this section, I implemented a UNet architecture for single-step denoising. The model was trained to map noisy MNIST digits back to their clean versions using an L2 loss function. The UNet consists of downsampling and upsampling blocks with skip connections.
1.2 Training Process
The training process involved generating noisy versions of MNIST digits using various sigma values. Below shows the effect of different noise levels on the input images:
Figure 3: Varying levels of noise on MNIST digits
Training Results
Figure 4: Training Loss Curve over 5 Epochs
Figure 5: Results on test set after 1 epoch of training
Figure 6: Results on test set after 5 epochs of training
1.2.2 Out-of-Distribution Testing
To evaluate the model's generalization capabilities, we tested it on noise levels (σ) that weren't seen during training. The results below show how the model performs across different noise intensities:
σ = 0.0
σ = 0.2
σ = 0.4
σ = 0.5
σ = 0.6
σ = 0.8
σ = 1.0
You can see that lower noise levels result in more accurate denoising, while higher noise levels result in more blurry images and more error.
Part B2: Time-Conditioned UNet
In this section, I implemented a time-conditioned UNet for diffusion modeling. The model was trained to predict noise at different timesteps, enabling iterative denoising of images. This builds upon the basic UNet from Part B1 by adding time conditioning through FCBlocks.
Architecture Modifications
The UNet architecture was modified to include time conditioning through FCBlocks:
- Added two FCBlocks for time embedding
- Modified the unflatten and up1 layers to incorporate time information
- Normalized time values to [0,1] range before embedding
Training Process
The model was trained with the following specifications:
- Batch size: 128
- Hidden dimension: 64
- Learning rate: 1e-3 with exponential decay
- Training duration: 20 epochs
Training Loss Over 20 Epochs
Sampling Results
Below are the sampling results after 5 and 20 epochs of training. The model shows improvement in generation quality as training progresses.
Generated Samples after 5 Epochs
Generated Samples after 20 Epochs
Analysis
The training loss curve shows steady improvement over the 20 epochs, with the most significant improvements occurring in the first 10 epochs. The sampling results demonstrate the model's ability to generate increasingly clear and coherent MNIST digits as training progresses.
Part B3: Class-Conditioned UNet
Building upon the time-conditioned UNet, this section implements additional class conditioning to enable controlled generation of specific MNIST digits. The model now accepts both time step t and class label c as conditioning signals.
Architecture Enhancements
The UNet architecture was further modified to include class conditioning:
- Added two additional FCBlocks for class embedding
- Implemented one-hot encoding for class labels (0-9)
- Added class conditioning dropout (p=0.1) for unconditional training
- Modified the modulation scheme to incorporate both time and class information
Training Configuration
The model was trained with the following specifications:
- Batch size: 128
- Hidden dimension: 64
- Learning rate: 1e-3 with exponential decay
- Training duration: 20 epochs
- Class conditioning dropout: 10%
- Classifier-free guidance scale: 5.0
Training Loss Over 20 Epochs
Generation Results
Below are the sampling results after 5 and 20 epochs of training, showing four instances of each digit (0-9). The results demonstrate the model's ability to generate specific digits while maintaining variation within each class.
Generated Samples after 5 Epochs (4 instances per digit)
Generated Samples after 20 Epochs (4 instances per digit)
Analysis
The class-conditioned model shows several improvements over the time-only conditioned version and improved with higher epochs.