Project 5 - Your Name

Project 5: Diffusion Models

This project explores diffusion models through two main components:

Part A: Experimenting with pretrained diffusion models (DeepFloyd IF)
Part B: Building and training a diffusion model from scratch for MNIST image generation

Part 0: Setup and Initial Testing

To begin working with DeepFloyd IF, I first set up the necessary environment and access:

Created a Hugging Face account and accepted the license for DeepFloyd/IF-I-XL-v1.0
Generated and configured my Hugging Face Hub access token
Downloaded the precomputed text embeddings to manage GPU memory constraints

Initial Generation Tests

Using a random seed of 42, I tested the model with three different prompts.

Playing (20 inference steps)

Playing (100 inference steps)

Observations

When comparing different inference steps, I found that the quality of the generated images are generally consistent. Not much difference in quality was observed between 20 and 100 inference steps. The model showed high quality/consistency of results when responding to the provided prompts. Particularly noteworthy was the ability of the model to generate diverse images that closely matched the textual descriptions provided in the prompts.

Part A: The Power of Diffusion Models

A1.1: Implementing the Forward Process

In this section, I implemented the forward process of a diffusion model, which progressively adds noise to an image. Starting with a clean test image of the Campanile (resized to 64x64), I applied increasing levels of noise at timesteps t = 250, 500, and 750. As shown in the images below, the forward process gradually transforms the clear image into increasingly noisy versions.

Original Image

t = 250

t = 500

t = 750

A1.2: Classical Denoising

In this section, I explored classical denoising methods using Gaussian blur filtering on the noisy images generated in the previous section. Despite attempting to optimize the blur parameters, the results demonstrate the limitations of classical denoising approaches when dealing with significant noise levels.

Comparison at t=250

Comparison at t=500

Comparison at t=750

A1.3: One-Step Denoising

In this section, I implemented one-step denoising using a pretrained diffusion model. The UNet model was used to estimate and remove noise from the images, conditioned on timesteps and the text prompt "a high quality photo". The results demonstrate the effectiveness of the learned denoising process compared to classical methods.

One-step Denoising Results at t=250

One-step Denoising Results at t=500

One-step Denoising Results at t=750

A1.4: Iterative Denoising

This section demonstrates the iterative denoising process using strided timesteps. Starting from a highly noisy image (t=690), we gradually denoise the image using steps of size 30. The results show the progression of the denoising process and compare it with one-step denoising and Gaussian blur methods. It is quite clear that the iterative denoising performs better than the one-step denoising and Gaussian blur methods.

Iterative Denoising Progress (Every 5th Step)

Iterative Denoising Result

One-step Denoising Result

Gaussian Blur Result

A1.5: Diffusion Model Sampling

In this section, we explore image generation from pure noise using the iterative denoising process. Starting with random noise and using the prompt "a high quality photo", we demonstrate the model's ability to generate images from scratch. Below are five samples generated using this method, showing the model's capability to create diverse outputs from random noise.

Five Generated Samples from Random Noise

A1.6: Classifier-Free Guidance (CFG)

This section demonstrates the power of Classifier-Free Guidance (CFG) in improving image generation quality. By combining conditional and unconditional noise estimates with a guidance scale of 7, we achieve significantly better results compared to the basic sampling method. Below are five samples generated using CFG, showing notably improved image quality and coherence.

Five Generated Samples using CFG (Scale = 7)

A1.7: Image-to-image Translation

This section explores image-to-image translation using different i_start values. Following the SDEdit algorithm, we add varying amounts of noise to images and then denoise them using CFG. The results show how different starting indices (noise levels) affect the balance between preserving the original image content and allowing creative modifications. Below are the results for i_start values [1, 3, 5, 7, 10, 20] using the prompt "a high quality photo".

Image-to-image Translation Results with Different Starting Indices

A1.7.1: Editing Hand-Drawn and Web Images

This section explores how the image-to-image translation process works with non-realistic source images. We experiment with both web-sourced images and hand-drawn sketches, demonstrating how the model projects these onto the natural image manifold.

Original Web Image

First Hand-drawn Sketch

Second Hand-drawn Sketch

Web Image Translation Progress (Noise Levels [1, 3, 5, 7, 10, 20])

First Sketch Translation Progress (Noise Levels [1, 3, 5, 7, 10, 20])

Second Sketch Translation Progress (Noise Levels [1, 3, 5, 7, 10, 20])

A1.7.2: Inpainting

In this section, I implemented inpainting following the RePaint paper's methodology. The process involves using a binary mask to selectively preserve original image content while generating new content in masked regions. For each denoising step, pixels outside the edit mask are replaced with the original image (with appropriate noise added), while pixels inside the mask are generated by the model.

Campanile Inpainting Mask

Campanile Inpainting Example

A1.7.3: Text-Conditional Image-to-image Translation

This section explores text-guided image-to-image translation, combining SDEdit's projection approach with text conditioning. Instead of using the generic "a high quality photo" prompt, we use specific text prompts to guide the generation process. This allows for controlled modifications while maintaining a balance between preserving the original image structure and incorporating elements from the text description. Below are the results using different noise levels [1, 3, 5, 7, 10, 20] with a specific text prompt.

Campanile (a rocket ship) (Noise Levels [1, 3, 5, 7, 10, 20])

A1.8: Visual Anagrams & Optical Illusions

In this section, I implemented visual anagrams using diffusion models to create images that reveal different content when viewed upside down. The technique involves averaging noise estimates from two different text prompts - one for the upright image and one for the inverted image. This creates fascinating optical illusions where a single image contains two distinct interpretations depending on its orientation.

Oil Painting of People Around a Campfire (Upright)

Oil Painting of an Old Man (Flipped)

A Man Wearing a Hat (Upright)

A Photo of a Man (Flipped)

A Photo of the Amalfi Coast (Upright)

A Photo of a Hipster Barista (Flipped)

A1.9: Hybrid Images & Factorized Diffusion

This section demonstrates the creation of hybrid images using factorized diffusion. The technique combines low frequencies from one noise estimate with high frequencies from another, creating images that appear different when viewed from different distances. Using a Gaussian blur with kernel size 33 and sigma 2, we create compelling hybrid images that reveal different content based on viewing distance.

Hybrid Image: Skull (Far) / Waterfall (Close)

Hybrid Image: Lithograph of a Skull (Far) / Amalfi Coast (Close)

Hybrid Image: Amalfi Coast (Far) / Dog (Close)

Part B: Diffusion Models from Scratch

Part B1: UNet Implementation

In this section, I implemented a UNet architecture for single-step denoising. The model was trained to map noisy MNIST digits back to their clean versions using an L2 loss function. The UNet consists of downsampling and upsampling blocks with skip connections.

1.2 Training Process

The training process involved generating noisy versions of MNIST digits using various sigma values. Below shows the effect of different noise levels on the input images:

Figure 3: Varying levels of noise on MNIST digits

Training Results

Figure 4: Training Loss Curve over 5 Epochs

Figure 5: Results on test set after 1 epoch of training

Figure 6: Results on test set after 5 epochs of training

1.2.2 Out-of-Distribution Testing

To evaluate the model's generalization capabilities, we tested it on noise levels (σ) that weren't seen during training. The results below show how the model performs across different noise intensities:

σ = 0.0

σ = 0.2

σ = 0.4

σ = 0.5

σ = 0.6

σ = 0.8

σ = 1.0

You can see that lower noise levels result in more accurate denoising, while higher noise levels result in more blurry images and more error.

Part B2: Time-Conditioned UNet

In this section, I implemented a time-conditioned UNet for diffusion modeling. The model was trained to predict noise at different timesteps, enabling iterative denoising of images. This builds upon the basic UNet from Part B1 by adding time conditioning through FCBlocks.

Architecture Modifications

The UNet architecture was modified to include time conditioning through FCBlocks:

Added two FCBlocks for time embedding
Modified the unflatten and up1 layers to incorporate time information
Normalized time values to [0,1] range before embedding

Training Process

The model was trained with the following specifications:

Batch size: 128
Hidden dimension: 64
Learning rate: 1e-3 with exponential decay
Training duration: 20 epochs

Training Loss Over 20 Epochs

Sampling Results

Below are the sampling results after 5 and 20 epochs of training. The model shows improvement in generation quality as training progresses.

Generated Samples after 5 Epochs

Generated Samples after 20 Epochs

Analysis

The training loss curve shows steady improvement over the 20 epochs, with the most significant improvements occurring in the first 10 epochs. The sampling results demonstrate the model's ability to generate increasingly clear and coherent MNIST digits as training progresses.

Part B3: Class-Conditioned UNet

Building upon the time-conditioned UNet, this section implements additional class conditioning to enable controlled generation of specific MNIST digits. The model now accepts both time step t and class label c as conditioning signals.

Architecture Enhancements

The UNet architecture was further modified to include class conditioning:

Added two additional FCBlocks for class embedding
Implemented one-hot encoding for class labels (0-9)
Added class conditioning dropout (p=0.1) for unconditional training
Modified the modulation scheme to incorporate both time and class information

Training Configuration

The model was trained with the following specifications:

Batch size: 128
Hidden dimension: 64
Learning rate: 1e-3 with exponential decay
Training duration: 20 epochs
Class conditioning dropout: 10%
Classifier-free guidance scale: 5.0

Training Loss Over 20 Epochs

Generation Results

Below are the sampling results after 5 and 20 epochs of training, showing four instances of each digit (0-9). The results demonstrate the model's ability to generate specific digits while maintaining variation within each class.

Generated Samples after 5 Epochs (4 instances per digit)

Generated Samples after 20 Epochs (4 instances per digit)

Analysis

The class-conditioned model shows several improvements over the time-only conditioned version and improved with higher epochs.