DiffBIR (xpixelgroup/diffbir)

DiffBIR

https://github.com/xpixelgroup/diffbir
Admin
DiffBIR is a blind image restoration pipeline that uses a generative diffusion prior to handle...

Tokens:8,921
Snippets:67
Trust Score:8
Update:3 months ago
Show doc for...
Context Summary (auto-generated)
Raw
# DiffBIR: Blind Image Restoration with Generative Diffusion Prior

DiffBIR is a powerful blind image restoration framework that leverages generative diffusion priors from Stable Diffusion to restore degraded images. It supports three main tasks: Blind Super-Resolution (BSR), Blind Face Restoration (BFR), and Blind Image Denoising (BID). The framework uses a two-stage pipeline where stage-1 removes degradations using a cleaner model (SwinIR, BSRNet, or SCUNet), and stage-2 applies a ControlNet-based diffusion model to generate high-quality, realistic details.

The project offers three model versions (v1, v2, v2.1) with v2.1 being the latest, featuring improved image captioning via LLaVA, better samplers, and enhanced tiled-sampling support for processing high-resolution images on limited VRAM. DiffBIR automatically downloads pretrained weights and provides both command-line inference and a Gradio web interface for interactive use.

## Installation

Install DiffBIR by cloning the repository and setting up the conda environment.

```bash
# Clone the repository
git clone https://github.com/XPixelGroup/DiffBIR.git
cd DiffBIR

# Create and activate conda environment
conda create -n diffbir python=3.10
conda activate diffbir
pip install -r requirements.txt
```

## Blind Image Super-Resolution (BSR)

Upscale low-resolution images while restoring details and removing degradation artifacts using the BSR inference loop.

```bash
# DiffBIR v2.1 (recommended) - uses LLaVA for auto-captioning
python -u inference.py \
    --task sr \
    --upscale 4 \
    --version v2.1 \
    --captioner llava \
    --cfg_scale 8 \
    --noise_aug 0 \
    --input inputs/demo/bsr \
    --output results/v2.1_demo_bsr

# DiffBIR v2 (ECCV paper version) - no auto-captioning
python -u inference.py \
    --task sr \
    --upscale 4 \
    --version v2 \
    --sampler spaced \
    --steps 50 \
    --captioner none \
    --pos_prompt '' \
    --neg_prompt 'low quality, blurry, low-resolution, noisy, unsharp, weird textures' \
    --cfg_scale 4 \
    --input inputs/demo/bsr \
    --output results/v2_demo_bsr \
    --device cuda --precision fp32
```

## Blind Face Restoration (BFR) - Aligned Faces

Restore degraded face images that are pre-aligned (cropped and centered on the face) using the specialized face restoration pipeline.

```bash
# DiffBIR v2.1 for aligned face restoration
python -u inference.py \
    --task face \
    --upscale 1 \
    --version v2.1 \
    --captioner llava \
    --cfg_scale 8 \
    --noise_aug 0 \
    --input inputs/demo/bfr/aligned \
    --output results/v2.1_demo_bfr_aligned

# DiffBIR v2 for aligned face restoration
python -u inference.py \
    --task face \
    --upscale 1 \
    --version v2 \
    --sampler spaced \
    --steps 50 \
    --captioner none \
    --pos_prompt '' \
    --neg_prompt 'low quality, blurry, low-resolution, noisy, unsharp, weird textures' \
    --cfg_scale 4.0 \
    --input inputs/demo/bfr/aligned \
    --output results/v2_demo_bfr_aligned \
    --device cuda --precision fp32
```

## Blind Face Restoration (BFR) - Unaligned/Whole Image

Restore faces within full scene images, automatically detecting and enhancing both faces and background.

```bash
# DiffBIR v2.1 for unaligned face + background restoration
python -u inference.py \
    --task face_background \
    --upscale 2 \
    --version v2.1 \
    --captioner llava \
    --cfg_scale 8 \
    --noise_aug 0 \
    --input inputs/demo/bfr/whole_img \
    --output results/v2.1_demo_bfr_unaligned

# DiffBIR v2 for unaligned face + background restoration
python -u inference.py \
    --task face_background \
    --upscale 2 \
    --version v2 \
    --sampler spaced \
    --steps 50 \
    --captioner none \
    --pos_prompt '' \
    --neg_prompt 'low quality, blurry, low-resolution, noisy, unsharp, weird textures' \
    --cfg_scale 4.0 \
    --input inputs/demo/bfr/whole_img \
    --output results/v2_demo_bfr_unaligned \
    --device cuda --precision fp32
```

## Blind Image Denoising (BID)

Remove noise from images while preserving and enhancing details using the denoising inference loop.

```bash
# DiffBIR v2.1 for image denoising
python -u inference.py \
    --task denoise \
    --upscale 1 \
    --version v2.1 \
    --captioner llava \
    --cfg_scale 8 \
    --noise_aug 0 \
    --input inputs/demo/bid \
    --output results/v2.1_demo_bid

# DiffBIR v2 for image denoising
python -u inference.py \
    --task denoise \
    --upscale 1 \
    --version v2 \
    --sampler spaced \
    --steps 50 \
    --captioner none \
    --pos_prompt '' \
    --neg_prompt 'low quality, blurry, low-resolution, noisy, unsharp, weird textures' \
    --cfg_scale 4.0 \
    --input inputs/demo/bid \
    --output results/v2_demo_bid \
    --device cuda --precision fp32
```

## Tiled Sampling for Low VRAM

Process high-resolution images on GPUs with limited memory by enabling tiled inference for all pipeline stages.

```bash
# Enable tiled sampling for all stages to reduce VRAM usage
python -u inference.py \
    --task sr \
    --upscale 4 \
    --version v2.1 \
    --captioner llava \
    --cfg_scale 8 \
    --input inputs/demo/bsr \
    --output results/tiled_output \
    --cleaner_tiled \
    --cleaner_tile_size 256 \
    --cleaner_tile_stride 128 \
    --vae_encoder_tiled \
    --vae_encoder_tile_size 256 \
    --vae_decoder_tiled \
    --vae_decoder_tile_size 256 \
    --cldm_tiled \
    --cldm_tile_size 512 \
    --cldm_tile_stride 256
```

## Custom Model Inference

Run inference with your own trained IRControlNet checkpoint by specifying the training config and model path.

```bash
# Inference with custom-trained model
python -u inference.py \
    --upscale 4 \
    --version custom \
    --train_cfg path/to/training/config.yaml \
    --ckpt path/to/saved/checkpoint.pt \
    --captioner llava \
    --cfg_scale 8 \
    --noise_aug 0 \
    --input inputs/demo/bsr \
    --output results/custom_demo_bsr
```

## Gradio Web Interface

Launch an interactive web UI for real-time image restoration with adjustable parameters.

```bash
# Launch Gradio interface with LLaVA captioner (recommended)
python run_gradio.py --captioner llava

# For low-VRAM systems, use RAM captioner or disable captioning
python run_gradio.py --captioner ram
python run_gradio.py --captioner none

# Control LLaVA quantization (4-bit default, 8-bit, or 16-bit)
python run_gradio.py --captioner llava --llava_bit 8
```

## Training Stage 1 (SwinIR Cleaner)

Train the stage-1 degradation removal model (SwinIR) on your own dataset.

```bash
# 1. Generate file lists for training and validation sets
find /path/to/images -type f > files.list
shuf files.list > files_shuf.list
head -n 10000 files_shuf.list > train.list
tail -n +10001 files_shuf.list > val.list

# 2. Edit configs/train/train_stage1.yaml with your paths:
# - dataset.train.file_list: path/to/train.list
# - dataset.val.file_list: path/to/val.list
# - train.exp_dir: path/to/experiment/output

# 3. Start training with accelerate
accelerate launch train_stage1.py --config configs/train/train_stage1.yaml
```

## Training Stage 2 (IRControlNet)

Train the stage-2 ControlNet model that generates realistic details using diffusion.

```bash
# 1. Download pretrained Stable Diffusion v2.1
wget https://huggingface.co/stabilityai/stable-diffusion-2-1-base/resolve/main/v2-1_512-ema-pruned.ckpt --no-check-certificate

# 2. Generate training file list (same as stage 1)
find /path/to/images -type f > train.list

# 3. Edit configs/train/train_stage2.yaml with your paths:
# - train.sd_path: path/to/v2-1_512-ema-pruned.ckpt
# - train.swinir_path: path/to/stage1/checkpoint.pt
# - train.exp_dir: path/to/experiment/output
# - dataset.train.file_list: path/to/train.list

# 4. Start training with accelerate
accelerate launch train_stage2.py --config configs/train/train_stage2.yaml
```

## Python API - Pipeline Usage

Use DiffBIR programmatically in Python by importing the pipeline classes and inference loops.

```python
import torch
import numpy as np
from PIL import Image
from omegaconf import OmegaConf
from diffbir.model import ControlLDM, SwinIR, Diffusion
from diffbir.pipeline import SwinIRPipeline
from diffbir.utils.common import instantiate_from_config, load_model_from_url
from diffbir.inference.pretrained_models import MODELS

device = "cuda"

# Load stage-1 cleaner model (SwinIR)
swinir = instantiate_from_config(OmegaConf.load("configs/inference/swinir.yaml"))
swinir.load_state_dict(load_model_from_url(MODELS["swinir_realesrgan"]))
swinir.eval().to(device)

# Load stage-2 model (ControlLDM)
cldm = instantiate_from_config(OmegaConf.load("configs/inference/cldm.yaml"))
sd_weight = load_model_from_url(MODELS["sd_v2.1_zsnr"])
cldm.load_pretrained_sd(sd_weight)
control_weight = load_model_from_url(MODELS["v2.1"])
cldm.load_controlnet_from_ckpt(control_weight)
cldm.eval().to(device)
cldm.cast_dtype(torch.float16)

# Load diffusion scheduler
diffusion = instantiate_from_config(OmegaConf.load("configs/inference/diffusion_v2.1.yaml"))
diffusion.to(device)

# Create pipeline
pipeline = SwinIRPipeline(swinir, cldm, diffusion, None, device)

# Load and process image
lq = Image.open("input.jpg").convert("RGB")
lq = lq.resize((lq.width * 4, lq.height * 4), Image.BICUBIC)  # Upscale for SR
lq_array = np.array(lq)[None]  # Add batch dimension

# Run restoration
with torch.no_grad(), torch.autocast(device, torch.float16):
    result = pipeline.run(
        lq_array,
        steps=10,
        strength=1.0,
        cleaner_tiled=False, cleaner_tile_size=256, cleaner_tile_stride=128,
        vae_encoder_tiled=False, vae_encoder_tile_size=256,
        vae_decoder_tiled=False, vae_decoder_tile_size=256,
        cldm_tiled=True, cldm_tile_size=512, cldm_tile_stride=256,
        pos_prompt="high quality, detailed",
        neg_prompt="low quality, blurry",
        cfg_scale=8.0,
        start_point_type="noise",
        sampler_type="edm_dpm++_3m_sde",
        noise_aug=0, rescale_cfg=False,
        s_churn=0, s_tmin=0, s_tmax=300, s_noise=1, eta=1, order=1
    )

# Save result
Image.fromarray(result[0]).save("output.png")
```

## Sampler Options

DiffBIR supports multiple diffusion samplers with different speed/quality trade-offs.

```bash
# Available samplers (use with --sampler flag):
# Fast samplers (10-15 steps recommended):
#   - edm_dpm++_3m_sde (default, high quality)
#   - edm_dpm++_2m_sde
#   - edm_dpm++_2m
#   - dpm++_m2

# Standard samplers (50 steps recommended):
#   - spaced
#   - ddim

# EDM samplers with stochastic options:
#   - edm_euler, edm_euler_a
#   - edm_heun
#   - edm_dpm_2, edm_dpm_2_a
#   - edm_lms
#   - edm_dpm++_2s_a
#   - edm_dpm++_sde

# Example: Fast high-quality sampling
python -u inference.py \
    --task sr --version v2.1 \
    --sampler edm_dpm++_3m_sde \
    --steps 10 \
    --input inputs/demo/bsr --output results/fast

# Example: Traditional DDIM sampling
python -u inference.py \
    --task sr --version v2 \
    --sampler ddim \
    --steps 50 \
    --start_point_type cond \
    --input inputs/demo/bsr --output results/ddim
```

## Pretrained Model Versions

DiffBIR offers three model versions optimized for different use cases.

| Version | Stage-2 Model | Training Data | Best For |
|---------|--------------|---------------|----------|
| v2.1 | DiffBIR_v2.1.pt | Unsplash + LLaVA captions | General use, best quality |
| v2 | v2.pth | Laion2b-en filtered | ECCV paper reproduction |
| v1 | v1_general.pth / v1_face.pth | ImageNet-1k / FFHQ | Task-specific models |

```bash
# Download models manually (auto-downloaded during inference)
# v2.1 (recommended)
wget https://huggingface.co/lxq007/DiffBIR-v2/resolve/main/DiffBIR_v2.1.pt

# v2
wget https://huggingface.co/lxq007/DiffBIR-v2/resolve/main/v2.pth

# v1 models
wget https://huggingface.co/lxq007/DiffBIR-v2/resolve/main/v1_general.pth
wget https://huggingface.co/lxq007/DiffBIR-v2/resolve/main/v1_face.pth
```

## Summary

DiffBIR provides a unified framework for blind image restoration tasks including super-resolution, face restoration, and denoising. The two-stage pipeline combines traditional degradation removal with diffusion-based detail generation, achieving state-of-the-art results on real-world degraded images. Key features include automatic image captioning via LLaVA for context-aware restoration, tiled sampling for processing high-resolution images on limited VRAM, and multiple sampler options for balancing speed and quality.

For typical usage, start with DiffBIR v2.1 using the `edm_dpm++_3m_sde` sampler at 10 steps with LLaVA captioning enabled for best results. Enable tiled sampling when processing large images or when running on GPUs with less than 12GB VRAM. The framework supports both command-line batch processing and interactive web-based inference via Gradio, making it suitable for both research workflows and practical applications.