### Install DiffSynth-Studio

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/ERNIE-Image.md

Clone the repository and install DiffSynth-Studio to use ERNIE-Image. Refer to Setup Dependencies for more details.

```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```

--------------------------------

### Install DiffSynth-Studio from Source

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Recommended installation method. Clones the repository, navigates into the directory, and installs the package in editable mode.

```bash
git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .
```

--------------------------------

### Install DiffSynth-Studio with All Dependencies

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Inference_WebUI.md

Install DiffSynth-Studio in '[all]' mode to include all necessary dependencies for the Inference WebUI. This is the recommended installation method.

```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .[all]
```

--------------------------------

### Download Example Dataset for Stable Diffusion

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Stable-Diffusion.md

Use this command to download the example dataset required for Stable Diffusion training. Ensure you have the modelscope CLI installed.

```shell
modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --include "stable_diffusion/*" --local_dir ./data/diffsynth_example_dataset
```

--------------------------------

### Download Example Dataset for Stable Diffusion XL

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Stable-Diffusion-XL.md

Use this command to download the example dataset required for Stable Diffusion XL training. Ensure the dataset name and include path are correct for your setup.

```shell
modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --include "stable_diffusion_xl/*" --local_dir ./data/diffsynth_example_dataset
```

--------------------------------

### Quick Start: Qwen-Image Inference

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Qwen-Image.md

Load the Qwen-Image model and perform inference using DiffSynth-Studio. This example demonstrates VRAM management, automatically controlling model parameter loading based on available VRAM. A minimum of 8GB VRAM is required.

```python
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
import torch

vram_config = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": torch.float8_e4m3fn,
    "onload_device": "cpu",
    "preparing_dtype": torch.float8_e4m3fn,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}
pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "精致肖像，水下少女，蓝裙飘逸，发丝轻扬，光影透澈，气泡环绕，面容恬静，细节精致，梦幻唯美。"
image = pipe(prompt, seed=0, num_inference_steps=40)
image.save("image.jpg")
```

--------------------------------

### Qwen-Image Pipeline Quick Start

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Overview.md

Use this pipeline for generating images with the Qwen-Image model. Ensure you have the necessary libraries installed and specify the correct model configurations and device.

```python
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
)
prompt = "精致肖像，水下少女，蓝裙飘逸，发丝轻扬，光影透澈，气泡环绕，面容恬静，细节精致，梦幻唯美。"
image = pipe(
    prompt, seed=0, num_inference_steps=40,
    # edit_image=Image.open("xxx.jpg").resize((1328, 1328)) # For Qwen-Image-Edit
)
image.save("image.jpg")
```

--------------------------------

### Install DiffSynth Studio with Ascend NPU Support (ARM)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Setup.md

Installs DiffSynth Studio from source with NPU support for aarch64/ARM architectures. Requires prior installation of CANN.

```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
# aarch64/ARM
pip install -e .[npu_aarch64]
```

--------------------------------

### Install DiffSynth Studio with Ascend NPU Support (x86)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Setup.md

Installs DiffSynth Studio from source with NPU support for x86 architectures. Requires prior installation of CANN and uses a CPU-based PyTorch index.

```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
# x86
pip install -e .[npu] --extra-index-url "https://download.pytorch.org/whl/cpu"
```

--------------------------------

### Install Flash Attention and Xfuser

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Wan.md

Install the necessary libraries for multi-GPU parallel acceleration. Ensure flash-attn is installed without build isolation.

```shell
pip install flash-attn --no-build-isolation
pip install xfuser
```

--------------------------------

### Download Example Dataset

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/ERNIE-Image.md

Command to download the example image dataset for testing purposes.

```shell
modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --local_dir ./data/diffsynth_example_dataset
```

--------------------------------

### Command to Start Training

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Training/Supervised_Fine_Tuning.md

Use this command to launch the training script after setting up the environment and code.

```bash
accelerate launch examples/qwen_image/model_training/special/simple/train.py
```

--------------------------------

### LTX-2 Video Synthesis Quick Start

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Load the LTX-2 model for video synthesis with VRAM management. This example uses repackaged model configurations for efficient memory usage. The model can run with as little as 8GB of VRAM.

```python
import torch
from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig
from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2

vram_config = {
    "offload_dtype": torch.float8_e5m2,
    "offload_device": "cpu",
    "onload_dtype": torch.float8_e5m2,
    "onload_device": "cpu",
    "preparing_dtype": torch.float8_e5m2,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}
# use the repackaged modelconfig from "DiffSynth-Studio/LTX-2-Repackage" to avoid redundant model loading
pipe = LTX2AudioVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),
        ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config),
        ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config),
        ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config),
        ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config),
        ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config),
        ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config),
        ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),
    stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)

# use the following modelconfig if you want to initialize model from offical checkpoints from "Lightricks/LTX-2"
# pipe = LTX2AudioVideoPipeline.from_pretrained(
#     torch_dtype=torch.bfloat16,
#     device="cuda",
#     model_configs=[
#         ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),

```

--------------------------------

### Quick Start FLUX.2 Image Generation

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/FLUX2.md

Load the FLUX.2-dev model and perform image inference using DiffSynth-Studio. This example demonstrates VRAM management and automatic model parameter loading. A minimum of 10GB VRAM is required.

```python
from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
import torch

vram_config = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": torch.float8_e4m3fn,
    "onload_device": "cpu",
    "preparing_dtype": torch.float8_e4m3fn,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}
pipe = Flux2ImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="text_encoder/*.safetensors", **vram_config),
        ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="transformer/*.safetensors", **vram_config),
        ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "High resolution. A dreamy underwater portrait of a serene young woman in a flowing blue dress. Her hair floats softly around her face, strands delicately suspended in the water. Clear, shimmering light filters through, casting gentle highlights, while tiny bubbles rise around her. Her expression is calm, her features finely detailed—creating a tranquil, ethereal scene."
image = pipe(prompt, seed=42, rand_device="cuda", num_inference_steps=50)
image.save("image.jpg")
```

--------------------------------

### Trajectory Imitation Distillation Training Example

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Z-Image.md

This is a link to a directory containing examples for trajectory imitation distillation training, an experimental feature. Specific scripts are located within this directory.

```bash
# No specific code provided, link to directory:
# examples/z_image/model_training/special/trajectory_imitation/
```

--------------------------------

### Wan2.1-VACE-1.3B-Preview Model Inference (Low VRAM)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Inference example for the Wan2.1-VACE-1.3B-Preview model optimized for low VRAM environments.

```python
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# Initialize the pipeline for video generation with control and reference image (low VRAM)
pipe = pipeline(Tasks.control_video, model='iic/VACE-Wan2.1-1.3B-Preview', device='cpu') # or specify a GPU with less memory

# Define input data
input_data = {
    'control_video': 'path/to/your/control_video.mp4',
    'reference_image': 'path/to/your/reference_image.png',
    'text': 'a dog running'
}

# Perform inference
output = pipe(input_data)

# Save the generated video
with open('output_low_vram.mp4', 'wb') as f:
    f.write(output)

```

--------------------------------

### Wan2.1-Fun-1.3B-Control Model Inference (Low VRAM)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Inference example for the Wan2.1-Fun-1.3B-Control model optimized for low VRAM environments.

```python
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# Initialize the pipeline for video generation with control video (low VRAM)
pipe = pipeline(Tasks.control_video, model='PAI/Wan2.1-Fun-1.3B-Control', device='cpu') # or specify a GPU with less memory

# Define input data
input_data = {
    'control_video': 'path/to/your/control_video.mp4',
    'text': 'a dog running'
}

# Perform inference
output = pipe(input_data)

# Save the generated video
with open('output_low_vram.mp4', 'wb') as f:
    f.write(output)

```

--------------------------------

### Wan2.1-VACE-1.3B Model Inference (Low VRAM)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Inference example for the Wan2.1-VACE-1.3B model optimized for low VRAM environments.

```python
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# Initialize the pipeline for video generation with control and reference image (low VRAM)
pipe = pipeline(Tasks.control_video, model='Wan-AI/Wan2.1-VACE-1.3B', device='cpu') # or specify a GPU with less memory

# Define input data
input_data = {
    'control_video': 'path/to/your/control_video.mp4',
    'reference_image': 'path/to/your/reference_image.png',
    'text': 'a dog running'
}

# Perform inference
output = pipe(input_data)

# Save the generated video
with open('output_low_vram.mp4', 'wb') as f:
    f.write(output)

```

--------------------------------

### FLUX.1-dev-InfiniteYou Model Training (Full)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/FLUX.md

Example script for full model training with FLUX.1-dev-InfiniteYou.

```python
code
```

--------------------------------

### Wan2.1-VACE-14B Model Inference (Low VRAM)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Inference example for the Wan2.1-VACE-14B model optimized for low VRAM environments.

```python
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# Initialize the pipeline for video generation with control and reference image (low VRAM)
pipe = pipeline(Tasks.control_video, model='Wan-AI/Wan2.1-VACE-14B', device='cpu') # or specify a GPU with less memory

# Define input data
input_data = {
    'control_video': 'path/to/your/control_video.mp4',
    'reference_image': 'path/to/your/reference_image.png',
    'text': 'a dog running'
}

# Perform inference
output = pipe(input_data)

# Save the generated video
with open('output_low_vram.mp4', 'wb') as f:
    f.write(output)

```

--------------------------------

### Set up and Launch Training Task

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Training/Supervised_Fine_Tuning.md

Configure and launch a supervised fine-tuning training task. This involves initializing the `Accelerator`, defining the `Dataset`, instantiating the `TrainingModule`, setting up the `ModelLogger`, and finally calling `launch_training_task` with the appropriate parameters.

```python
if __name__ == "__main__":
    accelerator = accelerate.Accelerator(
        kwargs_handlers=[accelerate.DistributedDataParallelKwargs(find_unused_parameters=True)],
    )
    dataset = UnifiedDataset(
        base_path="data/example_image_dataset",
        metadata_path="data/example_image_dataset/metadata.csv",
        repeat=50,
        data_file_keys="image",
        main_data_operator=UnifiedDataset.default_image_operator(
            base_path="data/example_image_dataset",
            height=512,
            width=512,
            height_division_factor=16,
            width_division_factor=16,
        )
    )
    model = QwenImageTrainingModule(accelerator.device)
    model_logger = ModelLogger(
        output_path="models/toy_model",
        remove_prefix_in_ckpt="pipe.dit.",
    )
    launch_training_task(
        accelerator, dataset, model, model_logger,
        learning_rate=1e-5, num_epochs=1,
    )
```

--------------------------------

### SDXL Inference with VRAM Management

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Stable-Diffusion-XL.md

Quick start example for loading the stabilityai/stable-diffusion-xl-base-1.0 model for inference. Requires a minimum of 6GB VRAM, with automatic parameter loading based on available memory.

```python
import torch
from diffsynth.core import ModelConfig
from diffsynth.pipelines.stable_diffusion_xl import StableDiffusionXLPipeline

vram_config = {
    "offload_dtype": torch.float32,
    "offload_device": "cpu",
    "onload_dtype": torch.float32,
    "onload_device": "cpu",
    "preparing_dtype": torch.float32,
    "preparing_device": "cuda",
    "computation_dtype": torch.float32,
    "computation_device": "cuda",
}
pipe = StableDiffusionXLPipeline.from_pretrained(
    torch_dtype=torch.float32,
    model_configs=[
        ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="text_encoder/model.safetensors", **vram_config),
        ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="text_encoder_2/model.safetensors", **vram_config),
        ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="unet/diffusion_pytorch_model.safetensors", **vram_config),
        ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="tokenizer/"),
    tokenizer_2_config=ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="tokenizer_2/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)

image = pipe(
    prompt="a photo of an astronaut riding a horse on mars",
    negative_prompt="",
    cfg_scale=5.0,
    height=1024,
    width=1024,
    seed=42,
    num_inference_steps=50,
)
image.save("image.jpg")
```

--------------------------------

### Full Training of Qwen-Image Model with Accelerate

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Model_Training.md

Example command for fully training the Qwen-Image model using accelerate. It specifies a configuration file, dataset paths, model identifiers, learning rate, epochs, checkpoint prefix removal, output path, trainable models, and enables gradient checkpointing and finding unused parameters.

```shell
accelerate launch --config_file examples/qwen_image/model_training/full/accelerate_config_zero2offload.yaml examples/qwen_image/model_training/train.py \
  --dataset_base_path data/example_image_dataset \
  --dataset_metadata_path data/example_image_dataset/metadata.csv \
  --max_pixels 1048576 \
  --dataset_repeat 50 \
  --model_id_with_origin_paths "Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image:vae/diffusion_pytorch_model.safetensors" \
  --learning_rate 1e-5 \
  --num_epochs 2 \
  --remove_prefix_in_ckpt "pipe.dit." \
  --output_path "./models/train/Qwen-Image_full" \
  --trainable_models "dit" \
  --use_gradient_checkpointing \
  --find_unused_parameters
```

--------------------------------

### Training with Accelerate Launch

Source: https://context7.com/modelscope/diffsynth-studio/llms.txt

Example command for launching training scripts using HuggingFace Accelerate. This framework supports LoRA, full fine-tuning, gradient checkpointing, and DeepSpeed ZeRO.

```bash
# Training with accelerate launch — LoRA and full fine-tuning
# DiffSynth-Studio's training framework uses HuggingFace Accelerate as its launcher and supports full fine-tuning, LoRA, gradient checkpointing, gradient accumulation, DeepSpeed ZeRO, and two-stage split training. Training scripts accept standardized CLI arguments for dataset, model loading, LoRA configuration, and output.
```

--------------------------------

### JoyAI-Image Quick Start Inference

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Loads the JoyAI-Image-Edit model with VRAM management for inference. Requires a minimum of 4GB VRAM. Automatically controls model parameter loading based on available VRAM. Downloads example dataset.

```python
from diffsynth.pipelines.joyai_image import JoyAIImagePipeline, ModelConfig
import torch
from PIL import Image
from modelscope import dataset_snapshot_download

# Download dataset
dataset_snapshot_download(
    dataset_id="DiffSynth-Studio/diffsynth_example_dataset",
    local_dir="data/diffsynth_example_dataset",
    allow_file_pattern="joyai_image/JoyAI-Image-Edit/*"
)

vram_config = {
    "offload_dtype": torch.bfloat16,
    "offload_device": "cpu",
    "onload_dtype": torch.bfloat16,
    "onload_device": "cpu",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}

pipe = JoyAIImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="transformer/transformer.pth", **vram_config),
        ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="JoyAI-Image-Und/model*.safetensors", **vram_config),
        ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="vae/Wan2.1_VAE.pth", **vram_config),
    ],
    processor_config=ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="JoyAI-Image-Und/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
```

--------------------------------

### Text-to-Video Generation with WanVideoPipeline

Source: https://context7.com/modelscope/diffsynth-studio/llms.txt

Generate videos from text prompts using the `WanVideoPipeline`. This example configures VAE tiling and dynamic VRAM management for efficient processing, suitable for systems with around 12 GB of VRAM. It also demonstrates optional step-skip acceleration via `tea_cache_l1_thresh`.

```python
from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
import torch

vram_cfg = dict(
    offload_dtype=torch.bfloat16, offload_device="cpu",
    onload_dtype=torch.bfloat16,  onload_device="cpu",
    preparing_dtype=torch.bfloat16, preparing_device="cuda",
    computation_dtype=torch.bfloat16, computation_device="cuda",
)
pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B",
                    origin_file_pattern="diffusion_pytorch_model*.safetensors", **vram_cfg),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B",
                    origin_file_pattern="models/umt5-xxl/*.safetensors", **vram_cfg),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B",
                    origin_file_pattern="Wan2.1_VAE.pth"),
    ],
    tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B",
                                  origin_file_pattern="tokenizer/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024**3) - 0.5,
)
frames = pipe(
    prompt="A majestic eagle soaring over a mountain range at sunset.",
    negative_prompt="blurry, low quality",
    height=480,
    width=832,
    num_frames=81,          # ~3 seconds at 24 fps
    num_inference_steps=50,
    cfg_scale=5.0,
    seed=42,
    tiled=True,             # VAE tiling to reduce peak VRAM
    tile_size=(30, 52),
    tile_stride=(15, 26),
    tea_cache_l1_thresh=0.1,   # optional step-skip acceleration
    tea_cache_model_id="Wan2.1-T2V-14B",
)
from diffsynth.utils import save_video
save_video(frames, "output.mp4", fps=24)
```

--------------------------------

### Load Model with Configuration

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Developer_Guide/Integrating_Your_Model.md

Demonstrates how to load a model using its configuration details, including hashing the model file, initializing the model class, loading the state dictionary, and applying any necessary state dict conversion. Ensure the `model_hash` matches the actual file hash.

```python
from diffsynth.core import hash_model_file, load_state_dict, skip_model_initialization
from diffsynth.models.qwen_image_text_encoder import QwenImageTextEncoder
from diffsynth.utils.state_dict_converters.qwen_image_text_encoder import QwenImageTextEncoderStateDictConverter
import torch

model_hash = "8004730443f55db63092006dd9f7110e"
model_name = "qwen_image_text_encoder"
model_class = QwenImageTextEncoder
state_dict_converter = QwenImageTextEncoderStateDictConverter
extra_kwargs = {}

model_path = [
    "models/Qwen/Qwen-Image/text_encoder/model-00001-of-00004.safetensors",
    "models/Qwen/Qwen-Image/text_encoder/model-00002-of-00004.safetensors",
    "models/Qwen/Qwen-Image/text_encoder/model-00003-of-00004.safetensors",
    "models/Qwen/Qwen-Image/text_encoder/model-00004-of-00004.safetensors",
]
if hash_model_file(model_path) == model_hash:
    with skip_model_initialization():
        model = model_class(**extra_kwargs)
    state_dict = load_state_dict(model_path, torch_dtype=torch.bfloat16, device="cuda")
    state_dict = state_dict_converter(state_dict)
    model.load_state_dict(state_dict, assign=True)
    print("Done!")
```

--------------------------------

### Quick Start WanVideo Pipeline

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Overview.md

Initialize and use the WanVideoPipeline for video generation. Ensure CUDA is available and specify model configurations.

```python
import torch
from diffsynth.utils.data import save_video
from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig

pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth"),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth"),
    ],
)

video = pipe(
    prompt="纪实摄影风格画面，一只活泼的小狗在绿茵茵的草地上迅速奔跑。小狗毛色棕黄，两只耳朵立起，神情专注而欢快。阳光洒在它身上，使得毛发看上去格外柔软而闪亮。背景是一片开阔的草地，偶尔点缀着几朵野花，远处隐约可见蓝天和几片白云。透视感鲜明，捕捉小狗奔跑时的动感和四周草地的生机。中景侧面移动视角。",
    negative_prompt="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
    seed=0, tiled=True,
)
save_video(video, "video.mp4", fps=15, quality=5)
```

--------------------------------

### Initializing Qwen Image Pipeline for Training

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Training/Supervised_Fine_Tuning.md

Load the Qwen Image pipeline and switch it to training mode with LoRA configuration. Ensure VRAM management is not enabled during this process.

```python
def __init__(self, device):
        super().__init__()
        # Load the pipeline
        self.pipe = QwenImagePipeline.from_pretrained(
            torch_dtype=torch.bfloat16,
            device=device,
            model_configs=[
                ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
                ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
                ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
            ],
            tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
        )
        # Switch to training mode
        self.switch_pipe_to_training_mode(
            self.pipe,
            lora_base_model="dit",
            lora_target_modules="to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj",
            lora_rank=32,
        )
```

--------------------------------

### FLUX.2 Model Inference Example

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Example Python script for running inference with the FLUX.2 model.

```python
from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig
import torch

pipe = Flux2ImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="text_encoder/*.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="transformer/*.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)
prompt = "High resolution. A dreamy underwater portrait of a serene young woman in a flowing blue dress. Her hair floats softly around her face, strands delicately suspended in the water. Clear, shimmering light filters through, casting gentle highlights, while tiny bubbles rise around her. Her expression is calm, her features finely detailed—creating a tranquil, ethereal scene."
image = pipe(prompt, seed=42, rand_device="cuda", num_inference_steps=50)
image.save("image.jpg")
```

--------------------------------

### Wan2.1-Fun-V1.1-1.3B-Control-Camera Model Training (Full)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md

Shell script for full training of the Wan2.1-Fun-V1.1-1.3B-Control-Camera model.

```bash
accelerate launch --num_processes=8 --mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing=True --enable_xformers_memory_efficient_attention=True --fsdp=full_shard --fsdp_config=./fsdp_config.json ./train_full.py   --pretrained_model_name_or_path="PAI/Wan2.1-Fun-V1.1-1.3B-Control-Camera"   --output_dir="./Wan2.1-Fun-V1.1-1.3B-Control-Camera"   --dataset_name="/mnt/datasets/WanVideo/"   --resolution=512   --train_batch_size=1   --gradient_accumulation_steps=1   --num_train_epochs=10   --learning_rate=1e-05   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_grad_norm=1   --checkpointing_steps=500   --validation_steps=500   --validation_image="./input.png"   "./control_camera.mp4"   --report_to="wandb"   --enable_xformers_memory_efficient_attention   --mixed_precision="fp16"   --gradient_checkpointing   --gradient_accumulation_steps=1   --fsdp="full_shard"   --fsdp_config=./fsdp_config.json
```

--------------------------------

### Install xDiT Dependency

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Accelerated_Inference.md

Install the xDiT dependency with flash-attn support for multi-GPU inference. Ensure version compatibility.

```bash
pip install "xfuser[flash-attn]>=0.4.3"
```

--------------------------------

### Install USP Libraries for NPU

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/GPU_support.md

Install these third-party libraries to use the Unified Sequence Parallel (USP) feature on NPU.

```shell
pip install git+https://github.com/feifeibear/long-context-attention.git
pip install git+https://github.com/xdit-project/xDiT.git
```

--------------------------------

### Install DiffSynth from PyPI

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Setup.md

Installs the DiffSynth package from the Python Package Index. Note that PyPI versions may have delays in updates.

```bash
pip install diffsynth
```

--------------------------------

### JSON Metadata Format Example

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/API_Reference/core/data.md

Example of a JSON file for dataset metadata. Supports list data but has a larger memory footprint.

```json
[
    {
        "image": "image_1.jpg",
        "prompt": "a dog"
    },
    {
        "image": "image_2.jpg",
        "prompt": "a cat"
    }
]
```

--------------------------------

### Training Module and Pipeline Setup

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Research_Tutorial/train_from_scratch.md

Defines the training module, initializes the image pipeline with specified models and configurations, and sets up the scheduler.

```python
class AAATrainingModule(DiffusionTrainingModule):
    def __init__(self, device):
        super().__init__()
        self.pipe = AAAImagePipeline.from_pretrained(
            torch_dtype=torch.bfloat16,
            device=device,
            model_configs=[
                ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="model.safetensors"),
                ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
            ],
            tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"),
        )
        self.pipe.dit = AAADiT().to(dtype=torch.bfloat16, device=device)
        self.pipe.freeze_except(["dit"])
        self.pipe.scheduler.set_timesteps(1000, training=True)

    def forward(self, data):
        inputs_posi = {"prompt": data["prompt"]}
        inputs_nega = {"negative_prompt": ""}
        inputs_shared = {
            "input_image": data["image"],
            "height": data["image"].size[1],
            "width": data["image"].size[0],
            "cfg_scale": 1,
            "use_gradient_checkpointing": False,
            "use_gradient_checkpointing_offload": False,
        }
        for unit in self.pipe.units:
            inputs_shared, inputs_posi, inputs_nega = self.pipe.unit_runner(unit, self.pipe, inputs_shared, inputs_posi, inputs_nega)
        loss = FlowMatchSFTLoss(self.pipe, **inputs_shared, **inputs_posi)
        return loss
```

--------------------------------

### CSV Metadata Format Example

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/API_Reference/core/data.md

Example of a CSV file for dataset metadata. Suitable for large datasets and simple data structures.

```csv
image,prompt
image_1.jpg,"a dog"
image_2.jpg,"a cat"
```

--------------------------------

### Quick Start FLUX Image Pipeline

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Overview.md

Demonstrates how to initialize and use the FluxImagePipeline for generating images with a specified prompt and seed. Ensure CUDA is available for GPU acceleration.

```python
import torch
from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig

pipe = FluxImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
    ],
)

image = pipe(prompt="a cat", seed=0)
image.save("image.jpg")
```

--------------------------------

### Differential LoRA Training Example

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Z-Image.md

This is a link to a directory containing examples for differential LoRA training. Specific scripts are located within this directory.

```bash
# No specific code provided, link to directory:
# examples/z_image/model_training/special/differential_training/
```

--------------------------------

### Wan2.1-Fun-V1.1-1.3B-Control-Camera Training (Full)

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Wan.md

Shell script for full training of the Wan2.1-Fun-V1.1-1.3B-Control-Camera model.

```bash
#!/bin/bash

# Example training command (replace with actual parameters)
# accelerate launch full_train.py \
#     --model_name_or_path "modelscope/Wan2.1-Fun-V1.1-1.3B-Control-Camera" \
#     --output_dir "./output/Wan2.1-Fun-V1.1-1.3B-Control-Camera-full" \
#     --dataset_name "your_dataset" \
#     --resolution 512 \
#     --train_batch_size 1 \
#     --gradient_accumulation_steps 4 \
#     --learning_rate 1e-5 \
#     --num_train_epochs 10 \
#     --enable_xformers_memory_efficient_attention \
#     --gradient_checkpointing \
#     --mixed_precision "fp16"
```

--------------------------------

### End-to-end Direct Distillation Example

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Overview.md

Example code for end-to-end direct distillation. This technique distills knowledge from a teacher model to a student model directly.

```bash
cd /mnt/DiffSynth-Studio/examples/wanvideo/model_training/special/direct_distill/
accelerate launch --num_processes=8 --mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing=True --enable_cpu_offload=True /mnt/DiffSynth-Studio/examples/wanvideo/model_training/special/direct_distill/direct_distill.py \
    --model_max_length=40 \
    --pretrained_model_name_or_path=/mnt/models/Wan2.2-Fun-A14B-InP \
    --dataset_name=/mnt/datasets/wanvideo \
    --output_dir=/mnt/outputs/wanvideo/special/direct_distill \
    --resolution=512 \
    --train_batch_size=1 \
    --num_train_epochs=10 \
    --learning_rate=1e-05 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --max_grad_norm=1 \
    --checkpointing_steps=500 \
    --validation_steps=500 \
    --validation_guidance_scale=7.5 \
    --validation_num_frames=16 \
    --validation_fps=8 \
    --seed=42

```

--------------------------------

### Quick Start Video Generation with Wan2.1-T2V-1.3B

Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Wan.md

Load the Wan2.1-T2V-1.3B model and perform video inference. VRAM management is enabled, automatically controlling model parameter loading based on available VRAM. A minimum of 8GB VRAM is required.

```python
import torch
from diffsynth.utils.data import save_video, VideoData
from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig

vram_config = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": torch.bfloat16,
    "onload_device": "cpu",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}
pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors", **vram_config),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth", **vram_config),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 2,
)

video = pipe(
    prompt="纪实摄影风格画面，一只活泼的小狗在绿茵茵的草地上迅速奔跑。小狗毛色棕黄，两只耳朵立起，神情专注而欢快。阳光洒在它身上，使得毛发看上去格外柔软而闪亮。背景是一片开阔的草地，偶尔点缀着几朵野花，远处隐约可见蓝天和几片白云。透视感鲜明，捕捉小狗奔跑时的动感和四周草地的生机。中景侧面移动视角。",
    negative_prompt="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
    seed=0, tiled=True,
)
save_video(video, "video.mp4", fps=15, quality=5)
```