### Install DiffSynth-Studio Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/ERNIE-Image.md Clone the repository and install DiffSynth-Studio to use ERNIE-Image. Refer to Setup Dependencies for more details. ```shell git clone https://github.com/modelscope/DiffSynth-Studio.git cd DiffSynth-Studio pip install -e . ``` -------------------------------- ### Install DiffSynth-Studio from Source Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Recommended installation method. Clones the repository, navigates into the directory, and installs the package in editable mode. ```bash git clone https://github.com/modelscope/DiffSynth-Studio.git cd DiffSynth-Studio pip install -e . ``` -------------------------------- ### Install DiffSynth-Studio with All Dependencies Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Inference_WebUI.md Install DiffSynth-Studio in '[all]' mode to include all necessary dependencies for the Inference WebUI. This is the recommended installation method. ```shell git clone https://github.com/modelscope/DiffSynth-Studio.git cd DiffSynth-Studio pip install -e .[all] ``` -------------------------------- ### Download Example Dataset for Stable Diffusion Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Stable-Diffusion.md Use this command to download the example dataset required for Stable Diffusion training. Ensure you have the modelscope CLI installed. ```shell modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --include "stable_diffusion/*" --local_dir ./data/diffsynth_example_dataset ``` -------------------------------- ### Download Example Dataset for Stable Diffusion XL Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Stable-Diffusion-XL.md Use this command to download the example dataset required for Stable Diffusion XL training. Ensure the dataset name and include path are correct for your setup. ```shell modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --include "stable_diffusion_xl/*" --local_dir ./data/diffsynth_example_dataset ``` -------------------------------- ### Quick Start: Qwen-Image Inference Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Qwen-Image.md Load the Qwen-Image model and perform inference using DiffSynth-Studio. This example demonstrates VRAM management, automatically controlling model parameter loading based on available VRAM. A minimum of 8GB VRAM is required. ```python from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig import torch vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.float8_e4m3fn, "onload_device": "cpu", "preparing_dtype": torch.float8_e4m3fn, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = QwenImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "精致肖像,水下少女,蓝裙飘逸,发丝轻扬,光影透澈,气泡环绕,面容恬静,细节精致,梦幻唯美。" image = pipe(prompt, seed=0, num_inference_steps=40) image.save("image.jpg") ``` -------------------------------- ### Qwen-Image Pipeline Quick Start Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Overview.md Use this pipeline for generating images with the Qwen-Image model. Ensure you have the necessary libraries installed and specify the correct model configurations and device. ```python from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig from PIL import Image import torch pipe = QwenImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"), ) prompt = "精致肖像,水下少女,蓝裙飘逸,发丝轻扬,光影透澈,气泡环绕,面容恬静,细节精致,梦幻唯美。" image = pipe( prompt, seed=0, num_inference_steps=40, # edit_image=Image.open("xxx.jpg").resize((1328, 1328)) # For Qwen-Image-Edit ) image.save("image.jpg") ``` -------------------------------- ### Install DiffSynth Studio with Ascend NPU Support (ARM) Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Setup.md Installs DiffSynth Studio from source with NPU support for aarch64/ARM architectures. Requires prior installation of CANN. ```shell git clone https://github.com/modelscope/DiffSynth-Studio.git cd DiffSynth-Studio # aarch64/ARM pip install -e .[npu_aarch64] ``` -------------------------------- ### Install DiffSynth Studio with Ascend NPU Support (x86) Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Setup.md Installs DiffSynth Studio from source with NPU support for x86 architectures. Requires prior installation of CANN and uses a CPU-based PyTorch index. ```shell git clone https://github.com/modelscope/DiffSynth-Studio.git cd DiffSynth-Studio # x86 pip install -e .[npu] --extra-index-url "https://download.pytorch.org/whl/cpu" ``` -------------------------------- ### Install Flash Attention and Xfuser Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Wan.md Install the necessary libraries for multi-GPU parallel acceleration. Ensure flash-attn is installed without build isolation. ```shell pip install flash-attn --no-build-isolation pip install xfuser ``` -------------------------------- ### Download Example Dataset Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/ERNIE-Image.md Command to download the example image dataset for testing purposes. ```shell modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --local_dir ./data/diffsynth_example_dataset ``` -------------------------------- ### Command to Start Training Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Training/Supervised_Fine_Tuning.md Use this command to launch the training script after setting up the environment and code. ```bash accelerate launch examples/qwen_image/model_training/special/simple/train.py ``` -------------------------------- ### LTX-2 Video Synthesis Quick Start Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Load the LTX-2 model for video synthesis with VRAM management. This example uses repackaged model configurations for efficient memory usage. The model can run with as little as 8GB of VRAM. ```python import torch from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2 vram_config = { "offload_dtype": torch.float8_e5m2, "offload_device": "cpu", "onload_dtype": torch.float8_e5m2, "onload_device": "cpu", "preparing_dtype": torch.float8_e5m2, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } # use the repackaged modelconfig from "DiffSynth-Studio/LTX-2-Repackage" to avoid redundant model loading pipe = LTX2AudioVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config), ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"), stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) # use the following modelconfig if you want to initialize model from offical checkpoints from "Lightricks/LTX-2" # pipe = LTX2AudioVideoPipeline.from_pretrained( # torch_dtype=torch.bfloat16, # device="cuda", # model_configs=[ # ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config), ``` -------------------------------- ### Quick Start FLUX.2 Image Generation Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/FLUX2.md Load the FLUX.2-dev model and perform image inference using DiffSynth-Studio. This example demonstrates VRAM management and automatic model parameter loading. A minimum of 10GB VRAM is required. ```python from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig import torch vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.float8_e4m3fn, "onload_device": "cpu", "preparing_dtype": torch.float8_e4m3fn, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = Flux2ImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="text_encoder/*.safetensors", **vram_config), ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="transformer/*.safetensors", **vram_config), ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "High resolution. A dreamy underwater portrait of a serene young woman in a flowing blue dress. Her hair floats softly around her face, strands delicately suspended in the water. Clear, shimmering light filters through, casting gentle highlights, while tiny bubbles rise around her. Her expression is calm, her features finely detailed—creating a tranquil, ethereal scene." image = pipe(prompt, seed=42, rand_device="cuda", num_inference_steps=50) image.save("image.jpg") ``` -------------------------------- ### Trajectory Imitation Distillation Training Example Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Z-Image.md This is a link to a directory containing examples for trajectory imitation distillation training, an experimental feature. Specific scripts are located within this directory. ```bash # No specific code provided, link to directory: # examples/z_image/model_training/special/trajectory_imitation/ ``` -------------------------------- ### Wan2.1-VACE-1.3B-Preview Model Inference (Low VRAM) Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Inference example for the Wan2.1-VACE-1.3B-Preview model optimized for low VRAM environments. ```python from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks # Initialize the pipeline for video generation with control and reference image (low VRAM) pipe = pipeline(Tasks.control_video, model='iic/VACE-Wan2.1-1.3B-Preview', device='cpu') # or specify a GPU with less memory # Define input data input_data = { 'control_video': 'path/to/your/control_video.mp4', 'reference_image': 'path/to/your/reference_image.png', 'text': 'a dog running' } # Perform inference output = pipe(input_data) # Save the generated video with open('output_low_vram.mp4', 'wb') as f: f.write(output) ``` -------------------------------- ### Wan2.1-Fun-1.3B-Control Model Inference (Low VRAM) Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Inference example for the Wan2.1-Fun-1.3B-Control model optimized for low VRAM environments. ```python from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks # Initialize the pipeline for video generation with control video (low VRAM) pipe = pipeline(Tasks.control_video, model='PAI/Wan2.1-Fun-1.3B-Control', device='cpu') # or specify a GPU with less memory # Define input data input_data = { 'control_video': 'path/to/your/control_video.mp4', 'text': 'a dog running' } # Perform inference output = pipe(input_data) # Save the generated video with open('output_low_vram.mp4', 'wb') as f: f.write(output) ``` -------------------------------- ### Wan2.1-VACE-1.3B Model Inference (Low VRAM) Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Inference example for the Wan2.1-VACE-1.3B model optimized for low VRAM environments. ```python from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks # Initialize the pipeline for video generation with control and reference image (low VRAM) pipe = pipeline(Tasks.control_video, model='Wan-AI/Wan2.1-VACE-1.3B', device='cpu') # or specify a GPU with less memory # Define input data input_data = { 'control_video': 'path/to/your/control_video.mp4', 'reference_image': 'path/to/your/reference_image.png', 'text': 'a dog running' } # Perform inference output = pipe(input_data) # Save the generated video with open('output_low_vram.mp4', 'wb') as f: f.write(output) ``` -------------------------------- ### FLUX.1-dev-InfiniteYou Model Training (Full) Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/FLUX.md Example script for full model training with FLUX.1-dev-InfiniteYou. ```python code ``` -------------------------------- ### Wan2.1-VACE-14B Model Inference (Low VRAM) Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Inference example for the Wan2.1-VACE-14B model optimized for low VRAM environments. ```python from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks # Initialize the pipeline for video generation with control and reference image (low VRAM) pipe = pipeline(Tasks.control_video, model='Wan-AI/Wan2.1-VACE-14B', device='cpu') # or specify a GPU with less memory # Define input data input_data = { 'control_video': 'path/to/your/control_video.mp4', 'reference_image': 'path/to/your/reference_image.png', 'text': 'a dog running' } # Perform inference output = pipe(input_data) # Save the generated video with open('output_low_vram.mp4', 'wb') as f: f.write(output) ``` -------------------------------- ### Set up and Launch Training Task Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Training/Supervised_Fine_Tuning.md Configure and launch a supervised fine-tuning training task. This involves initializing the `Accelerator`, defining the `Dataset`, instantiating the `TrainingModule`, setting up the `ModelLogger`, and finally calling `launch_training_task` with the appropriate parameters. ```python if __name__ == "__main__": accelerator = accelerate.Accelerator( kwargs_handlers=[accelerate.DistributedDataParallelKwargs(find_unused_parameters=True)], ) dataset = UnifiedDataset( base_path="data/example_image_dataset", metadata_path="data/example_image_dataset/metadata.csv", repeat=50, data_file_keys="image", main_data_operator=UnifiedDataset.default_image_operator( base_path="data/example_image_dataset", height=512, width=512, height_division_factor=16, width_division_factor=16, ) ) model = QwenImageTrainingModule(accelerator.device) model_logger = ModelLogger( output_path="models/toy_model", remove_prefix_in_ckpt="pipe.dit.", ) launch_training_task( accelerator, dataset, model, model_logger, learning_rate=1e-5, num_epochs=1, ) ``` -------------------------------- ### SDXL Inference with VRAM Management Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Stable-Diffusion-XL.md Quick start example for loading the stabilityai/stable-diffusion-xl-base-1.0 model for inference. Requires a minimum of 6GB VRAM, with automatic parameter loading based on available memory. ```python import torch from diffsynth.core import ModelConfig from diffsynth.pipelines.stable_diffusion_xl import StableDiffusionXLPipeline vram_config = { "offload_dtype": torch.float32, "offload_device": "cpu", "onload_dtype": torch.float32, "onload_device": "cpu", "preparing_dtype": torch.float32, "preparing_device": "cuda", "computation_dtype": torch.float32, "computation_device": "cuda", } pipe = StableDiffusionXLPipeline.from_pretrained( torch_dtype=torch.float32, model_configs=[ ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="text_encoder/model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="text_encoder_2/model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="unet/diffusion_pytorch_model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="tokenizer/"), tokenizer_2_config=ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="tokenizer_2/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) image = pipe( prompt="a photo of an astronaut riding a horse on mars", negative_prompt="", cfg_scale=5.0, height=1024, width=1024, seed=42, num_inference_steps=50, ) image.save("image.jpg") ``` -------------------------------- ### Full Training of Qwen-Image Model with Accelerate Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Model_Training.md Example command for fully training the Qwen-Image model using accelerate. It specifies a configuration file, dataset paths, model identifiers, learning rate, epochs, checkpoint prefix removal, output path, trainable models, and enables gradient checkpointing and finding unused parameters. ```shell accelerate launch --config_file examples/qwen_image/model_training/full/accelerate_config_zero2offload.yaml examples/qwen_image/model_training/train.py \ --dataset_base_path data/example_image_dataset \ --dataset_metadata_path data/example_image_dataset/metadata.csv \ --max_pixels 1048576 \ --dataset_repeat 50 \ --model_id_with_origin_paths "Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image:vae/diffusion_pytorch_model.safetensors" \ --learning_rate 1e-5 \ --num_epochs 2 \ --remove_prefix_in_ckpt "pipe.dit." \ --output_path "./models/train/Qwen-Image_full" \ --trainable_models "dit" \ --use_gradient_checkpointing \ --find_unused_parameters ``` -------------------------------- ### Training with Accelerate Launch Source: https://context7.com/modelscope/diffsynth-studio/llms.txt Example command for launching training scripts using HuggingFace Accelerate. This framework supports LoRA, full fine-tuning, gradient checkpointing, and DeepSpeed ZeRO. ```bash # Training with accelerate launch — LoRA and full fine-tuning # DiffSynth-Studio's training framework uses HuggingFace Accelerate as its launcher and supports full fine-tuning, LoRA, gradient checkpointing, gradient accumulation, DeepSpeed ZeRO, and two-stage split training. Training scripts accept standardized CLI arguments for dataset, model loading, LoRA configuration, and output. ``` -------------------------------- ### JoyAI-Image Quick Start Inference Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Loads the JoyAI-Image-Edit model with VRAM management for inference. Requires a minimum of 4GB VRAM. Automatically controls model parameter loading based on available VRAM. Downloads example dataset. ```python from diffsynth.pipelines.joyai_image import JoyAIImagePipeline, ModelConfig import torch from PIL import Image from modelscope import dataset_snapshot_download # Download dataset dataset_snapshot_download( dataset_id="DiffSynth-Studio/diffsynth_example_dataset", local_dir="data/diffsynth_example_dataset", allow_file_pattern="joyai_image/JoyAI-Image-Edit/*" ) vram_config = { "offload_dtype": torch.bfloat16, "offload_device": "cpu", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = JoyAIImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="transformer/transformer.pth", **vram_config), ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="JoyAI-Image-Und/model*.safetensors", **vram_config), ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="vae/Wan2.1_VAE.pth", **vram_config), ], processor_config=ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="JoyAI-Image-Und/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) ``` -------------------------------- ### Text-to-Video Generation with WanVideoPipeline Source: https://context7.com/modelscope/diffsynth-studio/llms.txt Generate videos from text prompts using the `WanVideoPipeline`. This example configures VAE tiling and dynamic VRAM management for efficient processing, suitable for systems with around 12 GB of VRAM. It also demonstrates optional step-skip acceleration via `tea_cache_l1_thresh`. ```python from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig import torch vram_cfg = dict( offload_dtype=torch.bfloat16, offload_device="cpu", onload_dtype=torch.bfloat16, onload_device="cpu", preparing_dtype=torch.bfloat16, preparing_device="cuda", computation_dtype=torch.bfloat16, computation_device="cuda", ) pipe = WanVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="diffusion_pytorch_model*.safetensors", **vram_cfg), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="models/umt5-xxl/*.safetensors", **vram_cfg), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="Wan2.1_VAE.pth"), ], tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-14B", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024**3) - 0.5, ) frames = pipe( prompt="A majestic eagle soaring over a mountain range at sunset.", negative_prompt="blurry, low quality", height=480, width=832, num_frames=81, # ~3 seconds at 24 fps num_inference_steps=50, cfg_scale=5.0, seed=42, tiled=True, # VAE tiling to reduce peak VRAM tile_size=(30, 52), tile_stride=(15, 26), tea_cache_l1_thresh=0.1, # optional step-skip acceleration tea_cache_model_id="Wan2.1-T2V-14B", ) from diffsynth.utils import save_video save_video(frames, "output.mp4", fps=24) ``` -------------------------------- ### Load Model with Configuration Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Developer_Guide/Integrating_Your_Model.md Demonstrates how to load a model using its configuration details, including hashing the model file, initializing the model class, loading the state dictionary, and applying any necessary state dict conversion. Ensure the `model_hash` matches the actual file hash. ```python from diffsynth.core import hash_model_file, load_state_dict, skip_model_initialization from diffsynth.models.qwen_image_text_encoder import QwenImageTextEncoder from diffsynth.utils.state_dict_converters.qwen_image_text_encoder import QwenImageTextEncoderStateDictConverter import torch model_hash = "8004730443f55db63092006dd9f7110e" model_name = "qwen_image_text_encoder" model_class = QwenImageTextEncoder state_dict_converter = QwenImageTextEncoderStateDictConverter extra_kwargs = {} model_path = [ "models/Qwen/Qwen-Image/text_encoder/model-00001-of-00004.safetensors", "models/Qwen/Qwen-Image/text_encoder/model-00002-of-00004.safetensors", "models/Qwen/Qwen-Image/text_encoder/model-00003-of-00004.safetensors", "models/Qwen/Qwen-Image/text_encoder/model-00004-of-00004.safetensors", ] if hash_model_file(model_path) == model_hash: with skip_model_initialization(): model = model_class(**extra_kwargs) state_dict = load_state_dict(model_path, torch_dtype=torch.bfloat16, device="cuda") state_dict = state_dict_converter(state_dict) model.load_state_dict(state_dict, assign=True) print("Done!") ``` -------------------------------- ### Quick Start WanVideo Pipeline Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Overview.md Initialize and use the WanVideoPipeline for video generation. Ensure CUDA is available and specify model configurations. ```python import torch from diffsynth.utils.data import save_video from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig pipe = WanVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors"), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth"), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth"), ], ) video = pipe( prompt="纪实摄影风格画面,一只活泼的小狗在绿茵茵的草地上迅速奔跑。小狗毛色棕黄,两只耳朵立起,神情专注而欢快。阳光洒在它身上,使得毛发看上去格外柔软而闪亮。背景是一片开阔的草地,偶尔点缀着几朵野花,远处隐约可见蓝天和几片白云。透视感鲜明,捕捉小狗奔跑时的动感和四周草地的生机。中景侧面移动视角。", negative_prompt="色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走", seed=0, tiled=True, ) save_video(video, "video.mp4", fps=15, quality=5) ``` -------------------------------- ### Initializing Qwen Image Pipeline for Training Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Training/Supervised_Fine_Tuning.md Load the Qwen Image pipeline and switch it to training mode with LoRA configuration. Ensure VRAM management is not enabled during this process. ```python def __init__(self, device): super().__init__() # Load the pipeline self.pipe = QwenImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device=device, model_configs=[ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"), ) # Switch to training mode self.switch_pipe_to_training_mode( self.pipe, lora_base_model="dit", lora_target_modules="to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj", lora_rank=32, ) ``` -------------------------------- ### FLUX.2 Model Inference Example Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Example Python script for running inference with the FLUX.2 model. ```python from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig import torch pipe = Flux2ImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="text_encoder/*.safetensors"), ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="transformer/*.safetensors"), ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "High resolution. A dreamy underwater portrait of a serene young woman in a flowing blue dress. Her hair floats softly around her face, strands delicately suspended in the water. Clear, shimmering light filters through, casting gentle highlights, while tiny bubbles rise around her. Her expression is calm, her features finely detailed—creating a tranquil, ethereal scene." image = pipe(prompt, seed=42, rand_device="cuda", num_inference_steps=50) image.save("image.jpg") ``` -------------------------------- ### Wan2.1-Fun-V1.1-1.3B-Control-Camera Model Training (Full) Source: https://github.com/modelscope/diffsynth-studio/blob/main/README.md Shell script for full training of the Wan2.1-Fun-V1.1-1.3B-Control-Camera model. ```bash accelerate launch --num_processes=8 --mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing=True --enable_xformers_memory_efficient_attention=True --fsdp=full_shard --fsdp_config=./fsdp_config.json ./train_full.py --pretrained_model_name_or_path="PAI/Wan2.1-Fun-V1.1-1.3B-Control-Camera" --output_dir="./Wan2.1-Fun-V1.1-1.3B-Control-Camera" --dataset_name="/mnt/datasets/WanVideo/" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --num_train_epochs=10 --learning_rate=1e-05 --lr_scheduler="constant" --lr_warmup_steps=0 --max_grad_norm=1 --checkpointing_steps=500 --validation_steps=500 --validation_image="./input.png" "./control_camera.mp4" --report_to="wandb" --enable_xformers_memory_efficient_attention --mixed_precision="fp16" --gradient_checkpointing --gradient_accumulation_steps=1 --fsdp="full_shard" --fsdp_config=./fsdp_config.json ``` -------------------------------- ### Install xDiT Dependency Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Accelerated_Inference.md Install the xDiT dependency with flash-attn support for multi-GPU inference. Ensure version compatibility. ```bash pip install "xfuser[flash-attn]>=0.4.3" ``` -------------------------------- ### Install USP Libraries for NPU Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/GPU_support.md Install these third-party libraries to use the Unified Sequence Parallel (USP) feature on NPU. ```shell pip install git+https://github.com/feifeibear/long-context-attention.git pip install git+https://github.com/xdit-project/xDiT.git ``` -------------------------------- ### Install DiffSynth from PyPI Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Pipeline_Usage/Setup.md Installs the DiffSynth package from the Python Package Index. Note that PyPI versions may have delays in updates. ```bash pip install diffsynth ``` -------------------------------- ### JSON Metadata Format Example Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/API_Reference/core/data.md Example of a JSON file for dataset metadata. Supports list data but has a larger memory footprint. ```json [ { "image": "image_1.jpg", "prompt": "a dog" }, { "image": "image_2.jpg", "prompt": "a cat" } ] ``` -------------------------------- ### Training Module and Pipeline Setup Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Research_Tutorial/train_from_scratch.md Defines the training module, initializes the image pipeline with specified models and configurations, and sets up the scheduler. ```python class AAATrainingModule(DiffusionTrainingModule): def __init__(self, device): super().__init__() self.pipe = AAAImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device=device, model_configs=[ ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="model.safetensors"), ModelConfig(model_id="black-forest-labs/FLUX.2-klein-4B", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"), ) self.pipe.dit = AAADiT().to(dtype=torch.bfloat16, device=device) self.pipe.freeze_except(["dit"]) self.pipe.scheduler.set_timesteps(1000, training=True) def forward(self, data): inputs_posi = {"prompt": data["prompt"]} inputs_nega = {"negative_prompt": ""} inputs_shared = { "input_image": data["image"], "height": data["image"].size[1], "width": data["image"].size[0], "cfg_scale": 1, "use_gradient_checkpointing": False, "use_gradient_checkpointing_offload": False, } for unit in self.pipe.units: inputs_shared, inputs_posi, inputs_nega = self.pipe.unit_runner(unit, self.pipe, inputs_shared, inputs_posi, inputs_nega) loss = FlowMatchSFTLoss(self.pipe, **inputs_shared, **inputs_posi) return loss ``` -------------------------------- ### CSV Metadata Format Example Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/API_Reference/core/data.md Example of a CSV file for dataset metadata. Suitable for large datasets and simple data structures. ```csv image,prompt image_1.jpg,"a dog" image_2.jpg,"a cat" ``` -------------------------------- ### Quick Start FLUX Image Pipeline Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Overview.md Demonstrates how to initialize and use the FluxImagePipeline for generating images with a specified prompt and seed. Ensure CUDA is available for GPU acceleration. ```python import torch from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig pipe = FluxImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/*.safetensors"), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"), ], ) image = pipe(prompt="a cat", seed=0) image.save("image.jpg") ``` -------------------------------- ### Differential LoRA Training Example Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Z-Image.md This is a link to a directory containing examples for differential LoRA training. Specific scripts are located within this directory. ```bash # No specific code provided, link to directory: # examples/z_image/model_training/special/differential_training/ ``` -------------------------------- ### Wan2.1-Fun-V1.1-1.3B-Control-Camera Training (Full) Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Wan.md Shell script for full training of the Wan2.1-Fun-V1.1-1.3B-Control-Camera model. ```bash #!/bin/bash # Example training command (replace with actual parameters) # accelerate launch full_train.py \ # --model_name_or_path "modelscope/Wan2.1-Fun-V1.1-1.3B-Control-Camera" \ # --output_dir "./output/Wan2.1-Fun-V1.1-1.3B-Control-Camera-full" \ # --dataset_name "your_dataset" \ # --resolution 512 \ # --train_batch_size 1 \ # --gradient_accumulation_steps 4 \ # --learning_rate 1e-5 \ # --num_train_epochs 10 \ # --enable_xformers_memory_efficient_attention \ # --gradient_checkpointing \ # --mixed_precision "fp16" ``` -------------------------------- ### End-to-end Direct Distillation Example Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Overview.md Example code for end-to-end direct distillation. This technique distills knowledge from a teacher model to a student model directly. ```bash cd /mnt/DiffSynth-Studio/examples/wanvideo/model_training/special/direct_distill/ accelerate launch --num_processes=8 --mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing=True --enable_cpu_offload=True /mnt/DiffSynth-Studio/examples/wanvideo/model_training/special/direct_distill/direct_distill.py \ --model_max_length=40 \ --pretrained_model_name_or_path=/mnt/models/Wan2.2-Fun-A14B-InP \ --dataset_name=/mnt/datasets/wanvideo \ --output_dir=/mnt/outputs/wanvideo/special/direct_distill \ --resolution=512 \ --train_batch_size=1 \ --num_train_epochs=10 \ --learning_rate=1e-05 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_grad_norm=1 \ --checkpointing_steps=500 \ --validation_steps=500 \ --validation_guidance_scale=7.5 \ --validation_num_frames=16 \ --validation_fps=8 \ --seed=42 ``` -------------------------------- ### Quick Start Video Generation with Wan2.1-T2V-1.3B Source: https://github.com/modelscope/diffsynth-studio/blob/main/docs/en/Model_Details/Wan.md Load the Wan2.1-T2V-1.3B model and perform video inference. VRAM management is enabled, automatically controlling model parameter loading based on available VRAM. A minimum of 8GB VRAM is required. ```python import torch from diffsynth.utils.data import save_video, VideoData from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = WanVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors", **vram_config), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth", **vram_config), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth", **vram_config), ], tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 2, ) video = pipe( prompt="纪实摄影风格画面,一只活泼的小狗在绿茵茵的草地上迅速奔跑。小狗毛色棕黄,两只耳朵立起,神情专注而欢快。阳光洒在它身上,使得毛发看上去格外柔软而闪亮。背景是一片开阔的草地,偶尔点缀着几朵野花,远处隐约可见蓝天和几片白云。透视感鲜明,捕捉小狗奔跑时的动感和四周草地的生机。中景侧面移动视角。", negative_prompt="色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走", seed=0, tiled=True, ) save_video(video, "video.mp4", fps=15, quality=5) ```