### Run CogVideoX Full Fine-tuning Script Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md This command shows how to start CogVideoX full fine-tuning. It points to the base CogVideoX-2B configuration and the sft.yaml for supervised fine-tuning. ```bash run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM" ``` -------------------------------- ### Start CogVideoX Fine-tuning Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md These commands execute the fine-tuning process for CogVideoX. Use `finetune_single_gpu.sh` for single GPU training and `finetune_multi_gpus.sh` for multi-GPU training. ```bash bash finetune_single_gpu.sh ``` ```bash bash finetune_multi_gpus.sh ``` -------------------------------- ### Run CogVideoX Lora Fine-tuning Script Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md This command demonstrates how to initiate CogVideoX fine-tuning using Lora with a single GPU setup. It specifies the base model configuration and the sft.yaml configuration file. ```bash run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM" ``` -------------------------------- ### Install CogVideoX Dependencies Source: https://github.com/thudm/cogvideo/blob/main/tools/venhancer/README.md Installs the necessary Python packages required to run the CogVideoX project from the provided requirements file. ```shell pip install -r requirements.txt ``` -------------------------------- ### Install VEnhancer Environment Source: https://github.com/thudm/cogvideo/blob/main/tools/venhancer/README.md Commands to clone the VEnhancer repository and install the necessary Python dependencies for the environment. ```shell git clone https://github.com/Vchitect/VEnhancer.git cd VEnhancer pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 ``` -------------------------------- ### SAT Training Configuration Source: https://context7.com/thudm/cogvideo/llms.txt Example YAML configuration for fine-tuning using the SAT framework. Specifies model parameters, data paths, and training settings like mixed precision. ```yaml # configs/sft.yaml - Training configuration args: model_parallel_size: 1 experiment_name: lora-custom-style mode: finetune load: "path/to/CogVideoX-2b-sat/transformer" no_load_rng: True train_iters: 1000 eval_iters: 1 eval_interval: 100 eval_batch_size: 1 save: ckpts save_interval: 100 log_interval: 20 train_data: ["path/to/train/data"] valid_data: ["path/to/val/data"] split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True deepspeed: bf16: enabled: True # For CogVideoX-5B fp16: enabled: False ``` -------------------------------- ### Launch Gradio Web Interface for Video Generation Source: https://context7.com/thudm/cogvideo/llms.txt Initializes a Gradio web UI to interact with the CogVideoX pipeline. It includes parameter tuning for inference steps and guidance scale, and supports prompt enhancement. ```python import gradio as gr import torch from diffusers import CogVideoXPipeline from diffusers.utils import export_to_video # Initialize pipeline pipe = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16 ).to("cuda") pipe.vae.enable_slicing() pipe.vae.enable_tiling() def generate_video(prompt, num_steps, guidance_scale): video = pipe( prompt=prompt, num_videos_per_prompt=1, num_inference_steps=int(num_steps), num_frames=49, guidance_scale=guidance_scale, ).frames[0] video_path = "gradio_output.mp4" export_to_video(video, video_path) return video_path # Create Gradio interface with gr.Blocks() as demo: gr.Markdown("# CogVideoX Video Generator") with gr.Row(): with gr.Column(): prompt = gr.Textbox( label="Prompt", placeholder="Describe your video...", lines=3 ) num_steps = gr.Slider(10, 100, value=50, label="Inference Steps") guidance = gr.Slider(1.0, 15.0, value=6.0, label="Guidance Scale") generate_btn = gr.Button("Generate Video") with gr.Column(): video_output = gr.Video(label="Generated Video") generate_btn.click( generate_video, inputs=[prompt, num_steps, guidance], outputs=video_output ) demo.launch() ``` ```bash # Run with OpenAI prompt enhancement OPENAI_API_KEY=your_key python inference/gradio_web_demo.py # Or run the full composite demo with I2V and V2V support python inference/gradio_composite_demo/app.py ``` -------------------------------- ### Execute Training Commands Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md Commands to initiate training using torchrun for either Lora or full fine-tuning, and shell scripts to run training on single or multiple GPUs. ```bash run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM" run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM" bash finetune_single_gpu.sh bash finetune_multi_gpus.sh ``` -------------------------------- ### Execute SFT Fine-tuning Scripts Source: https://github.com/thudm/cogvideo/blob/main/finetune/README.md Commands to initiate SFT fine-tuning for Text-to-Video (T2V) and Image-to-Video (I2V) tasks. These scripts require matching configurations in the DeepSpeed zero configuration files. ```bash bash train_zero_t2v.sh bash train_zero_i2v.sh ``` -------------------------------- ### CLI Inference Demonstration for CogVideoX Source: https://github.com/thudm/cogvideo/blob/main/README.md Provides a detailed CLI interface for running CogVideoX inference. It explains common parameters and configuration options for generating videos from text. ```python # Example usage of CLI inference from inference.cli_demo import run_inference run_inference(prompt="A futuristic city skyline at sunset", model_path="cogvideox-5b") ``` -------------------------------- ### Configure Fine-tuning Parameters Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md The sft.yaml file defines training parameters for fine-tuning. It supports model parallelization, iteration counts, and hardware-specific precision settings like fp16 or bf16. ```yaml model_parallel_size: 1 experiment_name: lora-disney mode: finetune load: "{your_CogVideoX-2b-sat_path}/transformer" train_iters: 1000 save: ckpts deepspeed: bf16: enabled: False fp16: enabled: True ``` -------------------------------- ### Run Inference with Fine-tuned CogVideo Model Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md This snippet shows how to modify the inference configuration file 'inference.sh' to use the fine-tuned CogVideo model. It sets up the environment and specifies the Python script and configuration files for running the sample video generation. The output is a generated video based on the provided parameters. ```bash run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42" ``` ```bash bash inference.sh ``` -------------------------------- ### Configure Fine-tuning Parameters Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md YAML configuration for full-parameter fine-tuning, specifying training iterations, data paths, and hardware acceleration settings like DeepSpeed. ```yaml model_parallel_size: 1 experiment_name: lora-disney mode: finetune load: "{your_CogVideoX-2b-sat_path}/transformer" no_load_rng: True train_iters: 1000 eval_iters: 1 eval_interval: 100 eval_batch_size: 1 save: ckpts save_interval: 100 log_interval: 20 train_data: [ "your train data path" ] valid_data: [ "your val data path" ] split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True deepspeed: bf16: enabled: False fp16: enabled: True ``` -------------------------------- ### Configure CogVideoX Lora Fine-tuning (sft.yaml) Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md This YAML configuration snippet is for Lora fine-tuning of CogVideoX. It includes specific Lora configuration parameters like 'r' and the target mixin. This should be used in conjunction with the base sft.yaml for full fine-tuning. ```yaml model: scale_factor: 1.55258426 disable_first_stage_autocast: true not_trainable_prefixes: [ 'all' ] log_keys: - txt lora_config: target: sat.model.finetune.lora2.LoraMixin params: r: 256 ``` -------------------------------- ### Quantized Model Inference Source: https://github.com/thudm/cogvideo/blob/main/README.md Demonstrates how to run inference on memory-constrained devices using quantized models. This script can be adapted to support FP8 precision for reduced memory usage. ```python # Example of running quantized inference from inference.cli_demo_quantization import run_quantized run_quantized(model_path="path/to/quantized/model", precision="int8") ``` -------------------------------- ### Video-to-Video Transformation with CogVideoX V2V Pipeline Source: https://context7.com/thudm/cogvideo/llms.txt Transforms an existing video based on a new text prompt using the CogVideoX Video-to-Video pipeline. It preserves the original motion and structure while applying the style or content described in the prompt. Requires loading a source video and specifying generation parameters. ```python import torch from diffusers import CogVideoXVideoToVideoPipeline, CogVideoXDPMScheduler from diffusers.utils import export_to_video, load_video # Load V2V pipeline pipe = CogVideoXVideoToVideoPipeline.from_pretrained( "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16 ) pipe.scheduler = CogVideoXDPMScheduler.from_config( pipe.scheduler.config, timestep_spacing="trailing" ) pipe.enable_sequential_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() # Load source video video = load_video("source_video.mp4") # Transform video with new style/content video_frames = pipe( prompt="A cyberpunk cityscape with neon lights reflecting on wet streets, rain falling.", video=video, num_frames=49, # CogVideoX-5B uses 49 frames for 6 seconds at 8fps height=480, width=720, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=True, generator=torch.Generator().manual_seed(42), ).frames[0] export_to_video(video_frames, "video_to_video.mp4", fps=8) ``` -------------------------------- ### Image-to-Video Generation with CogVideoX I2V Pipeline Source: https://context7.com/thudm/cogvideo/llms.txt Converts a static image into a video based on a text prompt using the CogVideoX Image-to-Video pipeline. It animates the input image while maintaining visual consistency with the prompt. Requires loading an image and specifying generation parameters. ```python import torch from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler from diffusers.utils import export_to_video, load_image # Load I2V pipeline pipe = CogVideoXImageToVideoPipeline.from_pretrained( "THUDM/CogVideoX1.5-5B-I2V", torch_dtype=torch.bfloat16 ) pipe.scheduler = CogVideoXDPMScheduler.from_config( pipe.scheduler.config, timestep_spacing="trailing" ) pipe.enable_sequential_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() # Load reference image image = load_image("path/to/your/image.png") # Generate video from image video_frames = pipe( prompt="The cat slowly turns its head and blinks while the wind gently rustles its fur.", image=image, num_frames=81, height=768, # CogVideoX1.5-5B-I2V supports custom resolutions width=1360, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=True, generator=torch.Generator().manual_seed(42), ).frames[0] export_to_video(video_frames, "image_to_video.mp4", fps=16) ``` -------------------------------- ### Configure CogVideoX Full Fine-tuning (sft.yaml) Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md This YAML configuration is used for full-parameter fine-tuning of the CogVideoX model. It specifies training parameters, data paths, and DeepSpeed settings. Ensure 'bf16' and 'fp16' are set appropriately for CogVideoX-2B or CogVideoX-5B. ```yaml model_parallel_size: 1 experiment_name: lora-disney mode: finetune load: "{your_CogVideoX-2b-sat_path}/transformer" no_load_rng: True train_iters: 1000 eval_iters: 1 eval_interval: 100 eval_batch_size: 1 save: ckpts save_interval: 100 log_interval: 20 train_data: [ "your train data path" ] valid_data: [ "your val data path" ] split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True deespeed: bf16: enabled: False fp16: enabled: True ``` -------------------------------- ### Execute Inference Script Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md Run the bash script to initiate the video generation process based on the configurations defined in the YAML file. ```bash bash inference.sh ``` -------------------------------- ### Execute LoRA Fine-tuning Scripts Source: https://github.com/thudm/cogvideo/blob/main/finetune/README.md Commands to initiate LoRA fine-tuning for Text-to-Video (T2V) and Image-to-Video (I2V) tasks. Users must modify configuration parameters within the respective bash scripts before execution. ```bash bash train_ddp_t2v.sh bash train_ddp_i2v.sh ``` -------------------------------- ### Fine-tune CogVideoX with SAT Framework Source: https://context7.com/thudm/cogvideo/llms.txt Fine-tune CogVideoX using the Swiss Army Transformer (SAT) framework for full model or LoRA training. Supports distributed training and requires a specific dataset structure with labels and videos. ```bash git lfs install git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT ``` ```bash cd sat bash finetune_single_gpu.sh ``` ```bash bash finetune_multi_gpus.sh ``` -------------------------------- ### Text-to-Video Generation with CogVideoX Pipeline Source: https://context7.com/thudm/cogvideo/llms.txt Generates videos from text prompts using the CogVideoX pipeline. It supports multiple model variants and integrates with the diffusers library for inference. Key parameters include prompt, number of frames, height, width, inference steps, and guidance scale. ```python import torch from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler from diffusers.utils import export_to_video # Load the pipeline pipe = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX1.5-5B", torch_dtype=torch.bfloat16 ) # Configure scheduler (DPM recommended for 5B models) pipe.scheduler = CogVideoXDPMScheduler.from_config( pipe.scheduler.config, timestep_spacing="trailing" ) # Enable memory optimizations pipe.enable_sequential_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() # Generate video video_frames = pipe( prompt="A radiant woman stands on a deserted beach, arms outstretched, wearing a beige trench coat against a backdrop of soft sky and sea.", num_frames=81, # 81 frames for 5 seconds at 16fps (CogVideoX1.5) height=768, width=1360, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=True, generator=torch.Generator().manual_seed(42), ).frames[0] # Save output export_to_video(video_frames, "output.mp4", fps=16) ``` -------------------------------- ### Perform LoRA Fusion and Custom Inference Source: https://context7.com/thudm/cogvideo/llms.txt Demonstrates how to fuse LoRA weights into the base model for faster inference and configure the scheduler for custom video generation styles. ```python pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing") video = pipe( prompt="A disney-style animated character dancing in a meadow.", num_frames=49, num_inference_steps=50, guidance_scale=3.0, use_dynamic_cfg=True, generator=torch.Generator(device="cpu").manual_seed(42), ).frames[0] export_to_video(video, "lora_output.mp4", fps=8) ``` -------------------------------- ### Define Fine-tuning Dataset Structure Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md The dataset must follow a specific directory structure where each video file in the 'videos' folder has a corresponding text file with the same name in the 'labels' folder. ```text . ├── labels │ ├── 1.txt │ ├── 2.txt └── videos ├── 1.mp4 ├── 2.mp4 ``` -------------------------------- ### POST /inference/generate Source: https://context7.com/thudm/cogvideo/llms.txt Generates a video based on a text prompt using the CogVideoX pipeline with optional LoRA fusion. ```APIDOC ## POST /inference/generate ### Description Generates a video from a text prompt. Supports LoRA fusion for style-tuned models and custom scheduler configurations. ### Method POST ### Endpoint /inference/generate ### Parameters #### Request Body - **prompt** (string) - Required - The text description for the video. - **num_frames** (integer) - Optional - Number of frames to generate. - **num_inference_steps** (integer) - Optional - Number of denoising steps. - **guidance_scale** (float) - Optional - Guidance scale for classifier-free guidance. - **lora_scale** (float) - Optional - Scaling factor for fused LoRA weights. ### Request Example { "prompt": "A disney-style animated character dancing in a meadow.", "num_frames": 49, "num_inference_steps": 50, "guidance_scale": 3.0 } ### Response #### Success Response (200) - **frames** (array) - The generated video frames. #### Response Example { "status": "success", "output_file": "lora_output.mp4" } ``` -------------------------------- ### Configure CogVideo Inference Settings Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md The inference.yaml file controls model paths, input methods (txt/cli), sampling parameters, and output directories. Ensure the 'load' path points to your transformer model checkpoint. ```yaml args: latent_channels: 16 mode: inference load: "{absolute_path/to/your}/transformer" batch_size: 1 input_type: txt input_file: configs/test.txt sampling_num_frames: 13 sampling_fps: 8 fp16: True output_dir: outputs/ force_inference: True ``` -------------------------------- ### Prompt Optimization with LLM Source: https://context7.com/thudm/cogvideo/llms.txt Uses an LLM (GPT-4 or GLM-4) to expand short user prompts into detailed, descriptive captions suitable for high-quality video generation. Requires an OpenAI-compatible API client. ```python from openai import OpenAI sys_prompt_t2v = """You are part of a team of bots that creates videos. You work with an assistant bot that will draw anything you say. For example, outputting "a beautiful morning in the woods with the sun peaking through the trees" will trigger your partner bot to output a video of a forest morning, as described. You will be prompted by people looking to create detailed, amazing videos. The way to accomplish this is to take their short prompts and make them extremely detailed and descriptive. Rules: - Output only a single video description per request - When modifications are requested, refactor the entire description to integrate suggestions - Video descriptions must be detailed but concise (around 100-150 words) """ def convert_prompt(prompt: str, retry_times: int = 3) -> str: """Convert simple prompt to detailed video description.""" client = OpenAI() for _ in range(retry_times): response = client.chat.completions.create( messages=[ {"role": "system", "content": sys_prompt_t2v}, {"role": "user", "content": f'Create an imaginative video descriptive caption for: "{prompt}"'}, ], model="gpt-4o", temperature=0.01, top_p=0.7, max_tokens=250, ) if response.choices: return response.choices[0].message.content return prompt simple_prompt = "a girl on the beach" detailed_prompt = convert_prompt(simple_prompt) print(detailed_prompt) ``` -------------------------------- ### Loading LoRA Fine-tuned Weights Source: https://context7.com/thudm/cogvideo/llms.txt Integrates custom LoRA adapters into the CogVideoX pipeline. It includes logic for setting the adapter scale based on the rank and alpha values of the trained weights. ```python import torch from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler from diffusers.utils import export_to_video pipe = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16 ).to("cuda") pipe.load_lora_weights( "path/to/lora/weights", weight_name="pytorch_lora_weights.safetensors", adapter_name="custom-style", ) lora_rank = 128 lora_alpha = 1 lora_scale = lora_alpha / lora_rank pipe.set_adapters(["custom-style"], [lora_scale]) ``` -------------------------------- ### Quantized Inference for CogVideoX Source: https://context7.com/thudm/cogvideo/llms.txt Demonstrates how to reduce memory footprint by applying INT8 or FP8 quantization to the transformer, text encoder, and VAE components of the CogVideoX pipeline. Requires torchao for quantization operations. ```python import torch from diffusers import ( AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline, CogVideoXDPMScheduler, ) from diffusers.utils import export_to_video from transformers import T5EncoderModel from torchao.quantization import quantize_, int8_weight_only from torchao.float8.inference import ActivationCasting, QuantConfig, quantize_to_float8 def quantize_model(model, scheme="int8"): """Apply quantization to model components.""" if scheme == "int8": quantize_(model, int8_weight_only()) elif scheme == "fp8": quantize_to_float8(model, QuantConfig(ActivationCasting.DYNAMIC)) return model model_path = "THUDM/CogVideoX-5b" dtype = torch.bfloat16 text_encoder = T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder", torch_dtype=dtype) text_encoder = quantize_model(text_encoder, "int8") transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer", torch_dtype=dtype) transformer = quantize_model(transformer, "int8") vae = AutoencoderKLCogVideoX.from_pretrained(model_path, subfolder="vae", torch_dtype=dtype) vae = quantize_model(vae, "int8") pipe = CogVideoXPipeline.from_pretrained( model_path, text_encoder=text_encoder, transformer=transformer, vae=vae, torch_dtype=dtype, ) pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing") pipe.enable_model_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() video = pipe( prompt="A majestic eagle soaring through mountain peaks at sunset.", num_frames=49, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=True, generator=torch.Generator(device="cuda").manual_seed(42), ).frames[0] export_to_video(video, "quantized_output.mp4", fps=8) ``` -------------------------------- ### POST /inference/parallel Source: https://context7.com/thudm/cogvideo/llms.txt Executes video generation across multiple GPUs using the xDiT framework for high-performance parallel inference. ```APIDOC ## POST /inference/parallel ### Description Triggers a distributed inference job across a cluster of GPUs using xFuser configuration. ### Method POST ### Endpoint /inference/parallel ### Parameters #### Request Body - **nproc_per_node** (integer) - Required - Number of GPUs to utilize. - **prompt** (string) - Required - The text description. - **ulysses_degree** (integer) - Optional - Parallelism configuration. - **ring_degree** (integer) - Optional - Parallelism configuration. ### Request Example { "nproc_per_node": 4, "prompt": "A majestic waterfall cascading through a lush rainforest.", "height": 480, "width": 720 } ### Response #### Success Response (200) - **job_id** (string) - Identifier for the distributed job. #### Response Example { "status": "processing", "job_id": "dist_gen_001" } ``` -------------------------------- ### Input Text Conversion for CogVideoX Source: https://github.com/thudm/cogvideo/blob/main/README.md Transforms user-provided prompts into long-form inputs optimized for CogVideoX training distributions. It defaults to using GLM-4 but is compatible with other LLMs. ```python # Convert short prompt to long-form input from inference.convert_demo import expand_prompt long_prompt = expand_prompt("A cat playing piano") print(long_prompt) ``` -------------------------------- ### Fine-tune CogVideoX with LoRA using Diffusers Source: https://context7.com/thudm/cogvideo/llms.txt Train custom LoRA adapters for style or domain-specific video generation using the diffusers-based training framework. Requires a specific dataset structure including prompts and video files. ```bash bash finetune/train_ddp_t2v.sh ``` ```bash accelerate launch --config_file finetune/accelerate_config.yaml \ finetune/train.py \ --model_name cogvideox_t2v \ --model_path THUDM/CogVideoX-5b \ --training_type lora \ --data_root ./dataset \ --caption_column prompts.txt \ --video_column videos.txt \ --output_dir ./output \ --train_resolution "49x480x720" \ --train_batch_size 1 \ --gradient_accumulation_steps 4 \ --learning_rate 1e-4 \ --lora_rank 128 \ --lora_alpha 128 \ --max_train_steps 1000 \ --checkpointing_steps 100 \ --mixed_precision bf16 ``` -------------------------------- ### Configure CogVideoX Model Parameters Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md This YAML configuration defines the architecture and parameters for the CogVideoX model, including the diffusion transformer, T5 text encoder, and 3D VAE. It specifies target modules and hyperparameters necessary for model initialization and inference. ```yaml model: scale_factor: 1.55258426 disable_first_stage_autocast: true log_keys: - txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: True vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: True final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models: - is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: "t5-v1_1-xxl" max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ] loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: True decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: True z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: False loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ``` -------------------------------- ### Convert SAT Weights to Huggingface Diffusers Format Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md This script converts model weights from the SAT format to the Huggingface Diffusers compatible format. This is necessary because the SAT weight format differs from Huggingface's. The script takes the SAT weights as input and outputs weights in the Huggingface format, enabling compatibility with the Diffusers library. ```python python ../tools/convert_weight_sat2hf.py ``` -------------------------------- ### Configure CogVideoX Inference Script Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md This command configures the inference script for a fine-tuned CogVideoX model, specifying the Lora configuration file and the general inference configuration. ```bash run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42" ``` -------------------------------- ### Multi-GPU Parallel Inference with xDiT Source: https://context7.com/thudm/cogvideo/llms.txt Uses the xDiT framework to distribute video generation tasks across multiple GPUs, significantly reducing inference time for large models. ```bash pip install xfuser torchrun --nproc_per_node=4 parallel_inference.py --model THUDM/CogVideoX-5b --ulysses_degree 1 --ring_degree 2 --use_cfg_parallel --height 480 --width 720 --num_frames 49 --prompt "A majestic waterfall cascading through a lush rainforest." ``` ```python from xfuser import xFuserCogVideoXPipeline, xFuserArgs # ... (setup engine_config) pipe = xFuserCogVideoXPipeline.from_pretrained( pretrained_model_name_or_path=engine_config.model_config.model, engine_config=engine_config, torch_dtype=torch.bfloat16, ) output = pipe(prompt=input_config.prompt, ...).frames[0] ``` -------------------------------- ### SAT Model Inference Programmatic Usage (Python) Source: https://context7.com/thudm/cogvideo/llms.txt Programmatic usage of SAT model for video generation in Python. Loads model weights, prepares input batches, and generates video samples using PyTorch. ```python # SAT inference programmatic usage import torch from sat.model.base_model import get_model from sat.training.model_io import load_checkpoint from diffusion_video import SATVideoDiffusionEngine from arguments import get_args # Parse arguments from config args = get_args(["--base", "configs/cogvideox_5b.yaml", "configs/inference.yaml"]) # Load model model = get_model(args, SATVideoDiffusionEngine) load_checkpoint(model, args) model.eval() # Prepare batch value_dict = { "prompt": "A serene lake surrounded by autumn trees with golden leaves falling.", "negative_prompt": "", "num_frames": torch.tensor(13).unsqueeze(0), } # Generate video with torch.no_grad(): samples = model.sample( c=condition_embeddings, uc=unconditional_embeddings, batch_size=1, shape=(13, 16, 60, 90), # T, C, H, W ) ``` -------------------------------- ### Perform DDIM Inversion for Video Editing Source: https://context7.com/thudm/cogvideo/llms.txt Executes DDIM inversion to edit existing videos while maintaining structural integrity. This script requires a pre-trained CogVideoX model and specific inference parameters such as guidance scale and frame count. ```bash python inference/ddim_inversion.py \ --model_path THUDM/CogVideoX-5b \ --prompt "A cyberpunk version of the scene with neon lights" \ --video_path input_video.mp4 \ --output_path ./ddim_output \ --guidance_scale 6.0 \ --num_inference_steps 50 \ --max_num_frames 49 \ --width 720 \ --height 480 \ --fps 8 \ --dtype bf16 \ --seed 42 ``` -------------------------------- ### SAT Model Inference Source: https://context7.com/thudm/cogvideo/llms.txt Run inference using SAT weights directly. Requires configuring an inference YAML file and executing a bash script. Supports text-to-video and interactive modes. ```bash # Configure configs/inference.yaml # args: # load: "path/to/transformer" # input_type: txt # or "cli" for interactive # input_file: configs/test.txt # sampling_num_frames: 13 # 13 for CogVideoX, 42 for CogVideoX1.5 # sampling_fps: 8 # bf16: True # output_dir: outputs/ # Run inference cd sat bash inference.sh ``` -------------------------------- ### Export Lora Weights from SAT to Huggingface Diffusers Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md This script exports Lora weights trained in the SAT format to the Huggingface Diffusers format. It requires the path to the SAT model weights and a directory to save the exported Huggingface-compatible Lora weights. The script facilitates the use of Lora-trained models within the Huggingface ecosystem by converting the internal weight structure. ```python python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/ ``` -------------------------------- ### Convert SAT Weights to Diffusers Format Source: https://context7.com/thudm/cogvideo/llms.txt Convert SAT-trained models to Hugging Face diffusers format for broader compatibility. This script handles different CogVideoX versions and includes options for FP16 or BF16 mixed precision. ```bash # Convert CogVideoX-2B python tools/convert_weight_sat2hf.py \ --transformer_ckpt_path path/to/sat/transformer/1000/mp_rank_00_model_states.pt \ --vae_ckpt_path path/to/sat/vae/3d-vae.pt \ --output_path ./converted_model \ --num_layers 30 \ --num_attention_heads 30 \ --scaling_factor 1.15258426 \ --snr_shift_scale 3.0 \ --fp16 ``` ```bash # Convert CogVideoX-5B python tools/convert_weight_sat2hf.py \ --transformer_ckpt_path path/to/sat/transformer/1000/mp_rank_00_model_states.pt \ --vae_ckpt_path path/to/sat/vae/3d-vae.pt \ --output_path ./converted_5b_model \ --num_layers 42 \ --num_attention_heads 48 \ --scaling_factor 0.7 \ --snr_shift_scale 1.0 \ --use_rotary_positional_embeddings \ --bf16 ``` ```bash # Convert CogVideoX1.5-5B-I2V python tools/convert_weight_sat2hf.py \ --transformer_ckpt_path path/to/sat/transformer/1000/mp_rank_00_model_states.pt \ --vae_ckpt_path path/to/sat/vae/3d-vae.pt \ --output_path ./converted_1.5_i2v \ --num_layers 42 \ --num_attention_heads 48 \ --use_rotary_positional_embeddings \ --i2v \ --version 1.5 \ --bf16 ``` -------------------------------- ### SAT to Huggingface Lora Weight Mapping Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md This mapping illustrates the correspondence between SAT's internal Lora weight structures and Huggingface Diffusers' Lora weight structures, specifically for attention layers. ```json { "attention.query_key_value.matrix_A.0": "attn1.to_q.lora_A.weight", "attention.query_key_value.matrix_A.1": "attn1.to_k.lora_A.weight", "attention.query_key_value.matrix_A.2": "attn1.to_v.lora_A.weight", "attention.query_key_value.matrix_B.0": "attn1.to_q.lora_B.weight", "attention.query_key_value.matrix_B.1": "attn1.to_k.lora_B.weight", "attention.query_key_value.matrix_B.2": "attn1.to_v.lora_B.weight", "attention.dense.matrix_A.0": "attn1.to_out.0.lora_A.weight", "attention.dense.matrix_B.0": "attn1.to_out.0.lora_B.weight" } ``` -------------------------------- ### Export Lora Weights from SAT to Huggingface Diffusers Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md This command exports Lora weights trained in the SAT format to the Huggingface Diffusers format. It requires the path to the saved SAT model states and a directory for the exported weights. ```python python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/ ``` -------------------------------- ### Lora Weight Structure Mapping (SAT to HF) Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md This mapping illustrates the correspondence between Lora weight structures in the SAT format and the Huggingface Diffusers format. Lora adds low-rank weights to the attention layers. This information is crucial for understanding the conversion process performed by the export script. ```python 'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight', 'attention.query_key_value.matrix_A.1': 'attn1.to_k.lora_A.weight', 'attention.query_key_value.matrix_A.2': 'attn1.to_v.lora_A.weight', 'attention.query_key_value.matrix_B.0': 'attn1.to_q.lora_B.weight', 'attention.query_key_value.matrix_B.1': 'attn1.to_k.lora_B.weight', 'attention.query_key_value.matrix_B.2': 'attn1.to_v.lora_B.weight', 'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight', 'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight' ``` -------------------------------- ### Configure Lora Fine-tuning Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md Specific configuration for Lora-based fine-tuning, defining the LoraMixin target and rank parameters for the transformer model. ```yaml model: scale_factor: 1.55258426 disable_first_stage_autocast: true not_trainable_prefixes: [ 'all' ] log_keys: - txt lora_config: target: sat.model.finetune.lora2.LoraMixin params: r: 256 ``` -------------------------------- ### Convert SAT Weights to Huggingface Diffusers Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md This Python script converts weights from the SAT format to the Huggingface Diffusers compatible format, which is necessary for using models with the Huggingface ecosystem. ```python python ../tools/convert_weight_sat2hf.py ``` -------------------------------- ### BibTeX Citation for CogVideo Research Source: https://github.com/thudm/cogvideo/blob/main/README.md This snippet provides the standard BibTeX format for citing the CogVideo and CogVideoX research papers in academic publications. It includes entries for both the original CogVideo paper and the CogVideoX-2B/5B expert transformer model paper. ```bibtex @article{yang2024cogvideox, title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others}, journal={arXiv preprint arXiv:2408.06072}, year={2024} } @article{hong2022cogvideo, title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers}, author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie}, journal={arXiv preprint arXiv:2205.15868}, year={2022} } ``` -------------------------------- ### Parallel Inference with xDiT Source: https://github.com/thudm/cogvideo/blob/main/README.md Utilizes the xDiT library to parallelize the video generation process across multiple GPUs, significantly improving performance for large-scale generation tasks. ```python # Parallel inference setup from tools.parallel_inference.parallel_inference_xdit import parallel_generate parallel_generate(prompt="Cinematic mountain landscape", num_gpus=4) ``` -------------------------------- ### POST /vae/process Source: https://context7.com/thudm/cogvideo/llms.txt Encodes video frames into latent representations or decodes latents back into video frames using the 3D Causal VAE. ```APIDOC ## POST /vae/process ### Description Provides endpoints for VAE encoding (video to latent) and decoding (latent to video) for custom pipeline integration. ### Method POST ### Endpoint /vae/process ### Parameters #### Request Body - **action** (string) - Required - Either 'encode' or 'decode'. - **data** (binary/tensor) - Required - Input video frames or latent tensor. - **model_path** (string) - Required - Path to the VAE model weights. ### Request Example { "action": "encode", "model_path": "THUDM/CogVideoX-2b/vae", "video_path": "input.mp4" } ### Response #### Success Response (200) - **result** (tensor/file) - The processed output (latents or video frames). #### Response Example { "status": "success", "latent_shape": [1, 16, 12, 60, 90] } ``` -------------------------------- ### Enable Diffusers Memory Optimizations Source: https://github.com/thudm/cogvideo/blob/main/README.md Configures the diffusers pipeline to reduce memory consumption during inference. These methods are recommended for NVIDIA Ampere architectures and above to balance memory usage and speed. ```python pipe.enable_sequential_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() ```