### Run CogVideoX Full Fine-tuning Script

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

This command shows how to start CogVideoX full fine-tuning. It points to the base CogVideoX-2B configuration and the sft.yaml for supervised fine-tuning.

```bash
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
```

--------------------------------

### Start CogVideoX Fine-tuning

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

These commands execute the fine-tuning process for CogVideoX. Use `finetune_single_gpu.sh` for single GPU training and `finetune_multi_gpus.sh` for multi-GPU training.

```bash
bash finetune_single_gpu.sh
```

```bash
bash finetune_multi_gpus.sh
```

--------------------------------

### Run CogVideoX Lora Fine-tuning Script

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

This command demonstrates how to initiate CogVideoX fine-tuning using Lora with a single GPU setup. It specifies the base model configuration and the sft.yaml configuration file.

```bash
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
```

--------------------------------

### Install CogVideoX Dependencies

Source: https://github.com/thudm/cogvideo/blob/main/tools/venhancer/README.md

Installs the necessary Python packages required to run the CogVideoX project from the provided requirements file.

```shell
pip install -r requirements.txt
```

--------------------------------

### Install VEnhancer Environment

Source: https://github.com/thudm/cogvideo/blob/main/tools/venhancer/README.md

Commands to clone the VEnhancer repository and install the necessary Python dependencies for the environment.

```shell
git clone https://github.com/Vchitect/VEnhancer.git
cd VEnhancer
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
```

--------------------------------

### SAT Training Configuration

Source: https://context7.com/thudm/cogvideo/llms.txt

Example YAML configuration for fine-tuning using the SAT framework. Specifies model parameters, data paths, and training settings like mixed precision.

```yaml
# configs/sft.yaml - Training configuration
args:
  model_parallel_size: 1
  experiment_name: lora-custom-style
  mode: finetune
  load: "path/to/CogVideoX-2b-sat/transformer"
  no_load_rng: True
  train_iters: 1000
  eval_iters: 1
  eval_interval: 100
  eval_batch_size: 1
  save: ckpts
  save_interval: 100
  log_interval: 20
  train_data: ["path/to/train/data"]
  valid_data: ["path/to/val/data"]
  split: 1,0,0
  num_workers: 8
  force_train: True
  only_log_video_latents: True
  deepspeed:
    bf16:
      enabled: True  # For CogVideoX-5B
    fp16:
      enabled: False
```

--------------------------------

### Launch Gradio Web Interface for Video Generation

Source: https://context7.com/thudm/cogvideo/llms.txt

Initializes a Gradio web UI to interact with the CogVideoX pipeline. It includes parameter tuning for inference steps and guidance scale, and supports prompt enhancement.

```python
import gradio as gr
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

# Initialize pipeline
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
).to("cuda")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

def generate_video(prompt, num_steps, guidance_scale):
    video = pipe(
        prompt=prompt,
        num_videos_per_prompt=1,
        num_inference_steps=int(num_steps),
        num_frames=49,
        guidance_scale=guidance_scale,
    ).frames[0]

    video_path = "gradio_output.mp4"
    export_to_video(video, video_path)
    return video_path

# Create Gradio interface
with gr.Blocks() as demo:
    gr.Markdown("# CogVideoX Video Generator")

    with gr.Row():
        with gr.Column():
            prompt = gr.Textbox(
                label="Prompt",
                placeholder="Describe your video...",
                lines=3
            )
            num_steps = gr.Slider(10, 100, value=50, label="Inference Steps")
            guidance = gr.Slider(1.0, 15.0, value=6.0, label="Guidance Scale")
            generate_btn = gr.Button("Generate Video")

        with gr.Column():
            video_output = gr.Video(label="Generated Video")

    generate_btn.click(
        generate_video,
        inputs=[prompt, num_steps, guidance],
        outputs=video_output
    )

demo.launch()
```

```bash
# Run with OpenAI prompt enhancement
OPENAI_API_KEY=your_key python inference/gradio_web_demo.py

# Or run the full composite demo with I2V and V2V support
python inference/gradio_composite_demo/app.py
```

--------------------------------

### Execute Training Commands

Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md

Commands to initiate training using torchrun for either Lora or full fine-tuning, and shell scripts to run training on single or multiple GPUs.

```bash
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
bash finetune_single_gpu.sh
bash finetune_multi_gpus.sh
```

--------------------------------

### Execute SFT Fine-tuning Scripts

Source: https://github.com/thudm/cogvideo/blob/main/finetune/README.md

Commands to initiate SFT fine-tuning for Text-to-Video (T2V) and Image-to-Video (I2V) tasks. These scripts require matching configurations in the DeepSpeed zero configuration files.

```bash
bash train_zero_t2v.sh
bash train_zero_i2v.sh
```

--------------------------------

### CLI Inference Demonstration for CogVideoX

Source: https://github.com/thudm/cogvideo/blob/main/README.md

Provides a detailed CLI interface for running CogVideoX inference. It explains common parameters and configuration options for generating videos from text.

```python
# Example usage of CLI inference
from inference.cli_demo import run_inference

run_inference(prompt="A futuristic city skyline at sunset", model_path="cogvideox-5b")
```

--------------------------------

### Configure Fine-tuning Parameters

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

The sft.yaml file defines training parameters for fine-tuning. It supports model parallelization, iteration counts, and hardware-specific precision settings like fp16 or bf16.

```yaml
model_parallel_size: 1
experiment_name: lora-disney
mode: finetune
load: "{your_CogVideoX-2b-sat_path}/transformer"
train_iters: 1000
save: ckpts
deepspeed:
  bf16:
    enabled: False
  fp16:
    enabled: True
```

--------------------------------

### Run Inference with Fine-tuned CogVideo Model

Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md

This snippet shows how to modify the inference configuration file 'inference.sh' to use the fine-tuned CogVideo model. It sets up the environment and specifies the Python script and configuration files for running the sample video generation. The output is a generated video based on the provided parameters.

```bash
run_cmd="$environs python sample_video.py --base configs/cogvideox_<model parameters>_lora.yaml configs/inference.yaml --seed 42"

```

```bash
bash inference.sh

```

--------------------------------

### Configure Fine-tuning Parameters

Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md

YAML configuration for full-parameter fine-tuning, specifying training iterations, data paths, and hardware acceleration settings like DeepSpeed.

```yaml
model_parallel_size: 1
experiment_name: lora-disney
mode: finetune
load: "{your_CogVideoX-2b-sat_path}/transformer"
no_load_rng: True
train_iters: 1000
eval_iters: 1
eval_interval: 100
eval_batch_size: 1
save: ckpts
save_interval: 100
log_interval: 20
train_data: [ "your train data path" ]
valid_data: [ "your val data path" ]
split: 1,0,0
num_workers: 8
force_train: True
only_log_video_latents: True
deepspeed:
  bf16:
    enabled: False
  fp16:
    enabled: True
```

--------------------------------

### Configure CogVideoX Lora Fine-tuning (sft.yaml)

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

This YAML configuration snippet is for Lora fine-tuning of CogVideoX. It includes specific Lora configuration parameters like 'r' and the target mixin. This should be used in conjunction with the base sft.yaml for full fine-tuning.

```yaml
model:
  scale_factor: 1.55258426
  disable_first_stage_autocast: true
  not_trainable_prefixes: [ 'all' ]
  log_keys:
    - txt

  lora_config:
    target: sat.model.finetune.lora2.LoraMixin
    params:
      r: 256
```

--------------------------------

### Quantized Model Inference

Source: https://github.com/thudm/cogvideo/blob/main/README.md

Demonstrates how to run inference on memory-constrained devices using quantized models. This script can be adapted to support FP8 precision for reduced memory usage.

```python
# Example of running quantized inference
from inference.cli_demo_quantization import run_quantized

run_quantized(model_path="path/to/quantized/model", precision="int8")
```

--------------------------------

### Video-to-Video Transformation with CogVideoX V2V Pipeline

Source: https://context7.com/thudm/cogvideo/llms.txt

Transforms an existing video based on a new text prompt using the CogVideoX Video-to-Video pipeline. It preserves the original motion and structure while applying the style or content described in the prompt. Requires loading a source video and specifying generation parameters.

```python
import torch
from diffusers import CogVideoXVideoToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import export_to_video, load_video

# Load V2V pipeline
pipe = CogVideoXVideoToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)

pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config,
    timestep_spacing="trailing"
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

# Load source video
video = load_video("source_video.mp4")

# Transform video with new style/content
video_frames = pipe(
    prompt="A cyberpunk cityscape with neon lights reflecting on wet streets, rain falling.",
    video=video,
    num_frames=49,           # CogVideoX-5B uses 49 frames for 6 seconds at 8fps
    height=480,
    width=720,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    generator=torch.Generator().manual_seed(42),
).frames[0]

export_to_video(video_frames, "video_to_video.mp4", fps=8)

```

--------------------------------

### Image-to-Video Generation with CogVideoX I2V Pipeline

Source: https://context7.com/thudm/cogvideo/llms.txt

Converts a static image into a video based on a text prompt using the CogVideoX Image-to-Video pipeline. It animates the input image while maintaining visual consistency with the prompt. Requires loading an image and specifying generation parameters.

```python
import torch
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import export_to_video, load_image

# Load I2V pipeline
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B-I2V",
    torch_dtype=torch.bfloat16
)

pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config,
    timestep_spacing="trailing"
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

# Load reference image
image = load_image("path/to/your/image.png")

# Generate video from image
video_frames = pipe(
    prompt="The cat slowly turns its head and blinks while the wind gently rustles its fur.",
    image=image,
    num_frames=81,
    height=768,              # CogVideoX1.5-5B-I2V supports custom resolutions
    width=1360,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    generator=torch.Generator().manual_seed(42),
).frames[0]

export_to_video(video_frames, "image_to_video.mp4", fps=16)

```

--------------------------------

### Configure CogVideoX Full Fine-tuning (sft.yaml)

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

This YAML configuration is used for full-parameter fine-tuning of the CogVideoX model. It specifies training parameters, data paths, and DeepSpeed settings. Ensure 'bf16' and 'fp16' are set appropriately for CogVideoX-2B or CogVideoX-5B.

```yaml
model_parallel_size: 1
experiment_name: lora-disney
mode: finetune
load: "{your_CogVideoX-2b-sat_path}/transformer"
no_load_rng: True
train_iters: 1000
eval_iters: 1
eval_interval: 100
eval_batch_size: 1
save: ckpts
save_interval: 100
log_interval: 20
train_data: [ "your train data path" ]
valid_data: [ "your val data path" ]
split: 1,0,0
num_workers: 8
force_train: True
only_log_video_latents: True
deespeed:
  bf16:
    enabled: False
  fp16:
    enabled: True
```

--------------------------------

### Execute Inference Script

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

Run the bash script to initiate the video generation process based on the configurations defined in the YAML file.

```bash
bash inference.sh
```

--------------------------------

### Execute LoRA Fine-tuning Scripts

Source: https://github.com/thudm/cogvideo/blob/main/finetune/README.md

Commands to initiate LoRA fine-tuning for Text-to-Video (T2V) and Image-to-Video (I2V) tasks. Users must modify configuration parameters within the respective bash scripts before execution.

```bash
bash train_ddp_t2v.sh
bash train_ddp_i2v.sh
```

--------------------------------

### Fine-tune CogVideoX with SAT Framework

Source: https://context7.com/thudm/cogvideo/llms.txt

Fine-tune CogVideoX using the Swiss Army Transformer (SAT) framework for full model or LoRA training. Supports distributed training and requires a specific dataset structure with labels and videos.

```bash
git lfs install
git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT
```

```bash
cd sat
bash finetune_single_gpu.sh
```

```bash
bash finetune_multi_gpus.sh
```

--------------------------------

### Text-to-Video Generation with CogVideoX Pipeline

Source: https://context7.com/thudm/cogvideo/llms.txt

Generates videos from text prompts using the CogVideoX pipeline. It supports multiple model variants and integrates with the diffusers library for inference. Key parameters include prompt, number of frames, height, width, inference steps, and guidance scale.

```python
import torch
from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler
from diffusers.utils import export_to_video

# Load the pipeline
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16
)

# Configure scheduler (DPM recommended for 5B models)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config,
    timestep_spacing="trailing"
)

# Enable memory optimizations
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

# Generate video
video_frames = pipe(
    prompt="A radiant woman stands on a deserted beach, arms outstretched, wearing a beige trench coat against a backdrop of soft sky and sea.",
    num_frames=81,           # 81 frames for 5 seconds at 16fps (CogVideoX1.5)
    height=768,
    width=1360,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    generator=torch.Generator().manual_seed(42),
).frames[0]

# Save output
export_to_video(video_frames, "output.mp4", fps=16)

```

--------------------------------

### Perform LoRA Fusion and Custom Inference

Source: https://context7.com/thudm/cogvideo/llms.txt

Demonstrates how to fuse LoRA weights into the base model for faster inference and configure the scheduler for custom video generation styles.

```python
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
video = pipe(
    prompt="A disney-style animated character dancing in a meadow.",
    num_frames=49,
    num_inference_steps=50,
    guidance_scale=3.0,
    use_dynamic_cfg=True,
    generator=torch.Generator(device="cpu").manual_seed(42),
).frames[0]
export_to_video(video, "lora_output.mp4", fps=8)
```

--------------------------------

### Define Fine-tuning Dataset Structure

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

The dataset must follow a specific directory structure where each video file in the 'videos' folder has a corresponding text file with the same name in the 'labels' folder.

```text
.
├── labels
│   ├── 1.txt
│   ├── 2.txt
└── videos
    ├── 1.mp4
    ├── 2.mp4
```

--------------------------------

### POST /inference/generate

Source: https://context7.com/thudm/cogvideo/llms.txt

Generates a video based on a text prompt using the CogVideoX pipeline with optional LoRA fusion.

```APIDOC
## POST /inference/generate

### Description
Generates a video from a text prompt. Supports LoRA fusion for style-tuned models and custom scheduler configurations.

### Method
POST

### Endpoint
/inference/generate

### Parameters
#### Request Body
- **prompt** (string) - Required - The text description for the video.
- **num_frames** (integer) - Optional - Number of frames to generate.
- **num_inference_steps** (integer) - Optional - Number of denoising steps.
- **guidance_scale** (float) - Optional - Guidance scale for classifier-free guidance.
- **lora_scale** (float) - Optional - Scaling factor for fused LoRA weights.

### Request Example
{
  "prompt": "A disney-style animated character dancing in a meadow.",
  "num_frames": 49,
  "num_inference_steps": 50,
  "guidance_scale": 3.0
}

### Response
#### Success Response (200)
- **frames** (array) - The generated video frames.

#### Response Example
{
  "status": "success",
  "output_file": "lora_output.mp4"
}
```

--------------------------------

### Configure CogVideo Inference Settings

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

The inference.yaml file controls model paths, input methods (txt/cli), sampling parameters, and output directories. Ensure the 'load' path points to your transformer model checkpoint.

```yaml
args:
  latent_channels: 16
  mode: inference
  load: "{absolute_path/to/your}/transformer"
  batch_size: 1
  input_type: txt
  input_file: configs/test.txt
  sampling_num_frames: 13
  sampling_fps: 8
  fp16: True
  output_dir: outputs/
  force_inference: True
```

--------------------------------

### Prompt Optimization with LLM

Source: https://context7.com/thudm/cogvideo/llms.txt

Uses an LLM (GPT-4 or GLM-4) to expand short user prompts into detailed, descriptive captions suitable for high-quality video generation. Requires an OpenAI-compatible API client.

```python
from openai import OpenAI

sys_prompt_t2v = """You are part of a team of bots that creates videos. You work with an assistant bot that will draw anything you say.

For example, outputting "a beautiful morning in the woods with the sun peaking through the trees" will trigger your partner bot to output a video of a forest morning, as described. You will be prompted by people looking to create detailed, amazing videos. The way to accomplish this is to take their short prompts and make them extremely detailed and descriptive.

Rules:
- Output only a single video description per request
- When modifications are requested, refactor the entire description to integrate suggestions
- Video descriptions must be detailed but concise (around 100-150 words)
"""

def convert_prompt(prompt: str, retry_times: int = 3) -> str:
    """Convert simple prompt to detailed video description."""
    client = OpenAI()

    for _ in range(retry_times):
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": sys_prompt_t2v},
                {"role": "user", "content": f'Create an imaginative video descriptive caption for: "{prompt}"'},
            ],
            model="gpt-4o",
            temperature=0.01,
            top_p=0.7,
            max_tokens=250,
        )
        if response.choices:
            return response.choices[0].message.content
    return prompt

simple_prompt = "a girl on the beach"
detailed_prompt = convert_prompt(simple_prompt)
print(detailed_prompt)
```

--------------------------------

### Loading LoRA Fine-tuned Weights

Source: https://context7.com/thudm/cogvideo/llms.txt

Integrates custom LoRA adapters into the CogVideoX pipeline. It includes logic for setting the adapter scale based on the rank and alpha values of the trained weights.

```python
import torch
from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
).to("cuda")

pipe.load_lora_weights(
    "path/to/lora/weights",
    weight_name="pytorch_lora_weights.safetensors",
    adapter_name="custom-style",
)

lora_rank = 128
lora_alpha = 1
lora_scale = lora_alpha / lora_rank
pipe.set_adapters(["custom-style"], [lora_scale])
```

--------------------------------

### Quantized Inference for CogVideoX

Source: https://context7.com/thudm/cogvideo/llms.txt

Demonstrates how to reduce memory footprint by applying INT8 or FP8 quantization to the transformer, text encoder, and VAE components of the CogVideoX pipeline. Requires torchao for quantization operations.

```python
import torch
from diffusers import (
    AutoencoderKLCogVideoX,
    CogVideoXTransformer3DModel,
    CogVideoXPipeline,
    CogVideoXDPMScheduler,
)
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only
from torchao.float8.inference import ActivationCasting, QuantConfig, quantize_to_float8

def quantize_model(model, scheme="int8"):
    """Apply quantization to model components."""
    if scheme == "int8":
        quantize_(model, int8_weight_only())
    elif scheme == "fp8":
        quantize_to_float8(model, QuantConfig(ActivationCasting.DYNAMIC))
    return model

model_path = "THUDM/CogVideoX-5b"
dtype = torch.bfloat16

text_encoder = T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder", torch_dtype=dtype)
text_encoder = quantize_model(text_encoder, "int8")

transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer", torch_dtype=dtype)
transformer = quantize_model(transformer, "int8")

vae = AutoencoderKLCogVideoX.from_pretrained(model_path, subfolder="vae", torch_dtype=dtype)
vae = quantize_model(vae, "int8")

pipe = CogVideoXPipeline.from_pretrained(
    model_path,
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=dtype,
)

pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

video = pipe(
    prompt="A majestic eagle soaring through mountain peaks at sunset.",
    num_frames=49,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "quantized_output.mp4", fps=8)
```

--------------------------------

### POST /inference/parallel

Source: https://context7.com/thudm/cogvideo/llms.txt

Executes video generation across multiple GPUs using the xDiT framework for high-performance parallel inference.

```APIDOC
## POST /inference/parallel

### Description
Triggers a distributed inference job across a cluster of GPUs using xFuser configuration.

### Method
POST

### Endpoint
/inference/parallel

### Parameters
#### Request Body
- **nproc_per_node** (integer) - Required - Number of GPUs to utilize.
- **prompt** (string) - Required - The text description.
- **ulysses_degree** (integer) - Optional - Parallelism configuration.
- **ring_degree** (integer) - Optional - Parallelism configuration.

### Request Example
{
  "nproc_per_node": 4,
  "prompt": "A majestic waterfall cascading through a lush rainforest.",
  "height": 480,
  "width": 720
}

### Response
#### Success Response (200)
- **job_id** (string) - Identifier for the distributed job.

#### Response Example
{
  "status": "processing",
  "job_id": "dist_gen_001"
}
```

--------------------------------

### Input Text Conversion for CogVideoX

Source: https://github.com/thudm/cogvideo/blob/main/README.md

Transforms user-provided prompts into long-form inputs optimized for CogVideoX training distributions. It defaults to using GLM-4 but is compatible with other LLMs.

```python
# Convert short prompt to long-form input
from inference.convert_demo import expand_prompt

long_prompt = expand_prompt("A cat playing piano")
print(long_prompt)
```

--------------------------------

### Fine-tune CogVideoX with LoRA using Diffusers

Source: https://context7.com/thudm/cogvideo/llms.txt

Train custom LoRA adapters for style or domain-specific video generation using the diffusers-based training framework. Requires a specific dataset structure including prompts and video files.

```bash
bash finetune/train_ddp_t2v.sh
```

```bash
accelerate launch --config_file finetune/accelerate_config.yaml \
    finetune/train.py \
    --model_name cogvideox_t2v \
    --model_path THUDM/CogVideoX-5b \
    --training_type lora \
    --data_root ./dataset \
    --caption_column prompts.txt \
    --video_column videos.txt \
    --output_dir ./output \
    --train_resolution "49x480x720" \
    --train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-4 \
    --lora_rank 128 \
    --lora_alpha 128 \
    --max_train_steps 1000 \
    --checkpointing_steps 100 \
    --mixed_precision bf16
```

--------------------------------

### Configure CogVideoX Model Parameters

Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md

This YAML configuration defines the architecture and parameters for the CogVideoX model, including the diffusion transformer, T5 text encoder, and 3D VAE. It specifies target modules and hyperparameters necessary for model initialization and inference.

```yaml
model:
  scale_factor: 1.55258426
  disable_first_stage_autocast: true
  log_keys:
    - txt

  denoiser_config:
    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
    params:
      num_idx: 1000
      quantize_c_noise: False

      weighting_config:
        target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
      scaling_config:
        target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
      discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
          shift_scale: 3.0

  network_config:
    target: dit_video_concat.DiffusionTransformer
    params:
      time_embed_dim: 512
      elementwise_affine: True
      num_frames: 49
      time_compressed_rate: 4
      latent_width: 90
      latent_height: 60
      num_layers: 30
      patch_size: 2
      in_channels: 16
      out_channels: 16
      hidden_size: 1920
      adm_in_channels: 256
      num_attention_heads: 30

      transformer_args:
        checkpoint_activations: True
        vocab_size: 1
        max_sequence_length: 64
        layernorm_order: pre
        skip_init: false
        model_parallel_size: 1
        is_decoder: false

      modules:
        pos_embed_config:
          target: dit_video_concat.Basic3DPositionEmbeddingMixin
          params:
            text_length: 226
            height_interpolation: 1.875
            width_interpolation: 1.875

        patch_embed_config:
          target: dit_video_concat.ImagePatchEmbeddingMixin
          params:
            text_hidden_size: 4096

        adaln_layer_config:
          target: dit_video_concat.AdaLNMixin
          params:
            qk_ln: True

        final_layer_config:
          target: dit_video_concat.FinalLayerMixin

  conditioner_config:
    target: sgm.modules.GeneralConditioner
    params:
      emb_models:
        - is_trainable: false
          input_key: txt
          ucg_rate: 0.1
          target: sgm.modules.encoders.modules.FrozenT5Embedder
          params:
            model_dir: "t5-v1_1-xxl"
            max_length: 226

  first_stage_config:
    target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
    params:
      cp_size: 1
      ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt"
      ignore_keys: [ 'loss' ]

      loss_config:
        target: torch.nn.Identity

      regularizer_config:
        target: vae_modules.regularizers.DiagonalGaussianRegularizer

      encoder_config:
        target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
        params:
          double_z: true
          z_channels: 16
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult: [ 1, 2, 2, 4 ]
          attn_resolutions: [ ]
          num_res_blocks: 3
          dropout: 0.0
          gather_norm: True

      decoder_config:
        target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
        params:
          double_z: True
          z_channels: 16
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult: [ 1, 2, 2, 4 ]
          attn_resolutions: [ ]
          num_res_blocks: 3
          dropout: 0.0
          gather_norm: False

  loss_fn_config:
    target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
    params:
      offset_noise_level: 0
      sigma_sampler_config:
        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
        params:
          uniform_sampling: True
          num_idx: 1000
          discretization_config:
            target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
            params:
              shift_scale: 3.0

  sampler_config:
    target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
    params:
      num_steps: 50
      verbose: True

      discretization_config:
        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
        params:
          shift_scale: 3.0

      guider_config:
        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
        params:
          scale: 6
          exp: 5
          num_steps: 50
```

--------------------------------

### Convert SAT Weights to Huggingface Diffusers Format

Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md

This script converts model weights from the SAT format to the Huggingface Diffusers compatible format. This is necessary because the SAT weight format differs from Huggingface's. The script takes the SAT weights as input and outputs weights in the Huggingface format, enabling compatibility with the Diffusers library.

```python
python ../tools/convert_weight_sat2hf.py

```

--------------------------------

### Configure CogVideoX Inference Script

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

This command configures the inference script for a fine-tuned CogVideoX model, specifying the Lora configuration file and the general inference configuration.

```bash
run_cmd="$environs python sample_video.py --base configs/cogvideox_<model parameters>_lora.yaml configs/inference.yaml --seed 42"
```

--------------------------------

### Multi-GPU Parallel Inference with xDiT

Source: https://context7.com/thudm/cogvideo/llms.txt

Uses the xDiT framework to distribute video generation tasks across multiple GPUs, significantly reducing inference time for large models.

```bash
pip install xfuser
torchrun --nproc_per_node=4 parallel_inference.py --model THUDM/CogVideoX-5b --ulysses_degree 1 --ring_degree 2 --use_cfg_parallel --height 480 --width 720 --num_frames 49 --prompt "A majestic waterfall cascading through a lush rainforest."
```

```python
from xfuser import xFuserCogVideoXPipeline, xFuserArgs
# ... (setup engine_config)
pipe = xFuserCogVideoXPipeline.from_pretrained(
    pretrained_model_name_or_path=engine_config.model_config.model,
    engine_config=engine_config,
    torch_dtype=torch.bfloat16,
)
output = pipe(prompt=input_config.prompt, ...).frames[0]
```

--------------------------------

### SAT Model Inference Programmatic Usage (Python)

Source: https://context7.com/thudm/cogvideo/llms.txt

Programmatic usage of SAT model for video generation in Python. Loads model weights, prepares input batches, and generates video samples using PyTorch.

```python
# SAT inference programmatic usage
import torch
from sat.model.base_model import get_model
from sat.training.model_io import load_checkpoint
from diffusion_video import SATVideoDiffusionEngine
from arguments import get_args

# Parse arguments from config
args = get_args(["--base", "configs/cogvideox_5b.yaml", "configs/inference.yaml"])

# Load model
model = get_model(args, SATVideoDiffusionEngine)
load_checkpoint(model, args)
model.eval()

# Prepare batch
value_dict = {
    "prompt": "A serene lake surrounded by autumn trees with golden leaves falling.",
    "negative_prompt": "",
    "num_frames": torch.tensor(13).unsqueeze(0),
}

# Generate video
with torch.no_grad():
    samples = model.sample(
        c=condition_embeddings,
        uc=unconditional_embeddings,
        batch_size=1,
        shape=(13, 16, 60, 90),  # T, C, H, W
    )
```

--------------------------------

### Perform DDIM Inversion for Video Editing

Source: https://context7.com/thudm/cogvideo/llms.txt

Executes DDIM inversion to edit existing videos while maintaining structural integrity. This script requires a pre-trained CogVideoX model and specific inference parameters such as guidance scale and frame count.

```bash
python inference/ddim_inversion.py \
    --model_path THUDM/CogVideoX-5b \
    --prompt "A cyberpunk version of the scene with neon lights" \
    --video_path input_video.mp4 \
    --output_path ./ddim_output \
    --guidance_scale 6.0 \
    --num_inference_steps 50 \
    --max_num_frames 49 \
    --width 720 \
    --height 480 \
    --fps 8 \
    --dtype bf16 \
    --seed 42
```

--------------------------------

### SAT Model Inference

Source: https://context7.com/thudm/cogvideo/llms.txt

Run inference using SAT weights directly. Requires configuring an inference YAML file and executing a bash script. Supports text-to-video and interactive modes.

```bash
# Configure configs/inference.yaml
# args:
#   load: "path/to/transformer"
#   input_type: txt           # or "cli" for interactive
#   input_file: configs/test.txt
#   sampling_num_frames: 13   # 13 for CogVideoX, 42 for CogVideoX1.5
#   sampling_fps: 8
#   bf16: True
#   output_dir: outputs/

# Run inference
cd sat
bash inference.sh
```

--------------------------------

### Export Lora Weights from SAT to Huggingface Diffusers

Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md

This script exports Lora weights trained in the SAT format to the Huggingface Diffusers format. It requires the path to the SAT model weights and a directory to save the exported Huggingface-compatible Lora weights. The script facilitates the use of Lora-trained models within the Huggingface ecosystem by converting the internal weight structure.

```python
python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory   {args.save}/export_hf_lora_weights_1/

```

--------------------------------

### Convert SAT Weights to Diffusers Format

Source: https://context7.com/thudm/cogvideo/llms.txt

Convert SAT-trained models to Hugging Face diffusers format for broader compatibility. This script handles different CogVideoX versions and includes options for FP16 or BF16 mixed precision.

```bash
# Convert CogVideoX-2B
python tools/convert_weight_sat2hf.py \
    --transformer_ckpt_path path/to/sat/transformer/1000/mp_rank_00_model_states.pt \
    --vae_ckpt_path path/to/sat/vae/3d-vae.pt \
    --output_path ./converted_model \
    --num_layers 30 \
    --num_attention_heads 30 \
    --scaling_factor 1.15258426 \
    --snr_shift_scale 3.0 \
    --fp16
```

```bash
# Convert CogVideoX-5B
python tools/convert_weight_sat2hf.py \
    --transformer_ckpt_path path/to/sat/transformer/1000/mp_rank_00_model_states.pt \
    --vae_ckpt_path path/to/sat/vae/3d-vae.pt \
    --output_path ./converted_5b_model \
    --num_layers 42 \
    --num_attention_heads 48 \
    --scaling_factor 0.7 \
    --snr_shift_scale 1.0 \
    --use_rotary_positional_embeddings \
    --bf16
```

```bash
# Convert CogVideoX1.5-5B-I2V
python tools/convert_weight_sat2hf.py \
    --transformer_ckpt_path path/to/sat/transformer/1000/mp_rank_00_model_states.pt \
    --vae_ckpt_path path/to/sat/vae/3d-vae.pt \
    --output_path ./converted_1.5_i2v \
    --num_layers 42 \
    --num_attention_heads 48 \
    --use_rotary_positional_embeddings \
    --i2v \
    --version 1.5 \
    --bf16
```

--------------------------------

### SAT to Huggingface Lora Weight Mapping

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

This mapping illustrates the correspondence between SAT's internal Lora weight structures and Huggingface Diffusers' Lora weight structures, specifically for attention layers.

```json
{
  "attention.query_key_value.matrix_A.0": "attn1.to_q.lora_A.weight",
  "attention.query_key_value.matrix_A.1": "attn1.to_k.lora_A.weight",
  "attention.query_key_value.matrix_A.2": "attn1.to_v.lora_A.weight",
  "attention.query_key_value.matrix_B.0": "attn1.to_q.lora_B.weight",
  "attention.query_key_value.matrix_B.1": "attn1.to_k.lora_B.weight",
  "attention.query_key_value.matrix_B.2": "attn1.to_v.lora_B.weight",
  "attention.dense.matrix_A.0": "attn1.to_out.0.lora_A.weight",
  "attention.dense.matrix_B.0": "attn1.to_out.0.lora_B.weight"
}
```

--------------------------------

### Export Lora Weights from SAT to Huggingface Diffusers

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

This command exports Lora weights trained in the SAT format to the Huggingface Diffusers format. It requires the path to the saved SAT model states and a directory for the exported weights.

```python
python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory   {args.save}/export_hf_lora_weights_1/
```

--------------------------------

### Lora Weight Structure Mapping (SAT to HF)

Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md

This mapping illustrates the correspondence between Lora weight structures in the SAT format and the Huggingface Diffusers format. Lora adds low-rank weights to the attention layers. This information is crucial for understanding the conversion process performed by the export script.

```python
'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
'attention.query_key_value.matrix_A.1': 'attn1.to_k.lora_A.weight',
'attention.query_key_value.matrix_A.2': 'attn1.to_v.lora_A.weight',
'attention.query_key_value.matrix_B.0': 'attn1.to_q.lora_B.weight',
'attention.query_key_value.matrix_B.1': 'attn1.to_k.lora_B.weight',
'attention.query_key_value.matrix_B.2': 'attn1.to_v.lora_B.weight',
'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight',
'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'

```

--------------------------------

### Configure Lora Fine-tuning

Source: https://github.com/thudm/cogvideo/blob/main/sat/README.md

Specific configuration for Lora-based fine-tuning, defining the LoraMixin target and rank parameters for the transformer model.

```yaml
model:
  scale_factor: 1.55258426
  disable_first_stage_autocast: true
  not_trainable_prefixes: [ 'all' ]
  log_keys:
    - txt
  lora_config:
    target: sat.model.finetune.lora2.LoraMixin
    params:
      r: 256
```

--------------------------------

### Convert SAT Weights to Huggingface Diffusers

Source: https://github.com/thudm/cogvideo/blob/main/sat/README_ja.md

This Python script converts weights from the SAT format to the Huggingface Diffusers compatible format, which is necessary for using models with the Huggingface ecosystem.

```python
python ../tools/convert_weight_sat2hf.py
```

--------------------------------

### BibTeX Citation for CogVideo Research

Source: https://github.com/thudm/cogvideo/blob/main/README.md

This snippet provides the standard BibTeX format for citing the CogVideo and CogVideoX research papers in academic publications. It includes entries for both the original CogVideo paper and the CogVideoX-2B/5B expert transformer model paper.

```bibtex
@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}
@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}
```

--------------------------------

### Parallel Inference with xDiT

Source: https://github.com/thudm/cogvideo/blob/main/README.md

Utilizes the xDiT library to parallelize the video generation process across multiple GPUs, significantly improving performance for large-scale generation tasks.

```python
# Parallel inference setup
from tools.parallel_inference.parallel_inference_xdit import parallel_generate

parallel_generate(prompt="Cinematic mountain landscape", num_gpus=4)
```

--------------------------------

### POST /vae/process

Source: https://context7.com/thudm/cogvideo/llms.txt

Encodes video frames into latent representations or decodes latents back into video frames using the 3D Causal VAE.

```APIDOC
## POST /vae/process

### Description
Provides endpoints for VAE encoding (video to latent) and decoding (latent to video) for custom pipeline integration.

### Method
POST

### Endpoint
/vae/process

### Parameters
#### Request Body
- **action** (string) - Required - Either 'encode' or 'decode'.
- **data** (binary/tensor) - Required - Input video frames or latent tensor.
- **model_path** (string) - Required - Path to the VAE model weights.

### Request Example
{
  "action": "encode",
  "model_path": "THUDM/CogVideoX-2b/vae",
  "video_path": "input.mp4"
}

### Response
#### Success Response (200)
- **result** (tensor/file) - The processed output (latents or video frames).

#### Response Example
{
  "status": "success",
  "latent_shape": [1, 16, 12, 60, 90]
}
```

--------------------------------

### Enable Diffusers Memory Optimizations

Source: https://github.com/thudm/cogvideo/blob/main/README.md

Configures the diffusers pipeline to reduce memory consumption during inference. These methods are recommended for NVIDIA Ampere architectures and above to balance memory usage and speed.

```python
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```