### Environment Setup for DiTCtrl

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Shell commands to set up a Conda environment, install PyTorch with CUDA support, and other dependencies for the DiTCtrl project.

```bash
# Setup environment
conda create -n ditctrl python=3.10
conda activate ditctrl

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
conda install xformers-0.0.28.post1-py310_cu12.1.0_pyt2.4.1
```

--------------------------------

### Install Git LFS

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Installs Git Large File Storage (LFS) which is recommended for handling large model weight files.

```bash
git lfs install
```

--------------------------------

### Setup DiTCtrl Environment

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Installs PyTorch with CUDA 12 support, other dependencies, and xFormers for the DiTCtrl project. Ensure you activate the 'ditctrl' conda environment before running these commands.

```bash
cd DiTCtrl

conda create -n ditctrl python=3.10
conda activate ditctrl

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121

pip install -r requirements.txt

conda install https://anaconda.org/xformers/xformers/0.0.28.post1/download/linux-64/xformers-0.0.28.post1-py310_cu12.1.0_pyt2.4.1.tar.bz2
```

--------------------------------

### Video Editing - Word Swap Configuration (YAML)

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Configuration for video editing using word swap. The 'is_edit' flag must be set to True. This example swaps 'white' with 'red' for a vintage SUV.

```yaml
args:
  seed: 123
  output_dir: outputs/edit_case/suv
  sampling_fps: 8
  is_edit: True
  prompts:
    - "The camera captures a white vintage SUV with a black roof rack driving along a steep dirt road..."
    - "The camera captures a red vintage SUV with a black roof rack driving along a steep dirt road..."
```

--------------------------------

### Run Multi-Prompt Video Generation (Bash)

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Executes the multi-prompt video generation script. Ensure you are in the 'sat' directory. Alternatively, run directly using a custom configuration file.

```bash
cd sat
bash run_multi_prompt.sh
```

```bash
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python sample_video.py \
  --base configs/cogvideox_2b.yaml configs/inference.yaml \
  --custom-config inference_case_configs/multi_prompts/rose.yaml
```

--------------------------------

### Expected Directory Structure

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Illustrates the expected directory structure after downloading and setting up the model weights and T5 encoder for the DiTCtrl project.

```text
# Expected directory structure
sat/
├── CogVideoX-2b-sat/
│   ├── transformer/
│   │   ├── 1000/
│   │   │   └── mp_rank_00_model_states.pt
│   │   └── latest
│   ├── t5-v1_1-xxl/
│   │   ├── config.json
│   │   ├── model-00001-of-00002.safetensors
│   │   └── ...
│   └── vae/
│       └── 3d-vae.pt
├── configs/
├── inference_case_configs/
└── sample_video.py
```

--------------------------------

### Download CogVideoX-2B Model Weights

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Downloads and unzips the VAE and transformer components of the CogVideoX-2B model. Arrange the downloaded files into the specified directory structure.

```bash
cd sat
mkdir CogVideoX-2b-sat
cd CogVideoX-2b-sat
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
mv 'index.html?dl=1' vae.zip
unzip vae.zip
wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
mv 'index.html?dl=1' transformer.zip
unzip transformer.zip
```

--------------------------------

### Multi-Prompt Video Generation Configuration (YAML)

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Configuration file for multi-prompt video generation. Specifies output directory, seed, and the sequence of prompts for scene transitions. Set 'is_run_isolated' to True for comparison.

```yaml
args:
  is_run_isolated: False
  seed: 42
  output_dir: outputs/multi_prompt_case/rose
  prompts:
    - "A gentle close shot of the same rose petal, where the camera gradually pulls back to reveal the entire unfurling bloom in its perfect symmetry."
    - "A steady medium shot of the rose, where the camera continues retreating to show the full stem with its leaves and neighboring buds."
    - "A smooth full shot of the rose bush, where the camera moves further back to encompass the entire garden bed and surrounding flowering plants."
```

--------------------------------

### Generate Conditioning for Video Prompts

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Generate text conditioning for video prompts with transition interpolation using generate_conditioning_parts. Calculate the total video length based on prompt count, tile size, overlap, and transition blocks.

```python
from sample_video import generate_conditioning_parts, calculate_video_length

prompts = [
    "A rose petal in close-up view...",
    "A medium shot of the rose...",
    "A full shot of the rose bush..."
]

# Generate conditions with transition blocks
c_total, uc_total = generate_conditioning_parts(
    prompts=prompts,
    model=model,
    num_samples=[1],
    num_transition_blocks=2,    # Gradual transitions between prompts
    longer_mid_segment=0        # Extra time for middle segments
)

# Calculate total video length
video_length = calculate_video_length(
    prompts_length=len(prompts),
    tile_size=13,
    overlap_size=9,
    num_transition_blocks=2,
    longer_mid_segment=0
)
# Formula: total_segments = num_prompts + num_transition_blocks * (num_prompts - 1) + longer_mid_segment * (num_prompts - 2)
```

--------------------------------

### LLM Prompt Generation Instruction

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Provides instructions for an LLM to generate multi-prompt video sequences with smooth transitions. Emphasizes slight prompt variations, self-contained prompts, and word count limits.

```markdown
# prompts_gen_instruction/ditctrl.md

Copy this instruction to GPT-4 or similar LLM:

"""You are part of a team of bots that creates multi-prompt videos. You work with
an assistant bot that will draw anything you say in square brackets.

For example, outputting 'a beautiful morning in the woods with the sun peaking
through the trees' will trigger your partner bot to output a video of a forest
morning, as described.

Rules:
1. Generate prompts that differ only slightly for smooth transitions
2. Avoid words like "the same" - prompts should be self-contained
3. Maximum 226 words per prompt
4. Keep similar word count across prompts in a group

Example 3-prompt sequence:
'A dark knight rests motionless atop a majestic black horse in the middle of
a vast grassland...; A dark knight guides the majestic black horse at a steady
gallop across a snow-covered field...; A dark knight guides the majestic black
horse at a steady gallop across the vast desert expanse...'"""
```

--------------------------------

### Run Video Editing Script

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Executes the script for video editing tasks. Ensure you are in the 'sat' directory.

```bash
cd sat
bash run_edit_video.sh
```

--------------------------------

### Run Video Editing - Attention Reweighting (Bash)

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Executes the script for attention reweighting in video editing. This feature allows adjustment of token influence within specified layers and steps. Found in the 'sat' directory.

```bash
cd sat
bash run_reweight_video.sh
```

--------------------------------

### Run Longer Single-Prompt Text-to-Video

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Executes the script for generating longer videos from a single text prompt. Ensure you are in the 'sat' directory.

```bash
cd sat
bash run_single_prompt.sh
```

--------------------------------

### Run Longer Multi-Prompt Text-to-Video

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Executes the script for generating longer videos from multiple text prompts. Ensure you are in the 'sat' directory.

```bash
cd sat
bash run_multi_prompt.sh
```

--------------------------------

### Visualize Attention Maps with Bash Script

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Run this bash script to visualize attention maps. Ensure you are in the 'sat' directory before execution.

```bash
cd sat
bash run_visualize.sh
```

--------------------------------

### Download T5 Encoder

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Shell commands to clone the CogVideoX-2b repository and move the T5 encoder components to the correct directory for DiTCtrl.

```bash
# Download T5 encoder
git lfs install
git clone https://huggingface.co/THUDM/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```

--------------------------------

### Video Editing - Attention Reweighting Configuration (YAML)

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Configuration for attention reweighting. Specifies the token index, reweighting scale, and the layers/steps to apply the changes. Uses 'ReWeightAdaLNMixin'.

```yaml
args:
  seed: 42
  output_dir: outputs/reweight_case/bear_0
  sampling_fps: 8
  reweight_token_idx: 0
  reweight_scale: 0
  start_step: 0
  end_step: 50
  start_layer: 0
  end_layer: 30
  adaln_mixin_names:
    - 'ReWeightAdaLNMixin'
  prompts:
    - "pink teddy bear wearing a cute pink bow tie"
    - "pink teddy bear wearing a cute pink bow tie"
```

--------------------------------

### Inference Configuration Parameters

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Configure the generation process through the inference.yaml file with detailed control over KV-sharing and latent blending.

```APIDOC
## Inference Configuration Parameters

Configure the generation process through the inference.yaml file with detailed control over KV-sharing and latent blending.

```yaml
# configs/inference.yaml
args:
  latent_channels: 16
  mode: inference
  load: "CogVideoX-2b-sat/transformer"
  batch_size: 1
  sampling_num_frames: 13
  sampling_fps: 16
  fp16: True
  seed: 42
  output_dir: outputs/multi_prompt_case

  # KV-sharing strategy selection
  adaln_mixin_names:
    - 'KVSharingAdaLNMixin'          # Basic KV-sharing
    # - 'KVSharingMaskGuidedAdaLNMixin'  # Mask-guided KV-sharing for precise control

  # Step and layer control for KV-sharing
  start_step: 2
  end_step: 25
  start_layer: 25
  end_layer: 30

  # Latent blending parameters
  overlap_size: 9              # Overlap frames between segments (9 recommended)
  num_transition_blocks: 2     # Transition blocks between prompts (2 recommended)
  longer_mid_segment: 0        # Extra segments for middle prompts

  # Mask-guided parameters
  thres: 0.3                   # Threshold for segmentation binary mask
  ref_token_idx: [0]           # Reference token indices for mask guidance
  cur_token_idx: [0]           # Current token indices
```
```

--------------------------------

### Single-Prompt Longer Video Generation Configuration (YAML)

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Configuration for single-prompt longer video generation. The 'single_prompt_length' argument controls the video extension factor. Defines the prompt and output settings.

```yaml
args:
  seed: 42
  single_prompt_length: 5
  output_dir: outputs/single_prompt_case/fish
  prompts:
    - "A vibrant school of tropical fish weaves through an intricate coral reef system, their scales shimmering like jewels in the filtered sunlight. Brilliant parrotfish, angelfish, and clownfish create a living rainbow as they navigate between branches of staghorn coral and giant sea fans. Rays of sunlight pierce the crystal-clear water, creating dancing light patterns on the coral below."
```

--------------------------------

### Inference Configuration for Video Generation

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Configure detailed inference parameters in inference.yaml for KV-sharing and latent blending. Adjust settings like latent channels, batch size, sampling frames, FPS, and seed for controlled generation.

```yaml
args:
  latent_channels: 16
  mode: inference
  load: "CogVideoX-2b-sat/transformer"
  batch_size: 1
  sampling_num_frames: 13
  sampling_fps: 16
  fp16: True
  seed: 42
  output_dir: outputs/multi_prompt_case

  # KV-sharing strategy selection
  adaln_mixin_names:
    - 'KVSharingAdaLNMixin'          # Basic KV-sharing
    # - 'KVSharingMaskGuidedAdaLNMixin'  # Mask-guided KV-sharing for precise control

  # Step and layer control for KV-sharing
  start_step: 2
  end_step: 25
  start_layer: 25
  end_layer: 30

  # Latent blending parameters
  overlap_size: 9              # Overlap frames between segments (9 recommended)
  num_transition_blocks: 2     # Transition blocks between prompts (2 recommended)
  longer_mid_segment: 0        # Extra segments for middle prompts

  # Mask-guided parameters
  thres: 0.3                   # Threshold for segmentation binary mask
  ref_token_idx: [0]           # Reference token indices for mask guidance
  cur_token_idx: [0]           # Current token indices
```

--------------------------------

### Execute Custom Inference Command

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Constructs and evaluates a command to run the video generation script with a custom inference configuration file. This allows for overriding default settings.

```bash
inference_case_config="inference_case_configs/multi_prompts/rose.yaml"
run_cmd="$environs python sample_video.py --base configs/cogvideox_2b.yaml configs/inference.yaml --custom-config $inference_case_config"
echo ${run_cmd}
eval ${run_cmd}
```

--------------------------------

### Clone and Organize T5 Model

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Clones the T5 model from Huggingface and moves its components into the 't5-v1_1-xxl' directory for use as an encoder.

```bash
git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```

--------------------------------

### Conditioning Generation

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Generate text conditioning for prompts with support for transition interpolation.

```APIDOC
## Conditioning Generation

Generate text conditioning for prompts with support for transition interpolation.

```python
from sample_video import generate_conditioning_parts, calculate_video_length

prompts = [
    "A rose petal in close-up view...",
    "A medium shot of the rose...",
    "A full shot of the rose bush..."
]

# Generate conditions with transition blocks
c_total, uc_total = generate_conditioning_parts(
    prompts=prompts,
    model=model,
    num_samples=[1],
    num_transition_blocks=2,    # Gradual transitions between prompts
    longer_mid_segment=0        # Extra time for middle segments
)

# Calculate total video length
video_length = calculate_video_length(
    prompts_length=len(prompts),
    tile_size=13,
    overlap_size=9,
    num_transition_blocks=2,
    longer_mid_segment=0
)
# Formula: total_segments = num_prompts + num_transition_blocks * (num_prompts - 1) + longer_mid_segment * (num_prompts - 2)
```
```

--------------------------------

### Set Hugging Face Mirror Endpoint

Source: https://github.com/tencentarc/ditctrl/blob/main/README.md

Sets the HF_ENDPOINT environment variable to use a mirror for downloading Hugging Face models, which can help resolve 'HeaderTooLarge' errors.

```bash
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download THUDM/CogVideoX-2b --local-dir ./CogVideoX-2b
```

--------------------------------

### SATVideoDiffusionEngine API for Sampling

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Utilize the SATVideoDiffusionEngine for video diffusion sampling. Supports single and multi-prompt generation, with options to switch between different attention layer strategies like KV-sharing and attention reweighting.

```python
from diffusion_video import SATVideoDiffusionEngine
from sat.model.base_model import get_model
from sat.training.model_io import load_checkpoint

# Initialize the model
model = get_model(args, SATVideoDiffusionEngine)
load_checkpoint(model, args)
model.eval()

# Single prompt sampling
samples = model.sample_single(
    cond=c,           # Conditioning dict with 'crossattn' key
    uc=uc,            # Unconditional conditioning
    randn=noise,      # Random noise tensor [B, T, C, H//8, W//8]
)

# Multi-prompt sampling
samples = model.sample_multi_prompt(
    cond=c_list,      # List of conditioning dicts
    uc=uc_list,       # List of unconditional conditionings
    randn=noise,      # Random noise tensor for full video
    tile_size=13,     # Frames per tile
    overlap_size=9,   # Overlap between tiles
)

# Switch attention layer strategy
model.switch_adaln_layer('KVSharingAdaLNMixin')      # Basic KV-sharing
model.switch_adaln_layer('KVSharingMaskGuidedAdaLNMixin')  # Mask-guided
model.switch_adaln_layer('ReWeightAdaLNMixin')       # Attention reweighting
model.switch_adaln_layer('BaseAdaLNMixin')           # Base implementation
```

--------------------------------

### CSCV Metric Evaluation Script

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Evaluate video consistency using the CSCV metric by running the provided bash script or executing the Python script directly. Specify video paths, target seed, feature extractor, and image size for evaluation.

```bash
# Run CSCV metric evaluation
cd metrics
bash run_cscv.sh

# Or run directly
python cscv_metric.py \
  --video_path /path/to/generated/videos \
  --target_seed 42 \
  --feature_extractor clip \
  --extractor_path openai/clip-vit-base-patch32 \
  --image_size 224
```

--------------------------------

### SATVideoDiffusionEngine API

Source: https://context7.com/tencentarc/ditctrl/llms.txt

The main engine class for video diffusion sampling with single and multi-prompt support.

```APIDOC
## SATVideoDiffusionEngine API

The main engine class for video diffusion sampling with single and multi-prompt support.

```python
from diffusion_video import SATVideoDiffusionEngine
from sat.model.base_model import get_model
from sat.training.model_io import load_checkpoint

# Initialize the model
model = get_model(args, SATVideoDiffusionEngine)
load_checkpoint(model, args)
model.eval()

# Single prompt sampling
samples = model.sample_single(
    cond=c,           # Conditioning dict with 'crossattn' key
    uc=uc,            # Unconditional conditioning
    randn=noise,      # Random noise tensor [B, T, C, H//8, W//8]
)

# Multi-prompt sampling
samples = model.sample_multi_prompt(
    cond=c_list,      # List of conditioning dicts
    uc=uc_list,       # List of unconditional conditionings
    randn=noise,      # Random noise tensor for full video
    tile_size=13,     # Frames per tile
    overlap_size=9,   # Overlap between tiles
)

# Switch attention layer strategy
model.switch_adaln_layer('KVSharingAdaLNMixin')      # Basic KV-sharing
model.switch_adaln_layer('KVSharingMaskGuidedAdaLNMixin')  # Mask-guided
model.switch_adaln_layer('ReWeightAdaLNMixin')       # Attention reweighting
model.switch_adaln_layer('BaseAdaLNMixin')           # Base implementation
```
```

--------------------------------

### CSCV Metric Evaluation

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Evaluate video consistency using the CSCV (Cosine Similarity Coefficient of Variation) metric.

```APIDOC
## CSCV Metric Evaluation

Evaluate video consistency using the CSCV (Cosine Similarity Coefficient of Variation) metric.

```bash
# Run CSCV metric evaluation
cd metrics
bash run_cscv.sh

# Or run directly
python cscv_metric.py \
  --video_path /path/to/generated/videos \
  --target_seed 42 \
  --feature_extractor clip \
  --extractor_path openai/clip-vit-base-patch32 \
  --image_size 224
```
```

--------------------------------

### Calculate Uniformity Score with NumPy

Source: https://context7.com/tencentarc/ditctrl/llms.txt

Calculates the uniformity score between adjacent frames using normalized feature vectors and dot product similarity. Requires NumPy for array operations.

```python
import numpy as np

def uniformity_score(points):
    """
    Calculate uniformity score between adjacent frames using dot product similarity.

    Args:
        points: numpy array, shape (n_frames, n_dimensions) - frame features

    Returns:
        float: uniformity score between 0-1, closer to 1 means more uniform/consistent
    """
    # Normalize feature vectors
    normalized_points = points / np.linalg.norm(points, axis=1, keepdims=True)

    # Calculate cosine similarity between adjacent frames
    similarities = np.sum(normalized_points[:-1] * normalized_points[1:], axis=1)

    # Calculate coefficient of variation CV = std/mean * 10
    cv = np.std(similarities) / np.mean(similarities) * 10

    # Score: 1/(1 + CV), higher is better
    score = 1 / (1 + cv)
    return score
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.