### Environment Setup for DiTCtrl Source: https://context7.com/tencentarc/ditctrl/llms.txt Shell commands to set up a Conda environment, install PyTorch with CUDA support, and other dependencies for the DiTCtrl project. ```bash # Setup environment conda create -n ditctrl python=3.10 conda activate ditctrl pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt conda install xformers-0.0.28.post1-py310_cu12.1.0_pyt2.4.1 ``` -------------------------------- ### Install Git LFS Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Installs Git Large File Storage (LFS) which is recommended for handling large model weight files. ```bash git lfs install ``` -------------------------------- ### Setup DiTCtrl Environment Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Installs PyTorch with CUDA 12 support, other dependencies, and xFormers for the DiTCtrl project. Ensure you activate the 'ditctrl' conda environment before running these commands. ```bash cd DiTCtrl conda create -n ditctrl python=3.10 conda activate ditctrl pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt conda install https://anaconda.org/xformers/xformers/0.0.28.post1/download/linux-64/xformers-0.0.28.post1-py310_cu12.1.0_pyt2.4.1.tar.bz2 ``` -------------------------------- ### Video Editing - Word Swap Configuration (YAML) Source: https://context7.com/tencentarc/ditctrl/llms.txt Configuration for video editing using word swap. The 'is_edit' flag must be set to True. This example swaps 'white' with 'red' for a vintage SUV. ```yaml args: seed: 123 output_dir: outputs/edit_case/suv sampling_fps: 8 is_edit: True prompts: - "The camera captures a white vintage SUV with a black roof rack driving along a steep dirt road..." - "The camera captures a red vintage SUV with a black roof rack driving along a steep dirt road..." ``` -------------------------------- ### Run Multi-Prompt Video Generation (Bash) Source: https://context7.com/tencentarc/ditctrl/llms.txt Executes the multi-prompt video generation script. Ensure you are in the 'sat' directory. Alternatively, run directly using a custom configuration file. ```bash cd sat bash run_multi_prompt.sh ``` ```bash WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python sample_video.py \ --base configs/cogvideox_2b.yaml configs/inference.yaml \ --custom-config inference_case_configs/multi_prompts/rose.yaml ``` -------------------------------- ### Expected Directory Structure Source: https://context7.com/tencentarc/ditctrl/llms.txt Illustrates the expected directory structure after downloading and setting up the model weights and T5 encoder for the DiTCtrl project. ```text # Expected directory structure sat/ ├── CogVideoX-2b-sat/ │ ├── transformer/ │ │ ├── 1000/ │ │ │ └── mp_rank_00_model_states.pt │ │ └── latest │ ├── t5-v1_1-xxl/ │ │ ├── config.json │ │ ├── model-00001-of-00002.safetensors │ │ └── ... │ └── vae/ │ └── 3d-vae.pt ├── configs/ ├── inference_case_configs/ └── sample_video.py ``` -------------------------------- ### Download CogVideoX-2B Model Weights Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Downloads and unzips the VAE and transformer components of the CogVideoX-2B model. Arrange the downloaded files into the specified directory structure. ```bash cd sat mkdir CogVideoX-2b-sat cd CogVideoX-2b-sat wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1 mv 'index.html?dl=1' vae.zip unzip vae.zip wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1 mv 'index.html?dl=1' transformer.zip unzip transformer.zip ``` -------------------------------- ### Multi-Prompt Video Generation Configuration (YAML) Source: https://context7.com/tencentarc/ditctrl/llms.txt Configuration file for multi-prompt video generation. Specifies output directory, seed, and the sequence of prompts for scene transitions. Set 'is_run_isolated' to True for comparison. ```yaml args: is_run_isolated: False seed: 42 output_dir: outputs/multi_prompt_case/rose prompts: - "A gentle close shot of the same rose petal, where the camera gradually pulls back to reveal the entire unfurling bloom in its perfect symmetry." - "A steady medium shot of the rose, where the camera continues retreating to show the full stem with its leaves and neighboring buds." - "A smooth full shot of the rose bush, where the camera moves further back to encompass the entire garden bed and surrounding flowering plants." ``` -------------------------------- ### Generate Conditioning for Video Prompts Source: https://context7.com/tencentarc/ditctrl/llms.txt Generate text conditioning for video prompts with transition interpolation using generate_conditioning_parts. Calculate the total video length based on prompt count, tile size, overlap, and transition blocks. ```python from sample_video import generate_conditioning_parts, calculate_video_length prompts = [ "A rose petal in close-up view...", "A medium shot of the rose...", "A full shot of the rose bush..." ] # Generate conditions with transition blocks c_total, uc_total = generate_conditioning_parts( prompts=prompts, model=model, num_samples=[1], num_transition_blocks=2, # Gradual transitions between prompts longer_mid_segment=0 # Extra time for middle segments ) # Calculate total video length video_length = calculate_video_length( prompts_length=len(prompts), tile_size=13, overlap_size=9, num_transition_blocks=2, longer_mid_segment=0 ) # Formula: total_segments = num_prompts + num_transition_blocks * (num_prompts - 1) + longer_mid_segment * (num_prompts - 2) ``` -------------------------------- ### LLM Prompt Generation Instruction Source: https://context7.com/tencentarc/ditctrl/llms.txt Provides instructions for an LLM to generate multi-prompt video sequences with smooth transitions. Emphasizes slight prompt variations, self-contained prompts, and word count limits. ```markdown # prompts_gen_instruction/ditctrl.md Copy this instruction to GPT-4 or similar LLM: """You are part of a team of bots that creates multi-prompt videos. You work with an assistant bot that will draw anything you say in square brackets. For example, outputting 'a beautiful morning in the woods with the sun peaking through the trees' will trigger your partner bot to output a video of a forest morning, as described. Rules: 1. Generate prompts that differ only slightly for smooth transitions 2. Avoid words like "the same" - prompts should be self-contained 3. Maximum 226 words per prompt 4. Keep similar word count across prompts in a group Example 3-prompt sequence: 'A dark knight rests motionless atop a majestic black horse in the middle of a vast grassland...; A dark knight guides the majestic black horse at a steady gallop across a snow-covered field...; A dark knight guides the majestic black horse at a steady gallop across the vast desert expanse...'""" ``` -------------------------------- ### Run Video Editing Script Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Executes the script for video editing tasks. Ensure you are in the 'sat' directory. ```bash cd sat bash run_edit_video.sh ``` -------------------------------- ### Run Video Editing - Attention Reweighting (Bash) Source: https://context7.com/tencentarc/ditctrl/llms.txt Executes the script for attention reweighting in video editing. This feature allows adjustment of token influence within specified layers and steps. Found in the 'sat' directory. ```bash cd sat bash run_reweight_video.sh ``` -------------------------------- ### Run Longer Single-Prompt Text-to-Video Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Executes the script for generating longer videos from a single text prompt. Ensure you are in the 'sat' directory. ```bash cd sat bash run_single_prompt.sh ``` -------------------------------- ### Run Longer Multi-Prompt Text-to-Video Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Executes the script for generating longer videos from multiple text prompts. Ensure you are in the 'sat' directory. ```bash cd sat bash run_multi_prompt.sh ``` -------------------------------- ### Visualize Attention Maps with Bash Script Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Run this bash script to visualize attention maps. Ensure you are in the 'sat' directory before execution. ```bash cd sat bash run_visualize.sh ``` -------------------------------- ### Download T5 Encoder Source: https://context7.com/tencentarc/ditctrl/llms.txt Shell commands to clone the CogVideoX-2b repository and move the T5 encoder components to the correct directory for DiTCtrl. ```bash # Download T5 encoder git lfs install git clone https://huggingface.co/THUDM/CogVideoX-2b.git mkdir t5-v1_1-xxl mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl ``` -------------------------------- ### Video Editing - Attention Reweighting Configuration (YAML) Source: https://context7.com/tencentarc/ditctrl/llms.txt Configuration for attention reweighting. Specifies the token index, reweighting scale, and the layers/steps to apply the changes. Uses 'ReWeightAdaLNMixin'. ```yaml args: seed: 42 output_dir: outputs/reweight_case/bear_0 sampling_fps: 8 reweight_token_idx: 0 reweight_scale: 0 start_step: 0 end_step: 50 start_layer: 0 end_layer: 30 adaln_mixin_names: - 'ReWeightAdaLNMixin' prompts: - "pink teddy bear wearing a cute pink bow tie" - "pink teddy bear wearing a cute pink bow tie" ``` -------------------------------- ### Inference Configuration Parameters Source: https://context7.com/tencentarc/ditctrl/llms.txt Configure the generation process through the inference.yaml file with detailed control over KV-sharing and latent blending. ```APIDOC ## Inference Configuration Parameters Configure the generation process through the inference.yaml file with detailed control over KV-sharing and latent blending. ```yaml # configs/inference.yaml args: latent_channels: 16 mode: inference load: "CogVideoX-2b-sat/transformer" batch_size: 1 sampling_num_frames: 13 sampling_fps: 16 fp16: True seed: 42 output_dir: outputs/multi_prompt_case # KV-sharing strategy selection adaln_mixin_names: - 'KVSharingAdaLNMixin' # Basic KV-sharing # - 'KVSharingMaskGuidedAdaLNMixin' # Mask-guided KV-sharing for precise control # Step and layer control for KV-sharing start_step: 2 end_step: 25 start_layer: 25 end_layer: 30 # Latent blending parameters overlap_size: 9 # Overlap frames between segments (9 recommended) num_transition_blocks: 2 # Transition blocks between prompts (2 recommended) longer_mid_segment: 0 # Extra segments for middle prompts # Mask-guided parameters thres: 0.3 # Threshold for segmentation binary mask ref_token_idx: [0] # Reference token indices for mask guidance cur_token_idx: [0] # Current token indices ``` ``` -------------------------------- ### Single-Prompt Longer Video Generation Configuration (YAML) Source: https://context7.com/tencentarc/ditctrl/llms.txt Configuration for single-prompt longer video generation. The 'single_prompt_length' argument controls the video extension factor. Defines the prompt and output settings. ```yaml args: seed: 42 single_prompt_length: 5 output_dir: outputs/single_prompt_case/fish prompts: - "A vibrant school of tropical fish weaves through an intricate coral reef system, their scales shimmering like jewels in the filtered sunlight. Brilliant parrotfish, angelfish, and clownfish create a living rainbow as they navigate between branches of staghorn coral and giant sea fans. Rays of sunlight pierce the crystal-clear water, creating dancing light patterns on the coral below." ``` -------------------------------- ### Inference Configuration for Video Generation Source: https://context7.com/tencentarc/ditctrl/llms.txt Configure detailed inference parameters in inference.yaml for KV-sharing and latent blending. Adjust settings like latent channels, batch size, sampling frames, FPS, and seed for controlled generation. ```yaml args: latent_channels: 16 mode: inference load: "CogVideoX-2b-sat/transformer" batch_size: 1 sampling_num_frames: 13 sampling_fps: 16 fp16: True seed: 42 output_dir: outputs/multi_prompt_case # KV-sharing strategy selection adaln_mixin_names: - 'KVSharingAdaLNMixin' # Basic KV-sharing # - 'KVSharingMaskGuidedAdaLNMixin' # Mask-guided KV-sharing for precise control # Step and layer control for KV-sharing start_step: 2 end_step: 25 start_layer: 25 end_layer: 30 # Latent blending parameters overlap_size: 9 # Overlap frames between segments (9 recommended) num_transition_blocks: 2 # Transition blocks between prompts (2 recommended) longer_mid_segment: 0 # Extra segments for middle prompts # Mask-guided parameters thres: 0.3 # Threshold for segmentation binary mask ref_token_idx: [0] # Reference token indices for mask guidance cur_token_idx: [0] # Current token indices ``` -------------------------------- ### Execute Custom Inference Command Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Constructs and evaluates a command to run the video generation script with a custom inference configuration file. This allows for overriding default settings. ```bash inference_case_config="inference_case_configs/multi_prompts/rose.yaml" run_cmd="$environs python sample_video.py --base configs/cogvideox_2b.yaml configs/inference.yaml --custom-config $inference_case_config" echo ${run_cmd} eval ${run_cmd} ``` -------------------------------- ### Clone and Organize T5 Model Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Clones the T5 model from Huggingface and moves its components into the 't5-v1_1-xxl' directory for use as an encoder. ```bash git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface # git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope mkdir t5-v1_1-xxl mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl ``` -------------------------------- ### Conditioning Generation Source: https://context7.com/tencentarc/ditctrl/llms.txt Generate text conditioning for prompts with support for transition interpolation. ```APIDOC ## Conditioning Generation Generate text conditioning for prompts with support for transition interpolation. ```python from sample_video import generate_conditioning_parts, calculate_video_length prompts = [ "A rose petal in close-up view...", "A medium shot of the rose...", "A full shot of the rose bush..." ] # Generate conditions with transition blocks c_total, uc_total = generate_conditioning_parts( prompts=prompts, model=model, num_samples=[1], num_transition_blocks=2, # Gradual transitions between prompts longer_mid_segment=0 # Extra time for middle segments ) # Calculate total video length video_length = calculate_video_length( prompts_length=len(prompts), tile_size=13, overlap_size=9, num_transition_blocks=2, longer_mid_segment=0 ) # Formula: total_segments = num_prompts + num_transition_blocks * (num_prompts - 1) + longer_mid_segment * (num_prompts - 2) ``` ``` -------------------------------- ### Set Hugging Face Mirror Endpoint Source: https://github.com/tencentarc/ditctrl/blob/main/README.md Sets the HF_ENDPOINT environment variable to use a mirror for downloading Hugging Face models, which can help resolve 'HeaderTooLarge' errors. ```bash export HF_ENDPOINT=https://hf-mirror.com huggingface-cli download THUDM/CogVideoX-2b --local-dir ./CogVideoX-2b ``` -------------------------------- ### SATVideoDiffusionEngine API for Sampling Source: https://context7.com/tencentarc/ditctrl/llms.txt Utilize the SATVideoDiffusionEngine for video diffusion sampling. Supports single and multi-prompt generation, with options to switch between different attention layer strategies like KV-sharing and attention reweighting. ```python from diffusion_video import SATVideoDiffusionEngine from sat.model.base_model import get_model from sat.training.model_io import load_checkpoint # Initialize the model model = get_model(args, SATVideoDiffusionEngine) load_checkpoint(model, args) model.eval() # Single prompt sampling samples = model.sample_single( cond=c, # Conditioning dict with 'crossattn' key uc=uc, # Unconditional conditioning randn=noise, # Random noise tensor [B, T, C, H//8, W//8] ) # Multi-prompt sampling samples = model.sample_multi_prompt( cond=c_list, # List of conditioning dicts uc=uc_list, # List of unconditional conditionings randn=noise, # Random noise tensor for full video tile_size=13, # Frames per tile overlap_size=9, # Overlap between tiles ) # Switch attention layer strategy model.switch_adaln_layer('KVSharingAdaLNMixin') # Basic KV-sharing model.switch_adaln_layer('KVSharingMaskGuidedAdaLNMixin') # Mask-guided model.switch_adaln_layer('ReWeightAdaLNMixin') # Attention reweighting model.switch_adaln_layer('BaseAdaLNMixin') # Base implementation ``` -------------------------------- ### CSCV Metric Evaluation Script Source: https://context7.com/tencentarc/ditctrl/llms.txt Evaluate video consistency using the CSCV metric by running the provided bash script or executing the Python script directly. Specify video paths, target seed, feature extractor, and image size for evaluation. ```bash # Run CSCV metric evaluation cd metrics bash run_cscv.sh # Or run directly python cscv_metric.py \ --video_path /path/to/generated/videos \ --target_seed 42 \ --feature_extractor clip \ --extractor_path openai/clip-vit-base-patch32 \ --image_size 224 ``` -------------------------------- ### SATVideoDiffusionEngine API Source: https://context7.com/tencentarc/ditctrl/llms.txt The main engine class for video diffusion sampling with single and multi-prompt support. ```APIDOC ## SATVideoDiffusionEngine API The main engine class for video diffusion sampling with single and multi-prompt support. ```python from diffusion_video import SATVideoDiffusionEngine from sat.model.base_model import get_model from sat.training.model_io import load_checkpoint # Initialize the model model = get_model(args, SATVideoDiffusionEngine) load_checkpoint(model, args) model.eval() # Single prompt sampling samples = model.sample_single( cond=c, # Conditioning dict with 'crossattn' key uc=uc, # Unconditional conditioning randn=noise, # Random noise tensor [B, T, C, H//8, W//8] ) # Multi-prompt sampling samples = model.sample_multi_prompt( cond=c_list, # List of conditioning dicts uc=uc_list, # List of unconditional conditionings randn=noise, # Random noise tensor for full video tile_size=13, # Frames per tile overlap_size=9, # Overlap between tiles ) # Switch attention layer strategy model.switch_adaln_layer('KVSharingAdaLNMixin') # Basic KV-sharing model.switch_adaln_layer('KVSharingMaskGuidedAdaLNMixin') # Mask-guided model.switch_adaln_layer('ReWeightAdaLNMixin') # Attention reweighting model.switch_adaln_layer('BaseAdaLNMixin') # Base implementation ``` ``` -------------------------------- ### CSCV Metric Evaluation Source: https://context7.com/tencentarc/ditctrl/llms.txt Evaluate video consistency using the CSCV (Cosine Similarity Coefficient of Variation) metric. ```APIDOC ## CSCV Metric Evaluation Evaluate video consistency using the CSCV (Cosine Similarity Coefficient of Variation) metric. ```bash # Run CSCV metric evaluation cd metrics bash run_cscv.sh # Or run directly python cscv_metric.py \ --video_path /path/to/generated/videos \ --target_seed 42 \ --feature_extractor clip \ --extractor_path openai/clip-vit-base-patch32 \ --image_size 224 ``` ``` -------------------------------- ### Calculate Uniformity Score with NumPy Source: https://context7.com/tencentarc/ditctrl/llms.txt Calculates the uniformity score between adjacent frames using normalized feature vectors and dot product similarity. Requires NumPy for array operations. ```python import numpy as np def uniformity_score(points): """ Calculate uniformity score between adjacent frames using dot product similarity. Args: points: numpy array, shape (n_frames, n_dimensions) - frame features Returns: float: uniformity score between 0-1, closer to 1 means more uniform/consistent """ # Normalize feature vectors normalized_points = points / np.linalg.norm(points, axis=1, keepdims=True) # Calculate cosine similarity between adjacent frames similarities = np.sum(normalized_points[:-1] * normalized_points[1:], axis=1) # Calculate coefficient of variation CV = std/mean * 10 cv = np.std(similarities) / np.mean(similarities) * 10 # Score: 1/(1 + CV), higher is better score = 1 / (1 + cv) return score ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.