### Install videopython

Source: https://videopython.com/

Install the core videopython library or the full package with all AI features. GPU is recommended for AI features.

```bash
pip install videopython          # core editing

```

```bash
pip install "videopython[ai]"    # + ALL local AI features (GPU recommended)

```

--------------------------------

### Basic VideoPython Installation

Source: https://videopython.com/getting-started/installation

Installs the core VideoPython package for basic video handling and processing using pip or uv.

```bash
pip install videopython
```

```bash
# Or with uv
uv add videopython
```

--------------------------------

### Install VideoPython with All AI Features

Source: https://videopython.com/getting-started/installation

Installs VideoPython with all AI-powered features, including generation, understanding, and dubbing, using pip or uv.

```bash
pip install "videopython[ai]"
```

```bash
# Or with uv
uv add videopython --extra ai
```

--------------------------------

### Install VideoDubber with Dubbing Extras

Source: https://videopython.com/api/ai/dubbing

Install the 'dub' extra for the core dubbing pipeline. Include 'tts' for default local speech synthesis.

```bash
pip install "videopython[dub]"        # pipeline WITHOUT local TTS
pip install "videopython[dub,tts]"    # + default local voice synthesis
```

--------------------------------

### Install FFmpeg on Ubuntu/Debian

Source: https://videopython.com/getting-started/installation

Installs FFmpeg using apt-get on Ubuntu/Debian systems. FFmpeg is a required prerequisite for VideoPython.

```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg
```

--------------------------------

### Install VideoPython with Specific AI Extras

Source: https://videopython.com/getting-started/installation

Installs VideoPython with specific AI capabilities like ASR, vision, separation, translation, TTS, or generation. Use granular extras for smaller, conflict-free installations.

```bash
pip install "videopython[asr]"          # just transcription
```

```bash
pip install "videopython[dub,tts]"      # dubbing with local TTS
```

--------------------------------

### Install FFmpeg on Windows

Source: https://videopython.com/getting-started/installation

Installs FFmpeg on Windows using Chocolatey. FFmpeg is a required prerequisite for VideoPython.

```bash
# Windows (with Chocolatey)
choco install ffmpeg
```

--------------------------------

### Audio Class Usage Examples

Source: https://videopython.com/api/core/audio

Demonstrates common operations with the Audio class, such as loading, creating, manipulating, and saving audio files.

```python
from videopython.audio import Audio

# Load from file
audio = Audio.from_path("music.mp3")

# Create silent track
silent = Audio.create_silent(duration_seconds=5.0, stereo=True)

# Basic operations
mono = audio.to_mono()
resampled = audio.resample(16000)
segment = audio.slice(start_seconds=1.0, end_seconds=5.0)

# Combine audio
combined = audio1.concat(audio2, crossfade=0.5)
mixed = audio1.overlay(audio2, position=2.0)

# Save
audio.save("output.wav")

```

--------------------------------

### Initialize Pyannote Speaker Diarization Pipeline

Source: https://videopython.com/api/ai/understanding

Initializes the pyannote speaker diarization pipeline. Requires 'pyannote.audio' to be installed.

```python
def _init_diarization(self) -> None:
    """Initialize pyannote speaker diarization pipeline."""
    import torch

    from videopython.ai._optional import require

    Pipeline = require("pyannote.audio", "asr", feature="AudioToText diarization").Pipeline

    self._diarization_pipeline = Pipeline.from_pretrained(
        self.PYANNOTE_DIARIZATION_MODEL, revision=pinned(self.PYANNOTE_DIARIZATION_MODEL)
    )
    self._diarization_pipeline.to(torch.device(self.device))
```

--------------------------------

### Initialize and Use TextToMusic for Audio Generation

Source: https://videopython.com/api/ai/generation

Initializes the MusicGen model locally and generates audio from a text description. Ensure 'transformers' is installed for this feature.

```python
class TextToMusic(ManagedPredictor):
    """Generates music from text descriptions using MusicGen."""

    def __init__(self, device: str | None = None):
        self.device = device
        self._processor: Any = None
        self._model: Any = None

    def _init_local(self) -> None:
        """Initialize local MusicGen model."""
        import os

        from videopython.ai._optional import require

        _transformers = require("transformers", "generation", feature="TextToMusic")
        AutoProcessor = _transformers.AutoProcessor
        MusicgenForConditionalGeneration = _transformers.MusicgenForConditionalGeneration

        os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

        requested_device = self.device
        device = select_device(self.device, mps_allowed=True)

        model_name = "facebook/musicgen-small"
        self._processor = AutoProcessor.from_pretrained(model_name, revision=pinned(model_name))
        self._model = MusicgenForConditionalGeneration.from_pretrained(model_name, revision=pinned(model_name))
        self._model.to(device)
        self.device = device
        log_device_initialization(
            "TextToMusic",
            requested_device=requested_device,
            resolved_device=device,
        )

    def generate_audio(self, text: str, max_new_tokens: int = 256) -> Audio:
        """Generate music audio from text description."""
        if self._model is None:
            self._init_local()

        inputs = self._processor(text=[text], padding=True, return_tensors="pt")
        inputs = {k: v.to(self.device) if hasattr(v, "to") else v for k, v in inputs.items()}
        audio_values = self._model.generate(**inputs, max_new_tokens=max_new_tokens)
        sampling_rate = self._model.config.audio_encoder.sampling_rate

        audio_data = audio_values[0, 0].cpu().float().numpy()

        metadata = AudioMetadata(
            sample_rate=sampling_rate,
            channels=1,
            sample_width=2,
            duration_seconds=len(audio_data) / sampling_rate,
            frame_count=len(audio_data),
        )
        return Audio(audio_data, metadata)

    def unload(self) -> None:
        """Release the MusicGen model so the next generate_audio() re-initializes."""
        self._model = None
        self._processor = None
        release_device_memory(self.device)

```

--------------------------------

### Iterate Through Video Frames

Source: https://videopython.com/api/core/video

Use the `__iter__` method to get a generator that yields frame index and frame data. Frame indices are absolute and account for any start second offset.

```python
def __iter__(self) -> Generator[tuple[int, np.ndarray], None, None]:
    """Yield (frame_index, frame) tuples.

    Frame indices are absolute indices in the original video,
    accounting for any start_second offset.
    """
    self._iter = self._iter_frames()
    return self._iter
```

--------------------------------

### Create and Execute a Video Editing Plan

Source: https://videopython.com/api/editing

Demonstrates how to define a video editing plan using a dictionary, convert it to a VideoEdit object, perform a dry-run validation, and then execute the plan to save the output file. This is the primary method for creating and running edits.

```python
from videopython.editing import VideoEdit

plan = {
    "segments": [
        {
            "source": "input.mp4",
            "start": 5.0,
            "end": 12.0,
            "operations": [
                {"op": "crop", "width": 0.5, "height": 1.0, "mode": "center"},
                {"op": "resize", "width": 1080, "height": 1920},
                {
                    "op": "blur_effect",
                    "mode": "constant",
                    "iterations": 1,
                    "window": {"start": 0.0, "stop": 1.0},
                },
            ],
        },
        {"source": "input.mp4", "start": 20.0, "end": 28.0},
    ],
    "post_operations": [
        {"op": "color_adjust", "brightness": 0.05},
    ],
}

edit = VideoEdit.from_dict(plan)
predicted = edit.validate()        # dry-run via VideoMetadata
edit.run_to_file("output.mp4", crf=20, preset="medium")  # streams to disk (constant memory, any video length)

```

--------------------------------

### Calculate Audio Levels for a Segment

Source: https://videopython.com/api/core/audio

Calculates audio levels (RMS, peak, dB) for a specified time segment. Use this to analyze the loudness of a particular part of the audio. The example shows how to get levels for the entire audio file.

```python
def get_levels(
    self,
    start_seconds: float = 0.0,
    end_seconds: float | None = None,
) -> "AudioLevels":
    """Calculate audio levels for a segment.

    Args:
        start_seconds: Start time in seconds (default: 0.0)
        end_seconds: End time in seconds (default: None, meaning end of audio)

    Returns:
        AudioLevels with RMS, peak, and dB measurements

    Example:
        >>> audio = Audio.from_path("audio.mp3")
        >>> levels = audio.get_levels()
        >>> print(f"Peak: {levels.db_peak:.1f} dB")
    """
    from videopython.audio.analysis import AudioLevels

    segment = self.slice(start_seconds, end_seconds)
    data = segment.data.flatten() if segment.metadata.channels == 2 else segment.data

    rms = float(np.sqrt(np.mean(data**2)))
    peak = float(np.max(np.abs(data)))

    # Convert to dB (avoid log of zero)
    db_rms = 20 * np.log10(max(rms, 1e-10))
    db_peak = 20 * np.log10(max(peak, 1e-10))

    return AudioLevels(rms=rms, peak=peak, db_rms=float(db_rms), db_peak=float(db_peak))
```

--------------------------------

### Video.__init__

Source: https://videopython.com/api/core/video

Initializes a Video object with frames, frames per second, and optional audio.

```APIDOC
## Video.__init__

### Description
Initializes a Video object with the provided frames, frames per second (fps), and an optional Audio object. If no audio is provided, a silent audio track is created.

### Method
__init__

### Parameters
* **frames** (ndarray) - The video frames.
* **fps** (int | float) - The frames per second.
* **audio** (Audio | None) - Optional audio object.
```

--------------------------------

### Implement KenBurns Effect

Source: https://videopython.com/api/effects

Use the KenBurns effect to create a cinematic pan-and-zoom animation between two crop regions. It's useful for adding motion to still images or guiding the viewer's eye. Ensure start and end regions are within bounds and have valid dimensions.

```python
class KenBurns(Effect):
    """Cinematic pan-and-zoom that smoothly animates between two crop regions.

    Creates movement by transitioning from a start region to an end region over
    the clip. Use it to add motion to still images or to guide the viewer's eye
    across a scene.
    """

    op: Literal["ken_burns"] = "ken_burns"
    streamable: ClassVar[bool] = True

    start_region: BoundingBox = Field(
        description="Starting crop region as a BoundingBox with normalized 0-1 coordinates."
    )
    end_region: BoundingBox = Field(description="Ending crop region as a BoundingBox with normalized 0-1 coordinates.")
    easing: Literal["linear", "ease_in", "ease_out", "ease_in_out"] = Field(
        "linear",
        description=(
            'Animation curve. "linear" moves at constant speed, "ease_in" starts slow, '
            '"ease_out" ends slow, "ease_in_out" starts and ends slow.'
        ),
    )

    _stream_regions: np.ndarray | None = PrivateAttr(default=None)
    _stream_target_w: int = PrivateAttr(default=0)
    _stream_target_h: int = PrivateAttr(default=0)

    @model_validator(mode="after")
    def _validate_regions(self) -> KenBurns:
        for name, region in [("start_region", self.start_region), ("end_region", self.end_region)]:
            if not (0 <= region.x <= 1 and 0 <= region.y <= 1):
                raise ValueError(f"{name} position must be in range [0, 1]!")
            if not (0 < region.width <= 1 and 0 < region.height <= 1):
                raise ValueError(f"{name} dimensions must be in range (0, 1]!")
            if region.x + region.width > 1 or region.y + region.height > 1:
                raise ValueError(f"{name} extends beyond image bounds!")
        return self

    def _crop_and_scale_frame(
        self,
        frame: np.ndarray,
        x: int,
        y: int,
        crop_w: int,
        crop_h: int,
        target_w: int,
        target_h: int,
    ) -> np.ndarray:
        cropped = frame[y : y + crop_h, x : x + crop_w]
        return cv2.resize(cropped, (target_w, target_h), interpolation=cv2.INTER_LINEAR)

    def _precompute_regions(self, n_frames: int, width: int, height: int) -> np.ndarray:
        sx = int(self.start_region.x * width)
        sy = int(self.start_region.y * height)
        sw = int(self.start_region.width * width)
        sh = int(self.start_region.height * height)
        ex = int(self.end_region.x * width)
        ey = int(self.end_region.y * height)
        ew = int(self.end_region.width * width)
        eh = int(self.end_region.height * height)

        regions = np.empty((n_frames, 4), dtype=np.int32)
        eased = ease(np.arange(n_frames, dtype=np.float64) / max(1, n_frames - 1), self.easing)
        for i in range(n_frames):
            et = float(eased[i])
            crop_w = int(sw + (ew - sw) * et)
            crop_h = int(sh + (eh - sh) * et)
            x = max(0, min(int(sx + (ex - sx) * et), width - crop_w))
            y = max(0, min(int(sy + (ey - sy) * et), height - crop_h))
            regions[i] = (x, y, crop_w, crop_h)
        return regions

    def streaming_init(self, total_frames: int, fps: float, width: int, height: int, **_context: Any) -> None:
        self._stream_regions = self._precompute_regions(total_frames, width, height)
        self._stream_target_w = width
        self._stream_target_h = height

    def process_frame(self, frame: np.ndarray, frame_index: int) -> np.ndarray:
        assert self._stream_regions is not None
        idx = min(frame_index, len(self._stream_regions) - 1)
        x, y, cw, ch = self._stream_regions[idx]
        return self._crop_and_scale_frame(frame, x, y, cw, ch, self._stream_target_w, self._stream_target_h)
```

--------------------------------

### Initialize and Use TextToVideo for Video Generation

Source: https://videopython.com/api/ai/generation

Initializes the TextToVideo pipeline and generates a video from a text prompt. The pipeline is automatically initialized on the first call to generate_video if not already loaded. Ensure necessary libraries like 'diffusers' are installed.

```python
class TextToVideo(ManagedPredictor):
    """Generates videos from text descriptions using local diffusion models."""

    def __init__(self, device: str | None = None):
        self.device = device
        self._pipeline: Any = None

    def _init_local(self) -> None:
        from videopython.ai._optional import require

        CogVideoXPipeline = require("diffusers", "generation", feature="TextToVideo").CogVideoXPipeline

        requested_device = self.device
        device, dtype = _get_torch_device_and_dtype(self.device)

        model_name = "THUDM/CogVideoX1.5-5B"
        self._pipeline = CogVideoXPipeline.from_pretrained(model_name, revision=pinned(model_name), torch_dtype=dtype)
        self._pipeline.to(device)
        self.device = device
        log_device_initialization(
            "TextToVideo",
            requested_device=requested_device,
            resolved_device=device,
        )

    def generate_video(
        self,
        prompt: str,
        num_steps: int = 50,
        num_frames: int = 81,
        guidance_scale: float = 6.0,
    ) -> Video:
        """Generate video from text prompt."""
        import torch

        if self._pipeline is None:
            self._init_local()

        video_frames = self._pipeline(
            prompt=prompt,
            num_inference_steps=num_steps,
            num_frames=num_frames,
            guidance_scale=guidance_scale,
            generator=torch.Generator(device=self.device).manual_seed(42),
        ).frames[0]
        video_frames = np.asarray(video_frames, dtype=np.uint8)
        return Video.from_frames(video_frames, fps=16.0)

    def unload(self) -> None:
        """Release the diffusion pipeline so the next generate_video() re-initializes."""
        self._pipeline = None
        release_device_memory(self.device)

```

--------------------------------

### __init__

Source: https://videopython.com/api/ai/understanding

Initializes the semantic scene detector with configurable parameters for threshold, minimum scene length, and device selection.

```APIDOC
## __init__

### Description
Initializes the semantic scene detector with configurable parameters for threshold, minimum scene length, and device selection.

### Method
__init__

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Parameters
- **threshold** (float) - Optional - Confidence threshold for scene boundaries (0.0-1.0). Higher values = fewer, more confident boundaries. Default: 0.5
- **min_scene_length** (float) - Optional - Minimum scene duration in seconds. Default: 0.5
- **device** (str | None) - Optional - Device to run on ('cuda', 'mps', 'cpu', or None for auto). Note: MPS may have numerical inconsistencies; use 'cpu' for reproducible results. Default: None

### Request Example
None

### Response
#### Success Response (200)
None

#### Response Example
None
```

--------------------------------

### Install FFmpeg on macOS

Source: https://videopython.com/getting-started/installation

Installs FFmpeg using Homebrew on macOS. FFmpeg is a required prerequisite for VideoPython.

```bash
# macOS
brew install ffmpeg
```

--------------------------------

### Initialize VideoDubber with Flat Kwargs or DubbingConfig

Source: https://videopython.com/api/ai/dubbing

Demonstrates two ways to initialize VideoDubber: using flat keyword arguments for ad-hoc calls or by creating an explicit DubbingConfig object for reusable presets. The flat kwargs approach is recommended for quick, one-off operations.

```python
from videopython.ai.dubbing import DubbingConfig, VideoDubber

# Flat kwargs (recommended for ad-hoc calls)
dubber = VideoDubber(device="cuda", low_memory=True, whisper_model="large")

# Explicit config (recommended for reusable presets)
config = DubbingConfig(
    device="cuda",
    low_memory=True,
    whisper_model="large",
    translator="qwen3",
    vocabulary=["Klarna", "Allegro"],
)
dubber = VideoDubber(config=config)
```

--------------------------------

### Use VideoEdit with Dictionary Configuration

Source: https://videopython.com/getting-started/quickstart

Demonstrates creating a VideoEdit plan using a dictionary, which mirrors the JSON wire format. This includes validation before running the edit and saving the output.

```python
from videopython.editing import VideoEdit

edit = VideoEdit.from_dict({
    "segments": [{
        "source": "input.mp4",
        "start": 0,
        "end": 10,
        "operations": [
            {"op": "resize", "width": 1280, "height": 720},
            {"op": "resample_fps", "fps": 30},
        ],
    }]
})
print(edit.validate())   # predicted VideoMetadata, no frames loaded
edit.run_to_file("output.mp4")

```

--------------------------------

### Initialize SceneVLM and AudioClassifier

Source: https://videopython.com/api/ai/video_analysis

Initializes SceneVLM and AudioClassifier based on configuration. Includes error handling for initialization failures, logging warnings if components cannot be loaded.

```python
scene_vlm: SceneVLM | None
try:
    scene_vlm = SceneVLM(**self.config.get_params(SCENE_VLM)) if SCENE_VLM in enabled else None
except (ImportError, OSError, RuntimeError, ValueError):
    logger.warning("Failed to initialize SceneVLM, skipping visual understanding", exc_info=True)
    scene_vlm = None

try:
    audio_classifier = (
        AudioClassifier(**self.config.get_params(AUDIO_CLASSIFIER)) if AUDIO_CLASSIFIER in enabled else None
    )
except (ImportError, OSError, RuntimeError, ValueError):
    logger.warning("Failed to initialize AudioClassifier, skipping audio classification", exc_info=True)
    audio_classifier = None
```

--------------------------------

### Load Videos from File, Segment, or Image

Source: https://videopython.com/getting-started/quickstart

Demonstrates how to load video files, specific segments of videos, or create videos from static images using the Video class. Also shows how to access basic video metadata.

```python
from videopython.base import Video

# Load from file
video = Video.from_path("input.mp4")

# Load a specific segment (more efficient for long videos)
video = Video.from_path("input.mp4", start_second=10, end_second=20)

# Create from a static image
import numpy as np
image = np.zeros((1080, 1920, 3), dtype=np.uint8)  # Black frame
video = Video.from_image(image, fps=24, length_seconds=3.0)

# Check video properties
print(video.metadata)  # 1920x1080 @ 30fps, 10.5 seconds
print(video.total_seconds)
print(video.frame_shape)  # (height, width, channels)

```

--------------------------------

### Get Per-Operation JSON Schema

Source: https://videopython.com/api/operations

Retrieve the JSON schema for a specific operation class. Use `llm_json_schema()` to get a schema suitable for LLM interactions, excluding server-only fields.

```python
from videopython.editing import Operation

cls = Operation.get("blur_effect")
schema = cls.model_json_schema()         # full (all fields)
llm_schema = cls.llm_json_schema()       # LLM-facing (llm_hidden dropped)

```

--------------------------------

### Initialize VideoDubber

Source: https://videopython.com/api/ai/dubbing

Initialize the VideoDubber with a configuration object or keyword arguments. If both are provided, a TypeError is raised. The configuration is logged upon initialization.

```python
class VideoDubber:
    """Dubs videos into different languages using the local pipeline.

    Accepts either a :class:`DubbingConfig` or the same knobs as flat kwargs
    (``device``, ``low_memory``, ``whisper_model``, ``translator``, etc.) --
    the flat path builds a ``DubbingConfig`` internally. See
    :class:`DubbingConfig` for the full knob list and defaults.
    """

    def __init__(
        self,
        config: DubbingConfig | None = None,
        *,
        tts_backend: SpeechBackend | None = None,
        **kwargs: Any,
    ):
        if config is not None and kwargs:
            raise TypeError("Pass either `config=` or knob kwargs, not both")
        self.config = config or DubbingConfig(**kwargs)
        # Optional injected speech backend. None -> the pipeline lazily builds
        # the local chatterbox-backed TextToSpeech (requires the [tts] extra).
        # Inject a SpeechBackend to dub with only [dub] installed.
        self._tts_backend = tts_backend
        self._local_pipeline: Any = None
        logger.info(
            "VideoDubber initialized with %s",
            " ".join(f"{k}={v}" for k, v in self.config.init_log_fields().items()),
        )
```

--------------------------------

### Apply Zoom Effect

Source: https://videopython.com/api/effects

Progressively zooms into or out of the frame center. The zoom factor must be greater than 1. 'in' mode starts wide and zooms in, 'out' mode starts tight and zooms out.

```python
class Zoom(Effect):
    """Progressively zooms into or out of the frame center over the clip duration."""

    op: Literal["zoom_effect"] = "zoom_effect"
    streamable: ClassVar[bool] = True

    zoom_factor: float = Field(
        gt=1,
        description="How far to zoom. 1.5 is a subtle push, 2.0 is moderate, 3.0+ is dramatic. Must be greater than 1.",
    )
    mode: Literal["in", "out"] = Field(
        description='"in" starts wide and pushes into the center, "out" starts tight and pulls back.',
    )

    _stream_crops: np.ndarray | None = PrivateAttr(default=None)
    _stream_width: int = PrivateAttr(default=0)
    _stream_height: int = PrivateAttr(default=0)

    def _crop_sizes(self, n_frames: int, width: int, height: int) -> np.ndarray:
        crop_w = np.linspace(width // self.zoom_factor, width, n_frames)
        crop_h = np.linspace(height // self.zoom_factor, height, n_frames)
        if self.mode == "in":
            crop_w, crop_h = crop_w[::-1], crop_h[::-1]
        return np.stack([crop_w, crop_h], axis=1)

    def streaming_init(self, total_frames: int, fps: float, width: int, height: int, **_context: Any) -> None:
        self._stream_crops = self._crop_sizes(total_frames, width, height)
        self._stream_width = width
        self._stream_height = height

    def process_frame(self, frame: np.ndarray, frame_index: int) -> np.ndarray:
        assert self._stream_crops is not None
        idx = min(frame_index, len(self._stream_crops) - 1)
        w, h = self._stream_crops[idx]
        width, height = self._stream_width, self._stream_height
        x = width / 2 - w / 2
        y = height / 2 - h / 2
        cropped = frame[round(y) : round(y + h), round(x) : round(x + w)]
        return cv2.resize(cropped, (width, height))
```

--------------------------------

### Normalize Scene Boundaries

Source: https://videopython.com/api/ai/video_analysis

Normalizes scene boundaries based on video metadata, ensuring start and end times/frames are within valid ranges and ordered correctly. Handles edge cases where end times are less than or equal to start times.

```python
def _normalize_scene_boundaries(self, scenes: list[SceneBoundary], metadata: VideoMetadata) -> list[SceneBoundary]:
        normalized: list[SceneBoundary] = []
        max_time = float(metadata.total_seconds)
        max_frame = int(metadata.frame_count)

        for item in scenes:
            start = max(0.0, min(max_time, float(item.start)))
            end = max(0.0, min(max_time, float(item.end)))
            if end <= start:
                continue

            start_frame = int(item.start_frame)
            end_frame = int(item.end_frame)
            start_frame = max(0, min(max_frame, start_frame))
            end_frame = max(0, min(max_frame, end_frame))
            if end_frame <= start_frame:
                start_frame = int(round(start * metadata.fps))
                end_frame = max(start_frame + 1, int(round(end * metadata.fps)))
                start_frame = max(0, min(max_frame, start_frame))
                end_frame = max(0, min(max_frame, end_frame))
                if end_frame <= start_frame:
                    continue

            normalized.append(
                SceneBoundary(
                    start=round(start, 6),
                    end=round(end, 6),
                    start_frame=start_frame,
                    end_frame=end_frame,
                )
            )

        normalized.sort(key=lambda scene: (scene.start, scene.end))
        return normalized
```

--------------------------------

### __init__

Source: https://videopython.com/api/ai/understanding

Initializes the face tracker. Users can configure how faces are selected, smoothing factors, detection intervals, minimum face size, and the backend (CPU/GPU).

```APIDOC
## __init__

### Description
Initializes the face tracker.

### Method
__init__

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Parameters
- **selection_strategy** (Literal['largest', 'centered', 'index']) - Optional - How to select which face to track. Options include 'largest' (default), 'centered', or 'index'.
- **face_index** (int) - Optional - Index of face to track when using the 'index' strategy. Defaults to 0.
- **smoothing** (float) - Optional - Exponential moving average factor (0-1). Higher values result in smoother tracking. Defaults to 0.8.
- **detection_interval** (int) - Optional - Run detection every N frames and interpolate between detections. Defaults to 3.
- **min_face_size** (int) - Optional - Minimum face size in pixels required for detection. Defaults to 30.
- **backend** (Literal['cpu', 'gpu', 'auto']) - Optional - Specifies the detection backend. Can be 'cpu', 'gpu', or 'auto' (default).
- **sample_rate** (int) - Optional - For GPU backend, detect every Nth frame and interpolate. Only used by track_video(). Defaults to 1.
- **batch_size** (int) - Optional - Batch size for GPU detection. Defaults to 16.
- **iou_match_threshold** (float) - Optional - Minimum IoU between consecutive detections to continue an existing per-shot track. Used by `track_shot`. Defaults to DEFAULT_IOU_MATCH_THRESHOLD.
- **max_missed_frames** (int) - Optional - Maximum number of consecutive frames a track can go without detection before being closed. Defaults to DEFAULT_MAX_MISSED_FRAMES.

### Request Example
```json
{
  "selection_strategy": "largest",
  "face_index": 0,
  "smoothing": 0.8,
  "detection_interval": 3,
  "min_face_size": 30,
  "backend": "auto",
  "sample_rate": 1,
  "batch_size": 16,
  "iou_match_threshold": 0.5,
  "max_missed_frames": 10
}
```

### Response
#### Success Response (200)
This method initializes the tracker and does not return a value directly. The tracker object is configured for subsequent use.

#### Response Example
None (initialization)

```

--------------------------------

### slice

Source: https://videopython.com/api/core/audio

Extracts a portion of the audio between specified start and end times.

```APIDOC
## slice

### Description
Extracts a portion of the audio between specified start and end times.

### Method
This is a method of the Audio class.

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Parameters
- **start_seconds** (float): Start time in seconds (default: 0.0).
- **end_seconds** (float | None): End time in seconds (default: None, meaning end of audio).

### Returns
- **Audio**: New Audio instance with the extracted portion.

### Raises
- **ValueError**: If start_seconds or end_seconds are invalid.
```

--------------------------------

### Apply Basic Transformations with VideoEdit

Source: https://videopython.com/getting-started/quickstart

Shows how to use VideoEdit and SegmentConfig to apply transformations like resizing and frame rate resampling to a video segment. The edited video is then saved to a file.

```python
from videopython.editing import VideoEdit, SegmentConfig
from videopython.editing.transforms import Resize, ResampleFPS

edit = VideoEdit(segments=[
    SegmentConfig(
        source="input.mp4",
        start=0,    # cut the first 10 seconds...
        end=10,     # ...via the segment range, not a cut operation
        operations=[
            Resize(width=1280, height=720),
            ResampleFPS(fps=30),
        ],
    )
])
edit.run_to_file("output.mp4")

```

--------------------------------

### slice

Source: https://videopython.com/api/core/audio

Extract a portion of the audio between specified start and end times in seconds.

```APIDOC
## slice

### Description
Extract a portion of the audio between start_seconds and end_seconds.

### Method
```python
slice(start_seconds: float = 0.0, end_seconds: float | None = None) -> Audio
```

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Parameters
- **start_seconds** (`float`) - Required - Start time in seconds (default: 0.0)
- **end_seconds** (`float | None`) - Optional - End time in seconds (default: None, meaning end of audio)

### Request Example
```python
# Example usage:
audio_segment = audio_object.slice(start_seconds=10.5, end_seconds=25.0)
```

### Response
#### Success Response (200)
- **Audio** (`Audio`) - New Audio instance with the extracted portion

#### Response Example
```json
{
  "audio_data": "...",
  "metadata": {
    "sample_rate": 44100,
    "channels": 2,
    "sample_width": 2,
    "duration_seconds": 14.5,
    "frame_count": 639450
  }
}
```

### Raises
- **ValueError**: If start_seconds or end_seconds are invalid
```

--------------------------------

### CutFrames

Source: https://videopython.com/api/transforms

Cuts a video segment by specifying the start and end frame numbers.

```APIDOC
## CutFrames

### Description
Cuts a video segment by specifying the start and end frame numbers.

### Method
Not specified (likely a method call on a video object)

### Endpoint
Not applicable (SDK method)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
None provided

### Response
#### Success Response
Not specified

#### Response Example
None provided
```

--------------------------------

### Initialize Video Object

Source: https://videopython.com/api/core/video

Constructs a Video object with frames, fps, and optional audio. If no audio is provided, a silent audio track is generated.

```python
class Video:
    def __init__(self, frames: np.ndarray, fps: int | float, audio: Audio | None = None):
        self.frames = frames
        self.fps = fps
        if audio:
            self.audio = audio
        else:
            self.audio = Audio.create_silent(
                duration_seconds=round(self.total_seconds, 2), stereo=True, sample_rate=44100
            )
```

--------------------------------

### CutSeconds

Source: https://videopython.com/api/transforms

Cuts a video segment by specifying the start and end times in seconds.

```APIDOC
## CutSeconds

### Description
Cuts a video segment by specifying the start and end times in seconds.

### Method
Not specified (likely a method call on a video object)

### Endpoint
Not applicable (SDK method)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
None provided

### Response
#### Success Response
Not specified

#### Response Example
None provided
```

--------------------------------

### Initialize Local SceneVLM Model

Source: https://videopython.com/api/ai/understanding

Initializes the local Qwen3.5 model and its processor. It handles device selection and ensures correct data types are used to avoid conflicts with other models.

```python
import time
import torch

from videopython.ai._optional import require
from videopython.core.logging import logger
from videopython.core.optional import pinned
from videopython.core.utils import select_device, log_device_initialization, release_device_memory


class SceneVLM:
    DEFAULT_MAX_IMAGE_PIXELS = 1024 * 1024

    def __init__(
        self, model_size: str, model_name: str | None = None, device: str | None = None, max_new_tokens: int = 128, temperature: float = 0.0, max_image_pixels: int | None = None
    ) -> None:
        self.model_size: SceneVLMModelSize = model_size
        self.model_name = model_name or SCENE_VLM_MODEL_IDS[model_size]
        self.device = device
        self.max_new_tokens = max_new_tokens
        self.temperature = temperature
        self.max_image_pixels = max_image_pixels if max_image_pixels is not None else self.DEFAULT_MAX_IMAGE_PIXELS
        self._processor: Any = None
        self._model: Any = None

        if model_size == "27b":
            self._warn_if_vram_under_large_model_floor()

    @staticmethod
    def _warn_if_vram_under_large_model_floor() -> None:
        """Loud WARNING when ``model_size='27b'`` is requested on a small card.

        Does not raise -- a knowledgeable user may run the 27B model with
        their own quantization layer or accept device off-loading. The
        warning makes the eventual OOM (deep inside ``from_pretrained``)
        easier to diagnose.
        """
        try:
            import torch

            if not torch.cuda.is_available():
                logger.warning(
                    "SceneVLM model_size='27b' requested but CUDA is not "
                    "available. 27B FP16 weights are ~54 GB; running on "
                    "CPU/MPS is likely to OOM."
                )
                return

            free_bytes, _total = torch.cuda.mem_get_info()
            free_gb = free_bytes / (1024**3)
            if free_gb < _LARGE_MODEL_VRAM_WARN_GB:
                logger.warning(
                    "SceneVLM model_size='27b' requested with %.1f GB free VRAM. "
                    "Qwen3.5-27B FP16 needs ~54 GB for weights alone -- expect "
                    "OOM during from_pretrained unless you wired up "
                    "quantization or device offloading.",
                    free_gb,
                )
        except ImportError:
            pass

    def _init_local(self) -> None:
        """Initialize local Qwen3.5 model."""
        import torch

        from videopython.ai._optional import require

        _transformers = require("transformers", "vision", feature="SceneVLM")
        AutoModelForImageTextToText = _transformers.AutoModelForImageTextToText
        AutoProcessor = _transformers.AutoProcessor

        t0 = time.perf_counter()
        requested_device = self.device
        resolved_device = select_device(self.device, mps_allowed=True)

        self._processor = AutoProcessor.from_pretrained(self.model_name, revision=pinned(self.model_name))
        # Save and restore default dtype -- transformers torch_dtype="auto" can
        # mutate torch.get_default_dtype(), which breaks concurrent models
        # (e.g. Whisper) that expect float32.
        saved_dtype = torch.get_default_dtype()
        try:
            self._model = AutoModelForImageTextToText.from_pretrained(
                self.model_name, torch_dtype="auto", revision=pinned(self.model_name)
            )
        finally:
            torch.set_default_dtype(saved_dtype)
        self._model.to(resolved_device)
        self._model.eval()
        self.device = resolved_device

        log_device_initialization(
            "SceneVLM",
            requested_device=requested_device,
            resolved_device=resolved_device,
        )
        logger.info(
            "SceneVLM(%s, model_size=%s) model weights loaded in %.2fs",
            self.model_name,
            self.model_size,
            time.perf_counter() - t0,
        )

```

--------------------------------

### Basic LLM Video Editing Workflow

Source: https://videopython.com/guides/llm-integration

Demonstrates the core workflow of generating a video edit plan with an LLM, validating it, and running it to a file. Ensure 'videopython.ai' is imported for AI operations.

```python
from videopython.editing import VideoEdit

schema = VideoEdit.json_schema()
plan = call_your_llm(schema=schema,
                     prompt="Create a 15s highlight reel from input.mp4")

edit = VideoEdit.from_dict(plan)
predicted = edit.validate()           # catches bad plans before any I/O
print(predicted)
edit.run_to_file("output.mp4")

```

--------------------------------

### Get Overlay Opacity

Source: https://videopython.com/api/effects

Returns the opacity value for the overlay. This is a simple getter for the opacity attribute.

```python
return self.opacity
```

--------------------------------

### Get Supported Languages for Translation

Source: https://videopython.com/api/ai/dubbing

Retrieves a dictionary of supported languages for text translation, which can be used for dubbing.

```python
@staticmethod
def get_supported_languages() -> dict[str, str]:
    from videopython.ai.generation.translation import TextTranslator

    return TextTranslator.get_supported_languages()
```

--------------------------------

### Initialize FaceTracker and Load Audio

Source: https://videopython.com/api/ai/video_analysis

Initializes the FaceTracker and attempts to load audio from a given path. Includes error handling for initialization and audio loading failures, logging warnings if issues occur.

```python
face_tracker: FaceTracker | None = None
if FACE_TRACKER in enabled:
    try:
        face_tracker = FaceTracker(**self.config.get_params(FACE_TRACKER))
    except (ImportError, OSError, RuntimeError, ValueError):
        logger.warning("Failed to initialize FaceTracker, skipping face tracks", exc_info=True)
        face_tracker = None

path_audio: Audio | None = None
if audio_classifier is not None and source_path is not None:
    try:
        path_audio = Audio.from_path(source_path)
    except (OSError, RuntimeError, ValueError):
        logger.warning(
            "Failed to load audio from path, audio classification will use clip fallback",
            exc_info=True,
        )
        path_audio = None
```

--------------------------------

### Initialize TextToSpeech

Source: https://videopython.com/api/ai/generation

Initializes the TextToSpeech class for generating audio from text. This is a basic setup for the TTS functionality.

```python
from videopython.ai import TextToSpeech

tts = TextToSpeech()
```

--------------------------------

### Create Audio from File Path (Deprecated)

Source: https://videopython.com/api/core/audio

Use `Audio.from_path()` instead of this deprecated method. It warns the user about the deprecation.

```python
from_file(file_path: str | Path) -> Audio
```

```python
@classmethod
def from_file(cls, file_path: str | Path) -> Audio:
    """Deprecated: Use from_path() instead."""
    import warnings

    warnings.warn(
        "Audio.from_file() is deprecated, use Audio.from_path() instead",
        DeprecationWarning,
        stacklevel=2,
    )
    return cls.from_path(file_path)
```

--------------------------------

### Initialize and Generate Video from Image

Source: https://videopython.com/api/ai/generation

Initializes the local diffusion pipeline if not already loaded and then generates a video animation from a static image. Requires the 'diffusers' library.

```python
from PIL import Image
import numpy as np

# Assuming Video and ManagedPredictor are defined elsewhere
# from videopython.core.video import Video
# from videopython.core.managed_predictor import ManagedPredictor
# from videopython.ai.generation.video import _get_torch_device_and_dtype, log_device_initialization, release_device_memory
# from videopython.core.pinned import pinned

class ImageToVideo(ManagedPredictor):
    """Generates videos from static images using local video diffusion."""

    def __init__(self, device: str | None = None):
        self.device = device
        self._pipeline: Any = None

    def _init_local(self) -> None:
        from videopython.ai._optional import require

        CogVideoXImageToVideoPipeline = require(
            "diffusers", "generation", feature="ImageToVideo"
        ).CogVideoXImageToVideoPipeline

        requested_device = self.device
        device, dtype = _get_torch_device_and_dtype(self.device)

        model_name = "THUDM/CogVideoX1.5-5B-I2V"
        self._pipeline = CogVideoXImageToVideoPipeline.from_pretrained(
            model_name, revision=pinned(model_name), torch_dtype=dtype
        )
        self._pipeline.to(device)
        self.device = device
        log_device_initialization(
            "ImageToVideo",
            requested_device=requested_device,
            resolved_device=device,
        )

    def generate_video(
        self, 
        image: Image,
        prompt: str = "",
        num_steps: int = 50,
        num_frames: int = 81,
        guidance_scale: float = 6.0,
    ) -> Video:
        """Generate video animation from a static image."""
        import torch

        if self._pipeline is None:
            self._init_local()

        video_frames = self._pipeline(
            prompt=prompt,
            image=image,
            num_inference_steps=num_steps,
            num_frames=num_frames,
            guidance_scale=guidance_scale,
            generator=torch.Generator(device=self.device).manual_seed(42),
        ).frames[0]
        video_frames = np.asarray(video_frames, dtype=np.uint8)
        return Video.from_frames(video_frames, fps=16.0)

    def unload(self) -> None:
        """Release the diffusion pipeline so the next generate_video() re-initializes."""
        self._pipeline = None
        release_device_memory(self.device)
```

--------------------------------

### Initialize Local Whisper Model

Source: https://videopython.com/api/ai/understanding

Loads the specified Whisper model locally. Requires the 'whisper' library to be installed.

```python
def _init_local(self) -> None:
    """Initialize local Whisper model."""
    from videopython.ai._optional import require

    whisper = require("whisper", "asr", feature="AudioToText")

    # No revision pin: openai-whisper downloads weights names from OpenAI's
    # own CDN, not via a HF from_pretrained repo, so there is no HF commit
    # SHA to pin (see videopython.ai._revisions module docstring).
    self._model = whisper.load_model(name=self.model_name, device=self.device)
```

--------------------------------

### Initialize Audio Object

Source: https://videopython.com/api/core/audio

Initializes an Audio object with provided numpy array data and AudioMetadata. The data should be normalized between -1 and 1.

```python
def __init__(self, data: np.ndarray, metadata: AudioMetadata):
    """
    Initialize Audio object

    Args:
        data: Audio data as numpy array, normalized between -1 and 1
        metadata: AudioMetadata object containing audio properties
    """
    self.data = data
    self.metadata = metadata
```

--------------------------------

### Get Audio Sample Count

Source: https://videopython.com/api/core/audio

Returns the total number of audio samples. This method is part of the Audio class.

```python
def __len__(self) -> int:
    """Returns the number of samples"""
    return self.metadata.frame_count
```

--------------------------------

### AudioEvent Duration Property

Source: https://videopython.com/api/ai/understanding

Calculates the duration of an AudioEvent in seconds. This property is derived from the start and end times of the event.

```python
@property
    def duration(self) -> float:
        """Duration of the audio event in seconds."""
        return self.end - self.start

```

--------------------------------

### FaceTracker Initialization

Source: https://videopython.com/api/ai/understanding

Initializes the FaceTracker with a specified backend and logs the initialization.

```python
self._detector: _FaceDetector | None = None
self._last_position: tuple[float, float] | None = None
self._last_size: tuple[float, float] | None = None
self._smoothed_position: tuple[float, float] | None = None
self._smoothed_size: tuple[float, float] | None = None
logger.info("FaceTracker initialized with backend=%s", self.backend)
```