### SyncNet Demo Output Example

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

Example of the expected output values from the SyncNet demo script. Minor variations may occur based on system configuration.

```text
AV offset:      3 
Min dist:       5.353
Confidence:     10.021
```

--------------------------------

### Install Dependencies and Download Model

Source: https://context7.com/joonson/syncnet_python/llms.txt

Installs SyncNet dependencies using Conda. Choose either the GPU (CUDA) or CPU-only environment. Downloads pretrained model weights and an example video.

```bash
# GPU (CUDA)
conda env create -f environment.yml
conda activate syncnet

# CPU only
conda env create -f environment-cpu.yml
conda activate syncnet

# Download pretrained model weights and example video
sh download_model.sh
# Downloads:
#   data/syncnet_v2.model      – SyncNet weights
#   data/example.avi           – example video
#   detectors/s3fd/weights/sfd_face.pth  – S3FD face detector weights
```

--------------------------------

### Extract Lip Features with SyncNetInstance

Source: https://context7.com/joonson/syncnet_python/llms.txt

Use SyncNetInstance.extract_feature to get lip CNN features. Requires loading model parameters first. Features are saved as a PyTorch tensor.

```python
import argparse, torch
from SyncNetInstance import SyncNetInstance

opt = argparse.Namespace(
    batch_size=20,
    vshift=15,
    tmp_dir='data',
    save_as='data/features.pt',
)

s = SyncNetInstance()
s.loadParameters('data/syncnet_v2.model')

feats = s.extract_feature(opt, videofile='data/example.avi')
# feats: torch.Tensor of shape (N_windows, 512)
print('Feature shape:', feats.shape)   # e.g. torch.Size([120, 512])

torch.save(feats, opt.save_as)
# Reload later:
feats_loaded = torch.load('data/features.pt')
```

--------------------------------

### Run Quick AV Sync Demo with demo_syncnet.py

Source: https://context7.com/joonson/syncnet_python/llms.txt

Execute the full SyncNet evaluation pipeline on a single video file. Requires specifying model and video paths.

```bash
python demo_syncnet.py \
    --videofile data/example.avi \
    --tmp_dir   data/work/pytmp \
    --reference demo \
    --initial_model data/syncnet_v2.model \
    --batch_size 20 \
    --vshift 15
# Expected output:
#   INFO  Model data/syncnet_v2.model loaded.
#   INFO  AV offset:    3
#   INFO  Min dist:     5.353
#   INFO  Confidence:   10.021
```

--------------------------------

### Initialize SyncNetInstance and Evaluate AV Sync

Source: https://context7.com/joonson/syncnet_python/llms.txt

Initializes the SyncNetInstance wrapper, loads pretrained model weights, and evaluates the audio-visual synchronisation offset for a given video file. The `vshift` option defines the search window for the offset.

```python
from SyncNetInstance import SyncNetInstance
import argparse

# Build a minimal options namespace
opt = argparse.Namespace(
    batch_size=20,
    vshift=15,           # search window: ±15 frames around zero offset
    tmp_dir='data/work/pytmp',
    reference='my_video',
)

# Instantiate and load weights
s = SyncNetInstance()                        # device auto-detected (cuda / cpu)
s.loadParameters('data/syncnet_v2.model')   # loads into internal S module

# Evaluate AV sync on a cropped face-track AVI
offset, conf, dists = s.evaluate(opt, videofile='data/work/pycrop/my_video/00000.avi')
# offset : numpy scalar – signed frame offset (positive = audio leads video)
# conf   : numpy scalar – median_dist - min_dist (higher = more confident)
# dists  : numpy array  – per-frame pairwise distances across the vshift window
print(f'AV offset: {int(offset):+d} frames')
print(f'Min dist:  {dists.min():.3f}')
print(f'Confidence:{float(conf):.3f}')
# Expected output for the provided example:
#   AV offset: +3 frames
#   Min dist:  5.353
#   Confidence:10.021
```

--------------------------------

### Run SyncNet Pipeline Stage 2

Source: https://context7.com/joonson/syncnet_python/llms.txt

Executes SyncNetInstance.evaluate() on cropped face tracks to compute sync offset. Requires video file, reference name, data directory, initial model, batch size, and video shift.

```bash
python run_syncnet.py \
    --videofile  /path/to/interview.mp4 \
    --reference  interview \
    --data_dir   data/work \
    --initial_model data/syncnet_v2.model \
    --batch_size 20 \
    --vshift 15
```

--------------------------------

### Run Full Pipeline: Face Detection & Tracking with run_pipeline.py

Source: https://context7.com/joonson/syncnet_python/llms.txt

Converts video, detects and tracks faces, and crops face-tracks for each speaker. Requires specifying input video and various processing parameters.

```bash
python run_pipeline.py \
    --videofile  /path/to/interview.mp4 \
    --reference  interview \
    --data_dir   data/work \
    --facedet_scale  0.25 \
    --crop_scale     0.40 \
    --min_track      100 \
    --frame_rate     25 \
    --num_failed_det 25 \
    --min_face_size  100
# Outputs:
#   data/work/pyavi/interview/video.avi        – 25fps converted video
#   data/work/pyframes/interview/*.jpg         – extracted frames
#   data/work/pywork/interview/faces.pckl      – raw face detections
```

--------------------------------

### Run SyncNet Demo

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

This command runs the SyncNet demo, processing a video file and outputting synchronization information. Specify the video file path and a temporary directory for processing.

```python
python demo_syncnet.py --videofile data/example.avi --tmp_dir /path/to/temp/directory
```

--------------------------------

### Run Full SyncNet Pipeline - Face Detection and Tracking

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

This command initiates the first stage of the SyncNet pipeline, focusing on face detection and tracking within the specified video.

```python
python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
```

--------------------------------

### Run Lip Feature Extraction Demo with demo_feature.py

Source: https://context7.com/joonson/syncnet_python/llms.txt

Extracts 512-D lip motion features from a video and saves them to a .pt file. Requires specifying video, model, and output paths.

```bash
python demo_feature.py \
    --videofile data/example.avi \
    --tmp_dir   data \
    --save_as   data/features.pt \
    --initial_model data/syncnet_v2.model \
    --batch_size 20
# Produces: data/features.pt  (torch.Tensor, shape [N, 512])
```

```python
import torch
feats = torch.load('data/features.pt')
print(feats.shape)   # e.g. torch.Size([120, 512])
```

--------------------------------

### Run Full SyncNet Pipeline - Sync Offset Estimation

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

This command executes the second stage of the SyncNet pipeline, responsible for estimating the audio-visual synchronization offset.

```python
python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
```

--------------------------------

### Create Conda Environment with GPU Support

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

Use this command to create a Conda environment with all necessary dependencies for GPU acceleration.

```bash
conda env create -f environment.yml
```

--------------------------------

### Download Pretrained Model

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

Execute this shell script to download the necessary pretrained model files for SyncNet.

```bash
sh download_model.sh
```

--------------------------------

### SyncNetInstance.loadParameters

Source: https://context7.com/joonson/syncnet_python/llms.txt

Loads pretrained PyTorch state-dict weights into the internal `S` network. It uses a CPU-safe `map_location` to ensure compatibility across different devices.

```APIDOC
## SyncNetInstance.loadParameters

### Description
Loads a serialised PyTorch state-dict into the internal `S` network using a CPU-safe `map_location`, so the same checkpoint works regardless of the inference device.

### Usage
```python
from SyncNetInstance import SyncNetInstance

s = SyncNetInstance()
s.loadParameters('data/syncnet_v2.model')
```

### Parameters
- `weights_path` (str): Path to the serialised PyTorch state-dict file (e.g., 'data/syncnet_v2.model').
```

--------------------------------

### Run SyncNet Pipeline Stage 3

Source: https://context7.com/joonson/syncnet_python/llms.txt

Generates annotated video output with per-frame confidence scores and color-coded bounding boxes. Requires video file, reference name, data directory, and frame rate.

```bash
python run_visualise.py \
    --videofile  /path/to/interview.mp4 \
    --reference  interview \
    --data_dir   data/work \
    --frame_rate 25
```

--------------------------------

### SyncNetInstance.evaluate

Source: https://context7.com/joonson/syncnet_python/llms.txt

Estimates the Audio-Visual (AV) offset by extracting frames and audio, computing lip and MFCC embeddings, and finding the lag that minimizes pairwise L2 distance over a configurable shift window.

```APIDOC
## SyncNetInstance.evaluate

### Description
Extracts frames and audio from a video via ffmpeg, computes lip embeddings with `forward_lip` and MFCC embeddings with `forward_aud`, then uses `calc_pdist` over a configurable shift window (`vshift`) to find the lag that minimises pairwise L2 distance.

### Usage
```python
import argparse
from SyncNetInstance import SyncNetInstance

opt = argparse.Namespace(
    batch_size=20,
    vshift=15,
    tmp_dir='/tmp/syncnet_work',
    reference='clip01',
)

s = SyncNetInstance()
s.loadParameters('data/syncnet_v2.model')

offset, conf, dists = s.evaluate(opt, videofile='data/work/pycrop/clip01/00000.avi')

print('Frames analysed:', dists.shape[0])
print('Search window  :', dists.shape[1], 'positions')
print('Best offset    :', int(offset), 'frames')
print('Confidence     :', f'{float(conf):.3f}')
```

### Parameters
- `opt` (argparse.Namespace): Namespace containing configuration options like `batch_size`, `vshift`, and `tmp_dir`.
- `videofile` (str): Path to the cropped face-track AVI video file.

### Returns
- `offset` (numpy scalar): Signed frame offset (positive = audio leads video).
- `conf` (numpy scalar): Confidence score (median_dist - min_dist).
- `dists` (numpy array): Per-frame pairwise distances across the `vshift` window.
```

--------------------------------

### SyncNetInstance - Core Model Wrapper

Source: https://context7.com/joonson/syncnet_python/llms.txt

The `SyncNetInstance` class wraps the core neural network and provides high-level methods for loading weights, running AV-offset evaluation, and extracting visual features. It automatically selects CUDA if available, falling back to CPU.

```APIDOC
## SyncNetInstance

### Description
`SyncNetInstance` wraps the `S` neural network and provides high-level methods for loading weights, running AV-offset evaluation, and extracting visual features. It automatically selects CUDA if available, falling back to CPU.

### Usage
```python
from SyncNetInstance import SyncNetInstance
import argparse

# Build a minimal options namespace
opt = argparse.Namespace(
    batch_size=20,
    vshift=15,           # search window: ±15 frames around zero offset
    tmp_dir='data/work/pytmp',
    reference='my_video',
)

# Instantiate and load weights
s = SyncNetInstance()
s.loadParameters('data/syncnet_v2.model')

# Evaluate AV sync on a cropped face-track AVI
offset, conf, dists = s.evaluate(opt, videofile='data/work/pycrop/my_video/00000.avi')

print(f'AV offset: {int(offset):+d} frames')
print(f'Min dist:  {dists.min():.3f}')
print(f'Confidence:{float(conf):.3f}')
```

### Parameters
- `opt` (argparse.Namespace): Namespace containing configuration options like `batch_size`, `vshift`, and `tmp_dir`.
- `videofile` (str): Path to the cropped face-track AVI video file.
```

--------------------------------

### Run Full SyncNet Pipeline - Visualization

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

This command runs the final stage of the SyncNet pipeline, generating visualizations based on the processed video and synchronization data.

```python
python run_visualise.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
```

--------------------------------

### Create Conda Environment for CPU Only

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

Use this command to create a Conda environment for running SyncNet on CPU only.

```bash
conda env create -f environment-cpu.yml
```

--------------------------------

### SyncNetInstance.evaluate: AV Offset Estimation

Source: https://context7.com/joonson/syncnet_python/llms.txt

Estimates the audio-visual synchronisation offset by computing lip and MFCC embeddings, then finding the lag that minimises pairwise L2 distance within a configurable shift window (`vshift`).

```python
import argparse
from SyncNetInstance import SyncNetInstance

opt = argparse.Namespace(
    batch_size=20,
    vshift=15,
    tmp_dir='/tmp/syncnet_work',
    reference='clip01',
)

s = SyncNetInstance()
s.loadParameters('data/syncnet_v2.model')

offset, conf, dists = s.evaluate(opt, videofile='data/work/pycrop/clip01/00000.avi')

# dists shape: (n_frames, 2*vshift+1)
import numpy as np
print('Frames analysed:', dists.shape[0])
print('Search window  :', dists.shape[1], 'positions')
print('Best offset    :', int(offset), 'frames')
print('Confidence     :', f'{float(conf):.3f}')
```

--------------------------------

### Load Pretrained Weights for SyncNetInstance

Source: https://context7.com/joonson/syncnet_python/llms.txt

Loads a serialised PyTorch state-dict into the internal `S` network using a CPU-safe `map_location`. This ensures the checkpoint works regardless of the inference device, making it safe to use on CPU-only machines.

```python
from SyncNetInstance import SyncNetInstance

s = SyncNetInstance()
s.loadParameters('data/syncnet_v2.model')
# Internally copies each parameter tensor by name into self.__S__.state_dict()
# Safe to call on a CPU-only machine even if the model was saved on GPU
```

--------------------------------

### SyncNetModel (S) Dual-Stream Architecture

Source: https://context7.com/joonson/syncnet_python/llms.txt

Core PyTorch nn.Module for SyncNet, processing audio (MFCC) and lip (RGB crops) streams separately. Outputs embeddings and calculates cosine similarity.

```python
import torch
from SyncNetModel import S

model = S(num_layers_in_fc_layers=1024)

# Audio branch — input: (N, 1, 13, T) MFCC spectrogram windows
audio_in = torch.randn(8, 1, 13, 20)   # batch=8, 1 channel, 13 MFCC bins, 20 time steps
audio_emb = model.forward_aud(audio_in)
print('Audio embedding:', audio_emb.shape)   # torch.Size([8, 1024])

# Lip branch — input: (N, 3, 5, H, W) 5-frame RGB face crops
lip_in = torch.randn(8, 3, 5, 224, 224)
lip_emb = model.forward_lip(lip_in)
print('Lip embedding  :', lip_emb.shape)     # torch.Size([8, 1024])

# Lip feature (pre-FC, 512-D) — used by extract_feature
lip_feat = model.forward_lipfeat(lip_in)
print('Lip feature    :', lip_feat.shape)    # torch.Size([8, 512])

# Cosine similarity between matched audio and lip pairs
sim = torch.nn.functional.cosine_similarity(audio_emb, lip_emb)
print('Similarity (synced pair ~1.0):', sim.mean().item())
```

--------------------------------

### SyncNet Pipeline Output Directories

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

Description of the output files generated by the SyncNet pipeline, including cropped face tracks and the final output video.

```text
$DATA_DIR/pycrop/$REFERENCE/*.avi - cropped face tracks
$DATA_DIR/pyavi/$REFERENCE/video_out.avi - output video (as shown below)
```

--------------------------------

### S (SyncNetModel)

Source: https://context7.com/joonson/syncnet_python/llms.txt

The core PyTorch `nn.Module` for the dual-stream SyncNet architecture. It processes audio spectrograms and lip crops through separate CNNs, mapping them to a 1024-D embedding space for synchronization analysis.

```APIDOC
## `S` (SyncNetModel) — Dual-Stream Neural Network

`S` is the core PyTorch `nn.Module` implementing the two-stream architecture: a 2-D CNN (`netcnnaud` + `netfcaud`) for MFCC spectrograms and a 3-D CNN (`netcnnlip` + `netfclip`) for 5-frame lip crops, both mapping to a 1024-D embedding space.

```python
import torch
from SyncNetModel import S

model = S(num_layers_in_fc_layers=1024)

# Audio branch — input: (N, 1, 13, T) MFCC spectrogram windows
audio_in = torch.randn(8, 1, 13, 20)   # batch=8, 1 channel, 13 MFCC bins, 20 time steps
audio_emb = model.forward_aud(audio_in)
print('Audio embedding:', audio_emb.shape)   # torch.Size([8, 1024])

# Lip branch — input: (N, 3, 5, H, W) 5-frame RGB face crops
lip_in = torch.randn(8, 3, 5, 224, 224)
lip_emb = model.forward_lip(lip_in)
print('Lip embedding  :', lip_emb.shape)     # torch.Size([8, 1024])

# Lip feature (pre-FC, 512-D) — used by extract_feature
lip_feat = model.forward_lipfeat(lip_in)
print('Lip feature    :', lip_feat.shape)    # torch.Size([8, 512])

# Cosine similarity between matched audio and lip pairs
sim = torch.nn.functional.cosine_similarity(audio_emb, lip_emb)
print('Similarity (synced pair ~1.0):', sim.mean().item())
```
```

--------------------------------

### Calculate Pairwise Distances with calc_pdist

Source: https://context7.com/joonson/syncnet_python/llms.txt

Standalone utility to compute L2 pairwise distance between visual and audio features over temporal shifts. Returns a list of distance tensors.

```python
import torch
from SyncNetInstance import calc_pdist

# Simulate feature tensors (e.g. from a real evaluate() run)
n_frames = 50
feat_dim = 1024
im_feat = torch.randn(n_frames, feat_dim)   # lip embeddings
cc_feat = torch.randn(n_frames, feat_dim)   # audio embeddings

dists = calc_pdist(im_feat, cc_feat, vshift=10)
# dists: list of n_frames tensors, each of length 2*vshift+1 = 21

stacked = torch.stack(dists, 1)             # shape: (21, n_frames)
mdist   = torch.mean(stacked, 1)            # mean distance per shift
minval, minidx = torch.min(mdist, 0)
offset = 10 - minidx.item()
conf   = torch.median(mdist).item() - minval.item()
print(f'Estimated offset: {offset:+d}  confidence: {conf:.3f}')
```

--------------------------------

### SyncNet Publication Citation

Source: https://github.com/joonson/syncnet_python/blob/master/README.md

BibTeX entry for citing the SyncNet paper.

```bibtex
@InProceedings{Chung16a,
  author       = "Chung, J.~S. and Zisserman, A.",
  title        = "Out of time: automated lip sync in the wild",
  booktitle    = "Workshop on Multi-view Lip-reading, ACCV",
  year         = "2016",
}
```

--------------------------------

### S3FD Face Detector Integration

Source: https://context7.com/joonson/syncnet_python/llms.txt

Integrates the S3FD face detector for per-frame face detection in images. Requires OpenCV and PyTorch. Detects faces and draws bounding boxes on the image.

```python
import cv2, torch
from detectors import S3FD

device = 'cuda' if torch.cuda.is_available() else 'cpu'
det = S3FD(device=device)

image_bgr = cv2.imread('frame.jpg')
image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)

# detect_faces returns an array of [x1, y1, x2, y2, confidence]
bboxes = det.detect_faces(image_rgb, conf_th=0.9, scales=[0.25])
for bbox in bboxes:
    x1, y1, x2, y2, conf = bbox
    print(f'Face at ({int(x1)},{int(y1)})-({int(x2)},{int(y2)})  conf={conf:.3f}')
    cv2.rectangle(image_bgr, (int(x1), int(y1)), (int(x2), int(y2)), (0, 255, 0), 2)

cv2.imwrite('frame_det.jpg', image_bgr)
```

--------------------------------

### calc_pdist

Source: https://context7.com/joonson/syncnet_python/llms.txt

Computes L2 pairwise distance between visual and audio feature sequences over a range of temporal shifts. It returns a list of distance tensors, one for each frame, enabling the calculation of audio-visual synchronization offset and confidence.

```APIDOC
## `calc_pdist` — Pairwise Distance Over Temporal Shifts

Standalone utility that computes L2 pairwise distance between a visual feature sequence and a temporally-padded audio feature sequence for every shift in `[-vshift, +vshift]`. Returns a list of distance tensors, one per frame.

```python
import torch
from SyncNetInstance import calc_pdist

# Simulate feature tensors (e.g. from a real evaluate() run)
n_frames = 50
feat_dim = 1024
im_feat = torch.randn(n_frames, feat_dim)   # lip embeddings
cc_feat = torch.randn(n_frames, feat_dim)   # audio embeddings

dists = calc_pdist(im_feat, cc_feat, vshift=10)
# dists: list of n_frames tensors, each of length 2*vshift+1 = 21

stacked = torch.stack(dists, 1)             # shape: (21, n_frames)
mdist   = torch.mean(stacked, 1)            # mean distance per shift
minval, minidx = torch.min(mdist, 0)
offset = 10 - minidx.item()
conf   = torch.median(mdist).item() - minval.item()
print(f'Estimated offset: {offset:+d}  confidence: {conf:.3f}')
```
```

--------------------------------

### SyncNetInstance.extract_feature

Source: https://context7.com/joonson/syncnet_python/llms.txt

Extracts visual lip embeddings from a video file. This method runs only the lip CNN to generate a `(N, 512)` feature tensor, suitable for downstream tasks like speaker diarization.

```APIDOC
## `SyncNetInstance.extract_feature` — Visual Lip Embedding Extraction

Runs only the lip CNN (`forward_lipfeat`) on every 5-frame window in the video, returning the pre-FC feature map as a `(N, 512)` tensor. Useful for building downstream speaker-diarisation or retrieval systems.

```python
import argparse, torch
from SyncNetInstance import SyncNetInstance

opt = argparse.Namespace(
    batch_size=20,
    vshift=15,
    tmp_dir='data',
    save_as='data/features.pt',
)

s = SyncNetInstance()
s.loadParameters('data/syncnet_v2.model')

feats = s.extract_feature(opt, videofile='data/example.avi')
# feats: torch.Tensor of shape (N_windows, 512)
print('Feature shape:', feats.shape)   # e.g. torch.Size([120, 512])

torch.save(feats, opt.save_as)
# Reload later:
feats_loaded = torch.load('data/features.pt')
```
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.