### SyncNet Demo Output Example Source: https://github.com/joonson/syncnet_python/blob/master/README.md Example of the expected output values from the SyncNet demo script. Minor variations may occur based on system configuration. ```text AV offset: 3 Min dist: 5.353 Confidence: 10.021 ``` -------------------------------- ### Install Dependencies and Download Model Source: https://context7.com/joonson/syncnet_python/llms.txt Installs SyncNet dependencies using Conda. Choose either the GPU (CUDA) or CPU-only environment. Downloads pretrained model weights and an example video. ```bash # GPU (CUDA) conda env create -f environment.yml conda activate syncnet # CPU only conda env create -f environment-cpu.yml conda activate syncnet # Download pretrained model weights and example video sh download_model.sh # Downloads: # data/syncnet_v2.model – SyncNet weights # data/example.avi – example video # detectors/s3fd/weights/sfd_face.pth – S3FD face detector weights ``` -------------------------------- ### Extract Lip Features with SyncNetInstance Source: https://context7.com/joonson/syncnet_python/llms.txt Use SyncNetInstance.extract_feature to get lip CNN features. Requires loading model parameters first. Features are saved as a PyTorch tensor. ```python import argparse, torch from SyncNetInstance import SyncNetInstance opt = argparse.Namespace( batch_size=20, vshift=15, tmp_dir='data', save_as='data/features.pt', ) s = SyncNetInstance() s.loadParameters('data/syncnet_v2.model') feats = s.extract_feature(opt, videofile='data/example.avi') # feats: torch.Tensor of shape (N_windows, 512) print('Feature shape:', feats.shape) # e.g. torch.Size([120, 512]) torch.save(feats, opt.save_as) # Reload later: feats_loaded = torch.load('data/features.pt') ``` -------------------------------- ### Run Quick AV Sync Demo with demo_syncnet.py Source: https://context7.com/joonson/syncnet_python/llms.txt Execute the full SyncNet evaluation pipeline on a single video file. Requires specifying model and video paths. ```bash python demo_syncnet.py \ --videofile data/example.avi \ --tmp_dir data/work/pytmp \ --reference demo \ --initial_model data/syncnet_v2.model \ --batch_size 20 \ --vshift 15 # Expected output: # INFO Model data/syncnet_v2.model loaded. # INFO AV offset: 3 # INFO Min dist: 5.353 # INFO Confidence: 10.021 ``` -------------------------------- ### Initialize SyncNetInstance and Evaluate AV Sync Source: https://context7.com/joonson/syncnet_python/llms.txt Initializes the SyncNetInstance wrapper, loads pretrained model weights, and evaluates the audio-visual synchronisation offset for a given video file. The `vshift` option defines the search window for the offset. ```python from SyncNetInstance import SyncNetInstance import argparse # Build a minimal options namespace opt = argparse.Namespace( batch_size=20, vshift=15, # search window: ±15 frames around zero offset tmp_dir='data/work/pytmp', reference='my_video', ) # Instantiate and load weights s = SyncNetInstance() # device auto-detected (cuda / cpu) s.loadParameters('data/syncnet_v2.model') # loads into internal S module # Evaluate AV sync on a cropped face-track AVI offset, conf, dists = s.evaluate(opt, videofile='data/work/pycrop/my_video/00000.avi') # offset : numpy scalar – signed frame offset (positive = audio leads video) # conf : numpy scalar – median_dist - min_dist (higher = more confident) # dists : numpy array – per-frame pairwise distances across the vshift window print(f'AV offset: {int(offset):+d} frames') print(f'Min dist: {dists.min():.3f}') print(f'Confidence:{float(conf):.3f}') # Expected output for the provided example: # AV offset: +3 frames # Min dist: 5.353 # Confidence:10.021 ``` -------------------------------- ### Run SyncNet Pipeline Stage 2 Source: https://context7.com/joonson/syncnet_python/llms.txt Executes SyncNetInstance.evaluate() on cropped face tracks to compute sync offset. Requires video file, reference name, data directory, initial model, batch size, and video shift. ```bash python run_syncnet.py \ --videofile /path/to/interview.mp4 \ --reference interview \ --data_dir data/work \ --initial_model data/syncnet_v2.model \ --batch_size 20 \ --vshift 15 ``` -------------------------------- ### Run Full Pipeline: Face Detection & Tracking with run_pipeline.py Source: https://context7.com/joonson/syncnet_python/llms.txt Converts video, detects and tracks faces, and crops face-tracks for each speaker. Requires specifying input video and various processing parameters. ```bash python run_pipeline.py \ --videofile /path/to/interview.mp4 \ --reference interview \ --data_dir data/work \ --facedet_scale 0.25 \ --crop_scale 0.40 \ --min_track 100 \ --frame_rate 25 \ --num_failed_det 25 \ --min_face_size 100 # Outputs: # data/work/pyavi/interview/video.avi – 25fps converted video # data/work/pyframes/interview/*.jpg – extracted frames # data/work/pywork/interview/faces.pckl – raw face detections ``` -------------------------------- ### Run SyncNet Demo Source: https://github.com/joonson/syncnet_python/blob/master/README.md This command runs the SyncNet demo, processing a video file and outputting synchronization information. Specify the video file path and a temporary directory for processing. ```python python demo_syncnet.py --videofile data/example.avi --tmp_dir /path/to/temp/directory ``` -------------------------------- ### Run Full SyncNet Pipeline - Face Detection and Tracking Source: https://github.com/joonson/syncnet_python/blob/master/README.md This command initiates the first stage of the SyncNet pipeline, focusing on face detection and tracking within the specified video. ```python python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output ``` -------------------------------- ### Run Lip Feature Extraction Demo with demo_feature.py Source: https://context7.com/joonson/syncnet_python/llms.txt Extracts 512-D lip motion features from a video and saves them to a .pt file. Requires specifying video, model, and output paths. ```bash python demo_feature.py \ --videofile data/example.avi \ --tmp_dir data \ --save_as data/features.pt \ --initial_model data/syncnet_v2.model \ --batch_size 20 # Produces: data/features.pt (torch.Tensor, shape [N, 512]) ``` ```python import torch feats = torch.load('data/features.pt') print(feats.shape) # e.g. torch.Size([120, 512]) ``` -------------------------------- ### Run Full SyncNet Pipeline - Sync Offset Estimation Source: https://github.com/joonson/syncnet_python/blob/master/README.md This command executes the second stage of the SyncNet pipeline, responsible for estimating the audio-visual synchronization offset. ```python python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output ``` -------------------------------- ### Create Conda Environment with GPU Support Source: https://github.com/joonson/syncnet_python/blob/master/README.md Use this command to create a Conda environment with all necessary dependencies for GPU acceleration. ```bash conda env create -f environment.yml ``` -------------------------------- ### Download Pretrained Model Source: https://github.com/joonson/syncnet_python/blob/master/README.md Execute this shell script to download the necessary pretrained model files for SyncNet. ```bash sh download_model.sh ``` -------------------------------- ### SyncNetInstance.loadParameters Source: https://context7.com/joonson/syncnet_python/llms.txt Loads pretrained PyTorch state-dict weights into the internal `S` network. It uses a CPU-safe `map_location` to ensure compatibility across different devices. ```APIDOC ## SyncNetInstance.loadParameters ### Description Loads a serialised PyTorch state-dict into the internal `S` network using a CPU-safe `map_location`, so the same checkpoint works regardless of the inference device. ### Usage ```python from SyncNetInstance import SyncNetInstance s = SyncNetInstance() s.loadParameters('data/syncnet_v2.model') ``` ### Parameters - `weights_path` (str): Path to the serialised PyTorch state-dict file (e.g., 'data/syncnet_v2.model'). ``` -------------------------------- ### Run SyncNet Pipeline Stage 3 Source: https://context7.com/joonson/syncnet_python/llms.txt Generates annotated video output with per-frame confidence scores and color-coded bounding boxes. Requires video file, reference name, data directory, and frame rate. ```bash python run_visualise.py \ --videofile /path/to/interview.mp4 \ --reference interview \ --data_dir data/work \ --frame_rate 25 ``` -------------------------------- ### SyncNetInstance.evaluate Source: https://context7.com/joonson/syncnet_python/llms.txt Estimates the Audio-Visual (AV) offset by extracting frames and audio, computing lip and MFCC embeddings, and finding the lag that minimizes pairwise L2 distance over a configurable shift window. ```APIDOC ## SyncNetInstance.evaluate ### Description Extracts frames and audio from a video via ffmpeg, computes lip embeddings with `forward_lip` and MFCC embeddings with `forward_aud`, then uses `calc_pdist` over a configurable shift window (`vshift`) to find the lag that minimises pairwise L2 distance. ### Usage ```python import argparse from SyncNetInstance import SyncNetInstance opt = argparse.Namespace( batch_size=20, vshift=15, tmp_dir='/tmp/syncnet_work', reference='clip01', ) s = SyncNetInstance() s.loadParameters('data/syncnet_v2.model') offset, conf, dists = s.evaluate(opt, videofile='data/work/pycrop/clip01/00000.avi') print('Frames analysed:', dists.shape[0]) print('Search window :', dists.shape[1], 'positions') print('Best offset :', int(offset), 'frames') print('Confidence :', f'{float(conf):.3f}') ``` ### Parameters - `opt` (argparse.Namespace): Namespace containing configuration options like `batch_size`, `vshift`, and `tmp_dir`. - `videofile` (str): Path to the cropped face-track AVI video file. ### Returns - `offset` (numpy scalar): Signed frame offset (positive = audio leads video). - `conf` (numpy scalar): Confidence score (median_dist - min_dist). - `dists` (numpy array): Per-frame pairwise distances across the `vshift` window. ``` -------------------------------- ### SyncNetInstance - Core Model Wrapper Source: https://context7.com/joonson/syncnet_python/llms.txt The `SyncNetInstance` class wraps the core neural network and provides high-level methods for loading weights, running AV-offset evaluation, and extracting visual features. It automatically selects CUDA if available, falling back to CPU. ```APIDOC ## SyncNetInstance ### Description `SyncNetInstance` wraps the `S` neural network and provides high-level methods for loading weights, running AV-offset evaluation, and extracting visual features. It automatically selects CUDA if available, falling back to CPU. ### Usage ```python from SyncNetInstance import SyncNetInstance import argparse # Build a minimal options namespace opt = argparse.Namespace( batch_size=20, vshift=15, # search window: ±15 frames around zero offset tmp_dir='data/work/pytmp', reference='my_video', ) # Instantiate and load weights s = SyncNetInstance() s.loadParameters('data/syncnet_v2.model') # Evaluate AV sync on a cropped face-track AVI offset, conf, dists = s.evaluate(opt, videofile='data/work/pycrop/my_video/00000.avi') print(f'AV offset: {int(offset):+d} frames') print(f'Min dist: {dists.min():.3f}') print(f'Confidence:{float(conf):.3f}') ``` ### Parameters - `opt` (argparse.Namespace): Namespace containing configuration options like `batch_size`, `vshift`, and `tmp_dir`. - `videofile` (str): Path to the cropped face-track AVI video file. ``` -------------------------------- ### Run Full SyncNet Pipeline - Visualization Source: https://github.com/joonson/syncnet_python/blob/master/README.md This command runs the final stage of the SyncNet pipeline, generating visualizations based on the processed video and synchronization data. ```python python run_visualise.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output ``` -------------------------------- ### Create Conda Environment for CPU Only Source: https://github.com/joonson/syncnet_python/blob/master/README.md Use this command to create a Conda environment for running SyncNet on CPU only. ```bash conda env create -f environment-cpu.yml ``` -------------------------------- ### SyncNetInstance.evaluate: AV Offset Estimation Source: https://context7.com/joonson/syncnet_python/llms.txt Estimates the audio-visual synchronisation offset by computing lip and MFCC embeddings, then finding the lag that minimises pairwise L2 distance within a configurable shift window (`vshift`). ```python import argparse from SyncNetInstance import SyncNetInstance opt = argparse.Namespace( batch_size=20, vshift=15, tmp_dir='/tmp/syncnet_work', reference='clip01', ) s = SyncNetInstance() s.loadParameters('data/syncnet_v2.model') offset, conf, dists = s.evaluate(opt, videofile='data/work/pycrop/clip01/00000.avi') # dists shape: (n_frames, 2*vshift+1) import numpy as np print('Frames analysed:', dists.shape[0]) print('Search window :', dists.shape[1], 'positions') print('Best offset :', int(offset), 'frames') print('Confidence :', f'{float(conf):.3f}') ``` -------------------------------- ### Load Pretrained Weights for SyncNetInstance Source: https://context7.com/joonson/syncnet_python/llms.txt Loads a serialised PyTorch state-dict into the internal `S` network using a CPU-safe `map_location`. This ensures the checkpoint works regardless of the inference device, making it safe to use on CPU-only machines. ```python from SyncNetInstance import SyncNetInstance s = SyncNetInstance() s.loadParameters('data/syncnet_v2.model') # Internally copies each parameter tensor by name into self.__S__.state_dict() # Safe to call on a CPU-only machine even if the model was saved on GPU ``` -------------------------------- ### SyncNetModel (S) Dual-Stream Architecture Source: https://context7.com/joonson/syncnet_python/llms.txt Core PyTorch nn.Module for SyncNet, processing audio (MFCC) and lip (RGB crops) streams separately. Outputs embeddings and calculates cosine similarity. ```python import torch from SyncNetModel import S model = S(num_layers_in_fc_layers=1024) # Audio branch — input: (N, 1, 13, T) MFCC spectrogram windows audio_in = torch.randn(8, 1, 13, 20) # batch=8, 1 channel, 13 MFCC bins, 20 time steps audio_emb = model.forward_aud(audio_in) print('Audio embedding:', audio_emb.shape) # torch.Size([8, 1024]) # Lip branch — input: (N, 3, 5, H, W) 5-frame RGB face crops lip_in = torch.randn(8, 3, 5, 224, 224) lip_emb = model.forward_lip(lip_in) print('Lip embedding :', lip_emb.shape) # torch.Size([8, 1024]) # Lip feature (pre-FC, 512-D) — used by extract_feature lip_feat = model.forward_lipfeat(lip_in) print('Lip feature :', lip_feat.shape) # torch.Size([8, 512]) # Cosine similarity between matched audio and lip pairs sim = torch.nn.functional.cosine_similarity(audio_emb, lip_emb) print('Similarity (synced pair ~1.0):', sim.mean().item()) ``` -------------------------------- ### SyncNet Pipeline Output Directories Source: https://github.com/joonson/syncnet_python/blob/master/README.md Description of the output files generated by the SyncNet pipeline, including cropped face tracks and the final output video. ```text $DATA_DIR/pycrop/$REFERENCE/*.avi - cropped face tracks $DATA_DIR/pyavi/$REFERENCE/video_out.avi - output video (as shown below) ``` -------------------------------- ### S (SyncNetModel) Source: https://context7.com/joonson/syncnet_python/llms.txt The core PyTorch `nn.Module` for the dual-stream SyncNet architecture. It processes audio spectrograms and lip crops through separate CNNs, mapping them to a 1024-D embedding space for synchronization analysis. ```APIDOC ## `S` (SyncNetModel) — Dual-Stream Neural Network `S` is the core PyTorch `nn.Module` implementing the two-stream architecture: a 2-D CNN (`netcnnaud` + `netfcaud`) for MFCC spectrograms and a 3-D CNN (`netcnnlip` + `netfclip`) for 5-frame lip crops, both mapping to a 1024-D embedding space. ```python import torch from SyncNetModel import S model = S(num_layers_in_fc_layers=1024) # Audio branch — input: (N, 1, 13, T) MFCC spectrogram windows audio_in = torch.randn(8, 1, 13, 20) # batch=8, 1 channel, 13 MFCC bins, 20 time steps audio_emb = model.forward_aud(audio_in) print('Audio embedding:', audio_emb.shape) # torch.Size([8, 1024]) # Lip branch — input: (N, 3, 5, H, W) 5-frame RGB face crops lip_in = torch.randn(8, 3, 5, 224, 224) lip_emb = model.forward_lip(lip_in) print('Lip embedding :', lip_emb.shape) # torch.Size([8, 1024]) # Lip feature (pre-FC, 512-D) — used by extract_feature lip_feat = model.forward_lipfeat(lip_in) print('Lip feature :', lip_feat.shape) # torch.Size([8, 512]) # Cosine similarity between matched audio and lip pairs sim = torch.nn.functional.cosine_similarity(audio_emb, lip_emb) print('Similarity (synced pair ~1.0):', sim.mean().item()) ``` ``` -------------------------------- ### Calculate Pairwise Distances with calc_pdist Source: https://context7.com/joonson/syncnet_python/llms.txt Standalone utility to compute L2 pairwise distance between visual and audio features over temporal shifts. Returns a list of distance tensors. ```python import torch from SyncNetInstance import calc_pdist # Simulate feature tensors (e.g. from a real evaluate() run) n_frames = 50 feat_dim = 1024 im_feat = torch.randn(n_frames, feat_dim) # lip embeddings cc_feat = torch.randn(n_frames, feat_dim) # audio embeddings dists = calc_pdist(im_feat, cc_feat, vshift=10) # dists: list of n_frames tensors, each of length 2*vshift+1 = 21 stacked = torch.stack(dists, 1) # shape: (21, n_frames) mdist = torch.mean(stacked, 1) # mean distance per shift minval, minidx = torch.min(mdist, 0) offset = 10 - minidx.item() conf = torch.median(mdist).item() - minval.item() print(f'Estimated offset: {offset:+d} confidence: {conf:.3f}') ``` -------------------------------- ### SyncNet Publication Citation Source: https://github.com/joonson/syncnet_python/blob/master/README.md BibTeX entry for citing the SyncNet paper. ```bibtex @InProceedings{Chung16a, author = "Chung, J.~S. and Zisserman, A.", title = "Out of time: automated lip sync in the wild", booktitle = "Workshop on Multi-view Lip-reading, ACCV", year = "2016", } ``` -------------------------------- ### S3FD Face Detector Integration Source: https://context7.com/joonson/syncnet_python/llms.txt Integrates the S3FD face detector for per-frame face detection in images. Requires OpenCV and PyTorch. Detects faces and draws bounding boxes on the image. ```python import cv2, torch from detectors import S3FD device = 'cuda' if torch.cuda.is_available() else 'cpu' det = S3FD(device=device) image_bgr = cv2.imread('frame.jpg') image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB) # detect_faces returns an array of [x1, y1, x2, y2, confidence] bboxes = det.detect_faces(image_rgb, conf_th=0.9, scales=[0.25]) for bbox in bboxes: x1, y1, x2, y2, conf = bbox print(f'Face at ({int(x1)},{int(y1)})-({int(x2)},{int(y2)}) conf={conf:.3f}') cv2.rectangle(image_bgr, (int(x1), int(y1)), (int(x2), int(y2)), (0, 255, 0), 2) cv2.imwrite('frame_det.jpg', image_bgr) ``` -------------------------------- ### calc_pdist Source: https://context7.com/joonson/syncnet_python/llms.txt Computes L2 pairwise distance between visual and audio feature sequences over a range of temporal shifts. It returns a list of distance tensors, one for each frame, enabling the calculation of audio-visual synchronization offset and confidence. ```APIDOC ## `calc_pdist` — Pairwise Distance Over Temporal Shifts Standalone utility that computes L2 pairwise distance between a visual feature sequence and a temporally-padded audio feature sequence for every shift in `[-vshift, +vshift]`. Returns a list of distance tensors, one per frame. ```python import torch from SyncNetInstance import calc_pdist # Simulate feature tensors (e.g. from a real evaluate() run) n_frames = 50 feat_dim = 1024 im_feat = torch.randn(n_frames, feat_dim) # lip embeddings cc_feat = torch.randn(n_frames, feat_dim) # audio embeddings dists = calc_pdist(im_feat, cc_feat, vshift=10) # dists: list of n_frames tensors, each of length 2*vshift+1 = 21 stacked = torch.stack(dists, 1) # shape: (21, n_frames) mdist = torch.mean(stacked, 1) # mean distance per shift minval, minidx = torch.min(mdist, 0) offset = 10 - minidx.item() conf = torch.median(mdist).item() - minval.item() print(f'Estimated offset: {offset:+d} confidence: {conf:.3f}') ``` ``` -------------------------------- ### SyncNetInstance.extract_feature Source: https://context7.com/joonson/syncnet_python/llms.txt Extracts visual lip embeddings from a video file. This method runs only the lip CNN to generate a `(N, 512)` feature tensor, suitable for downstream tasks like speaker diarization. ```APIDOC ## `SyncNetInstance.extract_feature` — Visual Lip Embedding Extraction Runs only the lip CNN (`forward_lipfeat`) on every 5-frame window in the video, returning the pre-FC feature map as a `(N, 512)` tensor. Useful for building downstream speaker-diarisation or retrieval systems. ```python import argparse, torch from SyncNetInstance import SyncNetInstance opt = argparse.Namespace( batch_size=20, vshift=15, tmp_dir='data', save_as='data/features.pt', ) s = SyncNetInstance() s.loadParameters('data/syncnet_v2.model') feats = s.extract_feature(opt, videofile='data/example.avi') # feats: torch.Tensor of shape (N_windows, 512) print('Feature shape:', feats.shape) # e.g. torch.Size([120, 512]) torch.save(feats, opt.save_as) # Reload later: feats_loaded = torch.load('data/features.pt') ``` ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.