### Install imagededup

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md

Installs the imagededup library using pip. This is a prerequisite for running the deduplication examples.

```bash
!pip install imagededup
```

--------------------------------

### Install imagededup from GitHub Source

Source: https://github.com/idealo/imagededup/blob/master/README.md

Installs the imagededup package by cloning the GitHub repository and then running the setup script. This method allows for installing the latest development version.

```bash
git clone https://github.com/idealo/imagededup.git
cd imagededup
pip install .
```

--------------------------------

### Install imagededup from PyPI

Source: https://github.com/idealo/imagededup/blob/master/README.md

Installs the imagededup package using pip. This is the recommended method for installation and requires a Python environment.

```bash
pip install imagededup
```

--------------------------------

### Prepare CIFAR10 Image Directory

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md

Creates a working directory and copies the CIFAR10 training and testing images into it. This organizes the images for the deduplication process.

```bash
image_dir = 'cifar10_images'
!mkdir $image_dir
!cp -r '/content/cifar/train/.' $image_dir
!cp -r '/content/cifar/test/.' $image_dir
```

--------------------------------

### Download and Extract CIFAR10 Dataset

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md

Downloads the CIFAR10 dataset archive and extracts its contents. This prepares the dataset for further processing.

```bash
!wget http://pjreddie.com/media/files/cifar.tgz
!tar xzf cifar.tgz
```

--------------------------------

### Import Plotting Utilities

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md

Imports necessary libraries for plotting duplicate images, including Path, plot_duplicates, and matplotlib.pyplot. Sets the figure size for plots.

```python
from pathlib import Path
from imagededup.utils import plot_duplicates
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15, 10)
```

--------------------------------

### Install imagededup and TensorFlow GPU

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

Installs the imagededup library and the GPU-enabled version of TensorFlow for enhanced performance on compatible hardware. This is a prerequisite for using the CNN method with GPU acceleration.

```python
# install imagededup via PyPI
!pip install imagededup

# by default imagededup is shipped with CPU-only support for TF but let's install GPU since we have it on Google Colab
!pip install tensorflow[and-cuda] --upgrade
```

--------------------------------

### Initialize CNN with Hugging Face ViT Model

Source: https://github.com/idealo/imagededup/blob/master/examples/use_custom_model.ipynb

This snippet demonstrates initializing the CNN encoder with a Hugging Face Vision Transformer (ViT) model. It includes installing the `transformers` library, defining a custom `VitHgface` class that wraps the ViT model and its processor, and then configuring the CNN.

```bash
!pip install transformers
```

```python
from pathlib import Path
from imagededup.methods import CNN
from imagededup.utils import CustomModel
from transformers import ViTModel, AutoImageProcessor
import torch
from torchvision.transforms import transforms

VIT_MODEL = "google/vit-base-patch16-224-in21k"

def vit_transform(image):
    transform = AutoImageProcessor.from_pretrained(VIT_MODEL)
    x = transform(image, return_tensors = 'pt')['pixel_values']
    return x

class VitHgface(torch.nn.Module):
    transform = transforms.Lambda(vit_transform)

    name = 'ViT_hgface'

    def __init__(self):
        super().__init__()
        self.vit = ViTModel.from_pretrained(VIT_MODEL)

    def forward(self, x):
        x  = x.view(-1, 3, 224, 224)
        with torch.no_grad():
            out = self.vit(pixel_values=x)
        return out.pooler_output

image_dir = Path('../tests/data/mixed_images')

custom_config = CustomModel(name=VitHgface.name,
                            model=VitHgface(),
                            transform=VitHgface.transform)

cnn = CNN(model_config=custom_config)
duplicates_cnn = cnn.find_duplicates(image_dir=image_dir, scores=True)

print(duplicates_cnn)

# Encode images separately
enc = cnn.encode_images(image_dir=image_dir)
print(enc)
```

--------------------------------

### Complete Image Deduplication Workflow with imagededup

Source: https://context7.com/idealo/imagededup/llms.txt

An end-to-end Python example demonstrating a typical image deduplication pipeline using imagededup. It covers choosing between hashing (PHash, DHash) and CNN methods, finding duplicates, reviewing them, identifying files to remove, and moving duplicates to a backup folder.

```python
from imagededup.methods import PHash, CNN
from imagededup.utils import plot_duplicates
import os
import shutil

# Step 1: Choose method based on use case
# - Hashing (PHash, DHash): Fast, good for exact/near-exact duplicates
# - CNN: Slower, better for transformed images (rotations, crops, etc.)

image_dir = 'path/to/image/directory'

# Step 2: Find duplicates using PHash (fast initial pass)
phasher = PHash()
duplicates_hash = phasher.find_duplicates(
    image_dir=image_dir,
    max_distance_threshold=10,
    scores=True,
    recursive=True
)

# Step 3: Find near-duplicates using CNN (more thorough)
cnn = CNN()
duplicates_cnn = cnn.find_duplicates(
    image_dir=image_dir,
    min_similarity_threshold=0.9,
    scores=True,
    recursive=True
)

# Step 4: Review detected duplicates
for filename, dups in duplicates_cnn.items():
    if dups:
        print(f"\n{filename} has {len(dups)} duplicates:")
        plot_duplicates(image_dir, duplicates_cnn, filename)

# Step 5: Get files to remove (automated selection)
files_to_remove = phasher.find_duplicates_to_remove(
    image_dir=image_dir,
    max_distance_threshold=10
)

# Step 6: Move duplicates to a separate folder (don't delete immediately!)
duplicate_folder = 'path/to/duplicates_backup'
os.makedirs(duplicate_folder, exist_ok=True)

for file in files_to_remove:
    src = os.path.join(image_dir, file)
    dst = os.path.join(duplicate_folder, file)
    shutil.move(src, dst)
    print(f"Moved {file} to backup folder")

print(f"\nMoved {len(files_to_remove)} duplicate files to {duplicate_folder}")
```

--------------------------------

### Get Filenames from Test Set

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

This snippet uses pathlib to get a set of all filenames ending with '.png' from the '/content/cifar/test' directory. This is often a precursor to comparing or analyzing specific subsets of the dataset.

```python
# test images are stored under '/content/cifar/test'
filenames_test = set([i.name for i in Path('/content/cifar/test').glob('*.png')])
```

--------------------------------

### Find and Plot Duplicates Between Test and Train Sets

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md

Identifies and plots images from the test set that have duplicates in the training set. This helps in understanding data overlap between the two sets and visualizes the findings for a relevant file.

```python
# keep only filenames that are in test set have duplicates in train set
duplicates_test_train = {}
for k, v in duplicates.items():
    if k in filenames_test:
        tmp = [i for i in v if i in filenames_train]
        duplicates_test_train[k] = tmp
    
# sort in descending order of duplicates
duplicates_test_train = {k: v for k, v in sorted(duplicates_test_train.items(), key=lambda x: len(x[1]), reverse=True)}

# plot duplicates found for some file
plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test_train, filename=list(duplicates_test_train.keys())[0])
```

--------------------------------

### Find Duplicates using CNN in Python

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md

Finds duplicate images in the entire dataset using a CNN-based approach from the imagededup library. It first encodes images and then identifies duplicates based on these encodings.

```python
from imagededup.methods import CNN

cnn = CNN()
encodings = cnn.encode_images(image_dir=image_dir)
duplicates = cnn.find_duplicates(encoding_map=encodings)
```

--------------------------------

### Find and Plot Test Set Duplicates with CNN

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md

Identifies and plots duplicate images specifically within the test set using CNN encodings. It filters the previously found duplicates to include only those present in the test set and then visualizes the results for the file with the most duplicates.

```python
# test images are stored under '/content/cifar/test'
filenames_test = set([i.name for i in Path('/content/cifar/test').glob('*.png')])

duplicates_test = {}
for k, v in duplicates.items():
  if k in filenames_test:
    tmp = [i for i in v if i in filenames_test]
    duplicates_test[k] = tmp
    
# sort in descending order of duplicates
duplicates_test = {k: v for k, v in sorted(duplicates_test.items(), key=lambda x: len(x[1]), reverse=True)}

# plot duplicates found for some file
plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test, filename=list(duplicates_test.keys())[0])
```

--------------------------------

### Find and Plot Train Set Duplicates with CNN

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md

Identifies and plots duplicate images within the training set using CNN encodings. Similar to the test set analysis, it filters duplicates to include only those in the training set and visualizes the results for a representative file.

```python
# train images are stored under '/content/cifar/train'
filenames_train = set([i.name for i in Path('/content/cifar/train').glob('*.png')])

duplicates_train = {}
for k, v in duplicates.items():
  if k in filenames_train:
    tmp = [i for i in v if i in filenames_train]
    duplicates_train[k] = tmp
    

# sort in descending order of duplicates
duplicates_train = {k: v for k, v in sorted(duplicates_train.items(), key=lambda x: len(x[1]), reverse=True)}

# plot duplicates found for some file
plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_train, filename=list(duplicates_train.keys())[0])
```

--------------------------------

### Deduplicate Images with Perceptual Hashing (Python)

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md

This snippet demonstrates how to find and remove duplicate images in a directory using perceptual hashing. It utilizes the PHash method from the imagededup library, allowing users to set a maximum Hamming distance threshold and specify an output file for the duplicate list. Ensure the 'imagededup' library is installed.

```python
from imagededup.methods import PHash
phasher = PHash()
duplicates = phasher.find_duplicates_to_remove(image_dir='path/to/image/directory', 
                                               max_distance_threshold=12, 
                                               outfile='my_duplicates.json')
```

--------------------------------

### Initialize CNN with Pre-packaged EfficientNet Model

Source: https://github.com/idealo/imagededup/blob/master/examples/use_custom_model.ipynb

This snippet shows how to initialize the CNN encoder using a pre-packaged EfficientNet model. It demonstrates importing necessary classes and configuring a custom model using EfficientNet's name, model, and transform functions.

```python
from pathlib import Path
from imagededup.methods import CNN
from imagededup.utils import CustomModel
from imagededup.utils.models import EfficientNet

image_dir = Path('../tests/data/mixed_images')

custom_config = CustomModel(name=EfficientNet.name,
                            model=EfficientNet(),
                            transform=EfficientNet.transform)

cnn_encoder = CNN(model_config=custom_config)
duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True)

print(duplicates_cnn)
```

--------------------------------

### Initialize Image Directory Path

Source: https://github.com/idealo/imagededup/blob/master/examples/Evaluation.ipynb

Sets up the path to the directory containing images for deduplication. This path is used as input for subsequent deduplication functions.

```python
from pathlib import Path

import imagededup

image_dir = Path('../tests/data/mixed_images')
```

--------------------------------

### Initialize CNN with Pre-packaged Custom Model - Python

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/custom_model.md

Initializes the CNN deduplication method using a pre-packaged model from the imagededup library, such as EfficientNet. This involves creating a `CustomModel` object with the model's name, instance, and transformation function, then passing this configuration to the `CNN` constructor.

```python
from imagededup.methods import CNN

# Get CustomModel construct
from imagededup.utils import CustomModel

# Get the prepackaged models from imagededup
from imagededup.utils.models import ViT, MobilenetV3, EfficientNet


# Declare a custom config with CustomModel, the prepackaged models come with a name and transform function
custom_config = CustomModel(name=EfficientNet.name,
                            model=EfficientNet(), 
                            transform=EfficientNet.transform)

# Use model_config argument to pass the custom config
cnn = CNN(model_config=custom_config)

# Use the model as usual
...
```

--------------------------------

### Prepare Image Directory for Deduplication

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

This code creates a working directory and copies all training and testing images from the extracted CIFAR10 dataset into this single directory. This consolidated directory is then used for image analysis.

```python
# create working directory and move all images into this directory
image_dir = 'cifar10_images'
!mkdir $image_dir
!cp -r '/content/cifar/train/.' $image_dir
!cp -r '/content/cifar/test/.' $image_dir
```

--------------------------------

### Download and Extract CIFAR10 Dataset

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

This snippet downloads the CIFAR10 dataset using wget and then extracts the compressed tarball using the tar command. It's a common first step for preparing image datasets.

```shell
# download CIFAR10 dataset and untar
!wget http://pjreddie.com/media/files/cifar.tgz
!tar xzf cifar.tgz
```

--------------------------------

### Find Image Duplicates using PHash in Python

Source: https://github.com/idealo/imagededup/blob/master/README.md

Demonstrates a Python workflow to find duplicate images in a directory using the Perceptual Hashing (PHash) method. It involves encoding images, finding duplicates based on encodings, and optionally plotting the results.

```python
from imagededup.methods import PHash
phasher = PHash()

# Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')

# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)

# plot duplicates obtained for a given file using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')
```

--------------------------------

### Deduplicate Images with CNN Features (Python)

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md

This snippet shows how to deduplicate images using Convolutional Neural Network (CNN) features. It employs the CNN method from imagededup, enabling deduplication based on a minimum cosine similarity threshold and saving the results to a specified JSON file. The 'imagededup' library must be installed for this to function.

```python
from imagededup.methods import CNN
cnn_encoder = CNN()
duplicates = cnn_encoder.find_duplicates_to_remove(image_dir='path/to/image/directory', 
                                                   min_similarity_threshold=0.85, 
                                                   outfile='my_duplicates.json')
```

--------------------------------

### Initialize CNN with Custom PyTorch Model

Source: https://github.com/idealo/imagededup/blob/master/examples/use_custom_model.ipynb

This snippet illustrates how to initialize the CNN encoder with a custom PyTorch model. It defines a `MyModel` class inheriting from `torch.nn.Module`, specifying its transformation and forward pass, and then uses it to configure the CNN.

```python
from pathlib import Path
import torch
from torchvision.transforms import transforms
from imagededup.methods import CNN
from imagededup.utils import CustomModel

# Declare custom feature extractor class
class MyModel(torch.nn.Module):
    transform = transforms.Compose(
        [
            transforms.Resize((256, 256)),
            transforms.ToTensor()
        ]
    ) # transform must take PIL.Image as input and return a torch.Tensor

    name = 'my_custom_model' # name can be any user-defined string

    def __init__(self):
        super().__init__()
        # Define the layers of the model here

    def forward(self, x):
        # Add more operations here
        x = x.view(-1, 256*256*3) # output shape: batch_size x features
        return x

image_dir = Path('../tests/data/mixed_images')

# Initialize the CNN using model_config parameter and setting it to the custom model
custom_config = CustomModel(name=MyModel.name,
                            model=MyModel(),
                            transform=MyModel.transform)

cnn = CNN(model_config=custom_config)
duplicates_cnn = cnn.find_duplicates(image_dir=image_dir, scores=True)

print(duplicates_cnn)
```

--------------------------------

### Evaluating Deduplication Results

Source: https://context7.com/idealo/imagededup/llms.txt

Provides a framework for evaluating the performance of image deduplication algorithms. It compares the algorithm's output (retrieved duplicates) against a manually curated ground truth mapping using standard information retrieval and classification metrics.

```python
from imagededup.evaluation import evaluate

# Ground truth: manually curated duplicate mappings
ground_truth_map = {
    'image1.jpg': ['image2.jpg', 'image3.jpg'],
    'image2.jpg': ['image1.jpg'],
    'image3.jpg': ['image1.jpg'],
    'image4.jpg': []
}

# Retrieved: duplicates found by algorithm
retrieved_map = {
    'image1.jpg': ['image2.jpg'],
    'image2.jpg': ['image1.jpg'],
    'image3.jpg': [],
    'image4.jpg': []
}

# Example usage (metrics depend on the specific evaluation function used)
# evaluation_results = evaluate(ground_truth_map, retrieved_map)

```

--------------------------------

### Implement GCC __builtin_ffs on Various Platforms

Source: https://github.com/idealo/imagededup/blob/master/imagededup/handlers/search/builtin/README.md

Provides portable implementations for GCC's `__builtin_ffs` and its variants (`ffsl`, `ffsll`, `ffs32`, `ffs64`). These functions find the first set bit in an integer and work across different compilers and older GCC versions. The implementations can utilize compiler-specific built-ins, inline assembly, or pure C code.

```c
int psnip_builtin_ffs(int);
int psnip_builtin_ffsl(long);
int psnip_builtin_ffsll(long long);
int psnip_builtin_ffs32(psnip_int32_t);
int psnip_builtin_ffs64(psnip_int64_t);
```

```c
#define PSNIP_BUILTIN_EMULATE_NATIVE
int __builtin_ffs(int);
int __builtin_ffsl(long);
int __builtin_ffsll(long long);
```

--------------------------------

### WHash: Encode and Find Duplicate Images with Python

Source: https://context7.com/idealo/imagededup/llms.txt

Employs Wavelet Hashing (WHash) using Haar wavelets to generate hashes, offering a balance between speed and accuracy for various image transformations. This function can encode images and find duplicates in a single step, supporting recursive directory scanning and parallel processing for both encoding and distance calculation. Dependencies include the imagededup library.

```python
from imagededup.methods import WHash

whasher = WHash()

# Encode images and find duplicates in one step
duplicates = whasher.find_duplicates(
    image_dir='path/to/images',
    max_distance_threshold=10,
    scores=True,
    recursive=True,
    num_enc_workers=4,   # Workers for encoding
    num_dist_workers=4   # Workers for distance calculation
)
```

--------------------------------

### Initialize CNN with User-Defined Custom Model - Python

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/custom_model.md

Initializes the CNN deduplication method using a user-defined PyTorch model. This requires defining a `torch.nn.Module` subclass with a `forward` method and optionally `name` and `transform` attributes. The custom model and its associated transformation are then wrapped in a `CustomModel` object and passed to the `CNN` constructor.

```python
from imagededup.methods import CNN

# Get CustomModel construct
from imagededup.utils import CustomModel

# Import necessary pytorch constructs for initializing a custom feature extractor
import torch
from torchvision.transforms import transforms

# Declare custom feature extractor class
class MyModel(torch.nn.Module):
    transform = transforms.Compose(
        [
            transforms.ToTensor()
        ]
    )
    name = 'my_custom_model'

    def __init__(self):
        super().__init__()
        # Define the layers of the model here

    def forward(self, x):
        # Do something with x
        return x

custom_config = CustomModel(name=MyModel.name,
                            model=MyModel(),
                            transform=MyModel.transform)

cnn = CNN(model_config=custom_config)

# Use the model as usual
...
```

--------------------------------

### PHash: Encode and Find Duplicate Images with Python

Source: https://context7.com/idealo/imagededup/llms.txt

Uses the PHash algorithm to generate 16-character hexadecimal hashes for images, robust to minor modifications. It can encode single images or entire directories, find duplicates based on Hamming distance, and optionally save results or list files for removal. Dependencies include the imagededup library.

```python
from imagededup.methods import PHash

# Initialize perceptual hasher
phasher = PHash()

# Generate hash for a single image
single_hash = phasher.encode_image(image_file='path/to/image.jpg')
print(f"Image hash: {single_hash}")  # Output: 16-character hex string like 'a8f0e2b145c37d90'

# Generate hashes for all images in a directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')
# Returns: {'image1.jpg': 'hash1', 'image2.jpg': 'hash2', ...}

# Find duplicates with hamming distance threshold (0-64, lower = stricter)
duplicates = phasher.find_duplicates(
    image_dir='path/to/image/directory',
    max_distance_threshold=10,  # Maximum hamming distance for duplicates
    scores=True,                # Include distance scores
    outfile='duplicates.json'   # Save results to file
)
# Returns: {'image1.jpg': [('similar1.jpg', 5), ('similar2.jpg', 8)], ...}

# Get list of files to remove (keeps one from each duplicate group)
files_to_remove = phasher.find_duplicates_to_remove(
    image_dir='path/to/image/directory',
    max_distance_threshold=10
)
# Returns: ['duplicate1.jpg', 'duplicate2.jpg', ...]
```

--------------------------------

### Evaluate Deduplication Metrics in Python

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/evaluating_performance.md

Calculates deduplication quality metrics using ground truth and retrieved mappings. Requires 'imagededup' library. Accepts 'ground_truth_map', 'retrieved_map', and 'metric' as input. Returns a dictionary of calculated metrics.

```python
from imagededup.evaluation import evaluate

# Example usage:
ground_truth_map = {
  '1.jpg': ['2.jpg', '4.jpg'],
  '2.jpg': ['1.jpg'],
  '3.jpg': [],
  '4.jpg': ['1.jpg']
}

retrieved_map = {
  '1.jpg': ['2.jpg'],
  '2.jpg': ['1.jpg'],
  '3.jpg': [],
  '4.jpg': []
}

# Evaluate all metrics (default)
metrics_all = evaluate(ground_truth_map, retrieved_map)
print("All metrics:", metrics_all)

# Evaluate specific metric
metrics_map = evaluate(ground_truth_map, retrieved_map, metric='map')
print("MAP metric:", metrics_map)

# Evaluate classification metrics
metrics_classification = evaluate(ground_truth_map, retrieved_map, metric='classification')
print("Classification metrics:", metrics_classification)
```

--------------------------------

### Import Plotting Utilities

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

This code imports necessary libraries from pathlib, imagededup.utils, and matplotlib for visualizing duplicate images. It also sets the default figure size for plots.

```python
# do some imports for plotting
from pathlib import Path
from imagededup.utils import plot_duplicates
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15, 10)
```

--------------------------------

### AHash: Encode and Find Duplicate Images with Python

Source: https://context7.com/idealo/imagededup/llms.txt

Implements Average Hashing (AHash) for fast duplicate detection, suitable for exact matches. It can encode images from NumPy arrays or directories (with recursive search), find duplicates using pre-computed encodings, and supports parallel processing for encoding. Dependencies include imagededup, numpy, and Pillow.

```python
from imagededup.methods import AHash

ahasher = AHash()

# Encode single image from numpy array
import numpy as np
from PIL import Image

image_array = np.array(Image.open('path/to/image.jpg'))
hash_from_array = ahasher.encode_image(image_array=image_array)

# Encode directory with recursive search in nested folders
encodings = ahasher.encode_images(
    image_dir='path/to/nested/directory',
    recursive=True,              # Search subdirectories
    num_enc_workers=4            # Parallel processing cores
)

# Find duplicates using pre-computed encodings
duplicates = ahasher.find_duplicates(
    encoding_map=encodings,      # Use pre-computed hashes
    max_distance_threshold=12,
    scores=False                 # Return only filenames, not scores
)
# Returns: {'image1.jpg': ['similar1.jpg', 'similar2.jpg'], ...}
```

--------------------------------

### Custom CNN Models for Feature Extraction

Source: https://context7.com/idealo/imagededup/llms.txt

Enables the use of alternative pre-trained models like Vision Transformer (ViT) or EfficientNet, or custom PyTorch models for image feature extraction within the CNN deduplication process. This allows for greater flexibility in balancing accuracy and performance.

```python
from imagededup.methods import CNN
from imagededup.utils import CustomModel
from imagededup.utils.models import ViT, EfficientNet, MobilenetV3

# Using Vision Transformer (ViT) - better accuracy, slower
vit_config = CustomModel(
    name=ViT.name,
    model=ViT(),
    transform=ViT.transform
)
cnn_vit = CNN(model_config=vit_config)
duplicates = cnn_vit.find_duplicates(image_dir='path/to/images', min_similarity_threshold=0.9)

# Using EfficientNet B4 - good balance of speed and accuracy
effnet_config = CustomModel(
    name=EfficientNet.name,
    model=EfficientNet(),
    transform=EfficientNet.transform
)
cnn_effnet = CNN(model_config=effnet_config)

# Using a custom PyTorch model
import torch
from torchvision.transforms import transforms

class MyFeatureExtractor(torch.nn.Module):
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    name = 'my_custom_extractor'

    def __init__(self):
        super().__init__()
        self.backbone = torch.hub.load('pytorch/vision', 'resnet18', pretrained=True)
        self.backbone.fc = torch.nn.Identity()  # Remove classification layer

    def forward(self, x):
        return self.backbone(x)  # Returns (batch_size, 512)

custom_config = CustomModel(
    name=MyFeatureExtractor.name,
    model=MyFeatureExtractor(),
    transform=MyFeatureExtractor.transform
)
cnn_custom = CNN(model_config=custom_config)
encodings = cnn_custom.encode_images(image_dir='path/to/images')
```

--------------------------------

### Plotting Detected Image Duplicates

Source: https://context7.com/idealo/imagededup/llms.txt

Visualizes detected duplicate images to aid in manual inspection. This utility takes the output from a deduplication method (like PHash) and generates a plot showing a query image alongside its identified duplicates, optionally saving the plot to a file.

```python
from imagededup.methods import PHash
from imagededup.utils import plot_duplicates

# Find duplicates
phasher = PHash()
duplicates = phasher.find_duplicates(
    image_dir='path/to/images',
    max_distance_threshold=12,
    scores=True
)

# Plot duplicates for a specific image
plot_duplicates(
    image_dir='path/to/images',
    duplicate_map=duplicates,
    filename='query_image.jpg',  # Must be a key in duplicate_map
    outfile='duplicates_plot.png'  # Optional: save plot to file
)
# Displays original image with all detected duplicates and their scores
```

--------------------------------

### DHash: Encode and Find Duplicate Images with Python

Source: https://context7.com/idealo/imagededup/llms.txt

Utilizes Difference Hashing (DHash) for efficient duplicate detection based on pixel gradient differences. This method is fast and effective for exact duplicates. It supports encoding images from directories, finding duplicates with parallel computation, and calculating Hamming distances between hashes. Dependencies include the imagededup library.

```python
from imagededup.methods import DHash

dhasher = DHash()

# Complete workflow: encode and find duplicates
encodings = dhasher.encode_images(image_dir='path/to/images')

# Find duplicates with parallel distance computation
duplicates = dhasher.find_duplicates(
    encoding_map=encodings,
    max_distance_threshold=8,
    scores=True,
    num_dist_workers=8  # Parallel workers for distance calculation
)

# Calculate hamming distance between two hashes manually
hash1 = dhasher.encode_image(image_file='image1.jpg')
hash2 = dhasher.encode_image(image_file='image2.jpg')
distance = dhasher.hamming_distance(hash1, hash2)
print(f"Hamming distance: {distance}")  # 0 = identical, 64 = completely different
```

--------------------------------

### Find Duplicate Images using CNN Method

Source: https://github.com/idealo/imagededup/blob/master/examples/Evaluation.ipynb

This snippet demonstrates how to find duplicate images using the CNN (Convolutional Neural Network) method from the imagededup library. It initializes the CNN encoder, specifies the image directory, and sets a similarity threshold to identify duplicates. The output is a dictionary mapping image filenames to lists of their duplicates.

```python
from imagededup.methods import CNN

cnn_encoder = CNN()
duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, min_similarity_threshold=0.97)
```

--------------------------------

### Evaluate Image Deduplication Metrics

Source: https://context7.com/idealo/imagededup/llms.txt

Calculates and prints various evaluation metrics for image deduplication, such as Mean Average Precision (MAP), Normalized DCG, Jaccard Index, Precision, Recall, and F1-Score. It supports retrieving all metrics or specific ones like 'map' or 'classification'.

```python
from imagededup.utils import evaluate

# Assuming ground_truth_map and retrieved_map are defined elsewhere
# metrics = evaluate(ground_truth_map, retrieved_map, metric='all')
# print(f"Mean Average Precision (MAP): {metrics['map']:.4f}")
# print(f"Normalized DCG: {metrics['ndcg']:.4f}")
# print(f"Jaccard Index: {metrics['jaccard']:.4f}")
# print(f"Precision (per class): {metrics['precision']}")
# print(f"Recall (per class): {metrics['recall']}")
# print(f"F1-Score (per class): {metrics['f1-score']}")

# map_score = evaluate(ground_truth_map, retrieved_map, metric='map')
# classification_metrics = evaluate(ground_truth_map, retrieved_map, metric='classification')
```

--------------------------------

### Find Duplicates from Image Directory (Python)

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md

Finds duplicate images within a specified directory using a chosen deduplication method. Requires the image directory path and an optional threshold parameter. Returns a dictionary mapping image filenames to their duplicates.

```python
from imagededup.methods import <method-name>
method_object = <method-name>()
duplicates = method_object.find_duplicates(image_dir='path/to/image/directory',
                                           <threshold-parameter-value>)
```

--------------------------------

### BibTeX Citation for Imagededup

Source: https://github.com/idealo/imagededup/blob/master/README.md

This snippet provides the BibTeX entry for citing the Imagededup project in academic publications. It includes author names, year, and a URL to the project's GitHub repository.

```bibtex
@misc{idealods2019imagededup,
  title={Imagededup},
  author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/idealo/imagededup}},
}
```

--------------------------------

### GCC Built-in Functions Implementation Status

Source: https://github.com/idealo/imagededup/blob/master/imagededup/handlers/search/builtin/README.md

Lists the GCC built-in functions that have been implemented in this portable module. The status indicates whether each function (e.g., ffs, clz, ctz, popcount) and its variants for different integer sizes are supported.

```text
- [x] ffs, ffsl, ffsll, ffs32, ffs64
- [x] clz, clzl, clzll, clz32, clz64
- [x] ctz, ctzl, ctzll, ctz32, ctz64
- [x] clrsb, clrsbl, clrsbll, clrsb32, clrsb64
- [x] popcount, popcountl, popcountll, popcount32, popcount64
- [x] parity, parityl, parityll, parity32, parity64
- [x] bswap16, bswap32, bswap64
```

--------------------------------

### Find Duplicates with Perceptual Hashing (PHash) and Scores

Source: https://github.com/idealo/imagededup/blob/master/examples/Finding_duplicates.ipynb

Uses the PHash method to find duplicate images based on perceptual hashing. It calculates hashes for all images and then evaluates Hamming distances to identify duplicates, returning a map of duplicates along with their similarity scores.

```python
from imagededup.methods import PHash

phasher = PHash()
duplicates = phasher.find_duplicates(image_dir=image_dir, scores=True)
```

--------------------------------

### Find Duplicate Files to Remove (PHash)

Source: https://context7.com/idealo/imagededup/llms.txt

Identifies duplicate image files within a directory based on perceptual hashing and a specified distance threshold. It returns a list of files marked for removal and can optionally save these to a JSON file. This method is suitable for finding visually similar images.

```python
from imagededup.methods import PHash

phasher = PHash()
files_to_remove = phasher.find_duplicates_to_remove(
    image_dir='path/to/images',
    max_distance_threshold=15,
    outfile='to_remove.json'
)
print(f"Found {len(files_to_remove)} duplicate files to remove")
```

--------------------------------

### Load and Filter Duplicates for Train Set (Python)

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

This code snippet first identifies all PNG image filenames within the '/content/cifar/train' directory and stores them in a set. It then filters a pre-existing 'duplicates' dictionary to retain only those image keys and their associated duplicates that are present in the training set. Finally, it sorts the resulting 'duplicates_train' dictionary in descending order by the number of duplicates.

```python
# train images are stored under '/content/cifar/train'
filenames_train = set([i.name for i in Path('/content/cifar/train').glob('*.png')])

# keep only filenames that are in train set
duplicates_train = {}
for k, v in duplicates.items():
  if k in filenames_train:
    tmp = [i for i in v if i in filenames_train]
    duplicates_train[k] = tmp
    

# sort in descending order of duplicates
duplicates_train = {k: v for k, v in sorted(duplicates_train.items(), key=lambda x: len(x[1]), reverse=True)}
```

--------------------------------

### Clang Built-in Functions Implementation Status

Source: https://github.com/idealo/imagededup/blob/master/imagededup/handlers/search/builtin/README.md

Details the implementation status of Clang-specific built-in functions within the portable module. This includes functions for bit manipulation and carry/borrow flags for various integer sizes.

```text
- [x] bitreverse8, bitreverse16, bitreverse32, bitreverse64
- [x] addcb, addcs, addc, addcl, addcll, addc8, addc16, addc32, addc64
- [x] subcb, subcs, subc, subcl, subcll, subc8, subc16, subc32, subc64
```

--------------------------------

### Encode Single Image using Perceptual Hashing (PHash)

Source: https://github.com/idealo/imagededup/blob/master/examples/Encoding_generation.ipynb

Generates a perceptual hash string for a single image file using the PHash method from the imagededup library. This method is useful for identifying visually similar images.

```python
from pathlib import Path
from imagededup.methods import PHash

single_image_path = Path('../tests/data/mixed_images/ukbench00120.jpg')
phasher = PHash()
phash_string = phasher.encode_image(image_file = single_image_path)
print(phash_string)
```

--------------------------------

### Encoding generation for all images in a directory

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/encoding_generation.md

Generates encodings for all images within a specified directory. The function returns a dictionary mapping image file names to their corresponding encodings.

```APIDOC
## Encoding generation for all images in a directory

### Description
Generates encodings for all images within a specified directory. The function returns a dictionary mapping image file names to their corresponding encodings.

### Method
POST (or relevant method for encoding generation)

### Endpoint
`/encode_images`

### Parameters
#### Query Parameters
- **image_dir** (string) - Required - Path to the image directory for which encodings are to be generated.
- **recursive** (boolean) - Optional - Set to `True` to find images recursively in a nested directory structure. Defaults to `False`.

### Request Example
```python
from imagededup.methods import DHash
dhasher = DHash()
encodings = dhasher.encode_images(image_dir='path/to/image/directory', recursive=True)
```

### Response
#### Success Response (200)
- **encodings** (dict) - A dictionary where keys are image file names (strings) and values are their corresponding encodings (hexadecimal strings for hashing methods, numpy arrays for CNN).

#### Response Example
```json
{
  "image1.jpg": "<encoding-image-1>",
  "image2.jpg": "<encoding-image-2>"
}
```

#### Considerations
- If an image cannot be loaded, it will be omitted from the returned encodings dictionary.
- Supported image formats: 'JPEG', 'PNG', 'BMP', 'MPO', 'PPM', 'TIFF', 'GIF', 'SVG', 'PGM', 'PBM', 'WEBP'.
```

--------------------------------

### Find Duplicates to Remove using CNN with imagededup

Source: https://github.com/idealo/imagededup/blob/master/examples/Finding_duplicates.ipynb

This code snippet demonstrates how to find duplicate images to remove using a CNN-based approach. It initializes the CNN encoder and then uses it to find duplicates within a specified image directory. The output is a list of duplicate filenames.

```python
from imagededup.methods import CNN

cnn_encoder = CNN()
duplicates_list_cnn = cnn_encoder.find_duplicates_to_remove(image_dir=image_dir)
```

--------------------------------

### CNN-Based Image Deduplication and Encoding

Source: https://context7.com/idealo/imagededup/llms.txt

Utilizes Convolutional Neural Networks (CNNs), defaulting to MobileNetV3, to generate image embeddings for detecting near-duplicates. It supports encoding single images or directories, finding duplicates based on similarity thresholds, and can operate with pre-computed encodings. Multiprocessing is supported on Linux for encoding directories.

```python
from imagededup.methods import CNN

# Initialize CNN encoder (uses MobileNetV3 by default)
cnn = CNN()

# Encode single image
encoding = cnn.encode_image(image_file='path/to/image.jpg')
# Returns: numpy array of shape (576,) for MobileNetV3

# Encode directory of images
encodings = cnn.encode_images(
    image_dir='path/to/images',
    recursive=True,
    num_enc_workers=0  # 0 = no multiprocessing (multiprocessing only on Linux)
)

# Find duplicates using cosine similarity threshold (-1.0 to 1.0)
duplicates = cnn.find_duplicates(
    image_dir='path/to/images',
    min_similarity_threshold=0.9,  # Higher = stricter matching
    scores=True,
    outfile='cnn_duplicates.json'
)
# Returns: {'image1.jpg': [('similar1.jpg', 0.95), ('similar2.jpg', 0.92)], ...}

# Find duplicates using pre-computed encodings
duplicates = cnn.find_duplicates(
    encoding_map=encodings,
    min_similarity_threshold=0.85,
    num_sim_workers=8  # Parallel similarity computation
)

# Get list of files to remove
files_to_remove = cnn.find_duplicates_to_remove(
    image_dir='path/to/images',
    min_similarity_threshold=0.9
)
```

--------------------------------

### Plot Duplicates from Test Set (Python)

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

This function call visualizes a specific group of duplicate images identified within the test set. It requires the 'image_dir', the 'duplicates_test' map, and the filename of the key to plot. The output indicates a UserWarning related to tight_layout rendering and the resulting matplotlib Figure object.

```python
plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test, filename=list(duplicates_test.keys())[0])
```

--------------------------------

### Find Duplicates to Remove using Generic Method

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md

Demonstrates how to use the generic find_duplicates_to_remove function, which returns a list of duplicate filenames without removing them. This function can be used with various deduplication methods like PHash, AHash, DHash, WHash, and CNN.

```python
from imagededup.methods import <method-name>
method_object = <method-name>()
duplicates = method_object.find_duplicates_to_remove(image_dir='path/to/image/directory',
                                                     <threshold-parameter-value>)
```

```python
from imagededup.methods import <method-name>
method_object = <method-name>()
duplicates = method_object.find_duplicates_to_remove(encoding_map=encoding_map,
                                                     <threshold-parameter-value>)
```

--------------------------------

### Plot Duplicates from Train Set (Python)

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

This function call visualizes a specific group of duplicate images identified within the training set. It requires the 'image_dir', the 'duplicates_train' map, and the filename of the key to plot. The output includes a UserWarning regarding tight_layout and the resulting matplotlib Figure object.

```python
# 70 duplicates found of same car!
plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_train, filename=list(duplicates_train.keys())[0])
```

--------------------------------

### Encode All Images in Directory using Difference Hashing (DHash)

Source: https://github.com/idealo/imagededup/blob/master/examples/Encoding_generation.ipynb

Generates difference hash strings for all images within a specified directory using the DHash method from the imagededup library. This method captures differences between adjacent pixels and is robust to minor image variations.

```python
from pathlib import Path
from imagededup.methods import DHash

image_dir = Path('../tests/data/mixed_images')
dhasher = DHash()
encodings = dhasher.encode_images(image_dir)
print(encodings)
```

--------------------------------

### Python: Fix Windows RuntimeError with multiprocessing guard

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/windows.md

This snippet demonstrates the correct way to structure Python code using the imagededup library on Windows to avoid a RuntimeError. The error arises from the `multiprocessing` module's restrictions on Windows. Encapsulating the main logic within an `if __name__ == '__main__':` block resolves this issue by ensuring proper process bootstrapping.

```python
from imagededup.methods import PHash

if __name__ == '__main__':
    phasher = PHash()

    # Generate encodings for all images in an image directory
    encodings = phasher.encode_images(image_dir='path/to/image/directory')

    # Find duplicates using the generated encodings
    duplicates = phasher.find_duplicates(encoding_map=encodings)

    # plot duplicates obtained for a given file using the duplicates dictionary
    from imagededup.utils import plot_duplicates
    plot_duplicates(image_dir='path/to/image/directory',
                    duplicate_map=duplicates,
                    filename='ukbench00120.jpg')

```

--------------------------------

### Encode All Images in Directory using CNN

Source: https://github.com/idealo/imagededup/blob/master/examples/Encoding_generation.ipynb

Generates CNN-based encodings for all images within a specified directory using the CNN method from the imagededup library. This method is suitable for capturing semantic similarities between images.

```python
from pathlib import Path
from imagededup.methods import CNN

image_dir = Path('../tests/data/mixed_images')
cnn_encoder = CNN()
cnn_encodings = cnn_encoder.encode_images(image_dir)
print(cnn_encodings)
```

--------------------------------

### Find Duplicates to Remove using Perceptual Hashing (PHash)

Source: https://github.com/idealo/imagededup/blob/master/examples/Finding_duplicates.ipynb

Employs the PHash method to identify duplicate images and returns a list of filenames that can be removed. This is useful for cleaning up redundant image files.

```python
from imagededup.methods import PHash

phasher = PHash()
duplicates_list = phasher.find_duplicates_to_remove(image_dir)
```

--------------------------------

### Plot Duplicates Common to Test and Train Sets (Python)

Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb

This function call visualizes a specific group of duplicate images that are present in both the test and train sets. It requires the 'image_dir', the 'duplicates_test_train' map, and the filename of the key to plot. The output includes a UserWarning regarding tight_layout and the resulting matplotlib Figure object.

```python
plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test_train, filename=list(duplicates_test_train.keys())[0])
```

--------------------------------

### find_duplicates() Method

Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md

This method identifies duplicate images within a specified directory or using pre-computed encodings. It can return just the duplicate file names or include similarity scores.

```APIDOC
## POST /find_duplicates

### Description
Finds duplicate images in a directory or from a map of image encodings. The results can optionally include similarity scores.

### Method
POST

### Endpoint
`/find_duplicates`

### Parameters
#### Query Parameters
- **method_name** (string) - Required - The deduplication method to use (e.g., PHash, AHash, DHash, WHash, CNN).
- **scores** (boolean) - Optional - If true, returns similarity scores along with duplicate file names. Defaults to false.
- **outfile** (string) - Optional - Path to a JSON file to save the results. Defaults to None.
- **recursive** (boolean) - Optional - Whether to search for images recursively in subdirectories. Defaults to false.

#### Request Body
- **image_dir** (string) - Optional - Path to the directory containing image files. Use this or `encoding_map`.
- **encoding_map** (object) - Optional - A dictionary mapping file names to their pre-computed encodings. Use this or `image_dir`.
- **threshold** (number) - Optional - The similarity threshold. For hashing methods (PHash, AHash, DHash, WHash), this is `max_distance_threshold` (int, 0-64). For CNN, this is `min_similarity_threshold` (float, -1.0 to 1.0). Defaults vary by method.

### Request Example
```json
{
  "image_dir": "path/to/image/directory",
  "method_name": "PHash",
  "threshold": 10,
  "scores": true
}
```

### Response
#### Success Response (200)
- **duplicates** (object) - A dictionary where keys are image file names and values are either a list of duplicate file names or a list of tuples containing duplicate file names and their scores.

#### Response Example
```json
{
  "duplicates": {
    "image1.jpg": [
      "image1_duplicate1.jpg",
      "image1_duplicate2.jpg"
    ],
    "image2.jpg": [
      "image2_duplicate1.jpg"
    ]
  }
}
```

#### Response Example with Scores
```json
{
  "duplicates": {
    "image1.jpg": [
      ["image1_duplicate1.jpg", 5],
      ["image1_duplicate2.jpg", 8]
    ],
    "image2.jpg": [
      ["image2_duplicate1.jpg", 3]
    ]
  }
}
```
```