### Install imagededup Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md Installs the imagededup library using pip. This is a prerequisite for running the deduplication examples. ```bash !pip install imagededup ``` -------------------------------- ### Install imagededup from GitHub Source Source: https://github.com/idealo/imagededup/blob/master/README.md Installs the imagededup package by cloning the GitHub repository and then running the setup script. This method allows for installing the latest development version. ```bash git clone https://github.com/idealo/imagededup.git cd imagededup pip install . ``` -------------------------------- ### Install imagededup from PyPI Source: https://github.com/idealo/imagededup/blob/master/README.md Installs the imagededup package using pip. This is the recommended method for installation and requires a Python environment. ```bash pip install imagededup ``` -------------------------------- ### Prepare CIFAR10 Image Directory Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md Creates a working directory and copies the CIFAR10 training and testing images into it. This organizes the images for the deduplication process. ```bash image_dir = 'cifar10_images' !mkdir $image_dir !cp -r '/content/cifar/train/.' $image_dir !cp -r '/content/cifar/test/.' $image_dir ``` -------------------------------- ### Download and Extract CIFAR10 Dataset Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md Downloads the CIFAR10 dataset archive and extracts its contents. This prepares the dataset for further processing. ```bash !wget http://pjreddie.com/media/files/cifar.tgz !tar xzf cifar.tgz ``` -------------------------------- ### Import Plotting Utilities Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md Imports necessary libraries for plotting duplicate images, including Path, plot_duplicates, and matplotlib.pyplot. Sets the figure size for plots. ```python from pathlib import Path from imagededup.utils import plot_duplicates import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (15, 10) ``` -------------------------------- ### Install imagededup and TensorFlow GPU Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb Installs the imagededup library and the GPU-enabled version of TensorFlow for enhanced performance on compatible hardware. This is a prerequisite for using the CNN method with GPU acceleration. ```python # install imagededup via PyPI !pip install imagededup # by default imagededup is shipped with CPU-only support for TF but let's install GPU since we have it on Google Colab !pip install tensorflow[and-cuda] --upgrade ``` -------------------------------- ### Initialize CNN with Hugging Face ViT Model Source: https://github.com/idealo/imagededup/blob/master/examples/use_custom_model.ipynb This snippet demonstrates initializing the CNN encoder with a Hugging Face Vision Transformer (ViT) model. It includes installing the `transformers` library, defining a custom `VitHgface` class that wraps the ViT model and its processor, and then configuring the CNN. ```bash !pip install transformers ``` ```python from pathlib import Path from imagededup.methods import CNN from imagededup.utils import CustomModel from transformers import ViTModel, AutoImageProcessor import torch from torchvision.transforms import transforms VIT_MODEL = "google/vit-base-patch16-224-in21k" def vit_transform(image): transform = AutoImageProcessor.from_pretrained(VIT_MODEL) x = transform(image, return_tensors = 'pt')['pixel_values'] return x class VitHgface(torch.nn.Module): transform = transforms.Lambda(vit_transform) name = 'ViT_hgface' def __init__(self): super().__init__() self.vit = ViTModel.from_pretrained(VIT_MODEL) def forward(self, x): x = x.view(-1, 3, 224, 224) with torch.no_grad(): out = self.vit(pixel_values=x) return out.pooler_output image_dir = Path('../tests/data/mixed_images') custom_config = CustomModel(name=VitHgface.name, model=VitHgface(), transform=VitHgface.transform) cnn = CNN(model_config=custom_config) duplicates_cnn = cnn.find_duplicates(image_dir=image_dir, scores=True) print(duplicates_cnn) # Encode images separately enc = cnn.encode_images(image_dir=image_dir) print(enc) ``` -------------------------------- ### Complete Image Deduplication Workflow with imagededup Source: https://context7.com/idealo/imagededup/llms.txt An end-to-end Python example demonstrating a typical image deduplication pipeline using imagededup. It covers choosing between hashing (PHash, DHash) and CNN methods, finding duplicates, reviewing them, identifying files to remove, and moving duplicates to a backup folder. ```python from imagededup.methods import PHash, CNN from imagededup.utils import plot_duplicates import os import shutil # Step 1: Choose method based on use case # - Hashing (PHash, DHash): Fast, good for exact/near-exact duplicates # - CNN: Slower, better for transformed images (rotations, crops, etc.) image_dir = 'path/to/image/directory' # Step 2: Find duplicates using PHash (fast initial pass) phasher = PHash() duplicates_hash = phasher.find_duplicates( image_dir=image_dir, max_distance_threshold=10, scores=True, recursive=True ) # Step 3: Find near-duplicates using CNN (more thorough) cnn = CNN() duplicates_cnn = cnn.find_duplicates( image_dir=image_dir, min_similarity_threshold=0.9, scores=True, recursive=True ) # Step 4: Review detected duplicates for filename, dups in duplicates_cnn.items(): if dups: print(f"\n{filename} has {len(dups)} duplicates:") plot_duplicates(image_dir, duplicates_cnn, filename) # Step 5: Get files to remove (automated selection) files_to_remove = phasher.find_duplicates_to_remove( image_dir=image_dir, max_distance_threshold=10 ) # Step 6: Move duplicates to a separate folder (don't delete immediately!) duplicate_folder = 'path/to/duplicates_backup' os.makedirs(duplicate_folder, exist_ok=True) for file in files_to_remove: src = os.path.join(image_dir, file) dst = os.path.join(duplicate_folder, file) shutil.move(src, dst) print(f"Moved {file} to backup folder") print(f"\nMoved {len(files_to_remove)} duplicate files to {duplicate_folder}") ``` -------------------------------- ### Get Filenames from Test Set Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb This snippet uses pathlib to get a set of all filenames ending with '.png' from the '/content/cifar/test' directory. This is often a precursor to comparing or analyzing specific subsets of the dataset. ```python # test images are stored under '/content/cifar/test' filenames_test = set([i.name for i in Path('/content/cifar/test').glob('*.png')]) ``` -------------------------------- ### Find and Plot Duplicates Between Test and Train Sets Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md Identifies and plots images from the test set that have duplicates in the training set. This helps in understanding data overlap between the two sets and visualizes the findings for a relevant file. ```python # keep only filenames that are in test set have duplicates in train set duplicates_test_train = {} for k, v in duplicates.items(): if k in filenames_test: tmp = [i for i in v if i in filenames_train] duplicates_test_train[k] = tmp # sort in descending order of duplicates duplicates_test_train = {k: v for k, v in sorted(duplicates_test_train.items(), key=lambda x: len(x[1]), reverse=True)} # plot duplicates found for some file plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test_train, filename=list(duplicates_test_train.keys())[0]) ``` -------------------------------- ### Find Duplicates using CNN in Python Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md Finds duplicate images in the entire dataset using a CNN-based approach from the imagededup library. It first encodes images and then identifies duplicates based on these encodings. ```python from imagededup.methods import CNN cnn = CNN() encodings = cnn.encode_images(image_dir=image_dir) duplicates = cnn.find_duplicates(encoding_map=encodings) ``` -------------------------------- ### Find and Plot Test Set Duplicates with CNN Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md Identifies and plots duplicate images specifically within the test set using CNN encodings. It filters the previously found duplicates to include only those present in the test set and then visualizes the results for the file with the most duplicates. ```python # test images are stored under '/content/cifar/test' filenames_test = set([i.name for i in Path('/content/cifar/test').glob('*.png')]) duplicates_test = {} for k, v in duplicates.items(): if k in filenames_test: tmp = [i for i in v if i in filenames_test] duplicates_test[k] = tmp # sort in descending order of duplicates duplicates_test = {k: v for k, v in sorted(duplicates_test.items(), key=lambda x: len(x[1]), reverse=True)} # plot duplicates found for some file plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test, filename=list(duplicates_test.keys())[0]) ``` -------------------------------- ### Find and Plot Train Set Duplicates with CNN Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/examples/CIFAR10_deduplication.md Identifies and plots duplicate images within the training set using CNN encodings. Similar to the test set analysis, it filters duplicates to include only those in the training set and visualizes the results for a representative file. ```python # train images are stored under '/content/cifar/train' filenames_train = set([i.name for i in Path('/content/cifar/train').glob('*.png')]) duplicates_train = {} for k, v in duplicates.items(): if k in filenames_train: tmp = [i for i in v if i in filenames_train] duplicates_train[k] = tmp # sort in descending order of duplicates duplicates_train = {k: v for k, v in sorted(duplicates_train.items(), key=lambda x: len(x[1]), reverse=True)} # plot duplicates found for some file plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_train, filename=list(duplicates_train.keys())[0]) ``` -------------------------------- ### Deduplicate Images with Perceptual Hashing (Python) Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md This snippet demonstrates how to find and remove duplicate images in a directory using perceptual hashing. It utilizes the PHash method from the imagededup library, allowing users to set a maximum Hamming distance threshold and specify an output file for the duplicate list. Ensure the 'imagededup' library is installed. ```python from imagededup.methods import PHash phasher = PHash() duplicates = phasher.find_duplicates_to_remove(image_dir='path/to/image/directory', max_distance_threshold=12, outfile='my_duplicates.json') ``` -------------------------------- ### Initialize CNN with Pre-packaged EfficientNet Model Source: https://github.com/idealo/imagededup/blob/master/examples/use_custom_model.ipynb This snippet shows how to initialize the CNN encoder using a pre-packaged EfficientNet model. It demonstrates importing necessary classes and configuring a custom model using EfficientNet's name, model, and transform functions. ```python from pathlib import Path from imagededup.methods import CNN from imagededup.utils import CustomModel from imagededup.utils.models import EfficientNet image_dir = Path('../tests/data/mixed_images') custom_config = CustomModel(name=EfficientNet.name, model=EfficientNet(), transform=EfficientNet.transform) cnn_encoder = CNN(model_config=custom_config) duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, scores=True) print(duplicates_cnn) ``` -------------------------------- ### Initialize Image Directory Path Source: https://github.com/idealo/imagededup/blob/master/examples/Evaluation.ipynb Sets up the path to the directory containing images for deduplication. This path is used as input for subsequent deduplication functions. ```python from pathlib import Path import imagededup image_dir = Path('../tests/data/mixed_images') ``` -------------------------------- ### Initialize CNN with Pre-packaged Custom Model - Python Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/custom_model.md Initializes the CNN deduplication method using a pre-packaged model from the imagededup library, such as EfficientNet. This involves creating a `CustomModel` object with the model's name, instance, and transformation function, then passing this configuration to the `CNN` constructor. ```python from imagededup.methods import CNN # Get CustomModel construct from imagededup.utils import CustomModel # Get the prepackaged models from imagededup from imagededup.utils.models import ViT, MobilenetV3, EfficientNet # Declare a custom config with CustomModel, the prepackaged models come with a name and transform function custom_config = CustomModel(name=EfficientNet.name, model=EfficientNet(), transform=EfficientNet.transform) # Use model_config argument to pass the custom config cnn = CNN(model_config=custom_config) # Use the model as usual ... ``` -------------------------------- ### Prepare Image Directory for Deduplication Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb This code creates a working directory and copies all training and testing images from the extracted CIFAR10 dataset into this single directory. This consolidated directory is then used for image analysis. ```python # create working directory and move all images into this directory image_dir = 'cifar10_images' !mkdir $image_dir !cp -r '/content/cifar/train/.' $image_dir !cp -r '/content/cifar/test/.' $image_dir ``` -------------------------------- ### Download and Extract CIFAR10 Dataset Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb This snippet downloads the CIFAR10 dataset using wget and then extracts the compressed tarball using the tar command. It's a common first step for preparing image datasets. ```shell # download CIFAR10 dataset and untar !wget http://pjreddie.com/media/files/cifar.tgz !tar xzf cifar.tgz ``` -------------------------------- ### Find Image Duplicates using PHash in Python Source: https://github.com/idealo/imagededup/blob/master/README.md Demonstrates a Python workflow to find duplicate images in a directory using the Perceptual Hashing (PHash) method. It involves encoding images, finding duplicates based on encodings, and optionally plotting the results. ```python from imagededup.methods import PHash phasher = PHash() # Generate encodings for all images in an image directory encodings = phasher.encode_images(image_dir='path/to/image/directory') # Find duplicates using the generated encodings duplicates = phasher.find_duplicates(encoding_map=encodings) # plot duplicates obtained for a given file using the duplicates dictionary from imagededup.utils import plot_duplicates plot_duplicates(image_dir='path/to/image/directory', duplicate_map=duplicates, filename='ukbench00120.jpg') ``` -------------------------------- ### Deduplicate Images with CNN Features (Python) Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md This snippet shows how to deduplicate images using Convolutional Neural Network (CNN) features. It employs the CNN method from imagededup, enabling deduplication based on a minimum cosine similarity threshold and saving the results to a specified JSON file. The 'imagededup' library must be installed for this to function. ```python from imagededup.methods import CNN cnn_encoder = CNN() duplicates = cnn_encoder.find_duplicates_to_remove(image_dir='path/to/image/directory', min_similarity_threshold=0.85, outfile='my_duplicates.json') ``` -------------------------------- ### Initialize CNN with Custom PyTorch Model Source: https://github.com/idealo/imagededup/blob/master/examples/use_custom_model.ipynb This snippet illustrates how to initialize the CNN encoder with a custom PyTorch model. It defines a `MyModel` class inheriting from `torch.nn.Module`, specifying its transformation and forward pass, and then uses it to configure the CNN. ```python from pathlib import Path import torch from torchvision.transforms import transforms from imagededup.methods import CNN from imagededup.utils import CustomModel # Declare custom feature extractor class class MyModel(torch.nn.Module): transform = transforms.Compose( [ transforms.Resize((256, 256)), transforms.ToTensor() ] ) # transform must take PIL.Image as input and return a torch.Tensor name = 'my_custom_model' # name can be any user-defined string def __init__(self): super().__init__() # Define the layers of the model here def forward(self, x): # Add more operations here x = x.view(-1, 256*256*3) # output shape: batch_size x features return x image_dir = Path('../tests/data/mixed_images') # Initialize the CNN using model_config parameter and setting it to the custom model custom_config = CustomModel(name=MyModel.name, model=MyModel(), transform=MyModel.transform) cnn = CNN(model_config=custom_config) duplicates_cnn = cnn.find_duplicates(image_dir=image_dir, scores=True) print(duplicates_cnn) ``` -------------------------------- ### Evaluating Deduplication Results Source: https://context7.com/idealo/imagededup/llms.txt Provides a framework for evaluating the performance of image deduplication algorithms. It compares the algorithm's output (retrieved duplicates) against a manually curated ground truth mapping using standard information retrieval and classification metrics. ```python from imagededup.evaluation import evaluate # Ground truth: manually curated duplicate mappings ground_truth_map = { 'image1.jpg': ['image2.jpg', 'image3.jpg'], 'image2.jpg': ['image1.jpg'], 'image3.jpg': ['image1.jpg'], 'image4.jpg': [] } # Retrieved: duplicates found by algorithm retrieved_map = { 'image1.jpg': ['image2.jpg'], 'image2.jpg': ['image1.jpg'], 'image3.jpg': [], 'image4.jpg': [] } # Example usage (metrics depend on the specific evaluation function used) # evaluation_results = evaluate(ground_truth_map, retrieved_map) ``` -------------------------------- ### Implement GCC __builtin_ffs on Various Platforms Source: https://github.com/idealo/imagededup/blob/master/imagededup/handlers/search/builtin/README.md Provides portable implementations for GCC's `__builtin_ffs` and its variants (`ffsl`, `ffsll`, `ffs32`, `ffs64`). These functions find the first set bit in an integer and work across different compilers and older GCC versions. The implementations can utilize compiler-specific built-ins, inline assembly, or pure C code. ```c int psnip_builtin_ffs(int); int psnip_builtin_ffsl(long); int psnip_builtin_ffsll(long long); int psnip_builtin_ffs32(psnip_int32_t); int psnip_builtin_ffs64(psnip_int64_t); ``` ```c #define PSNIP_BUILTIN_EMULATE_NATIVE int __builtin_ffs(int); int __builtin_ffsl(long); int __builtin_ffsll(long long); ``` -------------------------------- ### WHash: Encode and Find Duplicate Images with Python Source: https://context7.com/idealo/imagededup/llms.txt Employs Wavelet Hashing (WHash) using Haar wavelets to generate hashes, offering a balance between speed and accuracy for various image transformations. This function can encode images and find duplicates in a single step, supporting recursive directory scanning and parallel processing for both encoding and distance calculation. Dependencies include the imagededup library. ```python from imagededup.methods import WHash whasher = WHash() # Encode images and find duplicates in one step duplicates = whasher.find_duplicates( image_dir='path/to/images', max_distance_threshold=10, scores=True, recursive=True, num_enc_workers=4, # Workers for encoding num_dist_workers=4 # Workers for distance calculation ) ``` -------------------------------- ### Initialize CNN with User-Defined Custom Model - Python Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/custom_model.md Initializes the CNN deduplication method using a user-defined PyTorch model. This requires defining a `torch.nn.Module` subclass with a `forward` method and optionally `name` and `transform` attributes. The custom model and its associated transformation are then wrapped in a `CustomModel` object and passed to the `CNN` constructor. ```python from imagededup.methods import CNN # Get CustomModel construct from imagededup.utils import CustomModel # Import necessary pytorch constructs for initializing a custom feature extractor import torch from torchvision.transforms import transforms # Declare custom feature extractor class class MyModel(torch.nn.Module): transform = transforms.Compose( [ transforms.ToTensor() ] ) name = 'my_custom_model' def __init__(self): super().__init__() # Define the layers of the model here def forward(self, x): # Do something with x return x custom_config = CustomModel(name=MyModel.name, model=MyModel(), transform=MyModel.transform) cnn = CNN(model_config=custom_config) # Use the model as usual ... ``` -------------------------------- ### PHash: Encode and Find Duplicate Images with Python Source: https://context7.com/idealo/imagededup/llms.txt Uses the PHash algorithm to generate 16-character hexadecimal hashes for images, robust to minor modifications. It can encode single images or entire directories, find duplicates based on Hamming distance, and optionally save results or list files for removal. Dependencies include the imagededup library. ```python from imagededup.methods import PHash # Initialize perceptual hasher phasher = PHash() # Generate hash for a single image single_hash = phasher.encode_image(image_file='path/to/image.jpg') print(f"Image hash: {single_hash}") # Output: 16-character hex string like 'a8f0e2b145c37d90' # Generate hashes for all images in a directory encodings = phasher.encode_images(image_dir='path/to/image/directory') # Returns: {'image1.jpg': 'hash1', 'image2.jpg': 'hash2', ...} # Find duplicates with hamming distance threshold (0-64, lower = stricter) duplicates = phasher.find_duplicates( image_dir='path/to/image/directory', max_distance_threshold=10, # Maximum hamming distance for duplicates scores=True, # Include distance scores outfile='duplicates.json' # Save results to file ) # Returns: {'image1.jpg': [('similar1.jpg', 5), ('similar2.jpg', 8)], ...} # Get list of files to remove (keeps one from each duplicate group) files_to_remove = phasher.find_duplicates_to_remove( image_dir='path/to/image/directory', max_distance_threshold=10 ) # Returns: ['duplicate1.jpg', 'duplicate2.jpg', ...] ``` -------------------------------- ### Evaluate Deduplication Metrics in Python Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/evaluating_performance.md Calculates deduplication quality metrics using ground truth and retrieved mappings. Requires 'imagededup' library. Accepts 'ground_truth_map', 'retrieved_map', and 'metric' as input. Returns a dictionary of calculated metrics. ```python from imagededup.evaluation import evaluate # Example usage: ground_truth_map = { '1.jpg': ['2.jpg', '4.jpg'], '2.jpg': ['1.jpg'], '3.jpg': [], '4.jpg': ['1.jpg'] } retrieved_map = { '1.jpg': ['2.jpg'], '2.jpg': ['1.jpg'], '3.jpg': [], '4.jpg': [] } # Evaluate all metrics (default) metrics_all = evaluate(ground_truth_map, retrieved_map) print("All metrics:", metrics_all) # Evaluate specific metric metrics_map = evaluate(ground_truth_map, retrieved_map, metric='map') print("MAP metric:", metrics_map) # Evaluate classification metrics metrics_classification = evaluate(ground_truth_map, retrieved_map, metric='classification') print("Classification metrics:", metrics_classification) ``` -------------------------------- ### Import Plotting Utilities Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb This code imports necessary libraries from pathlib, imagededup.utils, and matplotlib for visualizing duplicate images. It also sets the default figure size for plots. ```python # do some imports for plotting from pathlib import Path from imagededup.utils import plot_duplicates import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (15, 10) ``` -------------------------------- ### AHash: Encode and Find Duplicate Images with Python Source: https://context7.com/idealo/imagededup/llms.txt Implements Average Hashing (AHash) for fast duplicate detection, suitable for exact matches. It can encode images from NumPy arrays or directories (with recursive search), find duplicates using pre-computed encodings, and supports parallel processing for encoding. Dependencies include imagededup, numpy, and Pillow. ```python from imagededup.methods import AHash ahasher = AHash() # Encode single image from numpy array import numpy as np from PIL import Image image_array = np.array(Image.open('path/to/image.jpg')) hash_from_array = ahasher.encode_image(image_array=image_array) # Encode directory with recursive search in nested folders encodings = ahasher.encode_images( image_dir='path/to/nested/directory', recursive=True, # Search subdirectories num_enc_workers=4 # Parallel processing cores ) # Find duplicates using pre-computed encodings duplicates = ahasher.find_duplicates( encoding_map=encodings, # Use pre-computed hashes max_distance_threshold=12, scores=False # Return only filenames, not scores ) # Returns: {'image1.jpg': ['similar1.jpg', 'similar2.jpg'], ...} ``` -------------------------------- ### Custom CNN Models for Feature Extraction Source: https://context7.com/idealo/imagededup/llms.txt Enables the use of alternative pre-trained models like Vision Transformer (ViT) or EfficientNet, or custom PyTorch models for image feature extraction within the CNN deduplication process. This allows for greater flexibility in balancing accuracy and performance. ```python from imagededup.methods import CNN from imagededup.utils import CustomModel from imagededup.utils.models import ViT, EfficientNet, MobilenetV3 # Using Vision Transformer (ViT) - better accuracy, slower vit_config = CustomModel( name=ViT.name, model=ViT(), transform=ViT.transform ) cnn_vit = CNN(model_config=vit_config) duplicates = cnn_vit.find_duplicates(image_dir='path/to/images', min_similarity_threshold=0.9) # Using EfficientNet B4 - good balance of speed and accuracy effnet_config = CustomModel( name=EfficientNet.name, model=EfficientNet(), transform=EfficientNet.transform ) cnn_effnet = CNN(model_config=effnet_config) # Using a custom PyTorch model import torch from torchvision.transforms import transforms class MyFeatureExtractor(torch.nn.Module): transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) name = 'my_custom_extractor' def __init__(self): super().__init__() self.backbone = torch.hub.load('pytorch/vision', 'resnet18', pretrained=True) self.backbone.fc = torch.nn.Identity() # Remove classification layer def forward(self, x): return self.backbone(x) # Returns (batch_size, 512) custom_config = CustomModel( name=MyFeatureExtractor.name, model=MyFeatureExtractor(), transform=MyFeatureExtractor.transform ) cnn_custom = CNN(model_config=custom_config) encodings = cnn_custom.encode_images(image_dir='path/to/images') ``` -------------------------------- ### Plotting Detected Image Duplicates Source: https://context7.com/idealo/imagededup/llms.txt Visualizes detected duplicate images to aid in manual inspection. This utility takes the output from a deduplication method (like PHash) and generates a plot showing a query image alongside its identified duplicates, optionally saving the plot to a file. ```python from imagededup.methods import PHash from imagededup.utils import plot_duplicates # Find duplicates phasher = PHash() duplicates = phasher.find_duplicates( image_dir='path/to/images', max_distance_threshold=12, scores=True ) # Plot duplicates for a specific image plot_duplicates( image_dir='path/to/images', duplicate_map=duplicates, filename='query_image.jpg', # Must be a key in duplicate_map outfile='duplicates_plot.png' # Optional: save plot to file ) # Displays original image with all detected duplicates and their scores ``` -------------------------------- ### DHash: Encode and Find Duplicate Images with Python Source: https://context7.com/idealo/imagededup/llms.txt Utilizes Difference Hashing (DHash) for efficient duplicate detection based on pixel gradient differences. This method is fast and effective for exact duplicates. It supports encoding images from directories, finding duplicates with parallel computation, and calculating Hamming distances between hashes. Dependencies include the imagededup library. ```python from imagededup.methods import DHash dhasher = DHash() # Complete workflow: encode and find duplicates encodings = dhasher.encode_images(image_dir='path/to/images') # Find duplicates with parallel distance computation duplicates = dhasher.find_duplicates( encoding_map=encodings, max_distance_threshold=8, scores=True, num_dist_workers=8 # Parallel workers for distance calculation ) # Calculate hamming distance between two hashes manually hash1 = dhasher.encode_image(image_file='image1.jpg') hash2 = dhasher.encode_image(image_file='image2.jpg') distance = dhasher.hamming_distance(hash1, hash2) print(f"Hamming distance: {distance}") # 0 = identical, 64 = completely different ``` -------------------------------- ### Find Duplicate Images using CNN Method Source: https://github.com/idealo/imagededup/blob/master/examples/Evaluation.ipynb This snippet demonstrates how to find duplicate images using the CNN (Convolutional Neural Network) method from the imagededup library. It initializes the CNN encoder, specifies the image directory, and sets a similarity threshold to identify duplicates. The output is a dictionary mapping image filenames to lists of their duplicates. ```python from imagededup.methods import CNN cnn_encoder = CNN() duplicates_cnn = cnn_encoder.find_duplicates(image_dir=image_dir, min_similarity_threshold=0.97) ``` -------------------------------- ### Evaluate Image Deduplication Metrics Source: https://context7.com/idealo/imagededup/llms.txt Calculates and prints various evaluation metrics for image deduplication, such as Mean Average Precision (MAP), Normalized DCG, Jaccard Index, Precision, Recall, and F1-Score. It supports retrieving all metrics or specific ones like 'map' or 'classification'. ```python from imagededup.utils import evaluate # Assuming ground_truth_map and retrieved_map are defined elsewhere # metrics = evaluate(ground_truth_map, retrieved_map, metric='all') # print(f"Mean Average Precision (MAP): {metrics['map']:.4f}") # print(f"Normalized DCG: {metrics['ndcg']:.4f}") # print(f"Jaccard Index: {metrics['jaccard']:.4f}") # print(f"Precision (per class): {metrics['precision']}") # print(f"Recall (per class): {metrics['recall']}") # print(f"F1-Score (per class): {metrics['f1-score']}") # map_score = evaluate(ground_truth_map, retrieved_map, metric='map') # classification_metrics = evaluate(ground_truth_map, retrieved_map, metric='classification') ``` -------------------------------- ### Find Duplicates from Image Directory (Python) Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md Finds duplicate images within a specified directory using a chosen deduplication method. Requires the image directory path and an optional threshold parameter. Returns a dictionary mapping image filenames to their duplicates. ```python from imagededup.methods import method_object = () duplicates = method_object.find_duplicates(image_dir='path/to/image/directory', ) ``` -------------------------------- ### BibTeX Citation for Imagededup Source: https://github.com/idealo/imagededup/blob/master/README.md This snippet provides the BibTeX entry for citing the Imagededup project in academic publications. It includes author names, year, and a URL to the project's GitHub repository. ```bibtex @misc{idealods2019imagededup, title={Imagededup}, author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran}, year={2019}, howpublished={\url{https://github.com/idealo/imagededup}}, } ``` -------------------------------- ### GCC Built-in Functions Implementation Status Source: https://github.com/idealo/imagededup/blob/master/imagededup/handlers/search/builtin/README.md Lists the GCC built-in functions that have been implemented in this portable module. The status indicates whether each function (e.g., ffs, clz, ctz, popcount) and its variants for different integer sizes are supported. ```text - [x] ffs, ffsl, ffsll, ffs32, ffs64 - [x] clz, clzl, clzll, clz32, clz64 - [x] ctz, ctzl, ctzll, ctz32, ctz64 - [x] clrsb, clrsbl, clrsbll, clrsb32, clrsb64 - [x] popcount, popcountl, popcountll, popcount32, popcount64 - [x] parity, parityl, parityll, parity32, parity64 - [x] bswap16, bswap32, bswap64 ``` -------------------------------- ### Find Duplicates with Perceptual Hashing (PHash) and Scores Source: https://github.com/idealo/imagededup/blob/master/examples/Finding_duplicates.ipynb Uses the PHash method to find duplicate images based on perceptual hashing. It calculates hashes for all images and then evaluates Hamming distances to identify duplicates, returning a map of duplicates along with their similarity scores. ```python from imagededup.methods import PHash phasher = PHash() duplicates = phasher.find_duplicates(image_dir=image_dir, scores=True) ``` -------------------------------- ### Find Duplicate Files to Remove (PHash) Source: https://context7.com/idealo/imagededup/llms.txt Identifies duplicate image files within a directory based on perceptual hashing and a specified distance threshold. It returns a list of files marked for removal and can optionally save these to a JSON file. This method is suitable for finding visually similar images. ```python from imagededup.methods import PHash phasher = PHash() files_to_remove = phasher.find_duplicates_to_remove( image_dir='path/to/images', max_distance_threshold=15, outfile='to_remove.json' ) print(f"Found {len(files_to_remove)} duplicate files to remove") ``` -------------------------------- ### Load and Filter Duplicates for Train Set (Python) Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb This code snippet first identifies all PNG image filenames within the '/content/cifar/train' directory and stores them in a set. It then filters a pre-existing 'duplicates' dictionary to retain only those image keys and their associated duplicates that are present in the training set. Finally, it sorts the resulting 'duplicates_train' dictionary in descending order by the number of duplicates. ```python # train images are stored under '/content/cifar/train' filenames_train = set([i.name for i in Path('/content/cifar/train').glob('*.png')]) # keep only filenames that are in train set duplicates_train = {} for k, v in duplicates.items(): if k in filenames_train: tmp = [i for i in v if i in filenames_train] duplicates_train[k] = tmp # sort in descending order of duplicates duplicates_train = {k: v for k, v in sorted(duplicates_train.items(), key=lambda x: len(x[1]), reverse=True)} ``` -------------------------------- ### Clang Built-in Functions Implementation Status Source: https://github.com/idealo/imagededup/blob/master/imagededup/handlers/search/builtin/README.md Details the implementation status of Clang-specific built-in functions within the portable module. This includes functions for bit manipulation and carry/borrow flags for various integer sizes. ```text - [x] bitreverse8, bitreverse16, bitreverse32, bitreverse64 - [x] addcb, addcs, addc, addcl, addcll, addc8, addc16, addc32, addc64 - [x] subcb, subcs, subc, subcl, subcll, subc8, subc16, subc32, subc64 ``` -------------------------------- ### Encode Single Image using Perceptual Hashing (PHash) Source: https://github.com/idealo/imagededup/blob/master/examples/Encoding_generation.ipynb Generates a perceptual hash string for a single image file using the PHash method from the imagededup library. This method is useful for identifying visually similar images. ```python from pathlib import Path from imagededup.methods import PHash single_image_path = Path('../tests/data/mixed_images/ukbench00120.jpg') phasher = PHash() phash_string = phasher.encode_image(image_file = single_image_path) print(phash_string) ``` -------------------------------- ### Encoding generation for all images in a directory Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/encoding_generation.md Generates encodings for all images within a specified directory. The function returns a dictionary mapping image file names to their corresponding encodings. ```APIDOC ## Encoding generation for all images in a directory ### Description Generates encodings for all images within a specified directory. The function returns a dictionary mapping image file names to their corresponding encodings. ### Method POST (or relevant method for encoding generation) ### Endpoint `/encode_images` ### Parameters #### Query Parameters - **image_dir** (string) - Required - Path to the image directory for which encodings are to be generated. - **recursive** (boolean) - Optional - Set to `True` to find images recursively in a nested directory structure. Defaults to `False`. ### Request Example ```python from imagededup.methods import DHash dhasher = DHash() encodings = dhasher.encode_images(image_dir='path/to/image/directory', recursive=True) ``` ### Response #### Success Response (200) - **encodings** (dict) - A dictionary where keys are image file names (strings) and values are their corresponding encodings (hexadecimal strings for hashing methods, numpy arrays for CNN). #### Response Example ```json { "image1.jpg": "", "image2.jpg": "" } ``` #### Considerations - If an image cannot be loaded, it will be omitted from the returned encodings dictionary. - Supported image formats: 'JPEG', 'PNG', 'BMP', 'MPO', 'PPM', 'TIFF', 'GIF', 'SVG', 'PGM', 'PBM', 'WEBP'. ``` -------------------------------- ### Find Duplicates to Remove using CNN with imagededup Source: https://github.com/idealo/imagededup/blob/master/examples/Finding_duplicates.ipynb This code snippet demonstrates how to find duplicate images to remove using a CNN-based approach. It initializes the CNN encoder and then uses it to find duplicates within a specified image directory. The output is a list of duplicate filenames. ```python from imagededup.methods import CNN cnn_encoder = CNN() duplicates_list_cnn = cnn_encoder.find_duplicates_to_remove(image_dir=image_dir) ``` -------------------------------- ### CNN-Based Image Deduplication and Encoding Source: https://context7.com/idealo/imagededup/llms.txt Utilizes Convolutional Neural Networks (CNNs), defaulting to MobileNetV3, to generate image embeddings for detecting near-duplicates. It supports encoding single images or directories, finding duplicates based on similarity thresholds, and can operate with pre-computed encodings. Multiprocessing is supported on Linux for encoding directories. ```python from imagededup.methods import CNN # Initialize CNN encoder (uses MobileNetV3 by default) cnn = CNN() # Encode single image encoding = cnn.encode_image(image_file='path/to/image.jpg') # Returns: numpy array of shape (576,) for MobileNetV3 # Encode directory of images encodings = cnn.encode_images( image_dir='path/to/images', recursive=True, num_enc_workers=0 # 0 = no multiprocessing (multiprocessing only on Linux) ) # Find duplicates using cosine similarity threshold (-1.0 to 1.0) duplicates = cnn.find_duplicates( image_dir='path/to/images', min_similarity_threshold=0.9, # Higher = stricter matching scores=True, outfile='cnn_duplicates.json' ) # Returns: {'image1.jpg': [('similar1.jpg', 0.95), ('similar2.jpg', 0.92)], ...} # Find duplicates using pre-computed encodings duplicates = cnn.find_duplicates( encoding_map=encodings, min_similarity_threshold=0.85, num_sim_workers=8 # Parallel similarity computation ) # Get list of files to remove files_to_remove = cnn.find_duplicates_to_remove( image_dir='path/to/images', min_similarity_threshold=0.9 ) ``` -------------------------------- ### Plot Duplicates from Test Set (Python) Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb This function call visualizes a specific group of duplicate images identified within the test set. It requires the 'image_dir', the 'duplicates_test' map, and the filename of the key to plot. The output indicates a UserWarning related to tight_layout rendering and the resulting matplotlib Figure object. ```python plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test, filename=list(duplicates_test.keys())[0]) ``` -------------------------------- ### Find Duplicates to Remove using Generic Method Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md Demonstrates how to use the generic find_duplicates_to_remove function, which returns a list of duplicate filenames without removing them. This function can be used with various deduplication methods like PHash, AHash, DHash, WHash, and CNN. ```python from imagededup.methods import method_object = () duplicates = method_object.find_duplicates_to_remove(image_dir='path/to/image/directory', ) ``` ```python from imagededup.methods import method_object = () duplicates = method_object.find_duplicates_to_remove(encoding_map=encoding_map, ) ``` -------------------------------- ### Plot Duplicates from Train Set (Python) Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb This function call visualizes a specific group of duplicate images identified within the training set. It requires the 'image_dir', the 'duplicates_train' map, and the filename of the key to plot. The output includes a UserWarning regarding tight_layout and the resulting matplotlib Figure object. ```python # 70 duplicates found of same car! plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_train, filename=list(duplicates_train.keys())[0]) ``` -------------------------------- ### Encode All Images in Directory using Difference Hashing (DHash) Source: https://github.com/idealo/imagededup/blob/master/examples/Encoding_generation.ipynb Generates difference hash strings for all images within a specified directory using the DHash method from the imagededup library. This method captures differences between adjacent pixels and is robust to minor image variations. ```python from pathlib import Path from imagededup.methods import DHash image_dir = Path('../tests/data/mixed_images') dhasher = DHash() encodings = dhasher.encode_images(image_dir) print(encodings) ``` -------------------------------- ### Python: Fix Windows RuntimeError with multiprocessing guard Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/windows.md This snippet demonstrates the correct way to structure Python code using the imagededup library on Windows to avoid a RuntimeError. The error arises from the `multiprocessing` module's restrictions on Windows. Encapsulating the main logic within an `if __name__ == '__main__':` block resolves this issue by ensuring proper process bootstrapping. ```python from imagededup.methods import PHash if __name__ == '__main__': phasher = PHash() # Generate encodings for all images in an image directory encodings = phasher.encode_images(image_dir='path/to/image/directory') # Find duplicates using the generated encodings duplicates = phasher.find_duplicates(encoding_map=encodings) # plot duplicates obtained for a given file using the duplicates dictionary from imagededup.utils import plot_duplicates plot_duplicates(image_dir='path/to/image/directory', duplicate_map=duplicates, filename='ukbench00120.jpg') ``` -------------------------------- ### Encode All Images in Directory using CNN Source: https://github.com/idealo/imagededup/blob/master/examples/Encoding_generation.ipynb Generates CNN-based encodings for all images within a specified directory using the CNN method from the imagededup library. This method is suitable for capturing semantic similarities between images. ```python from pathlib import Path from imagededup.methods import CNN image_dir = Path('../tests/data/mixed_images') cnn_encoder = CNN() cnn_encodings = cnn_encoder.encode_images(image_dir) print(cnn_encodings) ``` -------------------------------- ### Find Duplicates to Remove using Perceptual Hashing (PHash) Source: https://github.com/idealo/imagededup/blob/master/examples/Finding_duplicates.ipynb Employs the PHash method to identify duplicate images and returns a list of filenames that can be removed. This is useful for cleaning up redundant image files. ```python from imagededup.methods import PHash phasher = PHash() duplicates_list = phasher.find_duplicates_to_remove(image_dir) ``` -------------------------------- ### Plot Duplicates Common to Test and Train Sets (Python) Source: https://github.com/idealo/imagededup/blob/master/examples/CIFAR10_duplicates.ipynb This function call visualizes a specific group of duplicate images that are present in both the test and train sets. It requires the 'image_dir', the 'duplicates_test_train' map, and the filename of the key to plot. The output includes a UserWarning regarding tight_layout and the resulting matplotlib Figure object. ```python plot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test_train, filename=list(duplicates_test_train.keys())[0]) ``` -------------------------------- ### find_duplicates() Method Source: https://github.com/idealo/imagededup/blob/master/mkdocs/docs/user_guide/finding_duplicates.md This method identifies duplicate images within a specified directory or using pre-computed encodings. It can return just the duplicate file names or include similarity scores. ```APIDOC ## POST /find_duplicates ### Description Finds duplicate images in a directory or from a map of image encodings. The results can optionally include similarity scores. ### Method POST ### Endpoint `/find_duplicates` ### Parameters #### Query Parameters - **method_name** (string) - Required - The deduplication method to use (e.g., PHash, AHash, DHash, WHash, CNN). - **scores** (boolean) - Optional - If true, returns similarity scores along with duplicate file names. Defaults to false. - **outfile** (string) - Optional - Path to a JSON file to save the results. Defaults to None. - **recursive** (boolean) - Optional - Whether to search for images recursively in subdirectories. Defaults to false. #### Request Body - **image_dir** (string) - Optional - Path to the directory containing image files. Use this or `encoding_map`. - **encoding_map** (object) - Optional - A dictionary mapping file names to their pre-computed encodings. Use this or `image_dir`. - **threshold** (number) - Optional - The similarity threshold. For hashing methods (PHash, AHash, DHash, WHash), this is `max_distance_threshold` (int, 0-64). For CNN, this is `min_similarity_threshold` (float, -1.0 to 1.0). Defaults vary by method. ### Request Example ```json { "image_dir": "path/to/image/directory", "method_name": "PHash", "threshold": 10, "scores": true } ``` ### Response #### Success Response (200) - **duplicates** (object) - A dictionary where keys are image file names and values are either a list of duplicate file names or a list of tuples containing duplicate file names and their scores. #### Response Example ```json { "duplicates": { "image1.jpg": [ "image1_duplicate1.jpg", "image1_duplicate2.jpg" ], "image2.jpg": [ "image2_duplicate1.jpg" ] } } ``` #### Response Example with Scores ```json { "duplicates": { "image1.jpg": [ ["image1_duplicate1.jpg", 5], ["image1_duplicate2.jpg", 8] ], "image2.jpg": [ ["image2_duplicate1.jpg", 3] ] } } ``` ```