Copy detection
Copy detection is the task of identifying modified copies of images or videos.
Copy detection systems are deployed at scale in content moderation systems, allowing automated systems to take actions on copies of content that moderators have previously removed. This is especially important for classes of problematic content that are difficult to assess without human judgement, such as misinformation.
In foundation model training
A recent application of copy detection fingerprints, and SSCD specifically, is to remove clusters of duplicated images from foundation model training datasets. This has become industry standard across several modalities:
Image + video generation. Stable Diffusion 3 uses SSCD to deduplicate the
training dataset, demonstrating a
Self-supervised learning models. DINOv2 uses SSCD to deduplicate its training dataset. DINOv3 does the same, inheriting this step from DINOv2.
Multimodal LLMs. Llama 3 similarly uses SSCD to remove duplicate images from its dataset of image-text pairs.
Selected work
SSCD CVPR 2022
A Self-Supervised Descriptor for Image Copy Detection presents our SSCD fingerprint model.
This work adapts contrastive learning to the task of copy detection, and demonstrates how an entropy regularization technique significantly improves copy detection accuracy.
This work highlights important tradeoffs between semantic (i.e. linear classification) and copy detection accuracy.
As the strength of entropy regularization (
A production system based on an earlier version of this work, SimSearchNet++, is deployed on Facebook and Instagram.
SSCD has since become industry standard to remove clusters of duplicate images from training datasets for foundation models.
See the open source SSCD codebase with released model weights, to reproduce both training and evaluation of our method.
Visual Copy Detection Workshop (VCDW) CVPR 2023
I was the lead organizer of the Visual Copy Detection Workshop, held at CVPR 2023 in Vancouver. This workshop hosted presentations by Video Similarity Challenge participants, as well as invited talks on copy detection methods and use cases.
Video Similarity Challenge 2022-2023
The Video Similarity Dataset and Challenge is similar in spirit to the Image Similarity Challenge, adapted to the video domain. Participants must predict matching segments between query videos and a database of reference videos. Predicted segments must be temporally localized within each video. This challenge studies partial video copy detection end-to-end, including the search step.
Our report on the challenge was published in CVIU in June 2024. See also: an Arxiv preprint of the report, and the challenge website. Presentations of the challenge and participant solutions can be found in the VCDW workshop website.
Image Similarity Challenge NeurIPS 2021
The 2021 Image Similarity Challenge (ISC; see also the dataset paper) was an image copy detection challenge using a new dataset, DISC2021.
The dataset features a large dataset with robustly transformed matches. The challenge uses a needle-in-haystack setting inspired by real production systems, where most queries have no matches in the dataset.
Example matches from the Image Similarity Challenge.