Skip to main navigation Skip to search Skip to main content

SCOT: Self-Supervised Contrastive Pretraining for Zero-Shot Compositional Retrieval

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to un-seen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional re-trieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR. Our code and models are available at https://github.com/yahoo/SCOT.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5509-5519
Number of pages11
ISBN (Electronic)9798331510831
DOIs
StatePublished - 2025
Event2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025 - Tucson, United States
Duration: Feb 28 2025Mar 4 2025

Publication series

NameProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025

Conference

Conference2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025
Country/TerritoryUnited States
CityTucson
Period02/28/2503/4/25

Keywords

  • compostional
  • computer-vision
  • language
  • llm
  • retrieval
  • self-supervised
  • vision
  • zero-shot

Fingerprint

Dive into the research topics of 'SCOT: Self-Supervised Contrastive Pretraining for Zero-Shot Compositional Retrieval'. Together they form a unique fingerprint.

Cite this