Skip to main navigation Skip to search Skip to main content

HOVER: Hyperbolic Video-Text Retrieval

  • Jun Wen
  • , Yufeng Chen
  • , Ruiqi Shi
  • , Wei Ji
  • , Menglin Yang
  • , Difei Gao
  • , Junsong Yuan
  • , Roger Zimmermann

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Video-text retrieval is a crucial task in numerous computer vision applications. In this paper, we focus on video-text retrieval involving complex action compositions, where a single video encompasses multiple primitive actions such as “sitting up”, “opening door”, “cooking food”, and “eating.” Despite the common occurrences in real-world scenarios, such action-compositional videos have received limited research attention, often leading to significant performance degradations in existing retrieval methods. To address this challenge, we present Hyperbolic Video-tExt Retrieval (HOVER), which models the hierarchical semantic relationships between videos and texts by embedding them in a low-dimensional hyperbolic space. Since hyperbolic space provides a geometric prior that naturally aligns with hierarchical data, it allows for more efficient and generalizable representations of video-text semantic hierarchies. HOVER first longitudinally decomposes each video into a hierarchical action tree, where primitive mono-actions are represented as leaf nodes and increasingly complex action compositions as parent nodes. The semantic structures and temporal dependencies of videos/texts are then encoded in hyperbolic space by exploiting hyperbolic distance, norm, and relative cosine similarity. Experimental results show that HOVER significantly outperforms traditional Euclidean-based methods, particularly in scenarios with limited training labels, achieving a notable performance improvement of 28.83%. Additionally, the hyperbolic video-text embeddings learned by HOVER demonstrate strong generalization across new datasets containing videos with varying levels of action complexity.

Original languageEnglish
Pages (from-to)6192-6203
Number of pages12
JournalIEEE Transactions on Image Processing
Volume34
DOIs
StatePublished - 2025

Keywords

  • Video-text retrieval
  • hyperbolic representation
  • multimodal learning

Fingerprint

Dive into the research topics of 'HOVER: Hyperbolic Video-Text Retrieval'. Together they form a unique fingerprint.

Cite this