Abstract
Video-text retrieval is a crucial task in numerous computer vision applications. In this paper, we focus on video-text retrieval involving complex action compositions, where a single video encompasses multiple primitive actions such as “sitting up”, “opening door”, “cooking food”, and “eating.” Despite the common occurrences in real-world scenarios, such action-compositional videos have received limited research attention, often leading to significant performance degradations in existing retrieval methods. To address this challenge, we present Hyperbolic Video-tExt Retrieval (HOVER), which models the hierarchical semantic relationships between videos and texts by embedding them in a low-dimensional hyperbolic space. Since hyperbolic space provides a geometric prior that naturally aligns with hierarchical data, it allows for more efficient and generalizable representations of video-text semantic hierarchies. HOVER first longitudinally decomposes each video into a hierarchical action tree, where primitive mono-actions are represented as leaf nodes and increasingly complex action compositions as parent nodes. The semantic structures and temporal dependencies of videos/texts are then encoded in hyperbolic space by exploiting hyperbolic distance, norm, and relative cosine similarity. Experimental results show that HOVER significantly outperforms traditional Euclidean-based methods, particularly in scenarios with limited training labels, achieving a notable performance improvement of 28.83%. Additionally, the hyperbolic video-text embeddings learned by HOVER demonstrate strong generalization across new datasets containing videos with varying levels of action complexity.
| Original language | English |
|---|---|
| Pages (from-to) | 6192-6203 |
| Number of pages | 12 |
| Journal | IEEE Transactions on Image Processing |
| Volume | 34 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Video-text retrieval
- hyperbolic representation
- multimodal learning
Fingerprint
Dive into the research topics of 'HOVER: Hyperbolic Video-Text Retrieval'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver