Skip to main navigation Skip to search Skip to main content

SBAT: Video captioning with sparse boundary-aware transformer

  • Zhejiang University
  • Baidu Inc
  • Alibaba Group Holding Ltd.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

36 Scopus citations

Abstract

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

Original languageEnglish
Title of host publicationProceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020
EditorsChristian Bessiere
PublisherInternational Joint Conferences on Artificial Intelligence
Pages630-636
Number of pages7
ISBN (Electronic)9780999241165
StatePublished - 2020
Event29th International Joint Conference on Artificial Intelligence, IJCAI 2020 - Yokohama, Japan
Duration: Jan 1 2021 → …

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
Volume2021-January

Conference

Conference29th International Joint Conference on Artificial Intelligence, IJCAI 2020
Country/TerritoryJapan
CityYokohama
Period01/1/21 → …

Fingerprint

Dive into the research topics of 'SBAT: Video captioning with sparse boundary-aware transformer'. Together they form a unique fingerprint.

Cite this