Skip to main navigation Skip to search Skip to main content

TEACHING HUMAN BEHAVIOR IMPROVES CONTENT UNDERSTANDING ABILITIES OF VLMS

  • Somesh Singh
  • , S. I. Harini
  • , Yaman K. Singla
  • , Changyou Chen
  • , Rajiv Ratn Shah
  • , Veeky Baths
  • , Balaji Krishnamurthy

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Communication is defined as “Who says what to whom with what effect.” A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior signal is often ignored while training vision language models. We show that training VLMs on receiver behavior can actually help improve their content-understanding abilities. We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM's performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text and audio over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by upto 150%. We note that since receiver behavior, such as likes, comments, and replay graphs, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We also release BLIFT, our Behaviour-LLaVA IFT dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at behavior-in-the-wild.github.io/behavior-llava.

Original languageEnglish
Title of host publication13th International Conference on Learning Representations, ICLR 2025
PublisherInternational Conference on Learning Representations, ICLR
Pages35019-35050
Number of pages32
ISBN (Electronic)9798331320850
StatePublished - 2025
Event13th International Conference on Learning Representations, ICLR 2025 - Singapore, Singapore
Duration: Apr 24 2025Apr 28 2025

Publication series

Name13th International Conference on Learning Representations, ICLR 2025

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
Country/TerritorySingapore
CitySingapore
Period04/24/2504/28/25

Fingerprint

Dive into the research topics of 'TEACHING HUMAN BEHAVIOR IMPROVES CONTENT UNDERSTANDING ABILITIES OF VLMS'. Together they form a unique fingerprint.

Cite this