Skip to main navigation Skip to search Skip to main content

OpenEQA: Embodied Question Answering in the Era of Foundation Models

  • Arjun Majumdar
  • , Anurag Ajay
  • , Xiaohan Zhang
  • , Pranav Putta
  • , Sriram Yenamandra
  • , Mikael Henaff
  • , Sneha Silwal
  • , Paul McVay
  • , Oleksandr Maksymets
  • , Sergio Arnaud
  • , Karmesh Yadav
  • , Qiyang Li
  • , Ben Newman
  • , Mohit Sharma
  • , Vincent Berges
  • , Shiqi Zhang
  • , Pulkit Agrawal
  • , Yonatan Bisk
  • , Dhruv Batra
  • , Mrinal Kalakrishnan
  • Franziska Meier, Chris Paxton, Alexander Sax, Aravind Rajeswaran

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

83 Scopus citations

Abstract

We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory, exemplified by agents on smart glasses, or by actively exploring the environment, as in the case of mobile robots. We accompany our formulation with OpenEQA – the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset, we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol, we evaluate several state-of-the-art foundation models including GPT-4V, and find that they significantly lag behind human-level performance. Consequently, OpenEQA stands out as a straightforward, measurable, and practically relevant benchmark that poses a considerable challenge to current generation of foundation models. We hope this inspires and stimulates future research at the intersection of Embodied AI, conversational agents, and world models.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages16488-16498
Number of pages11
ISBN (Electronic)9798350353006
ISBN (Print)9798350353006
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: Jun 16 2024Jun 22 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period06/16/2406/22/24

Keywords

  • Embodied AI
  • Embodied Question Answering
  • Vision-Language Models

Fingerprint

Dive into the research topics of 'OpenEQA: Embodied Question Answering in the Era of Foundation Models'. Together they form a unique fingerprint.

Cite this