TY - GEN
T1 - OpenEQA
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Majumdar, Arjun
AU - Ajay, Anurag
AU - Zhang, Xiaohan
AU - Putta, Pranav
AU - Yenamandra, Sriram
AU - Henaff, Mikael
AU - Silwal, Sneha
AU - McVay, Paul
AU - Maksymets, Oleksandr
AU - Arnaud, Sergio
AU - Yadav, Karmesh
AU - Li, Qiyang
AU - Newman, Ben
AU - Sharma, Mohit
AU - Berges, Vincent
AU - Zhang, Shiqi
AU - Agrawal, Pulkit
AU - Bisk, Yonatan
AU - Batra, Dhruv
AU - Kalakrishnan, Mrinal
AU - Meier, Franziska
AU - Paxton, Chris
AU - Sax, Alexander
AU - Rajeswaran, Aravind
N1 - Publisher Copyright: © 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory, exemplified by agents on smart glasses, or by actively exploring the environment, as in the case of mobile robots. We accompany our formulation with OpenEQA – the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset, we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol, we evaluate several state-of-the-art foundation models including GPT-4V, and find that they significantly lag behind human-level performance. Consequently, OpenEQA stands out as a straightforward, measurable, and practically relevant benchmark that poses a considerable challenge to current generation of foundation models. We hope this inspires and stimulates future research at the intersection of Embodied AI, conversational agents, and world models.
AB - We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory, exemplified by agents on smart glasses, or by actively exploring the environment, as in the case of mobile robots. We accompany our formulation with OpenEQA – the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset, we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol, we evaluate several state-of-the-art foundation models including GPT-4V, and find that they significantly lag behind human-level performance. Consequently, OpenEQA stands out as a straightforward, measurable, and practically relevant benchmark that poses a considerable challenge to current generation of foundation models. We hope this inspires and stimulates future research at the intersection of Embodied AI, conversational agents, and world models.
KW - Embodied AI
KW - Embodied Question Answering
KW - Vision-Language Models
UR - https://www.scopus.com/pages/publications/85197386124
U2 - 10.1109/CVPR52733.2024.01560
DO - 10.1109/CVPR52733.2024.01560
M3 - Conference contribution
SN - 9798350353006
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 16488
EP - 16498
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -