Skip to main navigation Skip to search Skip to main content

SOCRATIC MODELS: COMPOSING ZERO-SHOT MULTIMODAL REASONING WITH LANGUAGE

  • Andy Zeng
  • , Maria Attarian
  • , Brian Ichter
  • , Krzysztof Choromanski
  • , Adrian Wong
  • , Stefan Welker
  • , Federico Tombari
  • , Aveek Purohit
  • , Michael S. Ryoo
  • , Vikas Sindhwani
  • , Johnny Lee
  • , Vincent Vanhoucke
  • , Pete Florence
  • Alphabet Inc.

Research output: Contribution to conferencePaperpeer-review

80 Scopus citations

Abstract

We investigate how multimodal prompt engineering can use language as the intermediate representation to combine complementary knowledge from different pretrained (potentially multimodal) language models for a variety of tasks. This approach is both distinct from and complementary to the dominant paradigm of joint multimodal training. It also recalls a traditional systems-building view as in classical NLP pipelines, but with prompting large pretrained multimodal models. We refer to these as Socratic Models (SMs): a modular class of systems in which multiple pretrained models may be composed zero-shot via multimodal-informed prompting to capture new multimodal capabilities, without additional finetuning. We show that these systems provide competitive state-of-the-art performance for zero-shot image captioning and video-to-text retrieval, and also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes), and (iii) robot perception and planning. We hope this work provides (a) results for stronger zero-shot baseline performance with analysis also highlighting their limitations, (b) new perspectives for building multimodal systems powered by large pretrained models, and (c) practical application advantages in certain regimes limited by data scarcity, training compute, or model access.

Original languageEnglish
StatePublished - 2023
Event11th International Conference on Learning Representations, ICLR 2023 - Kigali, Rwanda
Duration: May 1 2023May 5 2023

Conference

Conference11th International Conference on Learning Representations, ICLR 2023
Country/TerritoryRwanda
CityKigali
Period05/1/2305/5/23

Fingerprint

Dive into the research topics of 'SOCRATIC MODELS: COMPOSING ZERO-SHOT MULTIMODAL REASONING WITH LANGUAGE'. Together they form a unique fingerprint.

Cite this