Skip to main navigation Skip to search Skip to main content

Comparing Humans and Large Language Models on an Experimental Protocol Inventory for Theory of Mind Evaluation (EPITOME)

  • University of California at San Diego

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

We address a growing debate about the extent to which large language models (LLMs) produce behavior consistent with Theory of Mind (ToM) in humans. We present EPIT-OME: a battery of six experiments that tap diverse ToM capacities, including belief at-tribution, emotional inference, and pragmatic reasoning. We elicit a performance baseline from human participants for each task. We use the dataset to ask whether distributional linguistic information learned by LLMs is suf-ficient to explain ToM in humans. We com-pare performance of five LLMs to a baseline of responses from human comprehenders. Results are mixed. LLMs display considerable sensitivity to mental states and match human performance in several tasks. Yet, they commit systematic errors in others, especially those requiring pragmatic reasoning on the ba-sis of mental state information. Such uneven performance indicates that human-level ToM may require resources beyond distributional information.

Original languageEnglish
Pages (from-to)803-819
Number of pages17
JournalTransactions of the Association for Computational Linguistics
Volume12
DOIs
StatePublished - Jun 25 2024

Fingerprint

Dive into the research topics of 'Comparing Humans and Large Language Models on an Experimental Protocol Inventory for Theory of Mind Evaluation (EPITOME)'. Together they form a unique fingerprint.

Cite this