Abstract
We address a growing debate about the extent to which large language models (LLMs) produce behavior consistent with Theory of Mind (ToM) in humans. We present EPIT-OME: a battery of six experiments that tap diverse ToM capacities, including belief at-tribution, emotional inference, and pragmatic reasoning. We elicit a performance baseline from human participants for each task. We use the dataset to ask whether distributional linguistic information learned by LLMs is suf-ficient to explain ToM in humans. We com-pare performance of five LLMs to a baseline of responses from human comprehenders. Results are mixed. LLMs display considerable sensitivity to mental states and match human performance in several tasks. Yet, they commit systematic errors in others, especially those requiring pragmatic reasoning on the ba-sis of mental state information. Such uneven performance indicates that human-level ToM may require resources beyond distributional information.
| Original language | English |
|---|---|
| Pages (from-to) | 803-819 |
| Number of pages | 17 |
| Journal | Transactions of the Association for Computational Linguistics |
| Volume | 12 |
| DOIs | |
| State | Published - Jun 25 2024 |
Fingerprint
Dive into the research topics of 'Comparing Humans and Large Language Models on an Experimental Protocol Inventory for Theory of Mind Evaluation (EPITOME)'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver