Skip to main navigation Skip to search Skip to main content

Tap&Say: Touch Location-Informed Large Language Model for Multimodal Text Correction on Smartphones

  • Maozheng Zhao
  • , Michael Xuelin Huang
  • , Nathan G. Huang
  • , Shanqing Cai
  • , Henry Huang
  • , Michael G. Huang
  • , Shumin Zhai
  • , I. V. Ramakrishnan
  • , Xiaojun Bi
  • Stony Brook University
  • Alphabet Inc.
  • Westlake High School
  • Harvard University
  • University of Texas at Austin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

While voice input offers a convenient alternative to traditional text editing on mobile devices, practical implementations face two key challenges: 1) reliably distinguishing between editing commands and content dictation, and 2) effortlessly pinpointing the intended edit location. We propose Tap&Say, a novel multimodal system that combines touch interactions with Large Language Models (LLMs) for accurate text correction. By tapping near an error, users signal their edit intent and location, addressing both challenges. Then, the user speaks the correction text. Tap&Say utilizes the touch location, voice input, and existing text to generate contextually relevant correction suggestions. We propose a novel touch location-informed attention layer that integrates the tap location into the LLM's attention mechanism, enabling it to utilize the tap location for text correction. We fine-tuned the touch location-informed LLM on synthetic touch locations and correction commands, achieving significantly higher correction accuracy than the state-of-the-art method VT [45]. A 16-person user study demonstrated that Tap&Say outperforms VT [45] with shorter task completion time and fewer keyboard clicks and is preferred by users.

Original languageEnglish
Title of host publicationCHI 2025 - Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems
PublisherAssociation for Computing Machinery
ISBN (Electronic)9798400713941
DOIs
StatePublished - Apr 26 2025
Event2025 CHI Conference on Human Factors in Computing Systems, CHI 2025 - Yokohama, Japan
Duration: Apr 26 2025May 1 2025

Publication series

NameConference on Human Factors in Computing Systems - Proceedings

Conference

Conference2025 CHI Conference on Human Factors in Computing Systems, CHI 2025
Country/TerritoryJapan
CityYokohama
Period04/26/2505/1/25

Keywords

  • LLMs
  • multi-modal
  • text correction
  • voice input

Fingerprint

Dive into the research topics of 'Tap&Say: Touch Location-Informed Large Language Model for Multimodal Text Correction on Smartphones'. Together they form a unique fingerprint.

Cite this