Skip to main navigation Skip to search Skip to main content

Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions

  • Yucheng Ding
  • , Chaoyue Niu
  • , Fan Wu
  • , Shaojie Tang
  • , Chengfei Lyu
  • , Guihai Chen
  • Shanghai Jiao Tong University
  • Alibaba Group Holding Ltd.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

Many billion-scale large language models (LLMs) have been released for resource-constraint mobile devices to provide local LLM inference service when cloud-based powerful LLMs are not available. However, the capabilities of current on-device LLMs still lag behind those of cloud-based LLMs, and how to effectively and efficiently enhance on-device LLM inference becomes a practical requirement. We thus propose to collect the user's historical interactions with the cloud-based LLM and build an external datastore on the mobile device for enhancement using nearest neighbors search. Nevertheless, the full datastore improves the quality of token generation at the unacceptable expense of much slower generation speed. To balance performance and efficiency, we propose to select an optimal subset of the full datastore within the given size limit, the optimization objective of which is proven to be submodular. We further design an offline algorithm, which selects the subset after the construction of the full datastore, as well as an online algorithm, which performs selection over the stream and can be flexibly scheduled. We theoretically analyze the performance guarantee and the time complexity of the offline and the online designs to demonstrate effectiveness and scalability. We finally take three ChatGPT related dialogue datasets and four different on-device LLMs for evaluation. Evaluation results show that the proposed designs significantly enhance LLM performance in terms of perplexity while maintaining fast token generation speed. Practical overhead testing on the smartphone reveal the efficiency of on-device datastore subset selection from memory usage and computation overhead.

Original languageEnglish
Title of host publicationKDD 2024 - Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages597-608
Number of pages12
ISBN (Electronic)9798400704901
DOIs
StatePublished - Aug 24 2024
Event30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024 - Barcelona, Spain
Duration: Aug 25 2024Aug 29 2024

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024
Country/TerritorySpain
CityBarcelona
Period08/25/2408/29/24

Keywords

  • datastore subset selection
  • device-cloud hybrid service
  • on-device llm enhancement

Fingerprint

Dive into the research topics of 'Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions'. Together they form a unique fingerprint.

Cite this