TY - GEN
T1 - Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions
AU - Ding, Yucheng
AU - Niu, Chaoyue
AU - Wu, Fan
AU - Tang, Shaojie
AU - Lyu, Chengfei
AU - Chen, Guihai
N1 - Publisher Copyright: © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/8/24
Y1 - 2024/8/24
N2 - Many billion-scale large language models (LLMs) have been released for resource-constraint mobile devices to provide local LLM inference service when cloud-based powerful LLMs are not available. However, the capabilities of current on-device LLMs still lag behind those of cloud-based LLMs, and how to effectively and efficiently enhance on-device LLM inference becomes a practical requirement. We thus propose to collect the user's historical interactions with the cloud-based LLM and build an external datastore on the mobile device for enhancement using nearest neighbors search. Nevertheless, the full datastore improves the quality of token generation at the unacceptable expense of much slower generation speed. To balance performance and efficiency, we propose to select an optimal subset of the full datastore within the given size limit, the optimization objective of which is proven to be submodular. We further design an offline algorithm, which selects the subset after the construction of the full datastore, as well as an online algorithm, which performs selection over the stream and can be flexibly scheduled. We theoretically analyze the performance guarantee and the time complexity of the offline and the online designs to demonstrate effectiveness and scalability. We finally take three ChatGPT related dialogue datasets and four different on-device LLMs for evaluation. Evaluation results show that the proposed designs significantly enhance LLM performance in terms of perplexity while maintaining fast token generation speed. Practical overhead testing on the smartphone reveal the efficiency of on-device datastore subset selection from memory usage and computation overhead.
AB - Many billion-scale large language models (LLMs) have been released for resource-constraint mobile devices to provide local LLM inference service when cloud-based powerful LLMs are not available. However, the capabilities of current on-device LLMs still lag behind those of cloud-based LLMs, and how to effectively and efficiently enhance on-device LLM inference becomes a practical requirement. We thus propose to collect the user's historical interactions with the cloud-based LLM and build an external datastore on the mobile device for enhancement using nearest neighbors search. Nevertheless, the full datastore improves the quality of token generation at the unacceptable expense of much slower generation speed. To balance performance and efficiency, we propose to select an optimal subset of the full datastore within the given size limit, the optimization objective of which is proven to be submodular. We further design an offline algorithm, which selects the subset after the construction of the full datastore, as well as an online algorithm, which performs selection over the stream and can be flexibly scheduled. We theoretically analyze the performance guarantee and the time complexity of the offline and the online designs to demonstrate effectiveness and scalability. We finally take three ChatGPT related dialogue datasets and four different on-device LLMs for evaluation. Evaluation results show that the proposed designs significantly enhance LLM performance in terms of perplexity while maintaining fast token generation speed. Practical overhead testing on the smartphone reveal the efficiency of on-device datastore subset selection from memory usage and computation overhead.
KW - datastore subset selection
KW - device-cloud hybrid service
KW - on-device llm enhancement
UR - https://www.scopus.com/pages/publications/85203712551
U2 - 10.1145/3637528.3671679
DO - 10.1145/3637528.3671679
M3 - Conference contribution
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 597
EP - 608
BT - KDD 2024 - Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024
Y2 - 25 August 2024 through 29 August 2024
ER -