Skip to main navigation Skip to search Skip to main content

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

  • Zi Yang
  • , Samridhi Choudhary
  • , Siegfried Kunzmann
  • , Zheng Zhang
  • Amazon.com, Inc.
  • University of California at Santa Barbara

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, which significantly reduces model parameters. A quantization-aware training with learnable scale factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, a layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to 63× compression ratio, little accuracy loss and remarkable inference and training speedup.

Original languageEnglish
Pages (from-to)3292-3296
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
StatePublished - 2023
Event24th Annual conference of the International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: Aug 20 2023Aug 24 2023

Keywords

  • model compression
  • natural language understanding
  • quantization
  • tensor decomposition

Fingerprint

Dive into the research topics of 'Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding'. Together they form a unique fingerprint.

Cite this