Abstract
Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, which significantly reduces model parameters. A quantization-aware training with learnable scale factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, a layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to 63× compression ratio, little accuracy loss and remarkable inference and training speedup.
| Original language | English |
|---|---|
| Pages (from-to) | 3292-3296 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| Volume | 2023-August |
| DOIs | |
| State | Published - 2023 |
| Event | 24th Annual conference of the International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: Aug 20 2023 → Aug 24 2023 |
Keywords
- model compression
- natural language understanding
- quantization
- tensor decomposition
Fingerprint
Dive into the research topics of 'Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver