Multi-Stage Compression of Machine Learning Models

Nahyeon Kim and Prabhat Mishra
University of Florida


Abstract

Deploying machine learning (ML) models on edge devices is challenging due to strict memory and computational constraints. To mitigate these limitations, various model compression techniques, such as pruning, tensor decomposition, quantization, and Huffman coding, have been explored. However, most prior efforts perform compression on trained ML models, which requires significant fine-tuning to recover accuracy. In this work, we employ a multi-stage compression strategy that effectively combines pre-training pruning and tensor decomposition with post-training quantization and Huffman coding. The pre-training compression eliminates the need for costly fine-tuning for preserving model accuracy. It also reduces both training and inference memory requirements. To reduce the memory requirements further, we perform post-training quantization and Huffman coding. Unlike conventional quantization approaches that use a single scale factor per layer, we assign different scale factors to each decomposed core, thereby minimizing information loss. Extensive experimental evaluation demonstrates that our approach significantly reduces the model size (145 times for ResNet101, 114 times for ResNet50) with minor accuracy loss.