tinyML Talks on November 3, 2020 “A technique for extreme compression of LSTM models using sparse structured additive matrices” by Urmish Thakker

We held our next tinyML Talks webcast. Urmish Thakker from SambaNova Systems has presented A technique for extreme compression of LSTM models using sparse structured additive matrices on November 3, 2020.

Structured matrices, such as those derived from Kronecker products (KP), are effective at compressing neural networks, but can lead to unacceptable accuracy loss when applied to large models. In this paper, we propose the notion of doping -addition of an extremely sparse matrix to a structured matrix. Doping facilitates additional degrees of freedom for a small number of parameters, allowing them to independently diverge from the fixed structure. To train LSTMs with doped structured matrices, we introduce the additional parameter matrix while slowly annealing its sparsity level. However, we find that performance degrades as we slowly sparsify the doping matrix, due to co-matrix adaptation(CMA) between the structured and the sparse matrices. We address this overdependence on the sparse matrix using a co-matrix dropout regularization (CMR)scheme. We provide empirical evidence to show that doping, CMA and CMR are concepts generally applicable to multiple structured matrices (Kronecker Product, LMF, Hybrid Matrix Decomposition). Additionally, results with doped kronecker product matrices demonstrate state-of-the-art accuracy at large compression factors (10 − 25x) across 4 natural language processing applications with minor loss in accuracy. Doped KP compression technique outperforms previous state-of-the-art compression results by achieving 1.3−2.4xhigher compression factor at a similar accuracy, while also beating strong alternatives like pruning and low-rank methods by a large margin (8% or more).Additionally, we show that doped KP can be deployed on commodity hardware using the current software stack and achieve 2.5 − 5.5x inference run-time speed-upover baseline.

Urmish Thakker is a Deep Learning Researcher at SambaNova Systems. Before joining SambaNova, he worked with Arm Research, AMD, Texas Instruments and Broadcom. His research has primarily focused on the efficient execution of neural networks on resource-constrained devices. Specifically, he has worked on model quantization, pruning, structured matrices and low-rank decomposition. His work has led to patents, publications and contributions to various products across multiple companies. Urmish completed his Master’s in Computer Science from UW Madison in US and Bachelor’s from BITS Pilani in India.

==========================

Watch on YouTube:
Urmish Thakker

Download presentation slide:
Urmish Thakker

Feel free to ask your questions on this thread and keep the conversation going!

Hi Urmish, thanks for the interesting talk. I just wanted to expand on some of the discussions here, and follow up on some of the questions.

It seems that the optimization (during training) with the (Kronecker Product + Sparse) structure on the LSTM matrices suffer from some challenges, and you have presented some nice solutions to those.

I wanted to ask about other ways of optimizing for such structures during optimization. In convex optimization problems there are some loss terms which are known to lead to Sparse (L_p norms, with 0 < p <= 1) or low-rank (trace norm or nuclear norm) structures in matrices, without imposing that structure equationally. In fact, in low rank decomposition problems, imposing the low-rank structure directly/structurally is known to introduce the possibility of getting stuck in local minima.

From your presentation, I had the feeling that perhaps gradient descent and back propagation approach to matrix decomposition has a similar drawback, and I would be interested to know if such loss terms can be utilized instead of pruning and/or inducing directly the kronecker product during the training.