tinyML Talks on January 19, 2021 “Running Binarized Neural Networks on Microcontrollers” by Lukas Geiger

We held our next tinyML Talks webcast. Lukas Geiger from Plumerai has presented Running Binarized Neural Networks on Microcontrollers on January 19, 2021.

Forum January 19

Today’s deep learning methods limit the use of microcontrollers to only very basic machine learning tasks. In this talk, Lukas explains how real-time deep learning for complex tasks is achieved on microcontrollers with the help of Binarized Neural Networks (BNNs) - in which weights and activations are encoded not using 32 or 8 bits, but using only 1 bit.

BNNs allow for much lower memory requirements and extremely efficient execution, but require new training algorithms and custom software for inference. Our integrated approach tackles these issues. Built on top of Keras and TFLite, our open-source libraries (https://larq.dev) make it possible to build, train and benchmark BNNs on ARMv8-A architectures and we show how this work exposes the inconsistencies between published research and real world results.

Finally, we demonstrate the world’s first BNN running live on an ARM Cortex-M4 microcontroller using Plumerai’s software stack to bring unmatched efficiency to TinyML.

Lukas Geiger is a deep learning researcher at Plumerai working on new training methods and architectures for improving accuracy and efficiency of Binarized Neural Networks (BNNs). He is the author of the open-source Larq training library and core developer of the Plumerai software stack for deploying BNNs on embedded platforms.

==========================

Watch on YouTube:
Lukas Geiger

Download presentation slide:
Lukas Geiger

Feel free to ask your questions on this thread and keep the conversation going!

Hi Lukas, thanks again for the great talk. For the results you reported, did you have native popcount instructions? Would the addition of that or other simple nonstandard custom instructions to the ISA
improve your performance significantly?
Cheers,
Tim C.

Thank you very much for the great questions, it is amazing to see so many people interested in Binarized Neural Networks (BNNs). We did our best to answer all of your questions below. Please let us know if you have more follow up questions. We are very happy to answer all of them!

General questions

What application today should already be using TinyML (current tradeoffs make sense)?

TinyML is already used in many devices where microcontrollers perform relatively easy tasks such as classifying vibrations and sounds. However, it is much more difficult to use microcontrollers for more complex tasks - like person presence detection. Our goal is to make deep learning so efficient that - when put on cheap and low-power microcontrollers - it can be used for a whole new class of devices. The person presence detection example that we showed during the talk can be implemented to trigger push notifications for smart home cameras, wake up devices when a person is detected and detect occupancy of meeting rooms. With more work, similar technology can be used to count the number of people waiting in a queue, detecting if a shelf in a store is empty, classify hand gestures to control appliances and many more.

Is the neuron model like McCulloch-Pitts?

Yes, in its most basic form BNNs are similar to McCulloch-Pitts models, although usually activations are restricted to {-1, 1} instead of {0,1} and in BNNs the weights are binary as well. However, as with high-precision networks, one shouldn’t lean too heavily on the similarity to biological systems. For example, the comparison breaks down when bringing in high-precision residual connections, which are often crucial for good performance of BNNs.

What’s the performance and computation gap between int8-quantized and binarized neural networks?

This question was partly addressed during the live recording and is very dependent on the type of model and the exact type of hardware used. In (Bannink et al., 2021) we present a detailed analysis for our open source ARMv8-a binary convolutional implementation with a measured speedup between 9x and 12x on the Pixel 1 phone. For a full model comparison on Cortex-M4 please see slide 34.

How is batch normalization done on BNN of output is just +/-?

The output of a binary convolution is an integer and not a single bit value since there is an accumulation after the elementwise XNOR operation. Therefore, we can directly apply batch normalization after binarized layers. The output activations will then be binarized before the following binary layer.

Based on your example where shifting the image by one pixel changes the model’s prediction. Is it possible that the binarization process is affecting the translation invariance of the usual Convolutional neural net?

This is an effect that is not exclusive to BNNs. Common full precision CNNs also break translational invariance due to the use of zero padding, pooling or striding. For more information on this effect please see the analysis presented in (Zhang, R., 2019).

Quantization introduced biases on the overall model. We need to make sure that the model perform well on various slices of the data. Did you test BNN (in detecting persons for example) against that?

Yes, we tested our BNN against that, both during training as well as part of our model validation process. In the case of BNNs, quantization is already happening during training (quantization aware training) and not after the fact (post-training quantization) so the model can already adapt to additional biases that are introduced due to the quantization.

Are BNNs more suitable for microcontrollers based applications rather than high end applications?

BNNs can require less memory than 8-bit deep learning models. This makes them a relatively good fit for microcontrollers since they often have limited amounts of memory. However, it also depends on the architecture of the microcontroller and the instructions that are available.

Can you share some tricks on implementing efficient BNN inference on ARM-Cortex

Our BNN inference stack for ARM Cortex-M is proprietary, but our implementation for ARM Cortex-A is open source and available at GitHub - larq/compute-engine: Highly optimized inference engine for Binarized Neural Networks. This should also serve as a good starting point for efficient implementations on other platforms.

Regarding dataset for binarised NN environment. I was thinking of icon size (e.g. 96 x 96 ) e.g. for images. Is this possible with binarised NN environment?

This is possible. In fact the benchmarks shown on slide 34 use an input resolution of 96 x 96.

Questions regarding Larq

Is that [Larq] better then TFMOTs 1 bit quantization results? How is that different?

I wasn’t aware that TFMOT allows for 1-bit quantization. One big difference compared to Larq is that TFMOT doesn’t support any way of deploying BNNs in an efficient way. From my personal experience when I tried TFMOT for 8-bit quantization, I think Larq and TFMOT follow two different API design philosophies. TFMOT tries to make the process of quantization aware training very simple for beginners by converting an existing Keras model into a quantized version. This works very well if your use case matches exactly what TFMOT was designed for, but for me it felt very cumbersome as soon as models contain operations that TFMOT does not support directly or the quantization had to be customized.
On the other hand, Larq provides quantizable layers allowing users to build quantized models from first principles and aims for a progressive disclosure of complexity from simple ready to use models all the way to completely custom implementations of quantizers and layers. This approach means that it integrates seamlessly into all TensorFlow Keras features including advanced uses of Multi-GPU and automatic FP16 training on GPUs which to my knowledge aren’t supported by TFMOT.

In Larq Zoo, the quick net models consist of conv layers with 32 bit MACs. How do you guys decide on layer’s quantization?

We briefly discussed this question in the live Q&A. For QuickNet the decision was made by observing the accuracy versus latency tradeoff. We only binarized the layers where the bulk of the computation is spent. Therefor 1x1 convolutions are often not quantized since the small latency improvements through binarization wouldn’t justify a potential loss in accuracy. Please note that the remaining full precision operations can be quantized to 8-bit usually without a loss in accuracy.

Will the larq training library support pytorch backend in the near future?

We do not plan to support PyTorch in the near future. Our aim with Larq is to provide a highly integrated and easy to use end-to-end solution for binarized and other extremely quantized neural networks covering both research and deployment. Therefore we are focusing on supporting a single framework the best we can without compromising usability. We made this decision very early on in the development of Larq and do not plan to change this any time soon.

Have you planned to apply ternarisation with your tool?

Ternarisation is already supported natively in Larq on the training side an you can very easily provide custom quantizers. However, we currently do not have any plans to support ternary networks in Larq Compute Engine for fast inference.

Questions regarding the live demo

[What is the] size of person detector model?

The details for the models used during the demo are shown on slide 38.

Concerning data, do you automate the detection and identifaction of failure cases? or is it with human-in-the-loop?

We use a mix of both approaches where the detection can be automated if the failure is already part of our test set, but further analysis is usually done manually. However we are continuously working on streamlining and automating this process to speed up model iteration.

Why don’t you use lime for analysis of false negatives?

We didn’t do any extensive comparisons between interpretability methods for analysing failure cases. In the example shown we used Randomized Input Sampling which provided us with the insights we were looking for.

From what I understood, you trained the Plumerai network on many corner cases where it didn’t perform well. Is MobileNet also trained this way? If not, is it a fair comparison between the Plumerai model and the base MobileNet?

Was MobileNet model in demo trained on the same dataset as the Plumerai model?

No, MobileNet was taken straight from the TensorFlow Lite Micro example. We showed performance of MobileNet here to put our results into perspective and were not aiming to give a quantitative comparison between the two models. For a direct comparison between both models trained only on the COCO visual wakewords dataset, please see slide 34.

Was the Binarized model trained on images from the office?

Yes, the model was finetuned on a diverse set of images including ones from office environments.

The binarized Network compared to the MobileNet Baseline has a similar latency (even slower) but not even a factor of 3x smaller Footprint. I would have expected a stronger Impact from binary neural Networks. How much potential in optimizing this Network is still in there?

The Binarized NN at 858ms it not faster than the TFLite Micro version (713ms). Can you explain?

The two models use different architectures so the comparison does not show the impact of binarization in isolation. There are still a lot of improvements possible both in terms of novel model architectures and training algorithms for binary networks as well as improved hardware that can accelerate binary convolutions natively.

I’m not familiar with ARM Cortex ISA – is there an efficient popcount? Would your BNNs benefit greatly from the addition of one or more popcount instructions?

Indeed the Cortex-M4 ISA does not include a dedicated popcount instruction so we need to emulate it in software using multiple instructions which adds latency. An efficient popcount instruction would greatly improve performance on Cortex-M4.

The real-time demo was quite interesting, thank you. It was nice to show things in real-life, but just like accuracy, it feels like it was limited evidence (albeit in favor of the binarized model). Do you have any performance figures to at least give an idea of how the performance of the models change on a large scale data with binarization?

For a quantitative comparison please see slide 34, where we evaluate two models on the MS COCO visual wakewords dataset which is a common task for microcontrollers.

what size FPGA do you need for the person detection model you showed?

This mainly depends on your latency requirements. Slide 38 gives you more information about the size of the model used during the live demo which might help you to find the right FPGA.