tinyML Talks on January 4, 2022 “Demoing the world’s fastest inference engine for Arm Cortex-M” by Cedric Nugteren

We held our next tinyML Talks webcast. Cedric Nugteren from Plumerai presented Demoing the world’s fastest inference engine for Arm Cortex-M on January 4, 2022.

January 4 forum

Recently we announced Plumerai’s inference engine for 8-bit deep learning models on Arm Cortex-M microcontrollers. We showed that it is the world’s most efficient on MobileNetV2, beating TensorFlow Lite for Microcontrollers with CMSIS-NN kernels by 40% in terms of latency and 49% in terms of RAM usage with no loss in accuracy. However, that was just on a single network and it might have been cherry-picked. Therefore, we will give a live demonstration of a new service that you can use to test your own models with our inference engine. In this talk we will explain what we did to get these speedups and memory improvements and we will show benchmarks for the most important publicly available neural network models.

Cedric Nugteren is a software engineer focussed on writing efficient code for deep learning applications. After he received his MSc and PhD from Eindhoven University of Technology he optimized GPU and CPU code for various companies using C++, OpenCL and CUDA. Then, he worked for 4 years on deep learning for autonomous driving at TomTom, after which he joined Plumerai where he is now writing fast code for the smallest microcontrollers.

=========================

Watch on YouTube:
Cedric Nugteren

Download presentation slides:
Cedric Nugteren

Feel free to ask your questions on this thread and keep the conversation going!

2 Likes

Thank you very much for the talk Cedric. Does your inference engine support any way to run models using gates? Thinks like dynamically skipping a layer, adjusting number of filters…

Thank you for your interest. I do not think this is possible - not with Plumerai’s inference engine nor with TensorFlow Lite for Microcontrollers. If you have any specific model or .tflite file in mind then please let us know. You could of course consider splitting your network up into multiple parts and implement the ‘gating’ yourself (e.g. in C++) and run the different network parts separately on the inference engine.

Are similar gains to be expected when running on the recent ARM MCUs (Cortex-M55) with MVE extensions?

We have not tested our inference engine on Cortex-M55. We will do this in the near future as such MCUs become more available. We do expect similar gains, although it will of course depend on how well other inference engines such as TensorFlow Lite for Microcontrollers perform on such MCUs.

The models that PlumerAI’s inference engine didn’t perform well, could you please elaborate on what you perceive are the reasons?

We showed a few models in which Plumerai’s inference engine used slightly more RAM than TFLM, but did a lot better in terms of speed. These models are actually very small models with 3 layers only, and in this particular situation we can actually trade-off RAM usage for latency. So if a particular situation requires it, we can tune this knob and do better on RAM at the cost of some of the latency gains.

Any fine-detailed look onto the performance of different tuning of kernel-size, number of filters, type of convolution (depth-wise vs spatial one or dense)? in which provides a finer-look comparison with tf with microcontrollers? Thanks

Indeed, performance varies for different kinds of layers and with different configurations. There are however quite a lot of different situations and there is no simple relation between speed-up and layer parameters, so we suggest to just try it out for your own model using our Plumerai Benchmark service at https://plumerai.com/benchmark.

What happens to the models once processed on your server?

We do store models for internal use to be able us to improve our inference engine further. We will not reuse, reverse engineer, or steal your model, and will not sell or provide your data to third parties.

If you want us to test your model without storing it, or if you have other concerns, please contact us at hello@plumerai.com.

Did you benchmark against the GLOW compiler?

We did not compare with the Glow compiler yet. However this is a good suggestion and we will try it out soon.

Any memory limits for the models to be benchmarked?

Our public benchmarking service runs with two specific boards. Both boards have 2MB flash, while the memory sizes are 640KB and 1MB. See https://plumerai.com/benchmark for details.

What input data do you use? Zeroes? Random noise? Something else?

We do use random input data for the models. However, this should not matter since we do not test for accuracy, and model latencies and RAM usage numbers are deterministic with respect to the input provided.

Does plumerai “replace” all 3 tasks belonging to inference task that you mentioned with proprietary engine, or do you still make use of CMSIS-NN and only have own engine for the other 2 tasks covered by TF ?

Plumerai’s inference engine is built on top of TensorFlow Lite for Microcontrollers and does still make use of it (and thus also CMSIS-NN) for less common layers. The main advantage here is that any model that runs on TFLM runs with our inference engine.

Is your engine compatible with full static memory allocation like TensorFlow Lite Micro (no malloc)?

Yes, there is no malloc inside our inference engine, all memory allocation is static. Most of it is done by our memory planner.

Are you exploring different quantization parameters, such as 4 or 2-bit?

This question is beyond the scope of this presentation, but yes, at Plumerai we are not only considering INT8 and 1-bit layers but also look at any other quantization parameter combination for both weights and activations. We use whatever it takes to build the most accurate and efficient deep learning models for embedded hardware.

Would Cedric consider running these models in the future on AWS ARM-Cortex virtual MCUs to provide elastic scaling of this service for people wanting to compare their models for inference performance?

The AWS virtual MCUs are definitely something we are considering using. Not just for easier scaling but also to access newer MCUs that are not yet widely available on development boards, such as the Cortex-M55.

Can your inference engine run on any of the Risc-V cores? If so, do you have any benchmarking data?

Plumerai’s inference engine also runs on RISC-V. However, we do not have any recent benchmarks for this. If you are interested in this, I encourage you to contact us at hello@plumerai.com.

Will these slides be available offline?

Yes, you can find the slides here: https://cms.tinyml.org/wp-content/uploads/talks2021/tinyML_Talks_Cedric_Nugteren_220104.pdf

And a recording of the presentation can be found here: ​​https://www.youtube.com/watch?v=ComEgcN7KfY

While reporting peak memory usage do Plumerai engine does also consider the extra memory needed due to im2col operation

Yes. In our benchmark results we include memory needed for model tensors, for temporary buffers (such as for im2col), and for any other objects stored in the RAM. The only thing we do not include is the stack usage, which should be around 1 or 2KB. For other inference engines we do not include stack usage either, so the comparison is fair.

Does your benchmark of 50 models are with model-agnostic or model-specific optimizations?

The benchmarks shown in the slide are with Plumerai’s inference engine with everything enabled. So indeed, for every run the compiler performs model-specific optimizations.

Hi Cedric,

I’am yet very new to the field of TinyML and Tensorflow. As you already thought, it is not so easy to deactivate e.g. single channels. However, I am currently writing my thesis on this specific topic and managed to get something like this working in Tensorflow lite. Up to now just some Proof of concept models like a tiny MNIST CNN. For each inference you input a data tensor and a configuration tensor which sets the channel widths for the Conv Layers.

Can I can let you know once I get a larger model (maybe something like Mobilenet V1) running. I would be very interesting to see the performance difference.

Sure! If you test it with TensorFlow Lite for Microcontrollers locally and it works there, then you can send the model to me (‘myfirstname’@plumerai.com) or directly run it on our public benchmarking service. If it doesn’t work with TensorFlow Lite for Microcontrollers then most likely it won’t work with Plumerai’s inference engine either.

Hi Cedric, as you already initially imaged it’s not that straight forward to run these dynamic models with TensorFlow Lite for Microcontrollers. The model I came up until now requires the “FlexConv2D” op which is not supported by tf.lite.OpsSet.TFLITE_BUILTINS only with tf.lite.OpsSet.SELECT_TF_OPS which however seems to not work with TFLM. I’am currently looking into the internals of TFLM and how I might imiplement these Ops or fix the model to work with normal Conv2D.