Are similar gains to be expected when running on the recent ARM MCUs (Cortex-M55) with MVE extensions?
We have not tested our inference engine on Cortex-M55. We will do this in the near future as such MCUs become more available. We do expect similar gains, although it will of course depend on how well other inference engines such as TensorFlow Lite for Microcontrollers perform on such MCUs.
The models that PlumerAI’s inference engine didn’t perform well, could you please elaborate on what you perceive are the reasons?
We showed a few models in which Plumerai’s inference engine used slightly more RAM than TFLM, but did a lot better in terms of speed. These models are actually very small models with 3 layers only, and in this particular situation we can actually trade-off RAM usage for latency. So if a particular situation requires it, we can tune this knob and do better on RAM at the cost of some of the latency gains.
Any fine-detailed look onto the performance of different tuning of kernel-size, number of filters, type of convolution (depth-wise vs spatial one or dense)? in which provides a finer-look comparison with tf with microcontrollers? Thanks
Indeed, performance varies for different kinds of layers and with different configurations. There are however quite a lot of different situations and there is no simple relation between speed-up and layer parameters, so we suggest to just try it out for your own model using our Plumerai Benchmark service at https://plumerai.com/benchmark.
What happens to the models once processed on your server?
We do store models for internal use to be able us to improve our inference engine further. We will not reuse, reverse engineer, or steal your model, and will not sell or provide your data to third parties.
If you want us to test your model without storing it, or if you have other concerns, please contact us at hello@plumerai.com.
Did you benchmark against the GLOW compiler?
We did not compare with the Glow compiler yet. However this is a good suggestion and we will try it out soon.
Any memory limits for the models to be benchmarked?
Our public benchmarking service runs with two specific boards. Both boards have 2MB flash, while the memory sizes are 640KB and 1MB. See https://plumerai.com/benchmark for details.
What input data do you use? Zeroes? Random noise? Something else?
We do use random input data for the models. However, this should not matter since we do not test for accuracy, and model latencies and RAM usage numbers are deterministic with respect to the input provided.
Does plumerai “replace” all 3 tasks belonging to inference task that you mentioned with proprietary engine, or do you still make use of CMSIS-NN and only have own engine for the other 2 tasks covered by TF ?
Plumerai’s inference engine is built on top of TensorFlow Lite for Microcontrollers and does still make use of it (and thus also CMSIS-NN) for less common layers. The main advantage here is that any model that runs on TFLM runs with our inference engine.
Is your engine compatible with full static memory allocation like TensorFlow Lite Micro (no malloc)?
Yes, there is no malloc inside our inference engine, all memory allocation is static. Most of it is done by our memory planner.
Are you exploring different quantization parameters, such as 4 or 2-bit?
This question is beyond the scope of this presentation, but yes, at Plumerai we are not only considering INT8 and 1-bit layers but also look at any other quantization parameter combination for both weights and activations. We use whatever it takes to build the most accurate and efficient deep learning models for embedded hardware.
Would Cedric consider running these models in the future on AWS ARM-Cortex virtual MCUs to provide elastic scaling of this service for people wanting to compare their models for inference performance?
The AWS virtual MCUs are definitely something we are considering using. Not just for easier scaling but also to access newer MCUs that are not yet widely available on development boards, such as the Cortex-M55.
Can your inference engine run on any of the Risc-V cores? If so, do you have any benchmarking data?
Plumerai’s inference engine also runs on RISC-V. However, we do not have any recent benchmarks for this. If you are interested in this, I encourage you to contact us at hello@plumerai.com.
Will these slides be available offline?
Yes, you can find the slides here: https://cms.tinyml.org/wp-content/uploads/talks2021/tinyML_Talks_Cedric_Nugteren_220104.pdf
And a recording of the presentation can be found here: https://www.youtube.com/watch?v=ComEgcN7KfY
While reporting peak memory usage do Plumerai engine does also consider the extra memory needed due to im2col operation
Yes. In our benchmark results we include memory needed for model tensors, for temporary buffers (such as for im2col), and for any other objects stored in the RAM. The only thing we do not include is the stack usage, which should be around 1 or 2KB. For other inference engines we do not include stack usage either, so the comparison is fair.
Does your benchmark of 50 models are with model-agnostic or model-specific optimizations?
The benchmarks shown in the slide are with Plumerai’s inference engine with everything enabled. So indeed, for every run the compiler performs model-specific optimizations.