Two tinyML Talks on May 28, 2020: 1) "Low-cost neural network inferencing on the edge with" by Laszlo Kindrat (XMOS); 2) "Low Power Embedded Gesture Recognition Using Novel Short-Range Radar Sensors" by Michele Magno (ETH Zurich)

We held our sixth tinyML Talks webcast with two presentations:
Laszlo Kindrat from XMOS has presented Low-cost neural network inferencing on the edge with and Michele Magno from ETH Zurich has presented Low Power Embedded Gesture Recognition Using Novel Short-Range Radar Sensors on May 28, 2020 at 8:00 AM and 08:30 AM Pacific Time.

Laszlo Kindrat (left) and Michele Magno (right)

XMOS recently launched its next generation crossover processor,, featuring a novel vector unit designed for low precision integer and binarized neural network inference. In this session, we introduce the core ideas behind this vector unit, and explain how it deviates from a traditional load-store architecture to enable high throughput when calculating convolutions. We go on to outline the software tools and libraries that enable users to take full advantage of the hardware. Lastly, we demonstrate our optimization and deployment tools based on TensorFlow Lite for microcontrollers, by converting and analyzing a variant of MobileNet.

Laszlo Kindrat is currently in the role of Machine Learning Engineer at XMOS in Hampton, NH. He has aided the design of the XS3 architecture’s vector unit and plays a key role in developing inference tools for the xCORE-AI platform. Laszlo also has experience with traditional and machine learning algorithm development. He holds a PhD in applied mathematics from the University of New Hampshire.

Human-computer interface (HCI) is an attractive scenario, and a wide range of solutions, strategies, and technologies have been proposed recently. A promising novel sensing technology is high-frequency short-range Doppler-radar. This talk presents a low-power high-accuracy embedded hand-gesture recognition using low power short-range radar sensors from Acconeer. A 2D Convolutional Neural Network (CNN) using range frequency Doppler features is combined with a Temporal Convolutional Neural Network (TCN) for time sequence prediction. The final algorithm has a model size of only 45723 parameters, yielding a memory footprint of only 91kB. We acquired two datasets containing 11 challenging hand gestures performed by 26 different people containing a total of 20210 gesture instances. The algorithm achieved an accuracy of up to 92% on the 11 hands gestures. Furthermore, we implemented the prediction algorithm on the GAP8 Parallel Ultra-Low-Power processor RISC-V and ARM Cortex-M processors. The hardware-software solution matches the requirements for battery-operated wearable devices.

In this talk I will disclose information to access our Hand Gesture Recognition labeled database collected with the short-range and low power radar from Acconeer.

Michele Magno is a senior scientist and head of the Project-based Learning Centre at ETH Zurich. He is working in ETH since 2013 and has become a visiting lecturer or professor in several universities, namely the University of Nice Sophia, France, Enssat Lannion, France, University of Bologna and Mid University Sweden.

Dr. Magno is a Senior Member of IEEE, the finalist of ETH Spark Award 2018, and a recipient of many other awards and grants. His background is in computer sciences and electrical engineering.


Watch on YouTube:
Laszlo Kindrat
Michele Magno

Download presentation slide:
Laszlo Kindrat
Michele Magno

Feel free to ask your questions on this thread and keep the conversation going!

1 Like

When will the slides be available and where can I get a copy? Thanks

The slides and video will be available on Forum tomorrow.
Thank you.

FYI links to videos and slides posted at the bottom of the original post.

Here are the audience questions about that we didn’t have time to get to during the session:

  • Which kind of workload can you efficiently run on the Logical Cores besides CNNs or matmul-like kernels?
  • What is the inference time for MobileNetV2?
  • Can you do backpropagation as well (e.g. on some of the top layers)?
  • Any toolkits or books to recommend?

Here are more questions and answers from the webinar on

Q: Does the chip support both floating point and int8 operators?

A: Our port of the runtime supports all built-in operators that have reference implementations in the TFLite for Microcontrollers (TFLM) project. This includes a variety of floating point as well as int8 quantized operators. The vector unit, on the other hand, implements only integer arithmetic, so for maximal performance we recommend using int8 quantized models on our platform.

Q: What is the size of the TFLite for Microcontrollers runtime?

The size of the runtime depends on what subset of operators are registered, which is in turn governed by what model(s) you want to run on it. This can range from as little as ~20KB to as much as ~140KB, when all built-in and xcore-specific custom operators are registered. The runtime overhead for the model in the demo was under 100KB.

It is worth noting that the TFLite flatbuffer model does not include allocated memory for input/output data and activation tensors. Moreover, the additional application overhead will be dependent on what other components are need for the full system (e.g. FreeRTOS, mic array, HW interfaces).

Q: Do you have a demo for wakeword detection?

A: You can deploy any TFLite model on our platform whose operators are supported by TFLM, so you can train/convert your own wakeword model or use a pretrained one available online. We have partnered with multiple wakeword providers for our commercial applications, please contact us at if you would be interested in a demo of one of these.

Q: Can you say more about the deployment workflow?

A: Stay tuned for a follow-up post on our blog where, among other things, we will give some more detail on the deployment workflow.

Q: Which kind of workload can you efficiently run on the logical cores besides CNNs or matmul-like kernels?

A: In addition to specialized instructions designed to speed up long sequences of dot products (and therefore CNNs, and matrix-vector multiplications), the vector unit also implements vectorized integer operations (e.g. ADD, SUB, MUL, MACC), as well as butterfly operations to speed up FFTs. Using our math and DSP libraries you can speed up a wide variety of computationally intensive applications. Moreover, the xcore architecture has been designed to be blazing fast at IO processing tasks where hard real-time guarantees are necessary.

Q: Do each of the logical cores have an ALU that includes a multiply-accumulate unit?

A: The 8 logical cores on a tile share an ALU, which includes the floating point and integer vector units. Due to the 5-stage instruction execution pipeline in the ALU, at any given time any 5 of the 8 cores can execute instructions (e.g. a vectorized multiply-accumulate) simultaneously. Each of the 8 cores have their own sets of registers, so the execution of instructions in one core will never interfere with another.
Q: Does your converter perform quantization or is it the user’s responsibility?

A: The quantization is performed by the TFLite converter, and we provide a helper that configures and calls the converter in a single line of code. In the future we will wrap this and our optimizer converter so that the conversion from a Keras model becomes a one-stop-shop.

Q: What is the inference time for MobileNetV2?

A: We are working on benchmarking several sample models, including different variants of MobileNet. More performance metrics will be published as development kits become available to the general public.

Q: How is the optimization connected to compilation in the converter?

A: The model optimization step happens before compilation. After the optimizer has transformed the model, you can use it the same way as you would use any other model in a TFLM project. Typically, you would integrate the model and the runtime with the rest of your embedded application (e.g. a camera interface or FreeRTOS instance), before you compile it.

Q: Can you do backpropagation as well (e.g. on some of the top layers)?

A: We are currently only targeting inference, particularly for low bit-width (e.g. quantized and binarized) models.

Q: Any idea how much demo boards will cost and when they’ll be available?

A: The first dev kits will be available to the public at the end of 2020Q3. To stay updated, register your interest at, and we will keep you up to date.

Q: Does this chip support LSTMs?

A: The TFLM project currently does not support recurrent networks, but support is planned. As soon as the reference operators are introduced in TFLM, we will look at how we can speed them up on

Q: Is a totally new CPU architecture or is it based on an existing one?

A: The chip implements the third generation of the xcore architecture, which is based on XMOS’s proprietary instruction set.

Q: Can you give further insight on the mapping process? It seems that there could be millions of transformations: how does XMOS find the most efficient implementation?

A: The optimizer/converter works analogously to a pre-compilation optimization tool. Similarly to a compiler, the implementation itself consists of a sequence of transformation passes (canonicalization, optimization, cleanup, etc.) applied in multiple stages. These passes mutate the computational graph of the model, just like a compiler would mutate an intermediate representation.

Since the set of possible optimizations is indeed very large, we are constantly working on expanding the optimizations we implement. Our approach is to focus our efforts on optimizing the operators that require the most computational resources and/or are used most often. To improve the performance, we test various implementations of the operator, then implement the necessary transformation passes to enable these new implementations.

Q: Is there a throughput number, e.g., TOPS, TOPS/W, or inferences/sec on reference network?

A: The theoretical peak throughput of the chip is 51.2GMACC/S in int8 mode, under full utilization of the vector unit on all cores, along both tiles. We are working on benchmarking several sample models and we will publish performance metrics as soon as they are available.

Q: What is the typical power consumption during inference?

A: The chip has just recently come back from the foundry, so precise power benchmarks will be available when the characterization has been finished. The power consumption will be a few hundred mW at 700MHz, but the chip can also be clocked down.

Q: How are the model parameters stored and accessed? There is 512 KB of RAM but that is enough only for a very small model. Can it execute from Flash? How big is the Flash/ROM?

A: There is 512KB of SRAM on each tile, and there are two tiles on a chip. The chip supports execution from FLASH, as well as LPDDR1 memory, the size of which depends on what’s required by the application. Our deployment tools will allow models to store coefficients in either type of external memory and efficiently access those at runtime. If the model fits in SRAM, however, the vector unit has single cycle access to all coefficients.