Two tinyML Talks on June 9, 2020: 1) “SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers” by Igor Fedorov (Arm); 2) “tinyML doesn’t need Big Data, it needs Great Data” by Dominic Binks (Audio Analytic)

We held our seventh tinyML Talks webcast with two presentations:
Igor Fedorov from Arm ML Research has presented SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers and Dominic Binks from Audio Analytic has presented tinyML doesn’t need Big Data, it needs Great Data on June 9, 2020 at 8:00 AM and 08:30 AM Pacific Time.

Igor Fedorov (left) and Dominic Binks (right)

The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these resource-impoverished hardware platforms severely limit the complexity of machine learning models that can be deployed. For example, although convolutional neural networks (CNNs) achieve state-of-the-art results on many visual recognition tasks, CNN inference on MCUs is challenging due to severe memory limitations. To circumvent the memory challenge associated with CNNs, various alternatives have been proposed that do fit within the memory budget of an MCU, albeit at the cost of prediction accuracy. This paper challenges the idea that CNNs are not suitable for deployment on MCUs. We demonstrate that it is possible to automatically design CNNs which generalize well, while also being small enough to fit onto memory-limited MCUs. Our Sparse Architecture Search method combines neural architecture search with pruning in a single, unified approach, which learns superior models on four popular IoT datasets. The CNNs we find are more accurate and up to 7.4x smaller than previous approaches, while meeting the strict MCU working memory constraint.

Igor Fedorov is a member of the ARM Machine Learning Lab, working on neural network optimization for ARM hardware. His work covers network pruning, quantization, and architecture search methods. Prior to ARM, Igor completed a PhD in Electrical Engineering at the University of California San-Diego, working on Bayesian learning algorithms for signal processing.

Data is the fuel which drives ML. Good quality, realistic, diverse data is essential to train and evaluate tinyML models. Obtaining good quality data, even for something as pervasive as audio, is not as easy as it may seem. This talk will discuss some of the challenges of obtaining and processing good quality audio data for sound recognition tasks and the ways Audio Analytic has overcome those problems.
Topics covered include:

  • what are good sources and bad sources
  • how to gather good quality audio data
  • employing complex labelling strategies
  • using the data to evaluate performance.

While not specially just a tinyML problem, the challenges of running at the edge across disparate devices makes the problem more acute and is shared by other tinyML applications.

As an expert in embedded software, Dominic Binks is chiefly responsible for designing and overseeing the architecture of the company’s successful ai3™ sound recognition software platform, which was demonstrated running on a Cortex-M0+ chip at tinyML Summit in February 2020. Before joining the company he held a number of positions at Qualcomm in Cambridge, UK and San Diego, where he was part of Qualcomm’s core Android porting team.


Watch on YouTube:
Igor Fedorov
Dominic Binks

Download presentation slide:
Igor Fedorov
Dominic Binks

Feel free to ask your questions on this thread and keep the conversation going!

What is the exact time for this webinar? On this page it is 8:30am PT, but on the register page it is 8:00am PT. Thank you!

Hi Renzizhe,
The webinar will be from 8 am to 9 am PDT. It will include two presentations:
“SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers” presented by Igor Fedorov will start at 8:00 am Pacific time
“tinyML doesn’t need Big Data, it needs Great Data” presented by Dominic Binks will start at 8:30 AM Pacific time.
Thank you.

Thank you Olga! My bad. I just saw the time for second talk. :joy:

About SpArSe architecture searching, how/when will ARM make such techniques easy to try? (Assuming we have a GPU enabled machine or server farm to use for it.)

Not a problem, I hope you enjoyed the presentation.

Enjoyed the talk, a couple of questions for @DominicBinks:

  1. What feature extraction is appropriate for tinyml audio recognition models that provide a balance between compute & memory requirements and the detection & rejection performance?
  2. What do you think is the role of synthetic data in audio recognition task?

Appreciate your inputs, thanks!

Thanks for the question @hsanghvi. I discussed this with our training experts and these were the answers I got.

  1. The choice of features is one of the results of the optimisation process across real data. There is not a one-size-fits-all answer as the optimal features depend on the optimal model.

  2. The use of synthetic data is accepted for training but must be avoided for validation or testing. As such, synthetic training data could be viewed as one of the training parameters. However, it is an intractably large space to search if considering the optimisation of data synthesis to improve classification performance. It is therefore much more efficient and tractable to train on real data of the type mentioned in our talk. In all cases, validation and testing can only guarantee product performance when using real field data that is representative of the application, such as the one we use to develop our products.

Here are the questions we received for Igor’s sessions:

  1. What devices are actually considered edge devices? RPi 3B+ has more than enough memory, but it is considered edge device for edge inference, so should we consider Arduino with 2KB flash as an edge device and the rest of them (RPi, NVIDIA Jetson Nano) more consumer-level products?
    [Igor] We are considering devices like Arduino Uno(2KB RAM), or micro:bit (16KB RAM)

  2. Did you experiment with random network mutations in the morphism?
    [Igor] At the beginning, we sample completely randomly. It is a slow process, but we are bound to find an answer.
    In following stages, we only sample configurations from the best known configurations.

  3. Is your framework available online? How extensible is it for new optimization targets?
    [Igor] Currently, it is not open sourced, but you can follow the paper and extend it to other targets.

  4. main stream ai application still focus on video and audio,why sensor data level ai application not popular till now,anyone have some insight about this phenomenon?

  5. Given a simple 2 layer CNN with GAP & softmax can achieve over 98% accuacy on MNIST under 5k parameters, that fits in 5k RAM, where do you see the value of basian search for smaller networks. Is 3 times improvement is good enough?

  6. How do you measure the power. Is it only MCU or complete system (MCU+memory)
    [Igor] We use a development board, and connect power meter to it for measurement.

  7. Given limitation of storage and RAM in TinyML platforms, how does one set limit on the model size from growing too large and for self-prunning?

  8. Can you comment about the potential in your view of using pruning in combination with extreme quantization (e.g. binarization). Have you tried going beyond 8 bit to 2 or even 1 bit? Thank you in advance.
    [Igor] We are using 8 bit. In the context of such small models, it is pretty much equivalent to floating points. There is already some work in the literature to combine pruning and quantiztion. In the meantime, we want to make sure the model can be deployed with standard tools.

  9. For the Motivation and the use cases in general, what is the $ amount increase in price to couple MCU’s with 420 KB flash and 391 KB RAM?

  10. Are there use cases that demand such severely constrained small footprints or the motivation is to utilize available HW?

  11. when you talk about quantization, do you refer to post-training or pre-training quantization?
    [Igor] During training.

  12. Is there any plan on making the work open-source?

Thanks @dbinks, appreciate your response.

Here are the questions from Dominic’s session:

What are the current (most important) research questions in audio and tinyML?
[Dominic] From a machine learning perspective our researchers are looking into how you help machines to understand the context of sounds over the medium to long term using LA-LSTM networks. Two of our researchers recently blogged about their work in this space (
From a tinyML perspective the question is always about marrying compactness and performance. If one negatively impacts the other then you don’t have a practical solution for consumer devices. We’ve done amazingly well in this regard, as demonstrated by the M0+ demo back at the tinyML Summit in February but we keep pushing the boundaries.

Is the dataset open source? Why not use open source datasets like AudioSet and FreeSound?
[Dominic] Our Alexandria dataset, which we built for the specific task of sound recognition is not open source.
We don’t use these types of datasets that you mention because they are not suitable for giving consumer devices a sense of hearing for quite a few significant legal, ethical and technical reasons.
You can condense the reasons into three points. The first, is that in order to train models you need permission from the content owner to use underlying audio data commercially – audio and video content uploaded to YouTube, for example, is still owned by the person who uploaded it.
The second and third reasons are technical in nature and linked. They are 1 – quality, and 2 – diversity. To train effective models you need to understand everything about where that recording comes from, what devices were used to capture it, what the conditions of the environment in which it was recorded and whether any codecs have been applied. Data from these types of websites are unknown quantities, you’d be building a model based on data that you don’t fully understand.
As highlighted in my webcast, sound recognition is a dedicated field in its own right like speech and image recognition are. As a result it needs specific approaches to data collection, model training, model evaluation and compression. So in order to do it properly we had to build our own dataset, which now includes over 15m labelled sound events, across over 700 label types and with 200m meta data points.

If you train a model using extremely good data and then deploy it on a system that has one of the microphones you showed (much worse frequency response), does it still work as well?
[Dominic] yes, there are a range of techniques we use to ensure we get a degree of amendments. We train our model on a lot of data. It is not only coming from high quality microphones, but with different devices and different instances of devices.

How many samples do you need to sufficiently represent a sound (e.g. baby crying)?
[Dominic] In short, a lot - especially if you factor in the need to think about sound globally and the differences in each country that have a significant impact on sound. However, it varies from sound to sound. Some sound is more challenging than the others. For example, take a smoke or CO alarm. It is a fairly simple product that emits a beep in a set pattern. And then look at a window glass break. The smoke alarm appears easier to acquire and record and the glass break recording requires the facilities to break lots of different types of glass, in lots of different frames, using lots of different tools and in lots of different environments. With glass break it is easy to quickly comprehend the sheer scale of operation required. Building a smoke alarm model requires less data but still a lot as not all smoke or CO alarms in the world work to the same standard, for example, outside of the US many products don’t follow the T3/T4 standard. If you train only using a narrow range of alarms then it won’t necessarily work for all customers who depend on the detection of a smoke alarm to protect their property.
As a result, there is no hard and fast rule. Each sound is unique to train and requires a deep knowledge of the applications and the acoustic space.

Dominic, how do you capture data sets which include environmental effects such as reverberation?
[Dominic] We collect a huge audio data from real environments as well as data collected in our anechoic and semi-anechoic sound labs. The audio collected in our labs is clean of environmental effects such as reverb but because it has been collected using something akin to a ‘green screen’ we can then augment that data using some really clever auralisation techniques. With the augmented data we deliberately apply real environmental effects to provide us with a challenging, realistic and robust training set.

When you say you never needed to quantize your models because they are already very small, what size range are we talking about? as an example, what size does a model coming out of your pipeline have, let’s say if that model is able to detect 10 different sounds?
[Dominic] The baby crying model I showed is 10KB, which certainly works for the 32-bit MCUs.
The model has to be of a certain size in order to fit into the MCU. We size the model according to the needs we have. We have to use various techniques to generate a model that is right for the platform.

DO you train using overlapping sound with different labels?
[Dominic] Yes.

If you know that your data came from a non-standard microphone, and you deployed a model trained on those data, would you then recommend to try to tune the response of the end-device to match that of the microphone where the data comes from?
[Dominic] See my earlier answer, our models are trained to withstand microphone quality.

How does one capture anomalous data? It is easier to replicate normal data but it is the anomalous ones that can break a model?
[Dominic] The key is a solid understanding of the end application and then you design your data collection around what you know and what you learn. For example, if you are training a baby cry model for use in a child’s nursery there are already sounds that you can eliminate because they won’t be present. We use our taxonomy of sound to guide this process. From there we know the target sounds as well as the non-target sounds. I think a lot of people fail to understand that sound recognition is as much about training a system to recognise the sounds you want to identify as well as the sounds that you want to ignore.
That’s why you need experts and a rich, large and diverse dataset and why public sources just don’t cut it.

If you’re capturing massive amounts of audio data, how long does it take to train a model on it?
[Dominic] It depends on the complexity of the target sound and the non-target sounds, as well as the target application.

Have you tried wavelet transforms (instead of Short Time Fourier Transforms) in embedded devices for spectral analysis of your audio signal?
[Dominic] Over the last 10 years we’ve applied a wide range of techniques and we continue to do so.

What is the data collection strategy to handle rejecting non baby sounds in a baby cry detecting model
[Dominic] As mentioned before, it comes down to carefully constructing a data collection plan around the target sound, non-target sounds and the application.

What is the smallest platform (MCU, RAM, Flash) where you have implemented your solution and what is the use case?
[Dominic] We run our demo on 72Mhz Cortex-M0+ device(96kb RAM, 128KB Flash). Most customer devices are more powerful than this, like Cortex-M4/M7 devices. You do need certain amount of horsepower get through the audio processing part.

Have you worked in the area of preventive maintenance area in manufacturing e.g. factory equipment failure
[Dominic] Yes.

Do you utilize any data augmentation techniques to enhance the dataset to cover variations of a particular sound that may not be captured live?
[Dominic] Yes augmentation is amongst the techniques we use. We have an extensive understanding of environmental factors and this enables us to create realistic and diverse augmented data. In a practical sense this means understanding the differences between a dog bark in a brick house in Cambridge UK versus a wooden house in San Francisco. It saves on the hassle of having to fly a lot of dogs around the world.