Here are the questions from Dominic’s session:
What are the current (most important) research questions in audio and tinyML?
[Dominic] From a machine learning perspective our researchers are looking into how you help machines to understand the context of sounds over the medium to long term using LA-LSTM networks. Two of our researchers recently blogged about their work in this space (https://www.audioanalytic.com/putting-sound-into-context/).
From a tinyML perspective the question is always about marrying compactness and performance. If one negatively impacts the other then you don’t have a practical solution for consumer devices. We’ve done amazingly well in this regard, as demonstrated by the M0+ demo back at the tinyML Summit in February but we keep pushing the boundaries.
Is the dataset open source? Why not use open source datasets like AudioSet and FreeSound?
[Dominic] Our Alexandria dataset, which we built for the specific task of sound recognition is not open source.
We don’t use these types of datasets that you mention because they are not suitable for giving consumer devices a sense of hearing for quite a few significant legal, ethical and technical reasons.
You can condense the reasons into three points. The first, is that in order to train models you need permission from the content owner to use underlying audio data commercially – audio and video content uploaded to YouTube, for example, is still owned by the person who uploaded it.
The second and third reasons are technical in nature and linked. They are 1 – quality, and 2 – diversity. To train effective models you need to understand everything about where that recording comes from, what devices were used to capture it, what the conditions of the environment in which it was recorded and whether any codecs have been applied. Data from these types of websites are unknown quantities, you’d be building a model based on data that you don’t fully understand.
As highlighted in my webcast, sound recognition is a dedicated field in its own right like speech and image recognition are. As a result it needs specific approaches to data collection, model training, model evaluation and compression. So in order to do it properly we had to build our own dataset, which now includes over 15m labelled sound events, across over 700 label types and with 200m meta data points.
If you train a model using extremely good data and then deploy it on a system that has one of the microphones you showed (much worse frequency response), does it still work as well?
[Dominic] yes, there are a range of techniques we use to ensure we get a degree of amendments. We train our model on a lot of data. It is not only coming from high quality microphones, but with different devices and different instances of devices.
How many samples do you need to sufficiently represent a sound (e.g. baby crying)?
[Dominic] In short, a lot - especially if you factor in the need to think about sound globally and the differences in each country that have a significant impact on sound. However, it varies from sound to sound. Some sound is more challenging than the others. For example, take a smoke or CO alarm. It is a fairly simple product that emits a beep in a set pattern. And then look at a window glass break. The smoke alarm appears easier to acquire and record and the glass break recording requires the facilities to break lots of different types of glass, in lots of different frames, using lots of different tools and in lots of different environments. With glass break it is easy to quickly comprehend the sheer scale of operation required. Building a smoke alarm model requires less data but still a lot as not all smoke or CO alarms in the world work to the same standard, for example, outside of the US many products don’t follow the T3/T4 standard. If you train only using a narrow range of alarms then it won’t necessarily work for all customers who depend on the detection of a smoke alarm to protect their property.
As a result, there is no hard and fast rule. Each sound is unique to train and requires a deep knowledge of the applications and the acoustic space.
Dominic, how do you capture data sets which include environmental effects such as reverberation?
[Dominic] We collect a huge audio data from real environments as well as data collected in our anechoic and semi-anechoic sound labs. The audio collected in our labs is clean of environmental effects such as reverb but because it has been collected using something akin to a ‘green screen’ we can then augment that data using some really clever auralisation techniques. With the augmented data we deliberately apply real environmental effects to provide us with a challenging, realistic and robust training set.
When you say you never needed to quantize your models because they are already very small, what size range are we talking about? as an example, what size does a model coming out of your pipeline have, let’s say if that model is able to detect 10 different sounds?
[Dominic] The baby crying model I showed is 10KB, which certainly works for the 32-bit MCUs.
The model has to be of a certain size in order to fit into the MCU. We size the model according to the needs we have. We have to use various techniques to generate a model that is right for the platform.
DO you train using overlapping sound with different labels?
[Dominic] Yes.
If you know that your data came from a non-standard microphone, and you deployed a model trained on those data, would you then recommend to try to tune the response of the end-device to match that of the microphone where the data comes from?
[Dominic] See my earlier answer, our models are trained to withstand microphone quality.
How does one capture anomalous data? It is easier to replicate normal data but it is the anomalous ones that can break a model?
[Dominic] The key is a solid understanding of the end application and then you design your data collection around what you know and what you learn. For example, if you are training a baby cry model for use in a child’s nursery there are already sounds that you can eliminate because they won’t be present. We use our taxonomy of sound to guide this process. From there we know the target sounds as well as the non-target sounds. I think a lot of people fail to understand that sound recognition is as much about training a system to recognise the sounds you want to identify as well as the sounds that you want to ignore.
That’s why you need experts and a rich, large and diverse dataset and why public sources just don’t cut it.
If you’re capturing massive amounts of audio data, how long does it take to train a model on it?
[Dominic] It depends on the complexity of the target sound and the non-target sounds, as well as the target application.
Have you tried wavelet transforms (instead of Short Time Fourier Transforms) in embedded devices for spectral analysis of your audio signal?
[Dominic] Over the last 10 years we’ve applied a wide range of techniques and we continue to do so.
What is the data collection strategy to handle rejecting non baby sounds in a baby cry detecting model
[Dominic] As mentioned before, it comes down to carefully constructing a data collection plan around the target sound, non-target sounds and the application.
What is the smallest platform (MCU, RAM, Flash) where you have implemented your solution and what is the use case?
[Dominic] We run our demo on 72Mhz Cortex-M0+ device(96kb RAM, 128KB Flash). Most customer devices are more powerful than this, like Cortex-M4/M7 devices. You do need certain amount of horsepower get through the audio processing part.
Have you worked in the area of preventive maintenance area in manufacturing e.g. factory equipment failure
[Dominic] Yes.
Do you utilize any data augmentation techniques to enhance the dataset to cover variations of a particular sound that may not be captured live?
[Dominic] Yes augmentation is amongst the techniques we use. We have an extensive understanding of environmental factors and this enables us to create realistic and diverse augmented data. In a practical sense this means understanding the differences between a dog bark in a brick house in Cambridge UK versus a wooden house in San Francisco. It saves on the hassle of having to fly a lot of dogs around the world.