tinyML Talks on July 6, 2021 “Cracking a 600 million year old secret to fit computer vision on the edge” by Shivy Yohanandan

We held our next tinyML Talks webcast. Shivy Yohanandan from Xailient presented Cracking a 600 million year old secret to fit computer vision on the edge on July 6, 2021.

June 22 forum

The ultimate goal of AI IoT is to be aware of our surroundings through sensors which can respond in real-time so we can be more selective with how we use and manage our limited resources, which reduces business and environmental cost. But the big problem with AI IoT is that current AI uses more energy to process IoT data than the energy it’s trying to save, which is a paradox. The main cause is expensive algorithm families like YOLO, SSD, R-CNN, and their derivatives, which account for most of the computer vision algorithms used by everyone!

YOLOs and SSDs do object detection (a staple in most computer vision) by shrinking the full resolution image to 416x416 or 300x300 and then doing both localization and classification on this shrunken image. But you’ve now lost over 95% of information from the original image, which is why accuracy, robustness and generalizability seems to be poor, especially when trying to scale across many IoT sensors (e.g. cameras). In addition to this inherent design flaw, these models are huge and computationally expensive, which is why everyone is trying to fit them on the edge by shrinking these models. However, this often results in losing even more accuracy on a model that was already inaccurate to begin with!

Xailient solved this problem by cracking a 600 million year old secret in biological vision: selective attention and salience. The secret mechanism shows us how to split object detection into two separate models: detection and classification. This results in Xailient’s detector being only 44 KB – 5000x smaller than YOLO! You can then use your own flavor of classifier to process each detected ROI one-by-one, except now using a crop from the original image, thus preserving more information for better accuracy. So we’ve solved both model size and accuracy in one hit!

This allows Xailient to fit object detection on ultra-low power devices, which is exactly what we need to break the paradox above. And now we built a platform giving everyone easy access to this new kind of computer vision that’s much more efficient and accurate, and you don’t even need model compression! In this talk I will share some example use-cases of ultra-low powered aware AI IoT.

Dr. Shivy Yohanandan is the co-founder and Chief Technology Officer at Xailient – the computer vision platform that is revolutionizing Artificial Intelligence by teaching algorithms how to process images and video like humans! He holds a PhD in Artificial Intelligence and Computer Science but started his career as a Neuroscientist and Bioengineer from the University of Melbourne. Passionate about vision, Shivy spent 4 years bringing vision to the blind by helping build Australia’s first bionic eye as a Research Engineer. Then, during his PhD, he made a breakthrough in Neuroscience; discovering for the first time the precise mechanism behind a 600-million-year-old secret on how animals are capable of processing vast amounts of visual information very efficiently. Realising a significant gap in the inefficient way modern AI processes images and video, he mapped nature’s secret vision formulae into algorithms that now provide the core behind Xailient’s visual AI, which are now being used by companies across the world like Sony. Previously, Shivy worked as a research scientist for 3 years at IBM Research in AI for healthcare including computer vision in medical imaging and building a brain-machine interface to decode brainwaves for controlling a robotic arm.


Watch on YouTube:
Shivy Yohanandan

Download presentation slides:
Shivy Yohanandan

Feel free to ask your questions on this thread and keep the conversation going!

Is 44KB refers to the entire network number of parameters or only the detector (excluding backbone) number of parameters?
44kb is the size of the tflite file

Why does the reading app need a sticker? why can’t it just use the page?
I think because the book page numbers are typically at the bottom, and the reader can only work at the top (else will obstruct the reader). Therefore cannot invert the book since it needs to be the right side up for reading. Please check with the Be With Me Reader creator.

Is the Detectum algorithm available to research?
Proprietary so cannot disclose

How is the algorithm in the YOLO detects the object? Like how can convolving images can help classify the object?
Please check out the YOLO paper referenced in these slides for a detailed explanation.

Is there any nature inspire neural network model for analog sensors processing?
If you’re referring to analog sensors like temperature, pressure, etc., then there definitely are be biological neural networks that process these sensors, and the pea-sized structure we discovered and reverse-engineered would be processing this information as well

How many layers the model ?
Proprietary so cannot disclose

Could you tell something about the mAP for the last table comparing all these methods?
Please refer to the “Traffic Whitepaper” here: https://www.xailient.com/technical-details

Is there an analysis on the detection/localization accuracy of your model compared to the other methods? (i.e. from Haar Cascade to YOLO/R-CNN?)
Yes but these are on sensitive/private custom datasets from customers therefore we are unable to disclose the details at this stage

What kind of control is provided over the number and type of region proposals?
Proprietary so cannot disclose

Which dataset is 64.2 mAP reported on?
You can read the details here: https://www.xailient.com/post/challenges-of-running-deep-learning-computer-vision-on-computationally-limited-devices

Were all the other models modified for 1 class during the comparisons?
If you’re referring to the YOLOv5-s becnhmark slide done by one of our partners, please refer to their talk at Embedded Vision Summit 2021

Do you have a thread of work on audio pattern recognition?
No, Xailient is a computer visions company. We have not done any work with audio… yet.

Were you able to address the problem of YOLO/RCNN/SSD etc. having to downsample the input image in detectum?

Can you adapt the model to classify other modalities? e.g, soundscapes?
Yes, as long as the input is an image (with at least 1 channel)