Two tinyML Talks on May 14, 2020: 1) Embedded Computer Vision Hardware through the Eyes of AR/VR - Hans Reyserhove; 2) Using TensorFlow Lite for Microcontrollers for High-Efficiency NN Inference on Ultra-Low Power Processors - Jamie Campbell

We held our forth tinyML Talks webcast with two presentations:
Hans Reyserhove from Facebook has presented Embedded Computer Vision Hardware through the Eyes of AR/VR and Jamie Campbell from Synopsys has presented Using TensorFlow Lite for Microcontrollers for High-Efficiency NN Inference on Ultra-Low Power Processor on May 14, 2020 at 8:00 AM and 08:30 AM Pacific Time.


Hans Reyserhove (left) and Jamie Campbell (right)

“Embedded Computer Vision Hardware through the Eyes of AR/VR”
Hans Reyserhove

Postdoctoral Research Scientist, Facebook Reality Labs

Augmented reality is an emerging technology that requires pushing the curve on almost all relevant fronts: computer vision algorithms and ML pipelines, sensing and processing hardware, memories, power consumption and system form factor. This talk will dig deeper on the technological challenges that are being solved today to make augmented reality happen. Although there are parallels with other embedded CV systems, a few key differentiators are essential to AR. Essential is the technology stack to make it all happen: image sensors, interfaces and processing hardware are a few blocks under consideration that ultimately guide the system-level trade-offs. These trade-offs are further illustrated by applying them to the key always-on computer vision and ML pipelines necessary for augmented reality. A lot of these considerations translate to the bigger tinyML and embedded computer vision design space.

Hans Reyserhove is a Postdoctoral Research Scientist at Facebook Reality Labs. His research focuses on intelligent vision systems and sensing technologies for Augmented & Virtual Reality. He holds a PhD from University of Leuven, Belgium, focused on design of energy-efficient microcontroller systems and better-than-worst-case silicon systems. He has a M.S. degree focused on CMOS image sensors with pixel-level A/D conversion for extreme parallelism. His main interests lie in design, prototyping & optimization of silicon systems, including image sensors, hardware accelerators and computer vision applications.

"Using TensorFlow Lite for Microcontrollers for High-Efficiency NN Inference on Ultra-Low Power Processors"
Jamie Campbell
Software Engineering Manager, Synopsys

Deeply-embedded AIoT applications doing neural network (NN) inference need to achieve specified real-time performance requirements on systems with limited memory and power budget. Meanwhile, developers want a convenient way of migrating their NN graph designs to an embedded environment. In this talk, we will describe how specific hardware extensions on embedded processors can vastly improve the performance of NN inference operations, which allows targets to be met while consuming less power. We will then show how optimized NN inference libraries can be integrated with well-known ML front-ends to facilitate development flows.
To illustrate these concepts, we’ll show the Synopsys MLI Machine Learning Inference library running on a DSP-enhanced DesignWare® ARC® EM processor and explain how it was integrated with TensorFlow Lite for Microcontrollers (TFLM). To conclude, we will showcase Himax Technologies’ WE-I Plus silicon, a very low-power SoC targeted at AIoT applications that supports both MLI and TFLM.

Jamie Campbell is a Software Engineering Manager at Synopsys, leading teams responsible for the development of the Machine Learning Inference (MLI) library for Synopsys ARC processors and the creation of compelling demos and reference applications for the Synopsys Embedded Vision processors. Prior to focusing on machine learning, Jamie has worked in various capacities as an embedded software specialist, including R&D engineer, Field Applications Engineer and Corporate Applications Engineer at Precise Software Technologies, ARC International, Virage Logic and now Synopsys. Jamie holds a Bachelor of Science in Electrical Engineering from the University of Calgary, Canada.

==========================

Watch on YouTube:
Hans Reyserhove
Jamie Campbell

Download presentation slide:
Hans Reyserhove
Jamie Campbell

Feel free to ask your questions on this thread and keep the conversation going!

@hansreyserhove @JamieC

Here are some common questions that I’ll post here first:

  1. Do you see the accuracy of tinyML (smaller model sizes) coming close to ML with the multi-MB models? Would there have to be 2-stage ML?
  2. How much memory would you need to run the most memory memory-intensive ML model on one frame of the 640x480 camera stream?
  3. What are the best SOCs in the market for low power computer vision?
  4. For edge computing, which one is prefered? NEON, GPU, DSP or NPU?

@JamieC - here are the questions regarding your talk:

  1. Do you have TCM or L2/L1 cache for prefetch?
  2. Can you give some approximate numbers regarding improvements in clock cycles, or energy, or power?
  3. When are we going to see ARC in MLPerf?
  4. It is interesting that you are using an FPGA in your dev kit - are there any specific challenges and/or advantages to use this rather than an MCU?
  5. Are you looking to support ONNX soon?
  6. Is Synopsis providing the “tailored” optimized TFLite Micro to the ARC architecture? Or can we use the mainline TFLite Micro from Github?
  7. What level of network complexity can TensorFlow Lite Micro handle? Tagging @PeteWarden @dansitu also for this question

@hansreyserhove - here are the questions regarding your talk:

  1. Do you think that we now have the technology to build a outdoor Visual Positioning System feature on AR glasses (1 or 2 cameras). If not, what are the challenges (autonomy, performance, precision)?
  2. Are you using any automated model optimization tools to target model to specific ML accelerators?
  3. If you had a specialized processor with Intel processors on the same memory loop, what function would you have it doing?
  4. Would use a traditional ISP for processing raw images? Would that end up being bigger power hogger than even ML?
  5. For the system component energy consumption breakdown, is system idle energy considered?
  6. What kind of requirements do you have for NN compute? TOPS/watt/area?
  7. Where do you believe sensor power will land if scaled to the latest process node (e.g. 5nm)?
  8. Do you take any learning from the failure of the Google Glasses and might be working on it?
  9. Can inference done on a compressed image? How this will affect the precision?
  10. Why are you working with 30 fps rather than the film standard of 23.976 fps, which has been found to be how our brains process visually? Would that lower the amount of data that needs to be processed?
  11. Any outlooks for significant progress on the imager side?
  12. Would you consider the image sensor as the main bottleneck for AR/VR?
  13. For higher resolutions than VGA, would the energy breakdown among different modules change? Or will it still be dominated by the image sensor?
  14. Compared to the power for sensor is quite small - so is the compute even a bottleneck?
  15. Wouldn’t a set of Time-of-Flight sensors be useful for solving semantic segmentation and SLAM problems?
  16. Are you looking at event cameras sensors to solve dynamic range, global shutter, and perhaps power consumption?
  17. Do you think event cameras that only send data if the pixel value changes are a technology to solve this image sensor problem you are facing for the moment?

Regarding question 7 from @JamieC’s talk:

There’s no limit to the size of the network TensorFlow Lite Micro can support, besides hardware constraints. As long as you have the RAM, a large network will just take longer to inference. In terms of support for different types of deep learning operators, the framework supports a subset of TensorFlow operator kernels. You can see the following source file for a list of them all:

Hi everyone,
Thanks for dialing in and watching the talk!
Unfortunately we didn’t get to answer all the questions during the livestream, so I’ll make an attempt to do that here! All great questions by the way!

  1. Do you think that we now have the technology to build a outdoor Visual Positioning System feature on AR glasses (1 or 2 cameras). If not, what are the challenges (autonomy, performance, precision)?

Depends what functionality you would want from a Visual Positioning System. Oculus Quest has a form of visual positioning. The hardware needed for that is not significantly more challenging than any other ML at the edge. If you want to localize in prebuilt maps; those just doesn’t exist today. Autonomy, performance and precision are all a matter of how far you want to push it.

  1. Are you using any automated model optimization tools to target model to specific ML accelerators?

Facebook has published some work in that space. An interesting paper is here: https://arxiv.org/abs/1812.08934

  1. If you had a specialized processor with Intel processors on the same memory loop, what function would you have it doing?

I’m having a hard time visualizing what is meant here. If the person in question could elaborate below, that would be helpful!

  1. Would use a traditional ISP for processing raw images? Would that end up being bigger power hogger than even ML?

I don’t think it would be a bigger power hogger than ML, depending on how advanced you would want that ISP to be. If it is doing a known correction, the power/quality trade-off I expect to be better with a simple ISP.

  1. For the system component energy consumption breakdown, is system idle energy considered?

Somewhat, but not entirely. I agree there would be some more idle energy there, but it would depend heavily on the power gating scheme, PMIC and other things.

  1. What kind of requirements do you have for NN compute? TOPS/watt/area?

As much as possible with zero area and zero power :slight_smile: . The more you can fit, the more complex models people would use and the better performance you would get.

  1. Where do you believe sensor power will land if scaled to the latest process node (e.g. 5nm)?

Great question. It would definitely improve a lot, if that were possible. Analog circuits unfortunately don’t scale that well, although it would be interesting to see what could be done in something like a 5nm super advanced finFET node. Also note that pixel size is bound by the wavelength of the light, which means a certain silicon area, which means a certain power. Overall, something order of 1mW would be great!

  1. Do you take any learning from the failure of the Google Glasses and might be working on it?

This requires a less technical answer: I can’t really comment on their hardware, since I don’t know. Whether a product like that succeeds or fails depends heavily on market readiness and use case ecosystem. AR does not succeed if that does not exist.

  1. Can inference done on a compressed image? How this will affect the precision?

Yes, it will if you train for it. I think you can recover most of the precision loss there.

  1. Why are you working with 30 fps rather than the film standard of 23.976 fps, which has been found to be how our brains process visually? Would that lower the amount of data that needs to be processed?

Frame rate is directly proportional to the amount of data you need to process, so yes. However, I wouldn’t say 23.976fps is entirely in line with our brains. Some posts from Michael Abrash here dig deeper on what display frame rate you would need for VR:
http://blogs.valvesoftware.com/abrash/my-steam-developers-day-talk/

  1. Any outlooks for significant progress on the imager side?

Yes, just yesterday Sony announced an interesting sensor:
https://www.prnewswire.com/news-releases/sony-to-release-worlds-first-intelligent-vision-sensors-with-ai-processing-functionality-301059083.html

  1. Would you consider the image sensor as the main bottleneck for AR/VR?

Not completely, but image sensors are typically optimized for other use cases, which does pose some difficulties.

  1. For higher resolutions than VGA, would the energy breakdown among different modules change? Or will it still be dominated by the image sensor?

If you would use the full image resolution in your processing, then yes, ML will grow significantly, which will probably exceed sensor energy. The same goes for multiple algorithms reusing the same frame.

  1. Compared to the power for sensor is quite small - so is the compute even a bottleneck?

Also see answer to 13. For a single application it probably won’t. The more ML functionality you add, the more the balance will switch.

  1. Wouldn’t a set of Time-of-Flight sensors be useful for solving semantic segmentation and SLAM problems?

Yes, a more fundamental depth sensing modality is high stakes nowadays. Other than Apple’s recent addition to the iPad I’ve haven’t seen groundbreaking stuff yet:
https://www.i-micronews.com/with-the-apple-ipad-lidar-chip-sony-landed-on-the-moon-without-us-knowing/

  1. Are you looking at event cameras sensors to solve dynamic range, global shutter, and perhaps power consumption?
  2. Do you think event cameras that only send data if the pixel value changes are a technology to solve this image sensor problem you are facing for the moment?

I agree this is an interesting field, but there are lots of issues. The entire computer vision community focuses on frame based algorithms. I don’t think you can qualify an event sensor as global shutter, but yes it is quite instantaneous. Dynamic range is also hard to compare. The power cost of event sensing is that it is in fact always on, as opposed to a normal sensor where you sense-transmit-sleep. This might compromise power. Finally, I haven’t actually seen someone put an event camera on their head. All the examples I see are semi-static camera position. It would be interesting to see some results with a lot of fast camera motion!

Well, that’s it. Thanks everyone! This is actually quite fun! We should do it again :grin:
Cheers,
Hans

1 Like

Hello Everyone,
Thank you for your interest and for attending the talk yesterday. Since we didn’t get a chance to address all of your questions during the webinar, I’ll answer the outstanding ones below

Take care,
Jamie Campbell

  1. Do you have TCM or L2/L1 cache for prefetch?

ARC EM processors support a feature called “Closely Coupled Memories” or CCMs. You can have both code and data CCMs. These are single-cycle access memories. In addition to those, you can add XY memory to improve DSP performance, as I explained in the webinar. Certain configurations of ARC EM processors can have both instruction and data cache. Other families of ARC processors like the ARC HS have the option of adding L2 cache

  1. Can you give some approximate numbers regarding improvements in clock cycles, or energy, or power?

In the webinar, I showed how we could execute an inference from the TensorFlow Lite for Microcontrollers “Person Detect” example in 14M cycles, using the optimized kernels from the embARC MLI library. To compare this, running the same Person Detect graph using TFLM’s reference kernels only consumes 87M cycles. So, MLI libraries offer a 6x speed up for that graph.

  1. When are we going to see ARC in MLPerf?

We have begun monitoring tinyMLPerf activities and hope to include ARC in those benchmarks soon

  1. It is interesting that you are using an FPGA in your dev kit - are there any specific challenges and/or advantages to use this rather than an MCU?

Synopsys is an IP company. We license our solutions to our customers who then produce SoCs. Since the ARC processors are so highly configurable, it’s important to allow our users to experiment with many different configs in order to understand the various design tradeoffs before they produce silicon. By supplying development systems that are based on FPGAs, we can include several FPGA configs to allow for easy experimentation. FPGAs are generally slower than an ASIC, but this is usually not an issue for development systems.

  1. Are you looking to support ONNX soon?

The Synopsys EV6x Embedded Vision processor already supports ONNX as a way of importing graphs. In the future, we’ll consider ways of taking that support to the deeply embedded processors like EM as well.

  1. Is Synopsis providing the “tailored” optimized TFLite Micro to the ARC architecture? Or can we use the mainline TFLite Micro from Github?

We plan to contribute all ARC-specific changes to TFLM/TFLite Micro into the master Github repository. You will be able to read more about the ARC support once it goes live (expected within a couple of days) at this URL: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/kernels/arc_mli

  1. What level of network complexity can TensorFlow Lite Micro handle?

Please see @dansitu’s reply below. Synopsys would be interested in seeing support for LSTM- and RNN-style kernels added in the future.

  1. How much memory would you need to run the most memory memory-intensive ML model on one frame of the 640x480 camera stream?

Unfortunately, it’s not really possible to give an upper bound here. Memory utilization is determined by many factors, including types of kernels used by your graph, size of the weights for each layer, size of the inputs and outputs of each layer. If a graph topology is known (type of kernel used in each layer, IO sizes, and weights sizes) then it’s possible to determine the data memory requirements.

1 Like