tinyML Talks on September 22, 2021 “Hardware-aware Edge AI using the parameterizable ML accelerator UltraTrail” by Paul Palomero Bernardo

We held our next tinyML Talks webcast. Paul Palomero Bernardo from University of Tübingen presented Hardware-aware Edge AI using the parameterizable ML accelerator UltraTrail on September 22, 2021.

Forum September 22

IMPORTANT: Please register here

Specialized hardware accelerators for machine learning (ML) tasks help bring intelligent data processing to edge devices. To fully leverage their potential, efficient mapping of the software task onto the target hardware is required. One way to achieve this is through a joint design optimization of both hardware and software. This talk presents the parameterizable ML accelerator UltraTrail and its use in the hardware/software co-design framework HANNAH. We introduce the accelerators’ architecture and show how a hardware-aware neural architecture search can be utilized to automatically search for optimal hardware configurations. The advantages of this approach over a handcrafted solution are demonstrated on an audio use case. Finally, a generator-based approach is outlined that aims at further automating the design process of such hardware accelerators to increase performance and design efficiency.

However, when compared to what biology achieves, the current state-of-the-art is still orders of magnitude away in terms of energy efficiency. We will see how brain inspiration does translate into circuit or technology specification. We will explore what spiking neural networks can bring, exploiting novel technologies, such as Non-Volatile Memories, for increasing data locality, as in the brain.

Paul Palomero Bernardo was born in Tübingen, Germany, 1996. He received the B.S. and M.S. degrees in computer science from University of Tübingen, Tübingen, Germany, in 2017 and 2020, respectively, where he is currently pursuing the doctoral degree (Ph.D.) at the Department of Computer Science. His current research interests include neural network hardware and design optimization.
*=========================

Watch on YouTube:
Paul Palomero Bernardo

Download presentation slides:
Paul Palomero Bernardo

Feel free to ask your questions on this thread and keep the conversation going!

  1. Q: is the training framework standard (Pytorch, Tensorflow …etc) or is also custom for the workflow?
    A: The training framework is based on PyTorch. The quantization aware training (QAT) and neural architecture search (NAS) are custom implementations on top of it.

  2. Q: Are the memories all on-chip? and what type?
    A: Yes, they are all on-chip. For the shown results we used ultralow leakage, single-port SRAMs.

  3. Q: Any estimation on the scalability of current solution? Now it takes two days for a 64K parameter NN (not sure, just guess from the slides), will it be twenty days for a 640K parameter NN (Time for NN with 10 times of the parameters)?

A: The largest networks in our design space contain around 720K parameters. And training times do not scale linearly with the number of parameters, as the training times for the smaller networks are pretty I/O and memory bound. Assuming the design space for the hypothetical bigger network does not change I would assume that the search times would only be slightly worse. But direct NAS has of course scalability issues especially when the Neural Network or Hardware Architecure design spaces are increased significantly, although the scaling should still be sublinear considering the size of the design space.

  1. Q: How difficult is porting TVM to risc-V?
    A: TVM does support RISC-V using, for example, the LLVM backend. There is also a lot of progress on the microTVM end, which might be interesting as well: microTVM: TVM on bare-metal — tvm 0.8.dev0 documentation

  2. Q: Is it possible to train a NN with pytorch (brevitas maybe) and then deploy on ULTRATRAIL? or is a must now to use the custom QAT?
    A: Yes, this is also possible. The benefit of the custom QAT is, that it considers all quantizations and approximations present in UltraTrail, which leads to a bit-exact inference between the trained and deployed NN.

  3. Q: what is the size of the ASIC, and can it used with MCU?
    A: The accelerator UltraTrail alone has a size of 0.20 mm² (8x8 Array, ~61kB Memory). It was integrated as a coprocessor into a Pulpissimo-based SoC (total area: 1.56 mm²). It would generally be possible to combine a standalone UltraTrail with an MCU but we have not yet tested this.

  4. Q: how large is the memory?
    A: The memory size of an UltraTrail instance depends on the target use-case. For the presented keyword spotting task, there are ~50kB for NN parameters (WMEM, BMEM) and ~11kB for the feature maps (FMEM0-2, LMEM).

  5. Q: are you using some rtos like zephyr or plain c?
    A: We are using the PULP-OS for the Pulpissimo-based SoC.

  6. Q: What std benchmarks are you using?
    A: We used the Google Speech Commands Dataset (GSCDv2) for keyword spotting, UWNU/TUT for voice activity detection, and Hey Snips! For wake word detection.

  7. Q: Did you try vision datasets or any plan to use ML perf
    A: Since UltraTrail is currently optimized for 1D convolution we have not tried vision datasets. We plan to support this together with the ML perf benchmarks in our next iteration UltraTrailv2.