Two tinyML Talks on September 1, 2020 by Suren Jayasuriya from Arizona State University and Kristofor Carlson from BrainChip Inc

We held our tinyML Talks webcast with two presentations: Suren Jayasuriya from Arizona State University has presented Towards Software-Defined Imaging: Adaptive Video Subsampling for Energy-Efficient Object Tracking and Kristofor Carlson from BrainChip Inc. has presented The Akida Neural Processor: Low Power CNN Inference and Learning at the Edge on September 1, 2020 at 8:00 AM and 8:30 AM Pacific Time.


Suren Jayasuriya (left) and Kristofor Carlson (right)

CMOS image sensors have become more computational in nature including region-of-interest (ROI) readout, high dynamic range (HDR) functionality, and burst photography capabilities. Software-defined imaging is the new paradigm, modeling similar advances of radio technology, where image sensors are increasingly programmable and configurable to meet application-specific needs. In this talk, we present a suite of software-defined imaging algorithms that leverage CMOS sensors’ ROI capabilities for energy-efficient object tracking. In particular, we discuss how adaptive video subsampling can learn to jointly track objects and subsample future image frames in an online fashion. We present software results as well as FPGA accelerated algorithms that achieve video rate performance in their latency. Further, we highlight emerging work on using deep reinforcement learning to perform adaptive video subsampling during object tracking. All this work points to the software-hardware co-design of intelligent image sensors in the future.

Suren Jayasuriya is an assistant professor at Arizona State University, in the School of Arts, Media and Engineering and Electrical, Computer and Energy Engineering. Before this, he was a postdoctoral fellow at the Robotics Institute at Carnegie Mellon University, and he received his Ph.D. in electrical and computer engineering at Cornell University in 2017. His research interests are in computational imaging and photography, computer vision/graphics and machine learning, and CMOS image sensors.

The Akida event-based neural processor is a high-performance, low-power SoC targeting edge applications. In this session, we discuss the key distinguishing factors of Akida’s computing architecture which include aggressive 1 to 4-bit weight and activation quantization, event-based implementation of machine-learning operations, and the distribution of computation across many small neural processing units (NPUs). We show how these architectural changes result in a 50% reduction of MACs, parameter memory usage, and peak bandwidth requirements when compared with non-event-based 8-bit machine learning accelerators. Finally, we describe how Akida performs on-chip learning with a proprietary bio-inspired learning algorithm. We present state-of-the-art few-shot learning in both visual (MobileNet on mini-imagenet) and auditory (6-layer CNN on Google Speech Commands) domains.

Kristofor Carlson is a senior research scientist at BrainChip Inc. Previously, he worked as postdoctoral scholar in Jeff Krichmar’s cognitive robotics laboratory at UC Irvine where he studied unsupervised learning rules in spiking neural networks (SNNs), the application of evolutionary algorithms to SNNs, and neuromorphic computing. Afterwards, he worked as postdoctoral appointee at Sandia National Laboratories where he applied uncertainty quantification to computational neural models and helped develop neuromorphic systems. In his current role, he is involved in the design and optimization of both machine learning algorithms and hardware architecture of BrainChip’s latest system on a chip, Akida.

==========================

Watch on YouTube:
Suren Jayasuriya
Kristofor Carlson

Download presentation slide:
Suren Jayasuriya
Kristofor Carlson

Feel free to ask your questions on this thread and keep the conversation going!

@sjayasur - here are the audience questions we couldn’t get to during your talk:

  1. Is there an equivalent of FPGA (for compute) in 2D Vision Sensors such that the entire circuit was custom logic controllable? We’ve heard on TinyML talks about sensors which have encorporated AI circuit in 2D vision chips but I’m thinking custom below the level of GA and custom ASIC!
  2. How does subsampling efficiency compare when you predict the ROI on a H.264 I-frames for instance?
  3. Why not try for low-power on-chip encoders that can be integrated with this tracking - any constraints observed?
  4. Can you speak a little more about the CNN implementation on FPGA?
  5. Is there a tradeoff between the reduced number of pixels and the energy cost to run the algorithm to estimate the ROI?
  6. How much overhead is there in reconfiguring the sensor for each frame? How fast can this be done? Can this become the bottleneck?
  7. How does ROI handle partial occlusions?
  8. In terms of cost, is tracking more expensive or the classification?

@kristoforcarlson - here are the audience questions we couldn’t get to during your talk:

  1. What application scenario is the chip targeting assuming that low-bit precision will degrade accuracy?
  2. What is your definition of an event? What information does it have? Temporal, spatial, gradient? What is the total #bits to represent 1 event?
  3. What framework do you use for 4-bit QAT?
  4. Does the event based calculation help in reducing inference time or just in reducing power?
  5. Is the work scheduling for NPUs split up by frame or do all NPUs work on one frame at a time? To add: If it is one frame at a time, is it image quadrant or model layer based?
  6. Is the activation regularization performed in-situ?
  7. Can Akida do on the fly activity regularisation for retraining?
  8. The one shot learning was really cool, I was wondering if I take the photo of elephant in the room and inference is done in a jungle night time, will it work ? Does the network learn to see the elephant in the forest darkness ?
  9. Do new classes work if background is different from training background?
  10. In Akida’s current state, what would be a real-world application that it would most easily integrate with?
  11. What benefit is there in chaining multiple chips to achieve great number of NPUs available to the network?

Q1: What application scenario is the chip targeting assuming that low-bit precision will degrade accuracy?

A1: Actually, part of the talk was spent showing that accuracy degradation for 4-bit precision was very small (~1-2%) in most cases. Note also that we are talking about a very small accuracy degradation for challenging benchmark tasks like 1000-class classification etc. We expect models that use 4-bit quantization to perform even better for practical, well-constrained machine learning applications as opposed to the benchmarking tasks I previously mentioned. If you were referring to models that use bit-precisions lower than 4-bit (e.g. 1 or 2-bit precision), then the situation is a little different. There are some practical machine-learning tasks that lose very little accuracy even down to 1 or 2-bit precision but I think those trade-offs are less interesting. A more interesting trade-off is using lower-bit precision models with a high degree of sparsity to do ‘preprocessing’ for a more accurate (and power hungry) model. The most obvious example that comes to mind is using an extremely efficient low-bit precision model to region proposals so that a larger, more powerful model can perform object detection using these region-proposals. The combination of very low-bit precision and usage of activation sparsity really allows for some interesting trade-offs.

If you’re asking more generally what applications we are targeting, the answer is: Edge applications that require solutions to be severely limited by size, weight, and power are always a good start. Those that require keyword spotting, person detection, object classification, or multiple CNNs that perform similar tasks but require sensor fusion would probably be a very natural fit for Akida. Applications that have 3D point cloud inputs like LiDAR and DVS cameras also translate to Akida in a straight-forward way.

Q2: What is your definition of an event? What information does it have? Temporal, spatial, gradient? What is the total #bits to represent 1 event?

A2:
• We define an event as a non-zero activation. By this I mean that events can be viewed as non-zero entries in a feature (aka activation) map. We use QReLUs (quantized rectified linear units) as activations and therefore don’t have negative values in our feature maps so we could refer to events as positive non-zero entries in a feature map.
• The event itself carries the same information any element in a feature map carries. If we are talking about 4-bit activations, then each element is a single 4-bit value that represents the activity of feature at a particular x, y location for a particular input. It does not carry temporal or gradient information. Because of its location, it carries spatial information. Like any other CNN, a single element of a feature map could use the channel dimension to encode information about temporal inputs as some researchers have done.
• Our hardware supports 1-bit, 2-bit, and 4-bit events. That’s the size of each event.

Q3: What framework do you use for 4-bit QAT?

A3: For those who may be unfamiliar with this term, QAT refers to quantization aware training. The idea is that QAT simulates low-precision computation in the forward pass during training so that the error back-propagated to the weight updates will be done on the quantized information. That way, when the fully quantized model is executed during inference, the quantized parameters (and possibly activations) will sum to the correct prediction and classification accuracy will be maintained. For those who are interested, here is a link to Google Blog entry that discusses QAT.

We use a custom QAT framework built specifically to be efficient on our Akida NSoC. However, I should note that our QAT framework is built on the TensorFlow Keras API for users’ convenience. Those who are familiar with QAT frameworks should be pretty comfortable with our implementation. Here is a link to our quantization aware training documentation.

Q4: Does the event-based calculation help in reducing inference time or just in reducing power?

A4: Good question. The event-based calculation reduces both inference time and power. The reason inference time is also reduced is because we have built the hardware to be event-based from the ground up. This means Akida doesn’t waste time looking for events to process by looking through each feature map to find the non-zero activations but instead communicates information by sending events instead of an entire feature map (which would contain a large number of zero-valued activations that would have to be searched through and discarded).

Q5: Is the work scheduling for NPUs split up by frame or do all NPUs work on one frame at a time? To add: If it is one frame at a time, is it image quadrant or model layer based?

A5: The Akida architecture is very flexible in this regard. NPUs are organized by layer but the layer can be further subdivided by inputs (e.g. image quadrant) or by filters. We also have a processing mode that allows CNPs to work on different layer frames in parallel. For example, frame 1, layer 4 can be processed by a set of NPs in parallel with another set of NPs that are processing frame 2, layer 3. Depending on the application, this can be very useful.

Q6: Is the activation regularization performed in-situ?

A6: No, the activation regularization is performed during training. This needs to occur during training because activity regularization is built into the loss function.

Q7: Can Akida do on the fly activity regularization for retraining?

A7: No, not currently.

Q8: The one shot learning was really cool, I was wondering if I take the photo of elephant in the room and inference is done in a jungle night time, will it work? Does the network learn to see the elephant in the forest darkness?

A9: The short answer to this is “probably not”. The long answer is this scenario is a bit unlikely because of all the preprocessing that occurs on an image classification pipeline. The type of one-shot learning Akida performs requires a high-quality feature vector as an input. Without preprocessing to enhance or transform a very dark image to a reasonably ‘well-lit’ image, the feature vector will be poor and thus classification will be poor. However, modern image-processing pipelines often take care of this sort of thing much earlier before the dark image even gets to the CNN so it’s not as much of an issue as one might think. One could probably use or generate a data set that contains lots of animals in the dark and train off of that directly but there could be some tricks to handle images that are during the night and those during the day.

Q9: Do new classes work if background is different from training background?

A9: Good question. The answer is: yes. We just didn’t want the video to show some random class like ‘Tiger’ or ‘Police Car’ when the background was showing.

Q10: In Akida’s current state, what would be a real-world application that it would most easily integrate with?

A10: Edge applications that require solutions to be severely limited by size, weight, and power are always a good start. Those that require keyword spotting, person detection, object classification, or multiple CNNs that perform similar tasks but require sensor fusion would probably be a very natural fit for Akida. Applications that have 3D point cloud inputs like LiDAR and DVS cameras also translate to Akida in a straight-forward way.

Q11: What benefit is there in chaining multiple chips to achieve great number of NPUs available to the network?

A11: Chaining multiple chips just gives the user the ability to tackle more computationally challenging machine-learning tasks. Using two Akida chips may allow a user to get a particular FPS that a single Akida chip wouldn’t be able to provide.

1. Is there an equivalent of FPGA (for compute) in 2D Vision Sensors such that the entire circuit was custom logic controllable? We’ve heard on TinyML talks about sensors which have encorporated AI circuit in 2D vision chips but I’m thinking custom below the level of GA and custom ASIC!

A neuromorphic analog VLSI sensor may be a substitute for the FPGA (e.g. Indiveri G. Neuromorphic analog VLSI sensor for visual tracking: Circuits and application examples. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing. 1999 Nov;46(11):1337-47).

2. How does subsampling efficiency compare when you predict the ROI on a H.264 I-frames for instance?

H.264 breaks up an image into macro-blocks that can be used for motion estimation. Feeding estimated motion vectors to the Kalman filter in addition to the location information of the previous frame may enhance its prediction accuracy. And the closer the predicted bounding box location is to the actual region of interest, the higher the subsampling efficiency will get.

3. Why not try for low-power on-chip encoders that can be integrated with this tracking - any constraints observed?

Since the tracking is being implemented on the programmable logic of the ZCU102 board, it makes sense to use the output of the tracking to pre-emptively switch pixels off in subsequent frames. The overhead associated with the sensor reconfiguration should not raise too much concern, as the FPGA supports parallel processing and does not remain idle while the sensor is being reset.

4. Can you speak a little more about the CNN implementation on FPGA?

We are currently working towards porting a CNN network to an FPGA. We are making use of Xilinx’s Vitis AI along with their ZCU102 board for accelerating neural networks.

5. Is there a tradeoff between the reduced number of pixels and the energy cost to run the algorithm to estimate the ROI?

As the ROI estimation is an overhead cost, it is important to make sure the number of pixels turned off saves more energy than the ROI estimation alone. However, since we are assuming you are doing tracking, that cost is just the Kalman filter for prediction, which is a small matrix computation that is accelerated on the FPGA.

6. How much overhead is there in reconfiguring the sensor for each frame? How fast can this be done? Can this become the bottleneck?

Sensor reconfiguration may result in increased latency - by as much as 280 ms. However, since FPGA supports parallel processing, it keeps on operating on other data while the sensor is reset. Furthermore, if the overhead turns out to be a cause for concern, certain techniques can be adopted to eliminate the bottleneck, like the Banner media framework for instance (Hu J, Shearer A, Rajagopalan S, LiKamWa R. Banner: An image sensor reconfiguration framework for seamless resolution-based tradeoffs. InProceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services 2019 Jun 12 (pp. 236-248)).

7. How does ROI handle partial occlusions?
Classical techniques do not fare well where occlusions are concerned. Tracking accuracy degrades even in the case of partial occlusions. Neural network-based solutions are the key to solving such complex problems.

8. In terms of cost, is tracking more expensive or the classification?

Neural network-based object classification will be reasonably expensive. On the other hand, we have used mean shift in conjunction with a Kalman filter for tracking, which is a less expensive algorithm than any neural network-based solution.