Q1: What application scenario is the chip targeting assuming that low-bit precision will degrade accuracy?
A1: Actually, part of the talk was spent showing that accuracy degradation for 4-bit precision was very small (~1-2%) in most cases. Note also that we are talking about a very small accuracy degradation for challenging benchmark tasks like 1000-class classification etc. We expect models that use 4-bit quantization to perform even better for practical, well-constrained machine learning applications as opposed to the benchmarking tasks I previously mentioned. If you were referring to models that use bit-precisions lower than 4-bit (e.g. 1 or 2-bit precision), then the situation is a little different. There are some practical machine-learning tasks that lose very little accuracy even down to 1 or 2-bit precision but I think those trade-offs are less interesting. A more interesting trade-off is using lower-bit precision models with a high degree of sparsity to do ‘preprocessing’ for a more accurate (and power hungry) model. The most obvious example that comes to mind is using an extremely efficient low-bit precision model to region proposals so that a larger, more powerful model can perform object detection using these region-proposals. The combination of very low-bit precision and usage of activation sparsity really allows for some interesting trade-offs.
If you’re asking more generally what applications we are targeting, the answer is: Edge applications that require solutions to be severely limited by size, weight, and power are always a good start. Those that require keyword spotting, person detection, object classification, or multiple CNNs that perform similar tasks but require sensor fusion would probably be a very natural fit for Akida. Applications that have 3D point cloud inputs like LiDAR and DVS cameras also translate to Akida in a straight-forward way.
Q2: What is your definition of an event? What information does it have? Temporal, spatial, gradient? What is the total #bits to represent 1 event?
A2:
• We define an event as a non-zero activation. By this I mean that events can be viewed as non-zero entries in a feature (aka activation) map. We use QReLUs (quantized rectified linear units) as activations and therefore don’t have negative values in our feature maps so we could refer to events as positive non-zero entries in a feature map.
• The event itself carries the same information any element in a feature map carries. If we are talking about 4-bit activations, then each element is a single 4-bit value that represents the activity of feature at a particular x, y location for a particular input. It does not carry temporal or gradient information. Because of its location, it carries spatial information. Like any other CNN, a single element of a feature map could use the channel dimension to encode information about temporal inputs as some researchers have done.
• Our hardware supports 1-bit, 2-bit, and 4-bit events. That’s the size of each event.
Q3: What framework do you use for 4-bit QAT?
A3: For those who may be unfamiliar with this term, QAT refers to quantization aware training. The idea is that QAT simulates low-precision computation in the forward pass during training so that the error back-propagated to the weight updates will be done on the quantized information. That way, when the fully quantized model is executed during inference, the quantized parameters (and possibly activations) will sum to the correct prediction and classification accuracy will be maintained. For those who are interested, here is a link to Google Blog entry that discusses QAT.
We use a custom QAT framework built specifically to be efficient on our Akida NSoC. However, I should note that our QAT framework is built on the TensorFlow Keras API for users’ convenience. Those who are familiar with QAT frameworks should be pretty comfortable with our implementation. Here is a link to our quantization aware training documentation.
Q4: Does the event-based calculation help in reducing inference time or just in reducing power?
A4: Good question. The event-based calculation reduces both inference time and power. The reason inference time is also reduced is because we have built the hardware to be event-based from the ground up. This means Akida doesn’t waste time looking for events to process by looking through each feature map to find the non-zero activations but instead communicates information by sending events instead of an entire feature map (which would contain a large number of zero-valued activations that would have to be searched through and discarded).
Q5: Is the work scheduling for NPUs split up by frame or do all NPUs work on one frame at a time? To add: If it is one frame at a time, is it image quadrant or model layer based?
A5: The Akida architecture is very flexible in this regard. NPUs are organized by layer but the layer can be further subdivided by inputs (e.g. image quadrant) or by filters. We also have a processing mode that allows CNPs to work on different layer frames in parallel. For example, frame 1, layer 4 can be processed by a set of NPs in parallel with another set of NPs that are processing frame 2, layer 3. Depending on the application, this can be very useful.
Q6: Is the activation regularization performed in-situ?
A6: No, the activation regularization is performed during training. This needs to occur during training because activity regularization is built into the loss function.
Q7: Can Akida do on the fly activity regularization for retraining?
A7: No, not currently.
Q8: The one shot learning was really cool, I was wondering if I take the photo of elephant in the room and inference is done in a jungle night time, will it work? Does the network learn to see the elephant in the forest darkness?
A9: The short answer to this is “probably not”. The long answer is this scenario is a bit unlikely because of all the preprocessing that occurs on an image classification pipeline. The type of one-shot learning Akida performs requires a high-quality feature vector as an input. Without preprocessing to enhance or transform a very dark image to a reasonably ‘well-lit’ image, the feature vector will be poor and thus classification will be poor. However, modern image-processing pipelines often take care of this sort of thing much earlier before the dark image even gets to the CNN so it’s not as much of an issue as one might think. One could probably use or generate a data set that contains lots of animals in the dark and train off of that directly but there could be some tricks to handle images that are during the night and those during the day.
Q9: Do new classes work if background is different from training background?
A9: Good question. The answer is: yes. We just didn’t want the video to show some random class like ‘Tiger’ or ‘Police Car’ when the background was showing.
Q10: In Akida’s current state, what would be a real-world application that it would most easily integrate with?
A10: Edge applications that require solutions to be severely limited by size, weight, and power are always a good start. Those that require keyword spotting, person detection, object classification, or multiple CNNs that perform similar tasks but require sensor fusion would probably be a very natural fit for Akida. Applications that have 3D point cloud inputs like LiDAR and DVS cameras also translate to Akida in a straight-forward way.
Q11: What benefit is there in chaining multiple chips to achieve great number of NPUs available to the network?
A11: Chaining multiple chips just gives the user the ability to tackle more computationally challenging machine-learning tasks. Using two Akida chips may allow a user to get a particular FPS that a single Akida chip wouldn’t be able to provide.