Here are more questions and answers from the webinar on xcore.ai:
Q: Does the chip support both floating point and int8 operators?
A: Our port of the runtime supports all built-in operators that have reference implementations in the TFLite for Microcontrollers (TFLM) project. This includes a variety of floating point as well as int8 quantized operators. The xcore.ai vector unit, on the other hand, implements only integer arithmetic, so for maximal performance we recommend using int8 quantized models on our platform.
Q: What is the size of the TFLite for Microcontrollers runtime?
The size of the runtime depends on what subset of operators are registered, which is in turn governed by what model(s) you want to run on it. This can range from as little as ~20KB to as much as ~140KB, when all built-in and xcore-specific custom operators are registered. The runtime overhead for the model in the demo was under 100KB.
It is worth noting that the TFLite flatbuffer model does not include allocated memory for input/output data and activation tensors. Moreover, the additional application overhead will be dependent on what other components are need for the full system (e.g. FreeRTOS, mic array, HW interfaces).
Q: Do you have a demo for wakeword detection?
A: You can deploy any TFLite model on our platform whose operators are supported by TFLM, so you can train/convert your own wakeword model or use a pretrained one available online. We have partnered with multiple wakeword providers for our commercial applications, please contact us at firstname.lastname@example.org if you would be interested in a demo of one of these.
Q: Can you say more about the deployment workflow?
A: Stay tuned for a follow-up post on our blog where, among other things, we will give some more detail on the deployment workflow.
Q: Which kind of workload can you efficiently run on the logical cores besides CNNs or matmul-like kernels?
A: In addition to specialized instructions designed to speed up long sequences of dot products (and therefore CNNs, and matrix-vector multiplications), the xcore.ai vector unit also implements vectorized integer operations (e.g. ADD, SUB, MUL, MACC), as well as butterfly operations to speed up FFTs. Using our math and DSP libraries you can speed up a wide variety of computationally intensive applications. Moreover, the xcore architecture has been designed to be blazing fast at IO processing tasks where hard real-time guarantees are necessary.
Q: Do each of the logical cores have an ALU that includes a multiply-accumulate unit?
A: The 8 logical cores on a tile share an ALU, which includes the floating point and integer vector units. Due to the 5-stage instruction execution pipeline in the ALU, at any given time any 5 of the 8 cores can execute instructions (e.g. a vectorized multiply-accumulate) simultaneously. Each of the 8 cores have their own sets of registers, so the execution of instructions in one core will never interfere with another.
Q: Does your converter perform quantization or is it the user’s responsibility?
A: The quantization is performed by the TFLite converter, and we provide a helper that configures and calls the converter in a single line of code. In the future we will wrap this and our optimizer converter so that the conversion from a Keras model becomes a one-stop-shop.
Q: What is the inference time for MobileNetV2?
A: We are working on benchmarking several sample models, including different variants of MobileNet. More performance metrics will be published as development kits become available to the general public.
Q: How is the optimization connected to compilation in the converter?
A: The model optimization step happens before compilation. After the optimizer has transformed the model, you can use it the same way as you would use any other model in a TFLM project. Typically, you would integrate the model and the runtime with the rest of your embedded application (e.g. a camera interface or FreeRTOS instance), before you compile it.
Q: Can you do backpropagation as well (e.g. on some of the top layers)?
A: We are currently only targeting inference, particularly for low bit-width (e.g. quantized and binarized) models.
Q: Any idea how much demo boards will cost and when they’ll be available?
A: The first dev kits will be available to the public at the end of 2020Q3. To stay updated, register your interest at xcore.ai, and we will keep you up to date.
Q: Does this chip support LSTMs?
A: The TFLM project currently does not support recurrent networks, but support is planned. As soon as the reference operators are introduced in TFLM, we will look at how we can speed them up on xcore.ai.
Q: Is xcore.ai a totally new CPU architecture or is it based on an existing one?
A: The xcore.ai chip implements the third generation of the xcore architecture, which is based on XMOS’s proprietary instruction set.
Q: Can you give further insight on the mapping process? It seems that there could be millions of transformations: how does XMOS find the most efficient implementation?
A: The xcore.ai optimizer/converter works analogously to a pre-compilation optimization tool. Similarly to a compiler, the implementation itself consists of a sequence of transformation passes (canonicalization, optimization, cleanup, etc.) applied in multiple stages. These passes mutate the computational graph of the model, just like a compiler would mutate an intermediate representation.
Since the set of possible optimizations is indeed very large, we are constantly working on expanding the optimizations we implement. Our approach is to focus our efforts on optimizing the operators that require the most computational resources and/or are used most often. To improve the performance, we test various implementations of the operator, then implement the necessary transformation passes to enable these new implementations.
Q: Is there a throughput number, e.g., TOPS, TOPS/W, or inferences/sec on reference network?
A: The theoretical peak throughput of the chip is 51.2GMACC/S in int8 mode, under full utilization of the vector unit on all cores, along both tiles. We are working on benchmarking several sample models and we will publish performance metrics as soon as they are available.
Q: What is the typical power consumption during inference?
A: The chip has just recently come back from the foundry, so precise power benchmarks will be available when the characterization has been finished. The power consumption will be a few hundred mW at 700MHz, but the chip can also be clocked down.
Q: How are the model parameters stored and accessed? There is 512 KB of RAM but that is enough only for a very small model. Can it execute from Flash? How big is the Flash/ROM?
A: There is 512KB of SRAM on each tile, and there are two tiles on a chip. The chip supports execution from FLASH, as well as LPDDR1 memory, the size of which depends on what’s required by the application. Our deployment tools will allow models to store coefficients in either type of external memory and efficiently access those at runtime. If the model fits in SRAM, however, the vector unit has single cycle access to all coefficients.