There is a need for easy to use tools. Instead of having separate training and optimizing tools, it is desirable to have an integrated tool that takes the training data and gives you simple C code for a deep net that executes as-is on a low-end MCU or DSP.
Google, ST, NXP, and ARM’s libraries for MCU-based inference are currently way more cumbersome than they need to be.
Another important issue has to do with training objectives. Currently, most approaches convert a high-precision, trained network to a version that runs on simpler hardware. There is a need for training regimes with dual-purpose objective functions that try to reduce the prediction error and weight-matrix complexity simultaneously, starting from the very first epoch.
I think this post is very confusing. It mixes the need to architect automatically a tiny NN (e.g like AutoML /AutoKeras are not able to) with the need to deploy efficiently NNs on heterogeneous micro hardware FP32/INT8 (like mentioned libs are able to). true that sky is blue but depends on the lens one uses. Criticisms is always an easy job for everyone
This is our goal at Edge Impulse—taking the amazing low-level tools that the community has created and making them easy to use for embedded developers who don’t necessarily have machine learning experience but have a deep understanding of the domain in which they work. This allows domain experts to train, evaluate, and deploy models.
I strongly feel that this is the right approach; the alternatives are that we expect all domain experts to become deep learning experts, or that we expect deep learning experts to become domain experts!
We’re continually evolving the product towards this goal, and I’d love to hear your feedback if you have the opportunity to try it!
Indeed, Dan. The original post was criticism of the cumbersomeness of the current set of tools that are available to embedded system developers. The post did offer a remedy as well. It pointed towards a way that automatically generates hardware-agnostic nets that are easy to include in embedded systems. That way is a training regime that simultaneously optimizes for error as well as hardware efficiency, and spits out the result in the form of simple C code.
We have had some success with the dual optimization approach. The resultant DNN inference engine takes only 110 bytes on a Cortex-M4 (example). It can be made to run faster on MCUs/DSPs that support SMID/VLIW. It can also be easily transformed into a very compact and lightning-fast FPGAs.