Going mobile: AI in embedded devices
When you talk to your Alexa, Google assistant or Siri, your voice is recorded and sent to the cloud. All the processing – speech recognition, natural language understanding and the generation of an answer or action – is performed in a large data centre. Similar processes can be found in many other AI enabled products, however, this approach is not feasible for all kinds of applications. If an AI enabled device has to work in remote areas with a limited internet connection, if it requires very high reliability or very small latency, or if data privacy reasons prohibit the use of cloud compute, the processing has to be performed on the device itself. This is why many autonomous robots, drones and self-driving vehicles rely on embedded hardware for various critical tasks: a machine learning model is developed and trained using standard hardware or cloud services, but then the model is deployed on the mobile device itself and performs “inference at the edge”.
Hardware for inference at the edge
Of course, there are very specific requirements for hardware that can be used in these kind of applications. While GPUs or other specialised hardware accelerators like Google’s TPUs are great for training a machine learning model, they are heavy and require a lot of power. To explore what other options are available for machine learning at the edge, Machine Intelligence Garage organised a meetup with the NVIDIA Jetson team.
The NVIDIA Jetson is a small, lightweight and low-powered platform designed for embedded inference using deep neural networks. In the TX2 version, the Jetson’s CPU consists of a dual core NVIDIA Denver2 processor and a quad core Arm Cortex-A57 processor. It is accelerated by an NVIDIA Pascal GPU with 256 CUDA cores, giving the Jetson TX2 a theoretical peak performance of 1.33 TFLOPS/s half precision (16 bit) inference. It has 8GB of memory and 32GB of data storage available. Overall, the module weighs 85 grams and consumes 7.5 to 15 watts, making it deployable in a wide range of scenarios.
Other products for embedded inference include the Intel Movidius product family, which introduces processors that are specialised to accelerate computer vision workloads at the edge. Like NVIDIA Jetson products, Intel Movidius chips enable high performance inference while at the same time being lightweight and highly energy efficient. Similarly, the UK based semiconductor manufacturer Arm, develops hardware for embedded AI applications, the Arm Machine Learning Processor. Google has also made an announcement to set foot into the thriving market of machine learning optimised chips for at the edge inference with their Edge TPU. At the same time, several startups worldwide such as Cambricon and Kneron are further developing innovative chip designs for AI at the edge.
The importance of this development is emphasised by several manufacturers of processors for mobile applications, in particular in smartphones, starting to optimise their products for deep learning workloads. This includes the latest Qualcomm Snapdragon processors, Apple’s A11 bionic chip and the Huawei Kirin 970 (using the Cambricon NPU). These chips all make use of a coprocessor that is specialised for fast and energy efficient inference at a reduced precision of usually 16 bit.
Generating a small model for embedded use: pruning and regularisation
Although there are powerful hardware solutions for at the edge devices, there are always trade-offs with mobile compute solutions. Compared to an up-to-date GPU, both the throughput and the memory of mobile processors, are considerably smaller. Therefore, machine learning models usually require certain adaptations for performing inference at the edge. A critical parameter to consider is the size of the model, i.e. the combined size of all weights and parameters. The model not only has to fit within the available memory, but every memory operation requires time and power. Thus, a sparse model (one without neurons that have negligible contribution to the outcome) is desirable to decrease the memory footprint, accelerate the computation and reduce the power consumption. Moreover, training a sparse model is a good way to prevent overfitting and create models that generalise well.
Methods to obtain sparse neural networks, i.e. networks with a small number of non-zero parameters, are known as pruning (figure 2). Different pruning techniques are being used: a regularising term is often added to the loss function during the training process to enforce simple models that don’t overfit. However, introducing a regularising term with the L1-norm also helps to generate a memory efficient model. This regularisation is often referred to as LASSO (least absolute shrinkage and selection operator) . Since LASSO adds a term of the form 𝜆∑∣𝛽∣ to the loss function, it penalises non-zero weights 𝛽 and usually results in a sparse model with a small memory footprint. Besides the L1 norm, other norms are used with the regularising term as well: while the L2-norm (“ridge regression”) does not promote sparsity, the use of the L0-norm that simply counts non-zero weights can yield even more sparsity than LASSO. However, since the L0-norm is non-differentiable in 0 and the derivative is 0 everywhere else, gradient descent cannot be used directly in a model with L0 regularisation. Workarounds for this problem have been explored recently in . For additional details on regularisation techniques in general see for example [3, Chapter 7].
As an alternative or complementary approach to reduce the size of a machine learning model, pruning can be applied to a pre-trained model by identifying and removing neurons that have little effect on the results (figure 2). This idea has been pioneered by Yann LeCun and colleagues . After removing the least important neurons from the network, the rest of the model is usually adjusted to this change and the process is iterated (figure 3). The critical step is the identification of neurons to be pruned and a large variety of methods have been developed to efficiently rank neurons by their importance to the network outcome [5,6,7]. In regression models, the easiest method is to simply prune all connections with an absolute weight below some threshold . Neurons without inputs or outputs and their remaining connections can also be removed (figure 3). However, in most cases, better results can be obtained by ranking connections by the steepness of their gradients . By averaging the importance of neurons within feature maps, pruning can not only be applied to single neurons in regression models but also to bigger components like convolutional filters in CNNs [5,7]. This is particularly interesting because it can significantly reduce the computation time. While in typical CNNs the majority of parameters are used in fully connected layers, most compute operations have to be performed in the convolutional layers.
Reducing the size of weights and activations: quantisation of neural networks
A model’s memory footprint and, given suitable implementation and hardware, the required computation time can be significantly reduced by constraining weights and activations to fixed point values. This quantisation is most commonly used for inference only. Additional adjustments to the gradient descent algorithm are required for training quantised networks. In many models, weights and activations that are limited to 8 bit precision (256 different values) are successfully used for inference without having a strong effect on the accuracy of the model . However, even experiments with weights and activations that are constraint to just 2 values (1 bit precision) have been published .
When quantising a model, a sensible choice of the range of values that are represented by the quantised parameters is crucial: For 8 bit models the obvious choice is to represent the minimal and maximal value by 0 and 255 (or -128 and +127) respectively and linearly distributing the float values over this interval. However, different modifications to this representation might yield superior results. For instance, choosing a range of values that is smaller than the maximal interval, a non-linear distribution, a representation that exactly encodes the float value 0, or a representation that is symmetric around 0, have been shown to be beneficial in different use cases.
Designing models for efficient computation
Recently, further methods have been developed to reduce the model size. One particularly efficient approach replaces full convolutions from M input layers to N output layers with a k*k sized filter by depthwise separable convolutions [9,10]. Here, only a single convolution with a k*k filter is applied to every input layer. The resulting M layers are subsequently convolved into N output layers by 1*1 convolutions (figure 4). This reduces the number of required parameters and operations for a convolution by a factor of 1/N + 1/k2 while still yielding a competitive accuracy. This approach is easier understood with an example: when applied to ImageNet data, the first convolutional layer of the VGG-16 architecture  uses a 3*3 convolutional kernel to generate N = 64 feature maps from M = 3 channels of a 224*224 pixel image. This requires 32*2242*3*64=86,704,128multiply-add operations. A depthwise separable convolution first performs spatial convolutions on the 3 input channels (32*2242*3=1,354,752 operations), followed by depthwise convolutions with a 1*1 kernel (2242*3*64=9,633,792 operations), resulting in a 87% reduction of floating point operations.
Every machine learning application can benefit from the presented techniques to reduce the size of a network by increasing the throughput and decreasing the risk of overfitting. However, pruning methods remain especially important in situations with limited compute resources such as embedded applications.
Where are embedded systems used?
Many autonomous systems need to perform at least a part of their computations locally. During the meetup we explored some of these use cases together with innovative SMEs that offer solutions to some very interesting problems.
BotsAndUs is a UK based startup that develops service robots that can assist customers in a range of scenarios. The applied machine learning technology includes image processing for navigation and the recognition of humans and face expressions as well as speech recognition and NLP for the interaction with humans. A critical parameter in this use case is the battery lifetime, imposing the need for low powered compute resources.
Not only the power consumption of hardware but also its weight are crucial factors when it comes to applications in aerial drones. BeTomorrow develops software for drones that are used in entertainment and in industry. Both applications require advanced, low latency SLAM (simultaneous location and mapping) techniques that use data from cameras or LIDARs. This enables the drones to coordinate with other drones or conduct high precision measurements and inspections of their surroundings, used for example in architectural quality control.
Since AI that is deployed in autonomous and semi-autonomous vehicles needs to work with both extremely high reliability and low latency, at the edge AI solutions can be found in several application fields. The startup Letos puts its focus on a very particular use case, the detection and interpretation of human emotions from video and audio recordings. This can be used to infer the attention of the driver of a semi autonomous vehicle, but several other applications are conceivable as well. Therefore, Letos leverages the NVIDIA Jetson platform to develop a low-latency, universally applicable product.
To learn more about the latest hardware and software developments for machine learning as well as fascinating new use cases, get in touch with Machine Intelligence Garage and visit our exciting events. Machine Intelligence Garage is a programme delivered by Digital Catapult to help UK based startups in the machine learning field overcome one of the largest barriers they face: the access to computation power. This is accompanied by the expertise on a wide range of hardware resources and well-founded support with the choice of the most suitable hardware and its best utilisation.
 Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, 1996
 Louizos, Welling and Kingma, Learning Sparse Neural Networks through L_0 Regularization, ICLR 2018
 Goodfellow, Bengio and Courville, Deep Learning, MIT Press, 2016 http://www.deeplearningbook.org
 LeCun, Denker and Solla, Optimal brain damage, Advances in neural information processing systems, 1990
 Molchanov et al., Pruning convolutional neural networks for resource efficient inference, ICLR 2017
 Han, Mao and Dally, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv:1510.00149, 2015
 Li et al., Pruning Filters for Efficient ConvNets, ILCR 2017
 Han et al., Learning both Weights and Connections for Efficient Neural Networks, arXiv:1506.02626, 2015
 Howard et al., MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv:1704.04861, 2017
 Chollet, Xception: Deep Learning with Depthwise Separable Convolutions, arxiv:1610.02357, 2016
 Courbariaux et al., Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1, arXiv:1602.02830, 2016
 Simonyan and Zisserman: Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556, 2014