High performance computing for AI
We are living in the age of Big Data. Everyday 2.5m Terabytes of data are generated  and this process is accelerating: According to recent estimates, 90% of all data on the internet has been generated in the last two years . One answer to the problem of getting insights from this flood of data is the breathtaking development that machine learning, and in particular deep learning, has undergone in the last decade. Big efforts are being made to improve the efficiency of machine learning models. However, the increasing amount of training data, as well as the growing demands on the accuracy of these models have led to rapidly growing computational complexity.
Why use high performance computing for the training of AI?
The increasing hardware needs resulting from this development are addressed from different angles: because GPUs can efficiently perform the type of computations used in deep learning, such as dense matrix multiplications and convolutions, they have risen to become the standard hardware for deep learning workloads. Additionally, various novel chips that are specifically designed to accelerate machine learning applications are being developed, such as Google TPUs and Graphcore IPUs. Cloud computing providers are constantly expanding their offers with hardware and software to better meet the needs of machine learning applications. In general, techniques that are commonly used in high performance computing (HPC) for distributing workloads over multiple nodes are now being adopted for machine learning applications . Consequently, in recent years the the AI field has progressed to using hardware with multiple nodes, each accelerated by GPUs (figure 1). Yet, few people think of HPC when it comes to solutions for the computation bottleneck in artificial intelligence.
Figure 1: Development of deep learning hardware layout. Figure adapted from 
Against this background, we explored the use of HPC for machine learning applications at a Machine Intelligence Garage meetup hosted by Digital Catapult together with experts from two of the most established companies in the HPC field: Adrian Jackson from EPCC, a world leading HPC centre, and Rajesh Anantharaman from the global supercomputing leader Cray inc. They were joined by Nicolas Tonello, founder of Constelcom, a startup with the mission to make using HPC as simple as using cloud compute.
HPC: The basics
The performance of a single processor is limited by physical constraints around its size, its energy consumption, and its heat production. Consequently, the key to HPC is the parallelism of many processors. According to Adrian Jackson, any parallel system consists of four principal technologies that determine its speed:
- Processors and processing accelerators
- Temporal memory, including the main memory (RAM) and CPU cache
- Data storage
Frequently, the speed of memory and storage access and the communication are a bottleneck for the overall performance, rather than the processor speed. Hence, the full system architecture has to be factored in to maximise the hardware performance for machine learning applications.
The parallel hardware architecture can come in two different flavours: In a shared memory architecture, multiple processors share a common memory space, making programs relatively simple to implement. However, scaling of these systems is limited. In distributed memory architectures on the other hand, each processor has its own memory space (figure 2). The required communication via interconnects makes the programming of applications for distributed memory hardware more complex, yet this architecture has excellent scalability properties.
Figure 2: Different parallel hardware architectures. Figure adapted from Adrian Jackson, EPCC
Real world applications generally employ a hybrid architecture, so-called distributed shared memory systems that consist of several nodes with processors sharing memory within a node.
In modern HPC architectures, nodes are supplemented with accelerator chips like GPGPUs, that take over much of the computation. An example of such system is the Cray XC50 PizDaint at the Swiss National Supercomputing Centre (CSCS), currently the third fastest supercomputer in the world .
How to use HPC for deep learning?
When using any HPC infrastructure to train a deep learning model, the critical question is how the workload can be distributed over multiple nodes without generating too much communication overhead. By their design, deep neural networks allow different ways of implementing parallelism on HPC hardware. The most straightforward and popular approach is data parallelism, where the full model is stored on every worker and the workload is partitioned by the input data (figure 3a). If N workers are available for training, a mini-batch is split into N equal parts and each part is assigned to one worker. In model parallelism, different parts of the model are processed on different workers (figure 3b), and layer pipelining uses different workers simultaneously for different layers of the model (figure 3c). Model parallelism and layer pipelining can process larger models since only a subset of the weights has to be stored on a single worker.
Figure 3: Ways of distributing workloads in neural network. Figure adapted from 
When using a distributed memory hardware architecture, different kinds of information have to be exchanged between the nodes via interconnects. MPI (Message Passing Interface), a standard that is well established in high performance computing, is often the method of choice to guarantee fast and reliable communication between the nodes. Frameworks such as Horovod  or Cray’s PEDML Plugin automate distributing deep learning models over multiple nodes to support the end user. They ensure a good scaling efficiency by managing the communication and balancing the workloads.
What are the differences to cloud computing?
HPC offers a couple of advantages over classical cloud computing, particularly for computation intense R&D activities. HPC systems are optimized for large scale applications. This includes a tight coupling of different components and processes as well as a software stack that is highly optimised for the hardware architecture. In particular computations that profit from low latency communication and high bandwidth access to memory can receive a significant performance boost when being executed on HPC hardware. Finally, great efforts have been made recently by HPC providers as well as third parties to make accessing and using HPC as easy to use as local resources or cloud computing [6,7,8].
However, especially when compared to commercial cloud compute providers, barriers getting access to HPC infrastructure remain. Machine Intelligence Garage  facilitates HPC use by UK based AI startups and provides access to HPC facilities hosted by Cray and two major HPC centres in the UK, EPCC and STFC Hartree free of charge.
Machine Intelligence Garage is a programme delivered by Digital Catapult to help UK based startups in the machine learning field overcome one of the largest barriers they face: the access to computation power. This is accompanied by the expertise on a wide range of hardware resources and well-founded support with the choice of the most suitable hardware and its best utilisation.
 IBM, 10 Key Marketing Trends for 2017 https://public.dhe.ibm.com/common/ssi/ecm/wr/en/wrl12345usen/watson-customer-engagement-watson-marketing-wr-other-papers-and-reports-wrl12345usen-20170719.pdf
 Baidu research, Bringing HPC Techniques to Deep Learning, http://research.baidu.com/bringing-hpc-techniques-deep-learning/
 Tal Ben-Nun and Torsten Hoefler, Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv:1802.09941, 2018
 Alex Sergeev and Mike Del Balso, Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow, https://eng.uber.com/horovod/