Bloomsbury AI’s Cape, in partnership with Cray and Digital Catapult, sets record in open domain machine reading
Digital Catapult and Cray partner together to support AI company Bloomsbury AI to set the record in open domain machine reading. Cray’s partnership on the programme enables the Machine Intelligence Garage to offer companies the supercomputing power and expertise required to develop and build machine learning and AI solutions. The Cape story via Bloomsbury AI shows just how this can be done.
As part of the programme, Bloomsbury AI leveraged a Cray CS-Storm system in the Cray Accel AI lab to train and optimize the deep learning models within Cape, an open source technology that can answer questions about information contained in text documents.
The result of the collaboration is a new standard for reading comprehension performance on the TriviaQA competition, where a set of questions are used to measure the system’s ability to accurately answer questions using a large body of text as reference (Wikipedia articles and web pages). Question answering is a challenging Natural Language Processing (NLP) task and is used as a milestone for artificial intelligence systems. The reading comprehension competition allows participants to measure question answering prowess using a publicly available dataset (TriviaQA: A Large Scale Dataset for Reading Comprehension and Question Answering) and questions unknown to the competitor in advance.
The Bloomsbury AI Cape model was able to assume first place in the competition standings with scores of 67.32% EM (exact match) and 72.35% F1, setting a new state of the art in Wikipedia open-domain question answering. The model can also answer an impressive ~80% of TriviaQA’s questions correctly.
As Bloomsbury AI were developing the Cape model, they used the Cray Accel AI lab via Digital Catapult’s Machine Intelligence Garage and associated Cray CS-Storm GPU accelerated systems to reduce their model training time from over 42 hours using cloud resources to four hours using distributed training on the Cray CS-Storm system. Training time reduction is an important contributor to Bloomsbury AI’s success, as model development is an iterative process. Shorter training time results in more accurate models.
Bloomsbury AI and Cape
Cape, from Bloomsbury AI, is a complete end-to-end solution for a question answering system, including models, back-end servers and front-end admin interfaces. The solution’s core functionality allows users to upload text documents to the system and then pose questions (in natural language) to a document or collection of documents.Cape then uses this data to extract strings from the document(s) that best answer the question.
Further to this, the Cape tool can also be used as a python library for developers and hobbyists to integrate natural language question answering components into their systems. The solution can run in inference mode on a variety of different hardware, including locally and on cheap CPU-only servers on the cloud.
At the heart of Bloomsbury AI’s solution, Cape is a deep neural network that predicts the probability that a short span of text is the answer to an inputted question and document pair. The task is known as machine comprehension, machine reading, or extractive question answering in the literature. Cape’s model is based on the document-QA model from Clark et al. 2017, augmented with an ELMo language model (Peters et al. 2018), and several other architectural tweaks. Cape is multipurpose, and specifically trained in order to perform well on several diverse datasets simultaneously.
The inclusion of the ELMo language model is highly beneficial in terms of accuracy, especially on datasets such as SQUAD (Rajpurkar et al. 2016), but at the cost of significantly increasing the model’s training time, especially for longer documents.
Cape uses an innovative caching strategy that allows it to answer quickly on documents it has seen before at inference time, but the team were experiencing training times on typical cloud computing hardware of over 200 hours (an NVIDIA Tesla k80 GPU on a system with 40GB of CPU RAM).
This training time severely limited the experiments that Bloomsbury AI’s Machine learning engineers could perform, whilst also incurring significant cost to the business for server time.
Partnership with Digital Catapult and Cray
Through Machine Intelligence Garage, Digital Catapult’s programme to provide UK based machine learning startups with access to cutting edge compute resources, Bloomsbury AI accessed a powerful NVIDIA DGX-1 deep learning machine, which was initially used to develop the model team’s model
As a partner of Machine Intelligence Garage, Cray works with UK deep tech startups to help them access the compute resources they need to realise their business goals.
Cray’s Accel AI lab was able to provide Bloomsbury AI with access to a larger CS-Storm cluster system, and the support of one of their in-house experts. The Cray CS-Storm 500NX system is specifically built for AI workloads. The cluster used for this collaboration consists of 4 nodes, each with two 18 core Intel Xeon “Broadwell” CPUs with 512GB RAM and 8 NVIDIA Tesla V100 GPUs. The nodes are connected by InfiniBand connections, and the system supported by a 55TB Cray ClusterStor Lustre filesystem.
The partnership’s goals were to demonstrate how Cray’s AI machines could be used to provide real business value, and showcasing the raw power of the CS-Storm systems.The models that were to be trained on the CS-Storm system would be required to work with the rest of Cape’s technology. This provided the team with two constraints:
- The model architecture couldn’t significantly change.
- The model’s size (parameters, number of layers etc) had to be constrained so that it could still run on regular hardware.
The team decided to focus on how the training dynamics could be modified to improve the model, and on accelerating the training process.
Cape is trained on a several different datasets – some in-house and on others that are publicly available. Of particular interest to the Bloomsbury AI team is the TriviaQA dataset from Joshi et al, 2017.
TriviaQA tests a machine reading system’s ability to answer factual questions on full web and Wikipedia articles. The dataset itself is used in a distantly supervised paradigm. It consists of natural language questions, their answers, and a list of several Wikipedia or web articles that should contain the answers. These articles are split into chunks of a predefined number of words, and trained independently. 
Chunks that happen to contain the answer to the inputted question may not contain sufficient information to infer that the answer is indeed the correct answer. This is why the training regime for TriviaQA is said to be distantly supervised, as in practice, although most of these data points are high quality, there is a significant amount of noise.
The results below represent experiments using TriviaQA Wiki, the subset of TriviaQA is concerned with answering questions on Wikipedia articles. This subset contains 77582 questions and 48822 Wikipedia articles, with an average of 1.79 articles per question.
There were two main avenues open for experimentation that would make good use of the CS-Storm whilst staying within the design constraints:
- More powerful hardware (in particular larger GPU memory) allows for the size of text chunks to be increased: This should improve performance by allowing the model to learn longer-distance dependencies, and reduce the amount of the noise in the dataset.
- Take advantage of multiple GPUs to train in a data-parallel paradigm: Using the Horovod framework from Sergeev and Del Balso, 2018, we can distribute training on to many GPUs and nodes.  The benefit of this is twofold. Firstly, it increases the training speed, and secondly, by increasing the effective mini-batch size, the average update is less noisy, which is especially important in our distant supervision training regime.
Results – Convergence speed
The orange curve is for the control model, trained on short chunk lengths on a normal cloud server. The time taken to reach the optimal development accuracy was over 200 hours. The green curve shows an equivalent model trained using only one of the CS-Storm’s V100 GPUs. The time to peak performance is reduced to about 40 hours – a 5x speedup over the cloud server.
The blue curve shows the final model’s learning curve, trained using 16 GPUs distributed across two of CS-Storm’s cores. This system not only breaks the state of the art for this dataset, it also takes on average only 20 hours to train.It is also worth noting that the final model achieved an accuracy within 5% of its peak accuracy in less than 4 hours of training, whereas this milestone took over 42 hours for the k80.
Results – Accuracy
Two metrics used to assess model performance are Exact Match (EM) and F1 score. The EM score is the percentage of extracted answers that exactly match the gold reference answer, and the F1 metric is the mean word overlap F1 between the extracted answer and gold reference answer.
Using two of the CS-Storm nodes and 16 workers allowed for effective mini-batch sizes of 960, compared to only 30 on the cloud server. Not only did this enable very fast training, it also compensated for the noise in the dataset considerably. When using data parallelism, the research community recommends linearly scaling the learning rate with the number of workers, following the example of Goyal et al. 2017. However, the focus of their research was a supervised learning task, with very little noise in the dataset. Bloomsbury AI found that linearly scaling the learning rate for their task led to instability and poor performance. However, not modifying the learning rate did not lead to accelerated training. A good compromise was to scale the learning rate by a factor of the square root of the number of workers, which empirically provided excellent performance and training speed.
The team also experimented with very long document chunk sizes. Because of the extreme length of these chunks, the training was considerably slower. This type of model looked very promising from an accuracy perspective. but had two major drawbacks – they took over 120 hours to train, even on the CS-Storm, and the performance degraded considerably when using these models in a production environment with shorter document chunk sizes.
It is apparent that hyperparameter tuning enabled by fast training has allowed significant performance improvement. Furthermore, training on longer document chunks looks promising, but requires further experimentation.  The effect of large effective batch sizes can also be seen here, with increases of 0.72% EM and 1.12% F1 against an equivalent model trained on a single worker.
The best model was submitted to the official TriviaQA model leader-board, which can be viewed here. The test set answers are withheld from the public release, allowing for fair testing of model performances. Cape’s model received scores of 67.32% EM and 72.35% F1, setting a new state-of-the art in Wikipedia open-domain question answering. The test set also contains a verified subset, which has been curated by human annotators to be noise-free. On this subset the model scores 75.68% EM and 79.26% F1 indicating it answers an impressive ~80% of TriviaQA’s questions correctly.
Since this project concluded Bloomsbury AI has joined Facebook and has released Cape Open source to the community, including the best model from this project. The team hope that it will be of benefit to many people, empowering them to create great, intelligent products. To try Cape out, please check out the project here. Pull Requests for functionality, demos, tutorials and documentation are encouraged! The team would like to thank Cray and Digital Catapult for their support via the Machine Intelligence Grage programme, expertise and hardware resources, without which, this work would not have been possible.
: for further details on training regime, see Clark et al. 2017
: Each worker maintains a copy of the model being trained and computes parameter updates for a mini-batch. The parameter updates for all the workers are then averaged together using a highly efficient ring-allreduce method. All the workers’ models are then updated with the averaged parameter update. The overall effect is to increase the effective mini-batch size by a factor of the number of workers, i.e. 2 workers each computing updates with a mini-batch size of 64 training examples is equivalent to 1 worker with a mini-batch size of 128 training examples.
: Training runs were terminated after 120 hours. Learning curves for the Long chunk models indicated that training had not yet converged after this time, suggesting the long context model could have ended up performing best. The drawbacks of this model have already been mentioned however, and it could not be put into production, so further experiments with it were stopped.