Abhilash Majumder

Deep Learning has emerged as one of the most applicable technical paradigms in all areas of computation. Particularly in financial products, deep learning has been instrumental in providing proper predictions, better portfolio management, time series analysis, and market stability. However, the deep learning models, which are prototyped as research-based models, are not flexible and scalable enough to be used directly in the financial world. The reason for this is the amount of financial data circulating large indices, such as Dow Jones, MSCI, S&P, and along stock exchanges (BSE, NYSE, LSE) is enormous (Terabytes per day), and the frequency of update of these statistics is exceptionally high.(few thousands per millisecond).

Also, the present data is highly restricted, and hence no confidential financial information corpus is publicly available for training on large-scale models. With the advent of deep learning architecture, called Transformers, transfer learning on these financial data has been instrumental. There has been a massive surge in applying the transformer architecture (example, BERT) for any fine-tuning or inferential tasks such as stock recommendations from social media feeds (e.g., Twitter), equity and asset management, dividend validation from stock exchange sites, or determining hedge fund trends.

The BERT-based variants of Transformers are a class of deep learning models that are massively scaled and have millions of trainable parameters. These large models require a considerable amount of space (Gbs) and time for training and inference.

As shown in figure 1-1, the BERT base and significant variants have 110 and 340 million parameters, which is the most straightforward variation.

Figure 1-1: Parameter Size of BERT base and large models

Training them on financial data is getting updated incessantly per second with millions of new entries for any inferential downstream tasks such as question-answering, named-entity-recognition text-to-text-paraphrasing, will take a tremendous amount of time. Financial industries employ different methods to tackle this problem.

Mini Batch Processing With Parameter Updates

Since the data is enormous, if a classic BERT base model requires finding relevant financial answers from a public stock exchange site (question-answering BERT), the first standard approach to scale is to apply large mini-batches. Since the model is enormous and there are many internal sophisticated neural components (called encoder stacks), mini-batch processing with proper parameter updates (such as increasing the learning rate during training linearly) can significantly boost activity. The 12 -layer statement in Figure 1-1 in BERT base uncased translates to 12 such encoder neural components present inside BERT.

Figure 1-2, shows a single encoder unit, and BERT has 12 of them stacked on top of each other:

Figure 1-2: BERT base uncased Model (encoder representation) [Source: https://arxiv.org/abs/1706.03762 ]

Reuse Pretrained Model Configurations 

Another way to scale up such massive training on local hardware such as Nvidia Tesla GPUs or in CPUs is to use pre-trained weights of the models in inference mode rather than explicitly training the model on a new sample of data. This will lead to a confidence gain in the model; since in the financial domain, a large chunk of the data revolves around stock prices (decimals/floating type values), running inference using trained weights reduce both cost and time.

Since these models are already pre-trained on a large public benchmark corpus for any deep learning task, leveraging the pre-trained consequences through a simple multi-layered perceptron (neural unit) is easier and faster.

This is shown in Figure 1-3.

Figure 1-3: Using Pretrained model configurations in Inference mode
[Source: https://github.com/NVIDIA/DeepLearningExamples

Quantization of Large Scale Model And Warmup Strategy

Since natural language processing and computer vision fields operate in this sector, like BERT (NLP), Vision-based vision-based transformers are also used. Although the above methods help in crunching time during consistent training of terabytes of data, there is no optimization in space. And spinning up Virtual machines or Kubernetes clusters (Cloud compute container) may end up bearing a huge cost for the organization. For optimization of space, a variation of quantized weights is used. This is because if decimal weights are stored in configuration files (during training), they take up a lot of space as the number of trainable parameters increases non-linearly.

To reduce the memory footprint, a generic typecast to integer weights are used to save bits. On a large scale, this helps keep enormous costs. Warmup Strategy refers to steadily updating the learning rate as the sample sizes increase, which is bound by the square root function. This allows an aggressive training pattern on large corpora using the same samples but in fewer training steps.

Figure 1-4, shows the warmup strategy when applied to ImageNet (computer vision neural architecture) :

Figure 1-4: Warmup Strategy in ImageNet architecture

Distributed Learning in Cloud

While the previous approaches are used extensively during training, inference takes a significant amount of time and space. Deep Learning frameworks like Pytorch, Tensorflow, Mxnet, etc., are highly scalable as they employ computation graphs for gradient processing during training. With the advent of distributed computing, using frameworks like Pytorch to run inference on a large-scale transformer like BERT is best done on GPUs. The significant advantages of using GPU based distributed training is as follows:

1. Concurrent Model Execution: Multiple model variants or instances of the same model can be scaled simultaneously. This allows multithreading inside the Kubernetes Cluster, which forms the compute unit in a cloud service. Nvidia DGX optimizers, which are present in Nvidia GPUs, perform distributed optimization of gradients during training. This is useful because the minibatch sampling (which we discussed) is inherently current inside the GPU kernels. The task of the distributed optimizer is to spread the update across the GPUs during the training phase. This allows multiple gradient updates from the different GPU kernels to be updated and synchronized concurrently.

Figure 1-5 shows this:

Figure 1-5: Distributed optimizer in Nvidia GPUs

2. Inference Server: One of the significant implications of distributed computing is to leverage server redundancy. Most cloud providers like Microsoft Azure, Google Cloud Platform, Amazon Web Service provide storage buckets that can store the pre-trained model configurations. Deep Learning frameworks like TensorRT server or Tensorflow Hub can be used as Infrastructure as a Service (IaaS) to inference these models. One such example of running distributed inference through the PyTorch framework on the TensorRT server (Nvidia) provides a considerable speed boost. 

3.CUDA Graphs and Multi GPU support: Since most of the cloud compute uses Nvidia GPUs, there is a significant bottleneck when it comes to the processing speed. Since GPUs work parallelly, multiple GPUs may have parameter updates ready to be sent to the CPU simultaneously. However, the Virtual CPU is overburdened with the data received, which produces a lag in the inference/ training steps. Since the CPUs cannot adjust the rate of data to be sent to the GPUs, CUDA graphs play an essential role in automatically producing dependency graphs on the CPU. These graphs automatically adjust the amount of data load to the GPUs and scale according to the depth of the neural network.

4. Sharding and Parallelism: In most cases, the allocated spaces for the models become insufficient to store the ever-increasing parameters coming from a financial system. Since the parameters are updated every few seconds, space constraint often becomes an important issue. To resolve this, sharding of model configurations and replication takes place at different cold storages. This allows redundancy by removing fragmentation errors in memory, and in the event of server downtime, replica shards can be used to collect the model configurations. Apart from all of these sophisticated hardware methods to boost up speed, the most important one is optimized program logic and removing redundant gradient steps (computations) wherever required.

A parallelized architecture can be visualized as shown in Figure 1-6:

Figure 1-6: Distributed Infrastructure for Deep Learning

All the steps mentioned above are essential to scale extensive models with millions to billions of trainable parameters for any financial task, such as the downstream stock recommendation to question answering stock prices and image segmentation of visual financial data. Though all these models work on the transfer learning principle, the amount of computing and storage increases abruptly for this sector due to the model size and the trainable weights. This is because the data from finance is unique and has to be highly accurate. Accurate results and predictions can only be attained when deep learning models leverage distributed cloud computing infrastructure.

Abhilash Majumder
Abhilash Majumder

Abhilash Majumder is a Research Scientist at Morgan Stanley (MSCI ), and is a former research engineer for HSBC Holdings Plc. He is the author of the book "Deep Reinforcement Learning in Unity"(Springer) and is a deep learning mentor for Udacity. He is a former intern for Unity Technologies and a current moderator in the field of AI. He is a maintainer and contributor for Google research and was a speaker for Pydata, Unite events. He is a former graduate from NIT Durgapur.

Posted in Guest Articles By Abhilash Majumder   Date May 18, 2021

Leave a Reply

Your email address will not be published.