Training Deep Learning Models Efficiently on the Cloud
1. Storing data efficiently on the cloud
With Neural Concept Shape, you can use 3D numerical simulations as input to train your deep learning models. If we take the example of aerodynamic simulations, these CFD simulations results are usually much larger files than images or text (a single result can reach several hundreds of GB). Hence storing a large amount of them can become an issue in the long term, as it would require to regularly scale up the hardware infrastructures accordingly.
The engineer might then face difficulties to stream through the files, to make relevant analysis, and it becomes a real limitation and bottleneck in the usage of deep learning for engineering applications. At Neural Concept, we are aware of this issue, and we addressed it by evaluating different solutions over time.
Our initial set up was a NFS (Network File System) solution. It is very convenient because it allows to access files the same way we would access them on the local storage of a machine and can be used by several users in a team. However, this solution does not scale well with the accumulation of data, and the increasing number of experiments done by users simultaneously. Moreover, it can quickly become expensive.
As an alternative, we chose to store the data in a secure cloud environment, using FUSE libraries, allowing us to easily access data as if it was on a local computer, while benefiting from the powerful and flexible cloud architecture. FUSE, which stands for Filesystem in User space, is an interface for Unix-like operating systems that lets users create their own file systems. Using a fuse library, it is possible to mount a cloud storage bucket onto the local filesystem, and then applications can access files in the cloud storage as if it was on a local file system. The user of Neural Concept Shape is now able to train models directly from a secure cloud storage, without any impact on the speed of computation, as it was benchmarked internally.
2. Improving the training speed of deep learning models
Over the past years, the performance of GPUs has drastically improved, and are widely used in various deep learning applications. They allow a very fast and efficient computing, having large memory size available. Hence, the most modern GPUs are now able to tackle complex physics-based deep-learning challenges and deal with (very) refined geometries.
With Neural Concept Shape, we use 3D simulation data to train our models, which can be very heavy files if the simulation is extremely detailed. Most people would then tend to think that the main bottleneck when dealing with such data is the GPU itself, but it is not always the case. The main reason of slow-down (which can be critical for some applications) is sometimes the streaming of the data to the GPU. Indeed, for large files, and especially when the data is being fetched over the network from cloud storage, this can result in a drastic slow-down of the training process. It can then become a real limitation in the usage of deep learning for engineering applications.
This is why we are using the cache functionality of tensorflow data API ( https://www.tensorflow.org/api_docs/python/tf/data/Dataset#cache). We are able to cache the dataset to a local SSD disk during the training process, allowing very efficient retrieval of the data. Caching data means that it can be very efficiently and rapidly accessed as it is stored locally. After the initial pass over the dataset and the first iterations, the dataset gets cached and subsequent iterations go much faster.
In the graph, we see that after the initial 150 steps (after which data has been cached to the local SSD Disk), the training speed increases and remains steady when using the cache.This enables the engineers using Neural Concept Shape to perform various experiments very efficiently, even when dealing with a large dataset, or very complex simulations.
© Neural Concept: Saswata Chakravarty, Lead Software Engineer.