One of the most common challenges anyone working on a machine learning or data science project faces is the computing time and memory problem of using a vast amount of data. Huge data requires huge computing resources to save time and memory.
We worked on a project related to fish tracking. A fish tag, an electronic device implanted into the fish, emits pulses or signals as sound waves. These signals are recorded as pings in the receiver. These pings are analyzed to identify which tag has generated the ping. The uniqueness of the tags lies in the time gap between the pulses being transmitted. These pings are represented as acoustic time series data. To identify the tags using the machine learning approach, we generate images with the pings and perform image segmentation using the UNet neural network. Let us go through the time and memory challenges we faced over the course of the project and how the problems were resolved.
Challenges with training time
The dataset we worked on is huge, with nearly 21.5 million pings. We took a subset of the data to generate images and train a neural network. Training the neural network with nearly 85k images on a CPU took nearly an hour for each epoch. When the same number of images were used to train the neural network on a GPU machine, the training speed improved by nearly ten times. Thus, using the GPUs to train the neural network saved computation time. Though the CPU can train a deep learning model, GPU accelerates the model training. Hence, GPU is a better choice to train the deep learning model efficiently and effectively with very large datasets.
Challenges with memory
Besides the training time, another issue one can experience when working with huge datasets is memory. Loading the data frame containing 21.5 million records and generating the images occupied a lot of memory, and even a memory of 500GB was not sufficient. To overcome the memory problem, loading the data frame in chunks into the memory and processing the loaded data frame chunk to create images use comparatively less memory. To load the data frame in chunks, a parameter called ‘chunksize’ can be used while reading the saved data frame using pandas. For example:
Similarly, another memory issue that may occur when working with the image dataset is the storage of generated images in memory. Storing the generated images into a database table and clearing the image memory from RAM can solve this issue. The concept of PyTables  can be used to store the images in an HDF5 database file. It helps manage hierarchical datasets and is designed to efficiently and easily work with extremely large amounts of data. It optimizes memory and disk resources so that data takes up far less space. Its flush feature helps in freeing valuable memory resources so that they can be used for other things.
Another important point to keep in mind while training the model. Instead of loading all the images into memory at once and training the neural network, a subset of images can be loaded from the saved HDF5 file into the memory using data generators. This will also help in solving memory-related issues.
Overall, implementing the training and testing of the model on GPU saved a lot of time in finding the fish tags. Also, the concept of PyTables and data generators helped resolve the memory issue that occurred due to the storage of images in the memory.
. PyTables Documentation. https://www.pytables.org/usersguide/index.html