Part 1: Building your own dataset for machine learning

The following blog is Part 1 of a 3 Part Blog Series! Find Part 2 here: 

Watch Amit’s Discovery Session on how to build an image based dataset: 

We live in the information age. It is said to be a period in human history characterized by a shift from industrial production to one based on information. Humans and other animals make decisions based on the information available in their environment and their memories. But in the information age, we face the conundrum of information overload. Both for individuals and organizations, this barrage of information can often be paralyzing. Our brains are incapable of making sense of or finding patterns in very large amounts of information. This is why we used mathematical and statistical methods to process the abundant data which is available in our physical and digital environments to make more informed decisions. 

Data is at the heart of the information age. A dataset is a structured collection of data that is most commonly in tabular (row and column) format, where rows correspond to different observations or datapoints and columns correspond to features or variables. Image datasets can also have a tabular format, with image file paths or URLs in one column and image metadata in other columns. They can also be in the form of file structures with directories corresponding to a particular image class. Although there are many public datasets and data repositories, we may find that those datasets may not be suitable for our particular question or problem. It may be that the data we need is available but scattered across several datasets. In that case, we can collect the individual datasets, merge them and keep the data we need. In other cases, we may want to build our own dataset from scratch. 

During my master’s internship at DeepSense, I had the privilege to work with the BEcoME (Benthic Ecosystem Mapping and Engagement) project. One of the goals of the project was to train neural network models for different supervised and unsupervised tasks. In order to train such models which are capable of learning the unique and varied features of seafloor images from around the world, we needed an image dataset with such representative image samples from various parts of the seafloor. My role was to investigate several benthic image repositories and write scrapers for the collection of image data and associated metadata such as the coordinates, image label (if available), elevation, temperature, salinity etc. I will share some of the lessons learnt from that endeavor and some things to consider while making your own dataset: 

What is the purpose of the dataset? 

The first thing to consider is what we are trying the achieve by creating the dataset We must consider how the dataset will be used and who the target audience is. It is also important to consider if the dataset will be made public or used privately within an organization. The answers to these questions determine what kind of data we should collect, how to format and store it. 

How to collect the data? 

Traditionally datasets were generated from surveys or by taking measurements in the field. Today we have another huge repository of unstructured data, namely the world wide web. Web-scraping is the process of collecting data from websites. There are a number of open-source libraries for making web requests, parsing the response, and retrieving relevant data. In particular, the Python programming language contains many helpful libraries for both web scraping and data science. Many websites and data repositories also provide convenient APIs. 

Identifying data sources 

After determining the goal and methods it is necessary to identify particular websites or repositories from which to retrieve the data. It is advisable to have several alternatives since some of them may not be willing to share the data. 

Check back next week for Part 2 of 3!

Follow us on social media