Part 3: Data Volume, Cleaning and Documentation

This it the final part of a three part series of how to build a dataset ofr machine learning. Find the previous parts below:

Part 1: https://deepsense.ca/2022/02/09/blog-building-your-own-dataset-for-machine-learning/

Part 2: https://deepsense.ca/2022/02/17/part-2-acquiring-external-data/

Watch Amit’s Discovery Session on how to build an image based dataset: https://www.youtube.com/watch?v=3TT4hySJVvA

~~~~~~~

How much data is enough?

Web scrapping is actually quite fun and our data scientist instincts tell us that the more data we have the better. So, there is a natural tendency to keep on searching and scraping more and more data. However, we must also consider the marginal returns for collecting more data and the cost of putting in time and effort. This is we might return to the original purpose of the dataset and determine how large of a dataset is enough. In our case, we had to ask, do we have enough labelled images, do we have representative samples from various sites, how many missing values are there in each of the mandatory and optional columns, what is the spatial distribution of images and so on. These answers to these questions determine when we should stop collecting data and move on to the next steps. (Stay tuned for future blogs on this topic MAYBE also link to Jason’s)

Data cleaning

After the data collection phase, we must clean and format the data. If we are combining data from various sources, we must ensure that the same column has the same format data. For example: two different sources may express length measurements in different units, such as meters and kilometers. In this phase, we should convert them to a common unit. We should also check for duplicate entries. Even if the code is flawless, sometimes websites may have multiple copies of one item. The aim of this phase is to improve the overall data quality and prepare it for final use. (Stay tuned for future blogs on this topic)

Documentation

One of the most important but overlooked aspects of building a dataset is documentation. Whether the dataset is used internally or publicly, good documentation is always helpful. A well-documented dataset (also applies to code) can save a lot of time for the end user. A basic dataset description includes an explanation of what data is in it, the purpose of the dataset, its target audience, column descriptions, name of the authors/contributors, data last updated. Some organizations may also choose to publish their dataset and the methods of compiling it in a journal article. (Stay tuned for future blogs on this topic)

Remember, this is an iterative process

Finally, it is important to remind ourselves that building a dataset is an iterative process. We may collect some data, analyze it and determine where to focus further efforts. We may find that a certain source has certain restrictions on their data (e.g.: some sites allow you to view but not download data/images, other may allow downloads but not redistribution). We may have a particular vision during the initial data collection plan, but as we move through the process, we have to remain flexible and adapt to new information.

Hopefully, this has help you dear reader, to make more informed decisions on data collection in the age of information without succumbing to information overload.