We are often asked how much data one needs when using artificial intelligence/machine learning techniques. Unfortunately, there is no simple answer. It will depend highly on what type of data you have, and what methods you are employing.
Is it possible to have too little data?
Yes, but there can be some easy fixes. The first being to collect more data. Or use other sources of data – there are lots of freely shared datasets on the internet, for all different types of data. For example, the NASA Open Data Portal has thousands of datasets available for public use. Be sure to check out our future blog “What does my data need to look like to do AI/ML?” to learn more about types of data.
Is it possible to have too much data?
Yes, but what a wonderful predicament to be in. You don’t need to use all of your data to train a machine learning model. In fact, using too much data will make training the model very slow, and will likely cause the model to over fit the data. That is to say, the model learns not only the detail of the data, but also the noise. This can negatively impact the model’s ability to classify data it hasn’t seen before. There are different strategies for using subsets of your data that you can take advantage of. However, you would want to make sure whatever subset you use is representative of your data as a whole.
A better question to ask is how much clean data do you have?
Data cleaning is a necessary step in any problem. Most people could tell that a grainy image, or an audio clip with static could be classified as not clean. Not everyone would realise that unclean data means more than just data with noise. It can also mean misspelled text, data in the wrong field of a spreadsheet, or mistakes in transcribing. If you’re collecting data from different sites, they may not all have the same data fields, or may be missing some fields. While you generally want as much data as you can gather, the quality of the data needs to be stressed, as well as quantity. Some data may need to be left out as a result of the cleaning. This is yet another reason to make sure you have enough.
The short answer to the question is you likely need thousands of entries. Definitely not fewer than hundreds, but ideally on the order of hundreds of thousands. Of course, it will depend on the type of data you have. For example if your data is in a spreadsheet, it may be easy to get hundreds of thousands of entries – if not millions. On the other hand, if you were looking to use audio or video, you may need fewer entries, as each will encode a lot of data.
The harder the problem, or the more complex the data, the more you’ll need.