What does my data need to look like to do AI/ML?

Do you have data, but are unsure if it is suitable for AI/ML? There are many different types of data, but they generally boil down to a few categories. Typically, your data will be:

numerical
categorical
time series
text

Numerical

Numerical data can also be called quantitative. It can be a continuous variable, like temperature or wind speed. It could also be restricted to discrete values, such as the number of students in a class, or the number of units sold. Images are another type of numerical data. They can be represented as a matrix of numerical values (in grayscale). Colour images are represented using 3 values for each pixel, corresponding to the levels of red, green and blue. You can think of these as a 3-dimensional matrix (tensor), with each value being numerical.

Categorical

On the other hand, you may have categorical, or qualitative data. These tend to be a classification of an object, such as colour or species. You may even create categories from numerical data by binning. Age is a numerical value, but often in surveys you’ll find age groups (bins) such as 18-25, or 65+. These categories are used when the actual value isn’t as important as overall trends.

Time series

Another important type of data is time series. Generally, these are data values with an associated timestamp. For example, sensors could record temperature or windspeed every minute (or hour, or day). These are most often used when examining historical data in an attempt to predict future events. Another type of time series would be audio/video recordings. They record sound/images across time. For example, whales are often studied using acoustics.

Text

The other category of data is text. This can mean everything from webpages, to tweets, to books. In the ocean sector text data can include shipping manifests, vessel logs, company or regulatory required reporting, and emails or social media posts.

These categories aren’t mutually exclusive. Indeed, audio recordings would be stored numerically, across time; thus, would be both time series and quantitative.

Another consideration is whether you have structured or unstructured data. Structured data, as the name suggests, has clearly defined data types housed in a way that makes them easily searchable. It may be stored in a relational database (using SQL, for example), or a spreadsheet. Unstructured data typically represents everything else. This type of data is not easily searchable, like audio/video or social media posts. Both of these types of data can be used for ML. Structure data may be easier to use, however most data is unstructured.

Assuming your data falls into (at least) one of these categories, you can probably use it for AI/ML. Which model you employ will depend on the type of data you have and the prediction you want to make. Images classification often use convolutional neural nets, while time series often use recurrent neural nets. However, it will also depend on the type of question you want to answer. Sometimes standard statistical methods are all that is required.

Keep an eye out for our future blog, “Can I answer all of our problems with AI/ML?”