I recently completed an MSc in Computer Science at Dalhousie University. For my thesis, I analyzed sets of AIS* data I had not used before starting my Master’s degree. I want to share some lessons I learned about how to start working with unfamiliar data.
*AIS: Automatic identification system is an automatic tracking system for ships that uses transponders and satellite for identification, position, course, and speed with many vessels using to avoid collisions at sea
There are dozens of different data formats that data scientists analyze, including images, sound, and structured CSV data. While a data scientist may be familiar with processes to clean, manipulate, and analyze data, the source and quality of the data itself may require a lot of learning.
When starting a new project, you may not be familiar with the data that needs to be analyzed. It is intimidating to start a project with a new type of data you don’t have experience working with. Here are some ways to overcome this problem:
What is the first step?
The first thing to do when you have a new dataset is to take time and know the data better. Knowing the data means that you must explore every feature of data to understand what it represents. Looking into the distribution of a feature, whether it is numerical or categorical, understanding the patterns of its change and so on, are all good practices to know the data better. While learning about the data, you also need to evaluate the quality of the data. Without being an expert in this exact data, you may not be sure if any patterns or outliers are normal or data quality issues. You may have to reach out to a subject matter expert or look online to understand nuances better.
Where to get the information?
Once you know your data, you can start to explore the techniques that have been implemented for this specific data type that you work with. For example, text data requires specific preprocessing steps to extract a vector representation of your choice for each sample, while working with vision data does not require extracting a representation for it. Take some time to explore what others have done with the data by seeking out previous works on this data type in papers on Google Scholar, GitHub, Stack Overflow topics, etc.
Each source can be used in different ways. When you want to start coding, sources like GitHub and Stack Overflow can provide more practical examples. On the other hand, recent research papers can suggest techniques you may want to use in your project.
How to be successful when you are not experienced?
Experience can help you when you want to implement your idea on a data type you have worked with before. However, the innovation and unique ideas that you’ll have because your mind is not trained too much on the traditional ways of working with this data type will help you perform well on the project. For example, when I started working with AIS data, all of the previous works were looking at the data, as point-based data. However, I started a new approach to solve the problems related to this data type by converting it to images.
In addition, many algorithms that work on a specific data type will work on another one with minor changes. However, as many researchers want to stay in their comfort zone and only work with the data type, they have experience with, the efficacy of these algorithms are not tested. As a result, you can apply your knowledge and expertise to many other data types as well.
Overall, the challenges you face when working with new data can help you learn and apply your knowledge in a new field. Try to use your experience and bring new techniques to solve the problems related to that new data type, and don’t forget to enjoy the challenge.
Interested in finding out more about AIS data? Take a look at our latest DeepSense Discovery Session with Casey Hilliard from Meridian.