A crucial step in data management is the creation of a data dictionary. In general, a data dictionary is a central location where information about data is stored. This can be defined quite formally, but its implementation doesn’t need to be. Really, you just want a place to store information for future users of data.
Why do we need a data dictionary?
If you are collecting data, then you will be intimately familiar with the details of the data. It is vital to record these details not only for yourself but for other users of data. You may end up re-running analyses over time, or have to go back to old data to run new analyses. You may even move to a new job in the same company, or even move companies leaving your data for your replacement. We often get data shared with us from outside companies. Every time someone new looks at your data, they will have questions. To save yourself lots of headaches, it is best to write down the information as you are collecting the data.
What sort of information should a data dictionary contain?
Metadata is important to understanding data. There are many types of metadata required for a dataset. No matter what type of data you are collecting, it is pretty standard to record the what, when, and where pertaining to the data. This includes the date and time it was collected. In the oceans sector, the data is typically geospatial, and so GPS coordinates are often recorded as well. Details like this are normally recorded for each data point, and found in a spreadsheet saved with the dataset.
There are many other less obvious details that should be recorded. People often overlook the who, why, and how of data collection. These tend to be broader categories, but are no less important. A question like “Why is this data being collected” is obvious to the person collecting the data, but not so for future users. Likewise, a data scientist may not understand what is being studied, even though it may be obvious to subject matter experts what the data means.
How the data is collected and from which device, system or application. For example, the make and model of sensors are important in case you start using new types of sensors. Changing the data source or collection method can make data analysis complicated. If data is collected with two different types of sensors, the analysis may be invalid because the data is no longer comparable. Knowing who collected the data is also important, especially if it is collected manually as people may do it in slightly different ways. Or, they may use different formats, especially for things like datestamps. Units of measure are particularly important, as the data could be useless without them. Data collectors probably record the typical metadata in a spreadsheet, but may also have handwritten field notes. These should be converted to a digital format and preserved as well.
There can be a large cost in collecting data. However data is only valuable when it is used. If you have a dataset without any metadata, it may have limited to no real value. Or, it may be the perfect addition to someone’s work, but in order to use it, they will have to ask you each and every one of these questions to fully understand the dataset. What’s easier: writing this simple information down as your collecting, or having your inbox flooded next year when a new data user is looking for these details? It doesn’t have to be fancy, but taking detailed notes will make your dataset easier to work with for everyone. For more info on basic data management, see my first data management seminar here: https://www.youtube.com/watch?v=TgWjaUtC7BM