Having developed business understanding and a deep knowledge of the problem you are trying to solve, the next step in the CRISP-DM framework is to develop that same level of understanding around the data itself. This step isn’t analysis, but rather looking at the structure and shape of the data in order to determine what information is available and how to go about building your analysis.

This work might seem tedious at first since it involves lengthy documentation of data sources, but it provides dividends in future projects by increasing the organization’s knowledge and understanding of their data sources. This will help identify opportunities to improve data collection and quality, and also help subsequent projects progress faster.

Collect initial data

To start the journey towards data understanding, you need to acquire the raw data.  Here you either acquire, or acquire access to, the data listed in the project resources document.  If you use specialized tools for data exploration to increase your understanding, this step would also include loading the data into that tool.  If you have multiple sources of data, you’ll need to decide whether you want to combine those data sources up front or later in the process.

Initial data collection report

As you collect the data, you’ll want to document the data sources you acquired, along with their locations, their format, and any issues you encountered along the way. Recording this information makes it easier for future projects using these data sets and creates a FAQ document for common issues and solutions that you identify when sourcing the data.

Describe data

Once you have the data in hand, you will need to examine the data at a surface level to look at the size and shape of the data.  Your goal is to understand the fields, the format of each field, and the record count for each table.  Once you have completed this step, you should have a good idea of whether or not the data satisfies your requirements for the project.

Data description report

As you look at the data, you will want to begin documenting the features of the data for use later in the project. This might involve building data dictionaries for previously undocumented data sources.

Explore data

Now its time to dive into the data itself and begin looking at the records. This task is the beginning of acquainting yourself with the data by querying, aggregating, reporting, and visualizing the data. You’ll want to look at summary statistics, relationships between variables or between data sources, and begin identifying any subsets of interest within the data. The insight gained during these activities can begin answering business questions or inform the data quality report and build the basis for the transformations and later work required in the project.

Data exploration report

Over the course of the data exploration, document the results of your work for reference. These notes will inform the upcoming analysis and transformation phases of the project.

Verify data quality

The last task within data understanding is assessing the quality of the data. Based on the results of your data exploration you probably found some outliers or strange results, now you need to confirm whether these outliers and oddities are legitimate data point or whether they are incorrect. You will also want to look at the completeness of the data, does it cover the full time period you are interested in and are all fields consistently present. Lastly, take note of how missing data is represented in the data and whether this is implemented consistently.

Data quality report

Based on the results of your verification of the data quality, create a report describing the accuracy and completeness of the data. Document any issues you encountered and list the suggestions and workarounds for these issues.

Having completed the background work for data understanding, you should have detailed documentation of the data sources, their structure, their content, and their accuracy. This will help inform the upcoming steps of data preparation and modeling and is invaluable information for future projects.

Need help managing your data science project or improving your planning methodology? Get in touch with me using the contact form and learn how I can help.

Leave a Reply

Your email address will not be published. Required fields are marked *