CRISP-DM in depth: data preparation

After developing business understanding and data understanding, the next big objective in the CRISP-DM methodology is to prepare the data for modelling and analysis. This involves selecting, cleaning and transforming the data which will be used for the project. While this isn’t flashy work, it typically accounts for 60% to 80% of the effort for a project.

By the end of this stage you’ll have a dataset that is ready to be used for modelling, but there is often a lot of work that needs to occur in order to prepare raw data for analysis. The adage ‘garbage in, garbage out’ still maintains its relevance – the quality of your analysis cannot exceed the quality of your data.

Select data

The first task is to select the data you will use, given the knowledge you just acquired in the data understanding stage of the CRISP-DM process. Often you’ll have to choose between multiple potential data sources each with varying attributes and quality. In order to guide this decision, you’ll want to keep in mind the overall goals for the project, the quality of the data sources, and the data types within each data source (since some data types will potentially cause technical constraints, depending on the analysis techniques you use).

Within a specific data source, you’ll also need to determine whether you want to use the entire data set, or to subset it to include only certain fields (columns) or records (rows). For the fields, their inclusion will depend on whether they are required for your analysis; and for the records, their inclusion tends to be a matter of either project scope or data quality.

Rationale for inclusion / exclusion

Once you’ve decided which data source to use and what subset (if any) will be taken, you’ll want to document your decisions and the reasons for your decisions.

Clean data

Rarely is raw data ready for modelling without any intervention. The next step is to look at the data and determine what cleaning actions need to be performed in order to support your analysis techniques. This often includes identifying clean subsets of the data, updating blank values to suitable default values, or estimating missing values.

This step needs to be taken with care and consideration since the selection of default values or estimates could potentially impact the results of your analysis.

Data cleaning report

As you clean the data, you’ll want to document all the issues that you encountered and the steps you took to remedy each issue. This should be an extension of the data quality report which was developed as part of the data understanding stage.

Construct data

Often you’ll need to add additional information and / or transform existing information in order to support your analysis. This takes two forms, creating derived attributes and generating new records:

Derived attributes

Derived attributes are new calculated fields which are added to the data set in order to support your analysis. Often these are simple calculations which are created by adding or multiplying other fields from the same record.

Generated records

Occasionally you will need to generate entire records in order to fill in missing data for your analysis. These are often created to represent events which didn’t occur such as customers who did not make a purchase.

Integrate data

Data integration covers activities where data from multiple data sources is combined together and / or summarized.

Merged data

The output from integration is the merged data, one or more data sets which are built from data from a combination of multiple data sources. Often this involved denormalizing data which was stored in databases, combining the dimension and fact tables together to provide a view of events along with the relevant context.

Merging data also covers creating aggregates by grouping the data and then summing and / or averaging the resulting records. Depending on the analysis techniques you plan on using, this can help reduce the size of the data set which will help speed your analysis and provide more intuitive results.

Format data

Formatting data is the step where the data is reorganized to conform to the specifications of the modeling tools. Some modelling tools require that the first field is a unique identifier, or that the last field is the outcome that you want to predict. The changes in this step do not change the meaning of the data, only the way that it is represented. Other examples of formatting include ordering the records in a specific way, or syntactic changes such as trimming long strings or replace certain characters with special meaning.

Reformatted data

The outcome of the format data step is a reformatted data set which is ready to be used in a modelling tool.

Over the course of this stage, we began actually working with the data and massaged it to a state where it is ready to be loaded into whatever modelling tools we wish to use. The next step, modelling, is where the analysis begins in earnest.

Have any questions about loading or transforming data, or need some help wrangling the data for your project? Use the contact form to reach out to me – I’d love to hear from you.