CRISP-DM in depth: modeling

Once the data preparation phase is complete, its time to move on to the fun part of the CRISP-DM framework: modeling. Here you’ll chose which modeling technique to use, create some tests to assess the accuracy of your model, build the model, and then assess the model using the tests you created.

Select modeling technique

The first step in modeling is to decide which technique you want to use. This is really influenced by the business and data understanding, which inform which modeling technique best suits the problem you are trying to solve. In the documentation for the project you’ll want to capture not only your technique, but also the rationale for selecting it and any assumptions you’re making.

Modeling technique

It’s important to document the specific technique (k-nearest neighbours, random forest, neural network) that you are using.

Modeling assumptions

Many techniques involve assumptions around tuning the model, handling missing data values, converting between data types, and whether the inputs are assumed to have a specific statistical distribution. You’ll want to document these items since they all impact the final result.

Generate test design

In order to determine the effectiveness of your model, you need to develop some way of testing the model to determine it’s validity. It’s customary to separate your data into two sets, one which will be used for training the model, and a second set which will be used for testing the model.

Test design

Describe your plan for training, testing and evaluating the model. List the specific measures you are testing and what criteria you are using to determine if the model is successful or not. This includes the plan for dividing the data between training, testing and validation data sets.

Build model

At last it’s time to break out the modeling tool and run it on the data. Run the tool on the dataset and generate the model(s) you need.

Parameter settings

Often with modeling tools, there are multiple parameters which can be set and which will have an impact on the final results of the model. List the parameters, the values you chose, and your rationale for choosing them.

Models

The key output of this process will be a model which can be tested, and if it passes validation, used on an ongoing basis.

Model description

Lastly, once the models are created, describe the models themselves. Provide an interpretation of the models and document any issues encountered while generating the model.

Assess model

Now that the models are created, they need to be evaluated based on knowledge of the business and data, the data mining success criteria, and the results of testing the model. The initial evaluation will be from a purely technical level, subsequently, business analysts and domain experts are brought in to provide a more robust evaluation against the business context. This task pertains strictly to the models, the following evaluation phase provides a more holistic review of the project.

If multiple models are created, the models will be ranked against each other based on the evaluation criteria and the best model for the project is selected.

Model assessment

Document the results of the model assessment including the results of testing, the accuracy of the model, and their rank against other models if multiple models exist.

Revised parameter settings

Based on the assessment of the model, determine whether the model’s parameters should be updated and re-run the Build Model phase with the revised parameters. Continue to iterate model building and assessment until a model is created which satisfies the success criteria and you believe you have developed the best model. Document your revisions and assessments along the way.

Have any questions about building or evaluating models, or need some help wrangling the data for your project? Use the contact form to reach out to me – I’d love to hear from you.