While analysis tools and algorithms have evolved at a rapid pace, the overall business process for analytics has remained remarkably stable. One seminal work on the analytic process is IBM’s Cross-Industry Standard Process for Data Mining (CRISP-DM). At over 20 years old, it remains a relevant and useful tool for describing the overall data science workflow.
CRISP-DM is the most widely used analytics model, a result of it’s industry, tool, and application agnostic definition. It’s also an open standard which is free for use and flexible enough to cover a number of different analytic styles or approaches.
Overall the model is composed of six core activities:
- Business understanding: gathering the business requirements for the project, intended outcomes, and creating a problem definition.
- Data understanding: collecting initial data, understanding the scope, attributes, & veracity of the data, and review the data to help form initial insights and hypothesizes.
- Data preparation: transforming, cleaning, and preparing the final data set for ingestion by the modeling tools.
- Modeling: a number of modeling techniques and algorithms are tested and adjusted to provide the best results. There are numerous approaches which can be used, so this activity involves a high degree of exploration.
- Evaluation: once a final model is produced, it needs to be subjected to rigorous review to ensure that it meets the business requirements and does not cause any unintended consequences.
- Presentation / deployment: models that pass evaluation are ready to be shared within the organization and implemented.
The six activities are not a linear process but are part of an ongoing cycle of analytic work. Depending on the results of each step, it may be necessary to revisit prior steps to gain additional understanding, enhance the data, or improve models. Despite the cyclical nature, it all leads to the creation of models which can be integrated into day to day business activities.
When executed properly, a benefit of using the CRISP-DM framework is that it leads to additional business and data insights which help future analytics efforts. This builds a virtuous cycle which can deliver exponential results across projects and initiatives.