A successful data scientist needs to draw on skills from many disciplines, and one of the core skill sets is knowledge of relational databases and querying using structured query language (SQL). Relational databases are the most common way to store structured data, so a firm understanding of databases is key to obtaining performing simple analysis and reporting quickly.

If you are following the CRISP-DM framework, the second and third phases of the data mining process are obtaining understanding of the data and data preparation. Knowledge of databases and SQL helps support these phases in a number of ways.

First of all, knowledge of relational databases allows the data scientist to understand the scope and extent of the data. Relational databases store information across a number of separate tables, and knowledge of database systems allows users to build meaningful connections across those different data sets. This allows the data to be enriched by providing additional attributes or showing other data points of interest.

Secondly, SQL allows the data scientist to write queries to analyze and extract data from the database. These queries could take a number of different forms:

  • To build understanding of the data, the data scientist could extract some sample records.
  • If the data scientist has a simple hypothesis (that can be formulated based on fields from the database), they could write a query to quickly extract the data to prove or disprove their hunch.
  • When more complex algorithms and tools are going to be used, the data scientist could write a query to extract the relevant records and prepare them for additional analysis.
  • Simple recurring reports can be built almost entirely as a database query and run whenever the report needs to be updated.

Luckily relational databases are a simple concept to master. If you are familiar with more advanced Excel formulas like =vlookup(...), then you are well on your way to understanding how to build SQL queries to join tables together. There are a wealth of courses online that provide a good introduction to these concepts and I also provide in person and on site training on relational databases and SQL.

Want to learn more about training for your team, reach me using the contact form and I’ll be in touch.

Leave a Reply

Your email address will not be published. Required fields are marked *