One of the best ways to understand data is through the use of descriptive statistics, figuring out the minimum and maximum values, the median value, and the quartiles. When you’re working with smaller datasets this is easy, but with larger datasets you need to parse a lot of data to get these metrics. Luckily, you can use SQL to get descriptive statistics for your data directly from the database.
- Background and source data
- What next
Background and source data
My goal is to get the minimum, 1st quartile, median, 3rd quartile, and maximum property values for each neighborhood / community in Calgary. To get the raw data, I’m going to be working with property assessment values from the City of Calgary’s open data portal. Based on the link above, you can download the assessed property value for each address within the city. The data also contains the community name for each address so we have an idea which neighborhood each property is in.
Here’s a sample of the raw data:
|AUBURN BAY||535 AUBURN BAY HT SE||441,000|
|AUBURN BAY||531 AUBURN BAY HT SE||447,000|
|AUBURN BAY||527 AUBURN BAY HT SE||380,000|
|AUBURN BAY||5 AUBURN BAY PA SE||496,000|
|AUBURN BAY||9 AUBURN BAY PA SE||470,500|
|BEDDINGTON HEIGHTS||48 BEACONSFIELD RD NW||295,000|
|BEDDINGTON HEIGHTS||52 BEACONSFIELD RD NW||379,000|
|BEDDINGTON HEIGHTS||1419 BERKLEY DR NW||289,500|
|BEDDINGTON HEIGHTS||1417 BERKLEY DR NW||331,000|
|BEDDINGTON HEIGHTS||1415 BERKLEY DR NW||313,000|
I’m going to approach this in two steps. First, I’m going to write a subquery which assigns each address to a quartile based on it’s community. After that, I’ll write another query which will take this data and summarize it by community.
Step 1: Using a query to assign quartiles to data
Let’s start with the subquery. Using SQL’s analytic functions and
NTILE() we can assign each address to a quartile based on it’s community. This is pretty simple in code:
SELECT -- Get the community name CommunityName, -- Get the assessed value AssessedValue, -- Bucket the assessed value for each property into quartiles based on its community NTILE(4) OVER (PARTITION BY Community.Community ORDER BY AssessedValue) AS Quartile FROM Assessments
Running this query results in one row for each property, showing the community name, assessed property value, and the quartile (based on the assessed property value) for that community:
Step 2: Summarizing the quartile data by grouping
Now we need to summarize this by community. To do that, we’ll group by the community, use
MAX() to get the upper / lower boundaries of each quartile, and we’ll use
CASE to pull each quartile into it’s own column. The full, final query to do this is:
SELECT -- Show the community name CommunityName, -- Get the minimum value for the community MIN(AssessedValue) Minimum, -- Get the 1st quartile boundary for the community, which is the highest value in the 1st quartile MAX(CASE WHEN Quartile = 1 THEN AssessedValue END) 1Quartile, -- Get the median for the community, which is the highest value in the 2nd quartile MAX(CASE WHEN Quartile = 2 THEN AssessedValue END) Median, -- Get the 3rd quartile boundary for the community, which is the highest value in the 3rd quartile MAX(CASE WHEN Quartile = 3 THEN AssessedValue END) 3Quartile, -- Get the maximum value for the community MAX(AssessedValue) Maximum, -- Get a count of the total properties in the community COUNT(Quartile) AS Count FROM ( SELECT -- Get the community name Community.CommunityName, -- Get the assessed value AssessedValue, -- Bucket the assessed value for each property into quartiles based on its community NTILE(4) OVER (PARTITION BY Community.Community ORDER BY AssessedValue) AS Quartile FROM Assessments ) Vals GROUP BY CommunityName ORDER BY CommunityName
There we go! When we run this query, it provides the descriptive statistics for each community as a single row, with the statistics running across the columns like this:
Now that you can calculate the min, max, median, and quartiles by group in SQL, it’s easy to have a look at the data across different dimensions and get a feel for the distribution. While I grouped by the community above in the example, if the data had other dimensions such as “number of bedrooms” or “number of bathrooms” I could quickly change the query to get an overview of how those factors impacted the distribution of property values.
Try using this query to explore your data and see what insights you can uncover. If you find something interesting, I’d love to hear about it in the comments below.