One of the best ways to understand data is through the use of descriptive statistics, figuring out the minimum and maximum values, the median value, and the quartiles.  When you’re working with smaller datasets this is easy, but with larger datasets you need to parse a lot of data to get these metrics.  Luckily, you can use SQL to get descriptive statistics for your data directly from the database.

Contents

Background and source data

My goal is to get the minimum, 1st quartile, median, 3rd quartile, and maximum property values for each neighborhood / community in Calgary. To get the raw data, I’m going to be working with property assessment values from the City of Calgary’s open data portal. Based on the link above, you can download the assessed property value for each address within the city. The data also contains the community name for each address so we have an idea which neighborhood each property is in.

Here’s a sample of the raw data:

CommunityNameAddressAssessedValue
AUBURN BAY535 AUBURN BAY HT SE441,000
AUBURN BAY531 AUBURN BAY HT SE447,000
AUBURN BAY527 AUBURN BAY HT SE380,000
AUBURN BAY5 AUBURN BAY PA SE496,000
AUBURN BAY9 AUBURN BAY PA SE470,500
BEDDINGTON HEIGHTS48 BEACONSFIELD RD NW295,000
BEDDINGTON HEIGHTS52 BEACONSFIELD RD NW379,000
BEDDINGTON HEIGHTS1419 BERKLEY DR NW289,500
BEDDINGTON HEIGHTS1417 BERKLEY DR NW331,000
BEDDINGTON HEIGHTS1415 BERKLEY DR NW313,000

Tutorial

I’m going to approach this in two steps. First, I’m going to write a subquery which assigns each address to a quartile based on it’s community. After that, I’ll write another query which will take this data and summarize it by community.

Step 1: Using a query to assign quartiles to data

Let’s start with the subquery. Using SQL’s analytic functions and NTILE() we can assign each address to a quartile based on it’s community. This is pretty simple in code:

SELECT 
	-- Get the community name
	CommunityName,
	-- Get the assessed value
	AssessedValue,
	-- Bucket the assessed value for each property into quartiles based on its community
	NTILE(4) OVER (PARTITION BY Community.Community ORDER BY AssessedValue) AS Quartile
FROM
	Assessments

Running this query results in one row for each property, showing the community name, assessed property value, and the quartile (based on the assessed property value) for that community:

CommunityNameAssessedValueQuartile
AUBURN BAY254,0001
AUBURN BAY254,0001
AUBURN BAY254,0001
AUBURN BAY254,0002
AUBURN BAY254,0002
BEDDINGTON HEIGHTS384,5003
BEDDINGTON HEIGHTS385,0003
BEDDINGTON HEIGHTS385,0003
BEDDINGTON HEIGHTS385,0004
BEDDINGTON HEIGHTS385,0004

Step 2: Summarizing the quartile data by grouping

Now we need to summarize this by community. To do that, we’ll group by the community, use MIN() and MAX() to get the upper / lower boundaries of each quartile, and we’ll use CASE to pull each quartile into it’s own column. The full, final query to do this is:

SELECT
	-- Show the community name
	CommunityName,
	-- Get the minimum value for the community 
	MIN(AssessedValue) Minimum,
	-- Get the 1st quartile boundary for the community, which is the highest value in the 1st quartile
	MAX(CASE WHEN Quartile = 1 THEN AssessedValue END) 1Quartile,
	-- Get the median for the community, which is the highest value in the 2nd quartile
	MAX(CASE WHEN Quartile = 2 THEN AssessedValue END) Median,
	-- Get the 3rd quartile boundary for the community, which is the highest value in the 3rd quartile
	MAX(CASE WHEN Quartile = 3 THEN AssessedValue END) 3Quartile,
	-- Get the maximum value for the community
	MAX(AssessedValue) Maximum,
	-- Get a count of the total properties in the community 
	COUNT(Quartile) AS Count
FROM (
	SELECT
		-- Get the community name
		Community.CommunityName,
		-- Get the assessed value
		AssessedValue,
		-- Bucket the assessed value for each property into quartiles based on its community
		NTILE(4) OVER (PARTITION BY Community.Community ORDER BY AssessedValue) AS Quartile
	FROM
		Assessments
) Vals
GROUP BY
	CommunityName
ORDER BY
	CommunityName

There we go! When we run this query, it provides the descriptive statistics for each community as a single row, with the statistics running across the columns like this:

CommunityNameMinimum1QuartileMedian3QuartileMaximumCount
ACADIA6,250197,500386,500417,50054,760,0003,826
ALTADORE3,050392,500722,000878,0008,480,0003,079
ARBOUR LAKE10,000230,000430,000510,00026,080,0004,381
ASPEN WOODS4,050375,000751,0001,000,00025,100,0003,599
AUBURN BAY6,000254,000405,000532,00051,030,0006,904
BEDDINGTON HEIGHTS10,000283,000340,000385,00012,210,0004,074
BOWNESS10,000291,000358,500442,50018,950,0004,060
BRIDGELAND/RIVERSIDE16,31022,000269,000456,50041,120,0004,045
BRENTWOOD10,000216,000466,500530,50038,060,0003,438
BRIDLEWOOD3,050186,000353,000404,5002,260,0004,836

What next

Now that you can calculate the min, max, median, and quartiles by group in SQL, it’s easy to have a look at the data across different dimensions and get a feel for the distribution. While I grouped by the community above in the example, if the data had other dimensions such as “number of bedrooms” or “number of bathrooms” I could quickly change the query to get an overview of how those factors impacted the distribution of property values.

Try using this query to explore your data and see what insights you can uncover. If you find something interesting, I’d love to hear about it in the comments below.

Leave a Reply

Your email address will not be published.