Calculating medians and quartiles across groups in SQL

One of the best ways to understand data is through the use of descriptive statistics, figuring out the minimum and maximum values, the median value, and the quartiles. When you’re working with smaller datasets this is easy, but with larger datasets you need to parse a lot of data to get these metrics. Luckily, you can use SQL to get descriptive statistics for your data directly from the database.

Background and source data
Tutorial
- Step 1: Using a query to assign quartiles to data
- Step 2: Summarizing the quartile data by grouping
What next

Background and source data

My goal is to get the minimum, 1st quartile, median, 3rd quartile, and maximum property values for each neighborhood / community in Calgary. To get the raw data, I’m going to be working with property assessment values from the City of Calgary’s open data portal. Based on the link above, you can download the assessed property value for each address within the city. The data also contains the community name for each address so we have an idea which neighborhood each property is in.

Here’s a sample of the raw data:

CommunityName	Address	AssessedValue
AUBURN BAY	535 AUBURN BAY HT SE	441,000
AUBURN BAY	531 AUBURN BAY HT SE	447,000
AUBURN BAY	527 AUBURN BAY HT SE	380,000
AUBURN BAY	5 AUBURN BAY PA SE	496,000
AUBURN BAY	9 AUBURN BAY PA SE	470,500
BEDDINGTON HEIGHTS	48 BEACONSFIELD RD NW	295,000
BEDDINGTON HEIGHTS	52 BEACONSFIELD RD NW	379,000
BEDDINGTON HEIGHTS	1419 BERKLEY DR NW	289,500
BEDDINGTON HEIGHTS	1417 BERKLEY DR NW	331,000
BEDDINGTON HEIGHTS	1415 BERKLEY DR NW	313,000

Tutorial

I’m going to approach this in two steps. First, I’m going to write a subquery which assigns each address to a quartile based on it’s community. After that, I’ll write another query which will take this data and summarize it by community.

Step 1: Using a query to assign quartiles to data

Let’s start with the subquery. Using SQL’s analytic functions and NTILE() we can assign each address to a quartile based on it’s community. This is pretty simple in code:

SELECT 
	-- Get the community name
	CommunityName,
	-- Get the assessed value
	AssessedValue,
	-- Bucket the assessed value for each property into quartiles based on its community
	NTILE(4) OVER (PARTITION BY Community.Community ORDER BY AssessedValue) AS Quartile
FROM
	Assessments

Running this query results in one row for each property, showing the community name, assessed property value, and the quartile (based on the assessed property value) for that community:

CommunityName	AssessedValue	Quartile
AUBURN BAY	254,000	1
AUBURN BAY	254,000	1
AUBURN BAY	254,000	1
AUBURN BAY	254,000	2
AUBURN BAY	254,000	2
BEDDINGTON HEIGHTS	384,500	3
BEDDINGTON HEIGHTS	385,000	3
BEDDINGTON HEIGHTS	385,000	3
BEDDINGTON HEIGHTS	385,000	4
BEDDINGTON HEIGHTS	385,000	4

Step 2: Summarizing the quartile data by grouping

Now we need to summarize this by community. To do that, we’ll group by the community, use MIN() and MAX() to get the upper / lower boundaries of each quartile, and we’ll use CASE to pull each quartile into it’s own column. The full, final query to do this is:

SELECT
	-- Show the community name
	CommunityName,
	-- Get the minimum value for the community 
	MIN(AssessedValue) Minimum,
	-- Get the 1st quartile boundary for the community, which is the highest value in the 1st quartile
	MAX(CASE WHEN Quartile = 1 THEN AssessedValue END) 1Quartile,
	-- Get the median for the community, which is the highest value in the 2nd quartile
	MAX(CASE WHEN Quartile = 2 THEN AssessedValue END) Median,
	-- Get the 3rd quartile boundary for the community, which is the highest value in the 3rd quartile
	MAX(CASE WHEN Quartile = 3 THEN AssessedValue END) 3Quartile,
	-- Get the maximum value for the community
	MAX(AssessedValue) Maximum,
	-- Get a count of the total properties in the community 
	COUNT(Quartile) AS Count
FROM (
	SELECT
		-- Get the community name
		Community.CommunityName,
		-- Get the assessed value
		AssessedValue,
		-- Bucket the assessed value for each property into quartiles based on its community
		NTILE(4) OVER (PARTITION BY Community.Community ORDER BY AssessedValue) AS Quartile
	FROM
		Assessments
) Vals
GROUP BY
	CommunityName
ORDER BY
	CommunityName

There we go! When we run this query, it provides the descriptive statistics for each community as a single row, with the statistics running across the columns like this:

CommunityName	Minimum	1Quartile	Median	3Quartile	Maximum	Count
ACADIA	6,250	197,500	386,500	417,500	54,760,000	3,826
ALTADORE	3,050	392,500	722,000	878,000	8,480,000	3,079
ARBOUR LAKE	10,000	230,000	430,000	510,000	26,080,000	4,381
ASPEN WOODS	4,050	375,000	751,000	1,000,000	25,100,000	3,599
AUBURN BAY	6,000	254,000	405,000	532,000	51,030,000	6,904
BEDDINGTON HEIGHTS	10,000	283,000	340,000	385,000	12,210,000	4,074
BOWNESS	10,000	291,000	358,500	442,500	18,950,000	4,060
BRIDGELAND/RIVERSIDE	16,310	22,000	269,000	456,500	41,120,000	4,045
BRENTWOOD	10,000	216,000	466,500	530,500	38,060,000	3,438
BRIDLEWOOD	3,050	186,000	353,000	404,500	2,260,000	4,836

What next

Now that you can calculate the min, max, median, and quartiles by group in SQL, it’s easy to have a look at the data across different dimensions and get a feel for the distribution. While I grouped by the community above in the example, if the data had other dimensions such as “number of bedrooms” or “number of bathrooms” I could quickly change the query to get an overview of how those factors impacted the distribution of property values.

Try using this query to explore your data and see what insights you can uncover. If you find something interesting, I’d love to hear about it in the comments below.