Improve your SQL with these templates for formatting and documentation
Reduce the pain of code review and knowledge transfer with better SQL format and documentation.
Introduction
Have you ever come across a complicated SQL query for analysis?
Did you struggle to understand the code itself and the business logic underneath?
I did. And sometimes, it was with my queries in the past!
To save time for everyone who read my code (including myself), I have tried to apply two templates to my query and found that they are very helpful to:
- Increase the quality of the code itself
- Reduce code review time
- And improve knowledge transfer
What you can expect
In this article, I will share with you these two templates that I use.
Template #1: Document context and assumptions in your SQL query
Template #2: Format SQL query
To demonstrate their usage, I will go through an example using MySQL to summarize sales before and after COVID-19.
I hope these templates come in handy! This is especially when remote work is our new normal after COVID-19 started, increasing the importance of over-communication to make sure everyone is on the same page.
Template #1: Document context and assumptions in SQL query
1. Template
List this information before you write a query:
- Important business context for this query
- Expectations for query result
- Any assumptions made for the business logic and data
/*
CONTEXT:
- add a brief description of why we need this queryRESULT
EXPECTATION
- add a brief description of your expectations for the query result
ASSUMPTION:
- add assumption about business logic- add assumption about data
*/
2. Template in Action
Apply this template to our sales summary example:
/*
CONTEXT:
- Our company wants to understand if COVID has any impact on sales in stores around Chicago
RESULT EXPECTATION:
- This query returns total sales (in USD) for each of our stores in Chicago every month before and after COVID, starting from 2019-03-01
ASSUMPTION:
- Dates before 2020-03-01 are considered "Before COVID"- Each transaction has a unique id, so we do not expect duplications in our transaction table
- There are some spam transactions we have identified after COVID, so we will filter these out
*/
3. Benefits
Having a brief description with business context, result expectation, and assumptions before our queries has many benefits:
- It keeps us focused on the main goal of the query while writing it
- It helps readers quickly establish a high-level understanding of the value and expectations of our queries
- It assists code reviewers to create initial tests for the queries based on the expectations
Remember, this step is an iterating process. We might have to go back and forth to improve the documentation as we write the query.
Template #2: Format SQL query
1. Template Rules
There are many rules out there to format the SQL query. To keep it simple, these are the main rules I follow:
- Highlight reserved keywords (eg. SELECT, WHERE) using upper-case
- Clearly show where a query or subquery starts and ends using indentation
- For long and complicated query, include a comment before any major sub-queries or joins for context
- Give reference for where columns come from with their source table or descriptive table alias
2. Template in Action
This query is for demonstration only
3. Benefits
There are many benefits to readability with this query format. Here are some:
- Understand the overall structure of the query (ie. what columns are chosen, how many joins are there, which filters are applied)
- Save time to identify the start and end of subqueries for testing, because we can see the opening and closing parentheses of a subquery in the same vertical line
- Avoid getting lost in a complicated query with comments throughout the query
Combining Template #1 and Template #2
Combining these two templates in our example, we will have:
/*
CONTEXT:
- Our company wants to understand if COVID has any impact on sales in stores around Chicago
RESULT EXPECTATION
- This query returns total sales (in USD) for each of our stores in Chicago every month before and after COVID, starting from 2019-03-01
ASSUMPTION:
- Dates before 2020-03-01 are considered "Before COVID"
- Each transaction has a unique id, so we do not expect duplications
in our transaction table
- There are some spam transactions we have identified after COVID,
so we will filter these out
*/
SELECT
store_info.id,
store_info.name AS store_name,
DATE_FORMAT(transactions.date, "%Y-%m") AS transaction_month,
SUM(transactions.total_amount) AS total_amount
FROM
transactions
LEFT JOIN
-- get all stores in Chicago
(
SELECT
id,
name
FROM
stores
WHERE
city = 'Chicago'
) AS store_info
ON
transactions.branch_id = store_info.id
WHERE
transactions.date >= '2019-03-01'
-- filter spam transactions
AND transactions.id NOT IN
(
SELECT
id
FROM
spam_transactions
)
GROUP BY
store_info.id,
store_info.name,
DATE_FORMAT(transactions.date, "%Y-%m")
BONUS Tip
These templates are not the only templates out there. Finding out what works best for you and your team is a trial and error process.
After you finalize a formatting style, manually styling every query can be tiring. Many SQL IDE has an option to automate this process for you. However, if it still doesn’t meet your need, there are tools out there. Try searching with these keywords: “SQL formatter”
For Python users, I’ve been playing around with sqlparse. You can get more information here.
Photo by Isaac Smith on Unsplash