Improve your SQL with these templates for formatting and documentation

Reduce the pain of code review and knowledge transfer with better SQL format and documentation.

Introduction

Have you ever come across a complicated SQL query for analysis?

Did you struggle to understand the code itself and the business logic underneath?

I did. And sometimes, it was with my queries in the past!

To save time for everyone who read my code (including myself), I have tried to apply two templates to my query and found that they are very helpful to:

  • Increase the quality of the code itself
  • Reduce code review time
  • And improve knowledge transfer

What you can expect

In this article, I will share with you these two templates that I use.

Template #1: Document context and assumptions in your SQL query

Template #2: Format SQL query

To demonstrate their usage, I will go through an example using MySQL to summarize sales before and after COVID-19.

I hope these templates come in handy! This is especially when remote work is our new normal after COVID-19 started, increasing the importance of over-communication to make sure everyone is on the same page.

Template #1: Document context and assumptions in SQL query

1. Template

List this information before you write a query:

  1. Important business context for this query
  2. Expectations for query result
  3. Any assumptions made for the business logic and data
/* 
CONTEXT: 
- add a brief description of why we need this queryRESULT 

EXPECTATION
- add a brief description of your expectations for the query result

ASSUMPTION:
- add assumption about business logic- add assumption about data
*/

2. Template in Action

Apply this template to our sales summary example:

/*  
CONTEXT:  
- Our company wants to understand if COVID has any impact on sales in stores around Chicago

RESULT EXPECTATION: 
- This query returns total sales (in USD) for each of our stores in Chicago every month before and after COVID, starting from 2019-03-01

ASSUMPTION:
- Dates before 2020-03-01 are considered "Before COVID"- Each transaction has a unique id, so we do not expect duplications in our transaction table
- There are some spam transactions we have identified after COVID, so we will filter these out
*/

3. Benefits

Having a brief description with business context, result expectation, and assumptions before our queries has many benefits:

  1. It keeps us focused on the main goal of the query while writing it
  2. It helps readers quickly establish a high-level understanding of the value and expectations of our queries
  3. It assists code reviewers to create initial tests for the queries based on the expectations

Remember, this step is an iterating process. We might have to go back and forth to improve the documentation as we write the query.

Template #2: Format SQL query

1. Template Rules

There are many rules out there to format the SQL query. To keep it simple, these are the main rules I follow:

  1. Highlight reserved keywords (eg. SELECT, WHERE) using upper-case
  2. Clearly show where a query or subquery starts and ends using indentation
  3. For long and complicated query, include a comment before any major sub-queries or joins for context
  4. Give reference for where columns come from with their source table or descriptive table alias

2. Template in Action

This query is for demonstration only

3. Benefits

There are many benefits to readability with this query format. Here are some:

  • Understand the overall structure of the query (ie. what columns are chosen, how many joins are there, which filters are applied)
  • Save time to identify the start and end of subqueries for testing, because we can see the opening and closing parentheses of a subquery in the same vertical line
  • Avoid getting lost in a complicated query with comments throughout the query

Combining Template #1 and Template #2

Combining these two templates in our example, we will have:

/* 
CONTEXT: 
- Our company wants to understand if COVID has any impact on sales in stores around Chicago

RESULT EXPECTATION
- This query returns total sales (in USD) for each of our stores in Chicago every month before and after COVID, starting from 2019-03-01

ASSUMPTION:
- Dates before 2020-03-01 are considered "Before COVID"
- Each transaction has a unique id, so we do not expect duplications
in our transaction table
- There are some spam transactions we have identified after COVID,
so we will filter these out
*/

SELECT 
  store_info.id,
  store_info.name AS store_name,
  DATE_FORMAT(transactions.date, "%Y-%m") AS transaction_month,
  SUM(transactions.total_amount) AS total_amount
FROM 
  transactions
LEFT JOIN 
  -- get all stores in Chicago
  (
    SELECT 
      id,
      name
    FROM 
      stores
    WHERE
      city = 'Chicago'
  ) AS store_info
ON 
  transactions.branch_id = store_info.id
WHERE
  transactions.date >= '2019-03-01'
  -- filter spam transactions
  AND transactions.id NOT IN 
    (
      SELECT 
        id 
      FROM 
        spam_transactions
    )
GROUP BY 
  store_info.id,
  store_info.name,
  DATE_FORMAT(transactions.date, "%Y-%m")

BONUS Tip

These templates are not the only templates out there. Finding out what works best for you and your team is a trial and error process.

After you finalize a formatting style, manually styling every query can be tiring. Many SQL IDE has an option to automate this process for you. However, if it still doesn’t meet your need, there are tools out there. Try searching with these keywords: “SQL formatter”

For Python users, I’ve been playing around with sqlparse. You can get more information here.

Photo by Isaac Smith on Unsplash