Quick Dbt Patterns — Transform, Test, Publish
Here’s a conundrum for you. We write tests to prevent bad data from reaching production, but our dbt tests run after our models are created. How do we prevent bad data from reaching our downstream consumers? Traditional ETL/ELT platforms have a multitude of solutions, but many do not fit dbt. In this article we will look at a simple, well-known pattern to ensure bad data never reaches our consumers.
To better understand the solution, we need to define the problem. Data quality tests work in this order:
- Dbt creates or updates a model.
- Dbt runs tests on the model after it is materialized.
- If a test fails, downstream consumers get bad data.
- We scramble to fix everything.
Do you see the timing issue? Tests only run after the model is run. This means we cannot prevent the failing data from being materialized. Fortunately, there are several solutions ranging from dead simple to very complex. This article will stick with the dead simple.
Transform, Test, Publish
Once we consider the impact of testing data after it is materialized, a simple solution is quite intuitive. Add a view between downstream consumers and the model being tested. The steps are as follows.
- Create a view between the tested model and downstream consumers.
- Give the view a where clause based on a date or timestamp.
- Whenever the view runs, update the date or timestamp.
- If tests fail, then the dependent view will not run, thus retaining the last known good date or timestamp.
Here is an example. Consider the following model, which simply appends new rows to a table.
models/final/accounts.sql
{# No unique key means the table will simply append whatever the query produces #}
{# This allows us to easily create duplicates from the command line #}
{{
config(
materialized = 'incremental',
)
}}
select
account_id,
company_name,
company_slogan,
company_mission,
current_date() as updated_at
from {{ ref('stg_companies') }}
where updated_at = current_date()
If we add some tests.
models/final/accounts.yml
version: 2
models:
- name: accounts
description: Append-only dimension of companies signed up to order our products
config:
tags: ['accounts-team']
meta:
owner: '@Leo_Godin'
tags: ['#accounts-team']
columns:
- name: account_id
description: Internal ID of the company
tests:
- duplicate_accounts:
name: Duplicate Accounts Found
config:
severity: error
meta:
description: Duplicate accounts were found
- not_null
name: Null account id found
config:
severity: error
meta:
description: Null account ids were found
- name: company_name
description: Legal name of the company
tests:
- not_null
- name: company_slogan
description: Slogan of the company
- name: company_mission
description: Mission statement of the company
- name: created_at
description: Timestamp the record was inserted
- name: updated_at
description: Timestamp the record was updated
We can then create a simple view that will only read data from the last time the table was run AND no tests failed. If tests pass, the view will run and set the current date in the filter. If tests fail, the view will not run and will filter records from the last-known-good date.
models/reporting/accounts.sql
{{
config(
materialized = 'view',
)
}}
select
account_id,
company_name,
company_slogan,
company_mission,
updated_at
from {{ ref('accounts') }}
where updated_at <= current_date()
A few things to consider
- This pattern requires upstream test severity to be error. Warn will not stop the view from running.
- The above example is very simple. In truth, we rarely have such simple models, so the date/time logic may need to be enhanced.
- This pattern depends on a reporting layer for downstream consumption. If users are querying the table instead of the view, they will see bad data.
Need fake data that updates daily for practicing dbt? Check this out.