Quick Dbt Patterns — Transform, Test, Publish

Stop Sending Bad Data to Production

3 min readSep 23, 2023

Here’s a conundrum for you. We write tests to prevent bad data from reaching production, but our dbt tests run after our models are created. How do we prevent bad data from reaching our downstream consumers? Traditional ETL/ELT platforms have a multitude of solutions, but many do not fit dbt. In this article we will look at a simple, well-known pattern to ensure bad data never reaches our consumers.

Process showing how dbt runs a model then tests it, allowing bad data to reach production. — DQ tests fail too late (image by author)

To better understand the solution, we need to define the problem. Data quality tests work in this order:

Dbt creates or updates a model.
Dbt runs tests on the model after it is materialized.
If a test fails, downstream consumers get bad data.
We scramble to fix everything.

Do you see the timing issue? Tests only run after the model is run. This means we cannot prevent the failing data from being materialized. Fortunately, there are several solutions ranging from dead simple to very complex. This article will stick with the dead simple.

Transform, Test, Publish

Run model, test model. If tests pass, run view and update where clause date. If tests fail, do not update the view. — Transform, test, publish. (image by author)

Once we consider the impact of testing data after it is materialized, a simple solution is quite intuitive. Add a view between downstream consumers and the model being tested. The steps are as follows.

Create a view between the tested model and downstream consumers.
Give the view a where clause based on a date or timestamp.
Whenever the view runs, update the date or timestamp.
If tests fail, then the dependent view will not run, thus retaining the last known good date or timestamp.

Here is an example. Consider the following model, which simply appends new rows to a table.

models/final/accounts.sql

{# No unique key means the table will simply append whatever the query produces #}
{# This allows us to easily create duplicates from the command line #}

{{
  config(
    materialized = 'incremental',
    )
}}

select 
    account_id,
    company_name,
    company_slogan,
    company_mission,
    current_date() as updated_at   
from {{ ref('stg_companies') }} 
where updated_at = current_date()

If we add some tests.

models/final/accounts.yml

version: 2

models: 
  - name: accounts 
    description: Append-only dimension of companies signed up to order our products 
    config:
      tags: ['accounts-team']
      meta:
        owner: '@Leo_Godin'
        tags: ['#accounts-team']
    columns:
      - name: account_id 
        description: Internal ID of the company 
        tests: 
          - duplicate_accounts: 
              name: Duplicate Accounts Found
              config:
                severity: error
                meta:
                  description: Duplicate accounts were found
          - not_null
              name: Null account id found
              config:
                severity: error
                meta:
                  description: Null account ids were found
      - name: company_name 
        description: Legal name of the company 
        tests:
          - not_null
      - name: company_slogan 
        description: Slogan of the company 
      - name: company_mission 
        description: Mission statement of the company 
      - name: created_at 
        description: Timestamp the record was inserted 
      - name: updated_at
        description: Timestamp the record was updated

We can then create a simple view that will only read data from the last time the table was run AND no tests failed. If tests pass, the view will run and set the current date in the filter. If tests fail, the view will not run and will filter records from the last-known-good date.

models/reporting/accounts.sql

{{
  config(
    materialized = 'view',
    )
}}

select 
    account_id,
    company_name,
    company_slogan,
    company_mission,
    updated_at
from {{ ref('accounts') }} 
where updated_at <= current_date()

A few things to consider

This pattern requires upstream test severity to be error. Warn will not stop the view from running.
The above example is very simple. In truth, we rarely have such simple models, so the date/time logic may need to be enhanced.
This pattern depends on a reporting layer for downstream consumption. If users are querying the table instead of the view, they will see bad data.

Need fake data that updates daily for practicing dbt? Check this out.

Learn DBT with Real (Fake) Data

Introducing DBT-Fake

leo-godin.medium.com