Learn DBT with Real (Fake) Data

Introducing DBT-Fake

Leo Godin
4 min readAug 27, 2023

Learning data technologies is hard enough when we already have quality datasets to work with. Unfortunately, that’s often not the case. That huge dataset at work doesn’t help you at home when you first install DBT and get through the initial Jaffleshop tutorial.

Edit — Running the dbt project might be too much overhead for some use cases. We now run this project nightly to generate fake data on BigQuery. The four tables are public, so anyone can query them. Get full table names here.

DBT DAG of employees, companies, products and orders

Jaffleshop is great. It is succinct and helps us learn the basics, but data changes over time and our practice datasets should do the same. To that end, we have dbt-fake. A DBT project that generates a history of fake data, with the ability to generate updates daily.

TL;DR

  • Dbt-fake allows you to generate a history of fake data and update it daily.
  • This type of dataset is best for practice, tutorials, videos and articles.
  • The output of dbt-fake can be used as sources in your practice DBT project.

Why dbt-fake?

Few data professionals work with completely-static data. No, we track sales, accounts, marketing, etc. These are not static datasets. They update frequently. In fact, most of us probably spend our time creating or enabling daily reports. These represent challenges static data cannot imitate, so we need our practice datasets to update over time.

With a dataset like this, we can easily mimic real-world problems to solve. Instead of simply trying to understand what a dbt snaphot is, we can implement one from an appropriate dataset. We will find challenges to solve we didn’t even know about. Furthermore, dbt-fake will add new types of data in different formats to provide tougher problems to solve.

Right now, we have a fairly standard model that includes a set of customer companies, their employees, products, and orders. Tracking sales and creating invoices is excellent beginner-level practice. Soon, we will add tiered pricing, data in key-value pairs, status with start and end date, random data-quality issues, and more. All of these add complexity and require new solutions.

Whether you are learning, teaching, or creating tutorials and videos, the datasets you can generate with dbt-fake will make it much easier.

How to Use Dbt-fake

First and foremost, dbt-fake is intended to learn intermediate and advanced topics. If you are starting from scratch, complete the free DBT tutorials from DBT Labs. They are short and excellent. This is how you learn the basics. Start with fundamentals, then materializations, then Jinja and macros. These will give you enough to start practicing on your own.

The intent of dbt-fake is to provide tables that can be added as sources to your own practice project. This is the preferred way to use the dataset. However, the free tier of DBT cloud only allows a single project. In that case, you can simply add the output models as sources. This is not good practice in DBT, but is acceptable for this use case.

Right now, dbt-fake only support BigQuery. This is the easiest, free cloud database I could find. This tutorial will get you started. While the tutorial states no credit card is needed, you will need to add one to your billing account, because dbt-fake uses incremental tables that require DML. BigQuery allows up to 1TB of processing per month, so this should easily fit into the free tier with plenty of room for your own modeling.

In simple terms:

  1. Clone or download https://github.com/leogodin217/dbt-fake.
  2. Configure your BigQuery account in your DBT profile.
  3. Run dbt seed. This only needs to be run once.
  4. Create a history following the examples here.
  5. Follow one of the challenges or make up your own.

Challenges

  • Create a history of companies, employees and orders.
  • Add more orders for specific date ranges to show increased activity.
  • Configure this project to run in DBT Cloud, Airflow, Github Actions, or any orchestrator (Update data daily).
  • Create a new project and add the tables created by this project as sources.
  • Use snapshots to create SCD type-2 dimensions for companies and customers.
  • Separate enterprise_orders_base into category and product dimensions.
  • Create a view that generates invoices for all orders.
  • Create a seed with company ids and discounts (maybe as percentages). Apply those discounts to invoices.
  • Imagine a customer called in to get special pricing for specific items? How would you override prices?
  • Delete a random company, but leave the employees. How would you handle that situation where employees have no company?
  • Use a different database for this project. It will fail. Try to fix it.

Conclusion

Learning data skills requires appropriate datasets, and dbt-fake provides enough to learn intermediate and advanced DBT concepts. While the project is small, more functionality will be added soon. So stop using Jaffleshop and start using real (fake) data.

--

--

Leo Godin

I’m Leo and I love data! Recovering mansplainer, currently working as a lead data engineer at New Relic. BS in computer science and a MS in data