Learn ELT/ETL with Real (Fake Data)
TL;DR
- Free BigQuery dataset mimicking companies, employees, products and orders.
- This dataset updates nightly with new records.
- Documentation is here.
- Example queries below.
So, you want to learn a new tool. Maybe Spark, dbt, or a SaaS ETL platform. The tutorials are there. So many beginner tutorials out there. Which one to choose. Your goal is learning how to extract, transform and load data, but there’s a catch. We don’t work with static data that never changes, we work with data that updates regularly. Why do all these tutorials provide static CSV files to work with?
I’ll tell you why. Because there aren’t a lot of good datasets out there that update regularly and mimic what we’d see on the job. Trust me. I searched and searched. Being lazy, I much prefer to use something that exists rather than making it myself. I found nada. Zilch. Noodle. So….
Shaking off that laziness, I created dbt-fake — a dbt project that generates fake data. Great for me. Now I quickly generate data for articles and videos, but what about those who aren’t using dbt? I’m already generating the data on BigQuery each night. Why not make it available without the dependency of knowing dbt? Why not indeed.
It turns out BigQuery makes this very easy. Click a few buttons, accept the security warnings, and bam! — public data. Now anyone can access fake data on companies, employees, products, and sales. Random data is added nightly, which makes this dataset useful for learning ELT, ETL, or just T if that’s your bag.
The dataset is small enough that it easily fits in the free tier of BigQuery for me. Though, you’ll want to monitor it closely if you are doing a lot of work. At the same time, it is big enough to be useful for learning. Data starts from 07/17/2020 and will always be updated to the current date-1.
How About Some Example Queries?
Employees per company
select
companies.name,
companies.slogan,
companies.purpose,
count(employees.id) as num_employees
from `leogodin217-dbt-tutorial`.fake_product_company_sources.companies_base as companies
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.employees_base as employees
on employees.company_id = companies.id
group by
companies.name,
companies.slogan,
companies.purpose
Orders by date
select
date,
count(*) as num_orders
from `leogodin217-dbt-tutorial`.fake_product_company_sources.enterprise_orders_base
group by date
order by date desc
Sales by Date
select
orders.date,
sum(products.price * orders.num_items) as total_sales
from `leogodin217-dbt-tutorial`.fake_product_company_sources.enterprise_orders_base as orders
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.products_base as products
on products.id = orders.product_id
group by orders.date
order by orders.date desc
Sales by Company
select
companies.name,
sum(orders.num_items * products.price) as total_sales
from `leogodin217-dbt-tutorial`.fake_product_company_sources.enterprise_orders_base as orders
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.products_base as products
on products.id = orders.product_id
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.employees_base as employees
on employees.id = orders.employee_id
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.companies_base as companies
on companies.id = employees.company_id
group by companies.name