Learn ELT/ETL with Real (Fake Data)

Fake Company Data Public on BigQuery

Leo Godin
3 min readOct 6, 2023

TL;DR

  • Free BigQuery dataset mimicking companies, employees, products and orders.
  • This dataset updates nightly with new records.
  • Documentation is here.
  • Example queries below.
Image by author

So, you want to learn a new tool. Maybe Spark, dbt, or a SaaS ETL platform. The tutorials are there. So many beginner tutorials out there. Which one to choose. Your goal is learning how to extract, transform and load data, but there’s a catch. We don’t work with static data that never changes, we work with data that updates regularly. Why do all these tutorials provide static CSV files to work with?

I’ll tell you why. Because there aren’t a lot of good datasets out there that update regularly and mimic what we’d see on the job. Trust me. I searched and searched. Being lazy, I much prefer to use something that exists rather than making it myself. I found nada. Zilch. Noodle. So….

Shaking off that laziness, I created dbt-fake — a dbt project that generates fake data. Great for me. Now I quickly generate data for articles and videos, but what about those who aren’t using dbt? I’m already generating the data on BigQuery each night. Why not make it available without the dependency of knowing dbt? Why not indeed.

It turns out BigQuery makes this very easy. Click a few buttons, accept the security warnings, and bam! — public data. Now anyone can access fake data on companies, employees, products, and sales. Random data is added nightly, which makes this dataset useful for learning ELT, ETL, or just T if that’s your bag.

The dataset is small enough that it easily fits in the free tier of BigQuery for me. Though, you’ll want to monitor it closely if you are doing a lot of work. At the same time, it is big enough to be useful for learning. Data starts from 07/17/2020 and will always be updated to the current date-1.

How About Some Example Queries?

Employees per company

select 
companies.name,
companies.slogan,
companies.purpose,
count(employees.id) as num_employees
from `leogodin217-dbt-tutorial`.fake_product_company_sources.companies_base as companies
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.employees_base as employees
on employees.company_id = companies.id
group by
companies.name,
companies.slogan,
companies.purpose
Image by author

Orders by date

select 
date,
count(*) as num_orders
from `leogodin217-dbt-tutorial`.fake_product_company_sources.enterprise_orders_base
group by date
order by date desc
Image by author

Sales by Date

select 
orders.date,
sum(products.price * orders.num_items) as total_sales
from `leogodin217-dbt-tutorial`.fake_product_company_sources.enterprise_orders_base as orders
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.products_base as products
on products.id = orders.product_id
group by orders.date
order by orders.date desc
Image by author

Sales by Company

select 
companies.name,
sum(orders.num_items * products.price) as total_sales
from `leogodin217-dbt-tutorial`.fake_product_company_sources.enterprise_orders_base as orders
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.products_base as products
on products.id = orders.product_id
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.employees_base as employees
on employees.id = orders.employee_id
left join `leogodin217-dbt-tutorial`.fake_product_company_sources.companies_base as companies
on companies.id = employees.company_id
group by companies.name
image by author

--

--

Leo Godin

I’m Leo and I love data! Recovering mansplainer, currently working as a lead data engineer at New Relic. BS in computer science and a MS in data