Deploy your first automated Analytics Engineering pipeline with Fivetran, Snowflake and DBT Core
As a part of me becoming better analyst and beginning data engineer, I have participated in DBT Core deployment project under the guidance of one and only Dmitry Anoshin.
The result of your work will look like a dashboard presenting basic numbers for ecommerce store performance:
The data project infrastructure you will build looks like below:
Prerequisites - all the links can be found in my GitHub project:
Know how to deploy virtual environments
Python 3.11 or less
DBT fundamentals course
Snowflake Trial
FiveTran
Brazilian ecommerce dataset
Know a bit how github actions operate
In the end you will learn how to create a database, stream the data into it, configure the database, perform necessary modelling and then automate the model runs!
Full Github project is available here:
https://github.com/KirillJora/dbt-stream-project
Let’s get cracking!
Snowflake configuration
Below code creates data warehouses (processing cores) and databases for us to upload and transform the data:
-------------------------------------------
-- dbt credentials
-------------------------------------------
USE ROLE securityadmin;
-- dbt roles
CREATE OR REPLACE ROLE dbt_dev_role;
CREATE OR REPLACE ROLE dbt_prod_role;
------------------------------------------- Please replace with your dbt user password
CREATE OR REPLACE USER dbt_user PASSWORD = "<mysecretpassword>";
GRANT ROLE dbt_dev_role,dbt_prod_role TO USER dbt_user;
GRANT ROLE dbt_dev_role,dbt_prod_role TO ROLE sysadmin;
-------------------------------------------
-- dbt objects
-------------------------------------------
USE ROLE sysadmin;
CREATE OR REPLACE WAREHOUSE dbt_dev_wh WITH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 1 INITIALLY_SUSPENDED = TRUE;
CREATE OR REPLACE WAREHOUSE dbt_dev_heavy_wh WITH WAREHOUSE_SIZE = 'LARGE' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 1 INITIALLY_SUSPENDED = TRUE;
CREATE OR REPLACE WAREHOUSE dbt_prod_wh WITH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 1 INITIALLY_SUSPENDED = TRUE;
CREATE OR REPLACE WAREHOUSE dbt_prod_heavy_wh WITH WAREHOUSE_SIZE = 'LARGE' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 1 INITIALLY_SUSPENDED = TRUE;
GRANT ALL ON WAREHOUSE dbt_dev_wh TO ROLE dbt_dev_role;
GRANT ALL ON WAREHOUSE dbt_dev_heavy_wh TO ROLE dbt_dev_role;
GRANT ALL ON WAREHOUSE dbt_prod_wh TO ROLE dbt_prod_role;
GRANT ALL ON WAREHOUSE dbt_prod_heavy_wh TO ROLE dbt_prod_role;
CREATE OR REPLACE DATABASE dbt_hol_dev;
CREATE OR REPLACE DATABASE dbt_hol_prod;
GRANT ALL ON DATABASE dbt_hol_dev TO ROLE dbt_dev_role;
GRANT ALL ON DATABASE dbt_hol_prod TO ROLE dbt_prod_role;
GRANT ALL ON ALL SCHEMAS IN DATABASE dbt_hol_dev TO ROLE dbt_dev_role;
GRANT ALL ON ALL SCHEMAS IN DATABASE dbt_hol_prod TO ROLE dbt_prod_role;
As a result, we get two empty databases - prod & dev, two pair of data wairhouses (for prod & dev work), and a pair of roles and one user.
DBT Core configuration
Important - create virtual environment with python v 3.11 or less, and install dbt snowflake via the following command:
python -m pip install dbt-snowflake
Then you need to create a new dbt project in any local folder by running the following commands:
$ dbt init dbt_hol
$ cd dbt_hol
Open ~/.dbt/profiles
and add the following section. I suggest to place the profiles.yml config file from your local machine into git repository.
dbt_hol:
target: dev
outputs:
dev:
type: snowflake
######## Please replace with your Snowflake account name
account: <your_snowflake_trial_account>
user: dbt_user
######## Please replace with your Snowflake dbt user password
password: <mysecretpassword>
role: dbt_dev_role
database: dbt_hol_dev
warehouse: dbt_dev_wh
schema: public
threads: 200
prod:
type: snowflake
######## Please replace with your Snowflake account name
account: <your_snowflake_trial_account>
user: dbt_user
######## Please replace with your Snowflake dbt user password
password: <mysecretpassword>
role: dbt_prod_role
database: dbt_hol_prod
warehouse: dbt_prod_wh
schema: public
threads: 200
Open dbt-project.yml and update following sections:
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'dbt_hol'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: 'dbt_hol'
# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In this example config, we tell dbt to build all models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
dbt_hol:
# Config indicated by + and applies to all files under models/example/
mart:
schema: mart
materialized: view
staging:
schema: staging
materialized: view
transform:
schema: transform
materialized: view
Validate the configuration via running the following command in dbt_hol folder:
$ dbt debug
Also open up the project folder, and install dbt_utils library via updating packages.yml file with below code:
packages:
- package: dbt-labs/dbt_utils
version: 1.1.0
Deploy the utils, run below code in the command line:
dbt deps
Modelling:
The final architecture will look like below:
Start building data folders by following the principle folder/layer:
mkdir models/mart
mkdir models/staging
mkdir models/transform
Staging
Declare tables by creating a file sources.yml linking to snowflake database with relevant names:
version: 2
sources:
- name: fivetran_database
database: FIVETRAN_DATABASE
schema: RAW_DATA
tables:
- name: OLIST_ORDER_PAYMENTS_DATASET
- name: OLIST_ORDERS_DATASET
- name: OLIST_ORDER_ITEMS_DATASET
- name: PRODUCT_CATEGORY_NAME_TRANSLATION
- name: OLIST_PRODUCTS_DATASET
- name: OLIST_CUSTOMERS_DATASET
Then you want to start creating files with SQL code with stg_ prefix:
stg_customers.sql
stg_order_by_product.sql
stg_orders_revenue.sql
stg_orders.sql
stg_products.sql
The above files should be available in github repo!
Transform
Same as in staging, but three different files:
tr_prod_report.sql
tr_refunds.sql
tr_revenue_reports.sql
Mart
This is the most important layer as this will go into the dashboard. It contains two files:
mart_product_report.sql
mart_revenue_report.sql
Again, please feel free to copypaste sql from github repo.
Then you want to launch it via testing environment first before pushing into prod. This can be done via following command:
dbt run
And once you’ve debugged it, you can perform the production deployment via:
dbt run --target=prod
Automate via github actions:
First, create requirements.txt file in the core project folder. Put below text in it:
dbt-core
dbt-snowflake
In .github folder create a folder workflows, and then create a file called dbt_prod.yml. Add the following code to the file:
name: Daily dbt production run
on:
push:
branches: [ main ]
env:
DBT_PROFILE_TARGET: prod
DBT_PROFILE_USER: ${{ secrets.DBT_PROFILE_USER }}
DBT_PROFILE_PASSWORD: ${{ secrets.DBT_PROFILE_PASSWORD }}
jobs:
dbt_run:
name: Scheduled dbt refresh and test
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Python environment
uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
# working-directory: ./
- name: Prepare DBT profiles directory
run: |
mkdir -p ~/.dbt
cp profiles.yml ~/.dbt/profiles.yml
working-directory: ./
- name: Install dbt packages
run: dbt deps
# optionally use this parameter
# to set a main directory of dbt project:
working-directory: ./dbt/dbt_hol
- name: Run the models
run: dbt run --profile dbt_core_snowflake_github --target prod
working-directory: ./dbt/dbt_hol
- name: Run tests
run: dbt test --profile dbt_core_snowflake_github --target prod
working-directory: ./dbt/dbt_hol
# name: Run DBT in Docker
# on: [push]
# jobs:
# build-and-run-dbt:
# runs-on: ubuntu-latest
# steps:
# - name: Checkout repository
# uses: actions/checkout@v3
# - name: Build Docker image
# run: docker build -t dbt-docker .
# - name: Copy DBT profiles file
# run: cp profiles_github.yml ~/.dbt/profiles.yml
# - name: Run DBT Project
# run: |
# docker run \
# -e DBT_PROFILES_DIR=/dbt \
# -e SNOWFLAKE_USER=${{ secrets.DBT_PROFILE_USER }} \
# -e SNOWFLAKE_PASSWORD=${{ secrets.DBT_PROFILE_PASSWORD }} \
# dbt-docker dbt run --profile dbt_core_snowflake_github --target prod
# - name: Test DBT Project
# run: |
# docker run \
# -e DBT_PROFILES_DIR=/dbt \
# -e SNOWFLAKE_USER=${{ secrets.DBT_PROFILE_USER }} \
# -e SNOWFLAKE_PASSWORD=${{ secrets.DBT_PROFILE_PASSWORD }} \
# dbt-docker dbt test --profile dbt_core_snowflake_github --target prod
Please remember to add secret profile user and profile password with DBT credentials on github!
That’s it. Now every time there is a push into git repo, the model will run itself together with all specified tests. Congratulations on finishing, you should have a clean dataset ready for visualisation!
p.s. I’d like to share the credit with snowflake teams who have built this amazing tutorial used as a basis for this project. I won’t be able to achieve the above result in a given time.