Giới thiệu

Sau khi đã giới thiệu & set up project dbt cơ bản ở 2 bài trước, trong bài này, mình sẽ giải thích & thực hành 1 số component cần biết trong dbt

Các component

dbt_project.yml

File này cần được để ở root directory của project. File chứa các thông tin về cấu hình project.

Đầu tiên, các field name, version, config-version, profile là bắt buộc. Ngoài ra, như đã giải thích ở phần trước, dbt project bao gồm chính là các model, vậy nên model-paths cũng bắt buộc (relative path dẫn tới folder chứa các model).

dbt mặc định sau khi transform sẽ write vào VIEW ở warehouse. Phần cấu hình +materialized: table nhằm mục đích write vào TABLE thay vì VIEW. Ngoài ra, user có thể chọn write ra MATERIALIZED VIEW, ... tùy chọn, và cũng có thể cấu hình destination cho từng sub-directory trong folder models.

name: 'dbt_lab'
version: '1.1.0'
config-version: 2 profile: "{{ env_var('DBT_PROFILE', 'tuandz') }}" model-paths: ["models"]
test-paths: ["tests"]
macro-paths: ["macros"] models: dbt_lab: +materialized: table

profiles.yml

File này chứa các dbt profile, mỗi profile bao gồm thông tin để connect tới data platform, ví dụ host, port, user, password, ... Tại sao cần nhiều profile ư? Giống như môi trường dev, test, prod, ... mỗi profile để connect tới 1 warehouse riêng. Khi user chạy, có thể chọn profile qua field profile ở file dbt_project.yml.

tuandz: target: dev outputs: dev: type: postgres host: localhost user: postgres password: postgres port: 5432 dbname: postgres schema: public connect_timeout: 10 # default 10 seconds retries: 3

Models

Như đã viết ở P2, model là file SQL được đặt trong model-paths cấu hình ở dbt_project.yml. Dbt project nào cũng bắt buộc phải có model.

ref & source

ref & source dùng để thay thế việc dùng thẳng tên table dạng <database name>.<schema name>.<table name>.

ref: có thể dùng để thay thế cho target table của 1 trong các model có sẵn
- ví dụ: có 2 model model1.sql và model2.sql, có thể sử dụng ref('model1') hoặc ref('model2') để gọi tên table target của model tương ứng
source: dùng để thay thế cho table được cấu hình trong các file ở models/source/

ref & source không bắt buộc, nhưng khuyến khích sử dụng, để dbt có thể parse ra tree-based relations của các model, nôm na ví dụ là parse ra xem model X có input là từ những bảng nào

Cấu hình variable trong model

Trong model, có thể cấu hình variable để sử dụng trong model, hữu dụng trong use case mà có variable cần sử dụng nhiều chỗ ở trong model.

Ví dụ:

{% set time_gap = dbt_metadata_envs.get('TIME_GAP') if 'TIME_GAP' in dbt_metadata_envs else "interval '90' day" %} SELECT * FROM ref('model1')
WHERE created_at = current_date() - {{ time_gap }}

dbt_metadata_envs tương đương Python os.environ, là 1 map key-value từ biến môi trường. Tên biến môi trường cần bắt đầu với DBT_, DBT_ENV_SECRET_, hoặc DBT_ENV_CUSTOM_ENV_, ví dụ DBT_ENV_CUSTOM_ENV_TIME_GAP. Đọc thêm tại https://docs.getdbt.com/docs/build/environment-variables

Macro

Macro có thể được coi như là function trong coding. Dựa theo definition của dbt document: "Macros in Jinja are pieces of code that can be reused multiple times". Với những đoạn SQL được dùng nhiều nơi, có thể dùng macro để reuse tại nhiều nơi.

Lưu ý: macro dùng để thay cho đoạn SQL code, chứ không phải dùng để execute SQL rồi trả về kết quả.

Ví dụ:

Với macro

-- macro
{% macro cents_to_dollars(column_name) %} ({{ column_name }} / 100)::numeric(16, 2)
{% endmacro %}

Model

-- model
select id as payment_id, {{ cents_to_dollars('amount') }} as amount_usd, ...
from ref('model_x')

Model này được compiled ra câu SQL như sau

-- compiled model
select id as payment_id, (amount / 100)::numeric(16, 2) as amount_usd, ...
from example_schema.table_of_model_x

Dbt test

Có 2 loại dbt test, là Data test và Unit test.

Data test: verify data sau khi model run, nhằm để đảm bảo data quality. Ví dụ, verify 1 column có value unique, hay 1 column có value ở trong list ('placed', 'shipped', 'completed', 'returned'), ... Dưới đây là cấu hình test cho model orders

version: 2 models: - name: orders columns: - name: order_id tests: - unique - not_null - name: status tests: - accepted_values: values: ['placed', 'shipped', 'completed', 'returned'] - name: customer_id tests: - relationships: to: ref('customers') field: id

unique: order_id unique value
not_null: order_id không có value null
accepted_values: status có value trong list ('placed', 'shipped', 'completed', 'returned')
relationships: customer_id có relationship với column id ở bảng customers (toàn vẹn tham chiếu - referential integrity)

Unit test: như unit test trong coding, dùng để verify logic, bằng cách mock data, và verify với expected data. Ví dụ

unit_tests: - name: test_is_valid_email_address model: dim_customers given: - input: ref('stg_customers') rows: - {email: cool@example.com, email_top_level_domain: example.com} - {email: cool@unknown.com, email_top_level_domain: unknown.com} - {email: badgmail.com, email_top_level_domain: gmail.com} - {email: missingdot@gmailcom, email_top_level_domain: gmail.com} - input: ref('top_level_email_domains') rows: - {tld: example.com} - {tld: gmail.com} expect: rows: - {email: cool@example.com, is_valid_email_address: true} - {email: cool@unknown.com, is_valid_email_address: false} - {email: badgmail.com, is_valid_email_address: false} - {email: missingdot@gmailcom, is_valid_email_address: false}

Lưu ý: chỉ mock được cho ref & source, không mock được cho tên table gốc

Command

dbt ls

Parse resource của dbt project, tạo ra file tree-based dependency của các model, file lưu tại target/manifest.json.

Từ file này, có thể tìm ra được dependencies của từng model, ví dụ model X cần các ref, source nào, cần các macro nào, ...

dbt parse

Validate content các file trong project. Command này verify syntax Jinja và YAML của các file trong project

dbt compile

Compile các model trong project, để lấy được compiled SQL code của từng model. SQL code của từng model được lưu tại target/compiled/

dbt run

Command này bao gồm cả 3 command trên, tức là trong dbt run, đã bao gồm dbt ls, dbt parse và dbt compile, sau đó mới run SQL code đã được compiled.

SQL code để run được lưu tại target/run/

Note: Thêm flag --select "<list model name>" để run một số model nhất định. Ví dụ, dbt run --select "model1,model2,model3"

dbt docs

Generate ra website để view models

dbt docs generate

Có thể lấy được compiled code của các model ở trên web này

[dbt basic] [P3] Tìm hiểu & thực hành các khái niệm cần biết của dbt

Giới thiệu

Các component

dbt_project.yml

profiles.yml

Models

ref & source

Cấu hình variable trong model

Macro

Dbt test

Command

dbt ls

dbt parse

dbt compile

dbt run

dbt docs

Bình luận

Bài viết tương tự

Cái nhìn tổng thể về các công nghệ và công cụ hàng đầu trong Data Engineering

Đảm bảo bảo mật dữ liệu: Vấn đề phổ biến trong Data Engineering và hướng giải quyết

Giới Thiệu về Apache Spark

Analytics Engineer là gì? Lộ trình Data Analyst chuyển nghề làm Analytics Engineer

Top 5 kỹ năng Data Engineer cần biết trong 2025

5 dự án Data Engineer thực tế cho người mới bắt đầu (2025)