Skip to content

emilyriederer/dbtplyr

Folders and files

NameName
Last commit message
Last commit date
May 1, 2022
Feb 6, 2021
Feb 6, 2021
May 1, 2022
Oct 27, 2022
Oct 27, 2022
Mar 11, 2022
Feb 6, 2021
Feb 6, 2021
May 1, 2022
Aug 22, 2024
Sep 5, 2021
Oct 27, 2022

Repository files navigation

dbtplyr

This add-on package enhances dbt by providing macros which programmatically select columns based on their column names. It is inspired by the across() function and the select helpers in the R package dplyr.

dplyr (>= 1.0.0) has helpful semantics for selecting and applying transformations to variables based on their names. For example, if one wishes to take the sum of all variables with name prefixes of N and the mean of all variables with name prefixes of IND in the dataset mydata, they may write:

summarize(
  mydata, 
  across( starts_with('N'), sum),
  across( starts_with('IND', mean)
)

This package enables us to similarly write dbt data models with commands like:

{% set cols = dbtplyr.get_column_names( ref('mydata') ) %}
{% set cols_n = dbtplyr.starts_with('N', cols) %}
{% set cols_ind = dbtplyr.starts_with('IND', cols) %}

select

  {{ dbtplyr.across(cols_n, "sum({{var}}) as {{var}}_tot") }},
  {{ dbtplyr.across(cols_ind, "mean({{var}}) as {{var}}_avg") }}

from {{ ref('mydata') }}

which dbt then compiles to standard SQL.

Alternatively, to protect against cases where no column names matched the pattern provided (e.g. no variables start with n so cols_n is an empty list), one may instead internalize the final comma so that it is only compiled to SQL when relevant by using the final_comma parameter of across.

  {{ dbtplyr.across(cols_n, "sum({{var}}) as {{var}}_tot", final_comma = true) }}

Note that, slightly more dplyr-like, you may also write:

select

  {{ dbtplyr.across(dbtplyr.starts_with('N', ref('mydata')), "sum({{var}}) as {{var}}_tot") }},
  {{ dbtplyr.across(dbtplyr.starts_with('IND', ref('mydata')), "mean({{var}}) as {{var}}_avg") }}

from {{ ref('mydata') }}

But, as each function call is a bit longer than the equivalent dplyr code, I personally find the first form more readable.

Macros

The complete list of macros included are:

Functions to apply operation across columns

  • across(var_list, script_string, final_comma)
  • c_across(var_list, script_string)

Functions to evaluation condition across columns

  • if_any(var_list, script_string)
  • if_all(var_list, script_string)

Functions to subset columns by naming conventions

  • starts_with(string, relation or list)
  • ends_with(string, relation or list)
  • contains(string, relation or list)
  • not_contains(string, relation or list)
  • one_of(string_list, relation or list)
  • not_one_of(string_list, relation or list)
  • matches(string, relation)
  • everything(relation)
  • where(fn, relation) where fn is the string name of a Column type-checker (e.g. "is_number")

Note that all of the select-helper functions that take a relation as an argument can optionally be passed a list of names instead.

Documentation for these functions is available on the package website and in the macros/macro.yml file.