jtcohen6

jtcohen6

data, theatre, vegetable eater

Member Since 6 years ago

@dbt-labs, Marseille

Experience Points
32
follower
Lessons Completed
11
follow
Lessons Completed
44
stars
Best Reply Awards
10
repos

1994 contributions in the last year

Pinned
⚡ dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
Activity
Jan
24
17 hours ago
Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-84] [Bug] Models are not correctly quoted when materialized as views in dbt 1.0

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Models with quoted columns (e.g. containing spaces) fail to run when materialized as a view.

It looks like dbt 1.0.1 is explicitly referencing column names in the create or replace view statement:

e.g.

create or replace  view MODEL_DEV.dbt_XXXX.finance_concepts    
(
     App
   , 
     Account Name
)
  as (
WITH finance_concepts AS (

where 0.21.1 did not:

  create or replace  view MODEL_DEV.dbt_XXXX.finance_concepts  as (
WITH finance_concepts AS (

materializing as table in 1.0.1 does not reference columns and works correctly

      create or replace transient table MODEL_DEV.dbt_XXXX.finance_concepts  as
      (
WITH finance_concepts AS (

Expected Behavior

Either correctly quote columns or stop explicitly referencing columns in create or replace view

Steps To Reproduce

  1. in dbt 1.0.1
  2. create model with spaces in column name
  3. materialize as view
  4. dbt run model will fail
  5. materialize as table
  6. dbt run model will succeed

Relevant log output

No response

Environment

- OS: dbt Cloud
- dbt: 1.0.1

What database are you using dbt with?

snowflake

Additional Context

No response

jtcohen6
jtcohen6

@rossserven Thanks for opening the issue!

Have you enabled column-level persist_docs for the view model in the example above? In dbt-snowflake==1.0.0, we introduced the capability to persist descriptions on a view model's columns as column-level comments. This requires a little bit of fancy footwork, whereby dbt infers the column schema of the model in advance, so as to include that column schema in its create view statement, along with each column's comments:

-- models/my_view.yml
select 1 as id
# models/views.yml
version: 2
models:
  - name: my_view
    config:
      persist_docs:
        columns: true
    columns:
      - name: id
        description: My primary key
-- target/run/.../my_view.sql
  create or replace  view analytics.dbt_jcohen.my_view 
( 
      ID
)
   as (
    select 1 as id
  );

If your column names are quoted, their identifiers will be case-sensitive—not something we recommend on Snowflake, since it runs into problems like this one, though it is necessary if you need column names containing mixed casing or special characters. All the same, I believe this will work just by telling dbt that your column name needs to be quoted, via the quote column property:

-- models/my_view.yml
select 1 as "My ID Column"
# models/views.yml
version: 2
models:
  - name: my_view
    config:
      persist_docs:
        columns: true
    columns:
      - name: "My ID Column"
        quote: true
        description: My primary key
-- target/run/.../my_view.sql
  create or replace  view analytics.dbt_jcohen.my_view 
( 
  "My ID Column" COMMENT $$My primary key$$
)
   as (
    select 1 as id
  );

Voila!

I'm going to transfer this issue to the dbt-snowflake repo. If there's any change we need to make, it might be to the get_persist_docs_column_list macro over there. But I think we might be able to close this issue as resolved.

Jan
21
3 days ago
Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

drop support for Python 3.7.0 and 3.7.1

resolves #4584

Running dbt v1.0.0 and v0.21 (and possibly other dbt versions) with these Python versions causes a Python exception to be thrown. This disallows installing dbt with when running pip with these Python versions.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change
jtcohen6
jtcohen6

@nathaniel-may I realized we actually want this change to core/setup.py, not setup.py (which is for the now-deprecated dbt package)!

I'm fine with the reasoning above, so I switched #4584 to the 1.1.0 milestone.

Could we make sure this gets a changelog entry for 1.1.0? We shouldn't expect this to affect many people, but any time we up a lower bound (especially Python version), we should record it IMO

Jan
20
4 days ago
pull request

jtcohen6 pull request dbt-labs/docs.getdbt.com

jtcohen6
jtcohen6

Fix broken link in dbt-api

  • Fix events link to events-logging
  • Add redirect for anyone who's saved the old bad link (I think this is the right thing to do here?)
Activity icon
created branch

jtcohen6 in dbt-labs/docs.getdbt.com create branch fix/broken-link-events-logging

createdAt 4 days ago
Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-55] [Bug] var not injected in dbt run snapshot from dbt run invocation in 1.0.0+

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

In 1.0.0+ it appears that we are no longer able to inject var values into a dbt snapshot like so:

dbt snapshot --select dummy_snapshot --vars 'run_date: "2021-11-11"'

where the snapshot is defined as follows:

{% snapshot dummy_snapshot %}

{{
    config(


        target_schema = 'potato',
        unique_key = 'uniqueid',
        strategy = 'check',
        check_cols = 'all',
        updated_at = "cast ('{}' as timestamp)".format(var('run_date')),
        file_format = 'delta',
        invalidate_hard_deletes = true
    )
}}

select 1 as uniqueid, current_timestamp() as ts

{% endsnapshot %}

the logs confirm that a 'none' is being injected rather than the value passed ("2021-11-11").

However, creating the same snapshot and running it the same way in an environment on 0.21.0 works (the date is injected)

snapshot_logs_100.txt snapshot_logs_0210.txt

Expected Behavior

specified var is injected to the snapshot the same in 0.21.0 as in 1.0.0

Steps To Reproduce

  1. go to dbt environment on 1.0.0+
  2. create a snapshot as follows:
{% snapshot dummy_snapshot %}

{{
    config(
        target_schema = 'potato',
        unique_key = 'uniqueid',
        strategy = 'check',
        check_cols = 'all',
        updated_at = "cast ('{}' as timestamp)".format(var('run_date')),
        file_format = 'delta',
        invalidate_hard_deletes = true
    )
}}

select 1 as uniqueid, current_timestamp() as ts

{% endsnapshot %}
  1. invoke the snapshot using this command:
dbt snapshot --select dummy_snapshot --vars 'run_date: "2021-11-11"'
  1. observe 'none' being injected in the logs rather than the run_date passed in the dbt run call

Relevant log output

2022-01-18T17:40:30.685676Z: 17:40:30  On snapshot.my_new_project.dummy_snapshot: /* {"app": "dbt", "dbt_version": "1.0.1", "profile_name": "user", "target_name": "default", "node_id": "snapshot.my_new_project.dummy_snapshot"} */

      

      create or replace transient table DEMO_DB.potato.dummy_snapshot  as
      (

    select *,
        md5(coalesce(cast(uniqueid as varchar ), '')
         || '|' || coalesce(cast(cast ('None' as timestamp) as varchar ), '')
        ) as dbt_scd_id,
        cast ('None' as timestamp) as dbt_updated_at,
        cast ('None' as timestamp) as dbt_valid_from,
        nullif(cast ('None' as timestamp), cast ('None' as timestamp)) as dbt_valid_to
    from (
        



select 1 as uniqueid, current_timestamp() as ts

    ) sbq



      );

Versus:

2022-01-18T17:51:04.596006Z: On snapshot.jaffle_shop.dummy_snapshot: /* {"app": "dbt", "dbt_version": "0.21.0", "profile_name": "user", "target_name": "default", "node_id": "snapshot.jaffle_shop.dummy_snapshot"} */

      

      create or replace transient table DEMO_DB.potato.dummy_snapshot  as
      (

    select *,
        md5(coalesce(cast(uniqueid as varchar ), '')
         || '|' || coalesce(cast(cast ('2021-11-11' as timestamp) as varchar ), '')
        ) as dbt_scd_id,
        cast ('2021-11-11' as timestamp) as dbt_updated_at,
        cast ('2021-11-11' as timestamp) as dbt_valid_from,
        nullif(cast ('2021-11-11' as timestamp), cast ('2021-11-11' as timestamp)) as dbt_valid_to
    from (
        



select 1 as uniqueid, current_timestamp() as ts

    ) sbq


      );


### Environment

```markdown
- OS:
- Python:
- dbt: 1.0.0 vs 0.21.0
dbt cloud

What database are you using dbt with?

snowflake

Additional Context

No response

jtcohen6
jtcohen6

Thanks for the speedy investigation! I'd be in favor of one of the simpler fixes, and defer to you on which one — whichever is clearer to understand / adds less tech debt for the future

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-47] [Bug] Flag `execute` is set to true during the doc generation

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I'm not sure if it's a bug or if it's by design but the execute flag is set to true during the doc generation.

That means that blocks protected by {% if execute %}...{% endif %} are actually executed which, in my opinion, shouldn't be the case when generating the documentation.

This can (and does in my case) cause the doc generation to fail in particular conditions.

Expected Behavior

The execute flag shouldn't be set to true during the doc generation.

The doc generation shouldn't trigger any query to be run against the database for (at least) the following reasons:

  • access rights : the doc generation shouldn't need the rights to execute queries and access table data
  • delays : executing queries can take some time and slow down the doc generation
  • costs : executing queries on large tables can generate non-negligible costs

Steps To Reproduce

Here is a minimal example to better understand the issue.

  1. Suppose we have a model named model_1.
  2. Suppose we have a macro macro_1 using the run_query function to extract something from model_1.
  3. Finally, we use this macro in another model.

Now if we generate the documentation, the macro will actually be executed and the query will be executed on the database to extract the required information from model_1. If we had selected a profile+target for which the model_1 table/view doesn't exist in the database, the doc generation will fail as the table/view is not present.

In my case, this causes the failure of the CI build that generates and publishes the dbt documentation as this pipeline is launched with it's dedicated profile/target.

Relevant log output

No response

Environment

- dbt: 0.20.0 / 1.0.1

What database are you using dbt with?

bigquery

Additional Context

No response

jtcohen6
jtcohen6

Really cool conversation here @pcasteran and @emmyoop! I hope the following adds more clarity than confusion, but please let me know if it yields the opposite effect.

There are three distinct "phases" that we're discussing:

  1. Parse-time Jinja rendering. This is when dbt first reads through your models, pulling out the values of ref + source + config to build its internal manifest/DAG. At this point, no database connection is required, no queries are run, and execute is set to False. (Partial parsing enables dbt to fully or partially skip this step if files/inputs are unchanged between invocations.)
  2. Model SQL compilation. This happens only at runtime, and execute is set to True. A database connection is required, and any introspective queries (statement, run_query, get_columns_in_relation, etc) that are required as necessary inputs to template out model SQL will be run against the database.
  3. Model materialization. Using the compiled model SQL produced in step 2, dbt will wrap that SQL in the appropriate materialization logic (create, merge, etc) and execute that logic to create/update database objects.

So:

  • Tasks that just parse the project and return an internal manifest, such as dbt parse + dbt ls, perform step 1 only.
  • Tasks that seek to return fully compiled model SQL, such as dbt compile + dbt docs generate, perform steps 1 + 2.
  • Tasks that materialize objects in the database, such as dbt run + dbt build, perform steps 1-3. (Specifically, they perform step 2-3 on a node-by-node basis, compiling-then-executing each selected resource in DAG order.)

It sounds, for good reasons, like you want your docs generate step to run metadata queries (required to generate catalog.json), but to avoid other introspective queries required to compile model SQL. I think @emmyoop's suggestion is right on: You can dbt docs generate --no-compile, and grab the manifest.json from a previous run/build invocation to use for serving docs instead. Alternatively, you could use dbt ls to generate that manifest without running any database queries, knowing that each model's "Compiled SQL" will be missing from the resulting docs.

One final option: You can use other conditional logic, such as flags.WHICH (returns name of current task/command), to gate statements/queries that you don't want to run during compile or docs generate. In response to your final concern:

  • Imagine we have a macro performing DDL or DML statements. Now the compile phase (and thus the generation of the documentation), who should be a non-intrusive operation, will lead to mutate the schema and/or data in the database !!

Indeed, this is an anti-pattern in dbt. All queries required for templating model SQL should be introspective only. All DDL/DML should be contained within the materialization logic, in order to preserve the guarantee of idempotence. The state of the database should be identical before and after dbt compile (from the perspective of dbt)—this is just as important as ensuring that dbt run && dbt run yields the same results as dbt run.

Jan
19
5 days ago
Activity icon
issue

jtcohen6 issue dbt-labs/dbt-core

jtcohen6
jtcohen6

Restore previous behavior when a test depends on a disabled (vs. missing) resource

Prompted by Slack threads here + here

Folks who've newly upgraded to v1.0 are seeing a lot of warn-level (stdout) messages about tests defined on disabled models. This is what our internal analytics project looks like, and I don't think it's an uncommon experience:

Screenshot 2022-01-19 at 19 16 48

(There's many more below that)

It is desirable to warn about tests that have "missed" the correct model, such as due to a typo in a ref or yaml property (e.g. see https://github.com/dbt-labs/dbt-core/issues/4532). It's also a very common pattern to disable unused/irrelevant models in packages, and the result is a quite cluttered terminal. There are some useful warnings in there (missing node for patch), lost as needles in haystack-traces.

What I proposed in one of the threads linked above:

If a test depends on a disabled model, log this as DEBUG-level (shows up in logs/dbt.log) If a test depends on a missing model, log this as WARN-level (shows up in standard CLI output)

I looked quickly at the code, and turns out this was actually our previous behavior: https://github.com/dbt-labs/dbt-core/blob/08b2f16e10eb3420ecffee3b4d18ac2f45e601ea/core/dbt/parser/manifest.py#L790-L803

We've since replaced the first of those with a call to the InvalidRefInTestNode event, so this would just look like changing: https://github.com/dbt-labs/dbt-core/blob/a588607ec6546de441812e215fbbf4e9ff20101b/core/dbt/events/types.py#L1186-L1192

To:

@dataclass
class InvalidRefInTestNode(DebugLevel):
    msg: str
    code: str = "I051"

    def message(self) -> str:
        return self.msg

Voila, just the good stuff:

Screenshot 2022-01-19 at 19 19 35
Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

drop support for Python 3.7.0 and 3.7.1

resolves #4584

Running dbt v1.0.0 and v0.21 (and possibly other dbt versions) with these Python versions causes a Python exception to be thrown. This disallows installing dbt with when running pip with these Python versions.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change
Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

Rename data directory to seeds

resolves #4588

Description

The sample project data directory had the wrong name. Now it doesn't!

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change
jtcohen6
jtcohen6

Nice catch + quick work, team! Let's make sure this one gets backported to 1.0.latest, for inclusion in 1.0.2

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-55] [Bug] var not injected in dbt run snapshot from dbt run invocation in 1.0.0+

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

In 1.0.0+ it appears that we are no longer able to inject var values into a dbt snapshot like so:

dbt snapshot --select dummy_snapshot --vars 'run_date: "2021-11-11"'

where the snapshot is defined as follows:

{% snapshot dummy_snapshot %}

{{
    config(


        target_schema = 'potato',
        unique_key = 'uniqueid',
        strategy = 'check',
        check_cols = 'all',
        updated_at = "cast ('{}' as timestamp)".format(var('run_date')),
        file_format = 'delta',
        invalidate_hard_deletes = true
    )
}}

select 1 as uniqueid, current_timestamp() as ts

{% endsnapshot %}

the logs confirm that a 'none' is being injected rather than the value passed ("2021-11-11").

However, creating the same snapshot and running it the same way in an environment on 0.21.0 works (the date is injected)

snapshot_logs_100.txt snapshot_logs_0210.txt

Expected Behavior

specified var is injected to the snapshot the same in 0.21.0 as in 1.0.0

Steps To Reproduce

  1. go to dbt environment on 1.0.0+
  2. create a snapshot as follows:
{% snapshot dummy_snapshot %}

{{
    config(
        target_schema = 'potato',
        unique_key = 'uniqueid',
        strategy = 'check',
        check_cols = 'all',
        updated_at = "cast ('{}' as timestamp)".format(var('run_date')),
        file_format = 'delta',
        invalidate_hard_deletes = true
    )
}}

select 1 as uniqueid, current_timestamp() as ts

{% endsnapshot %}
  1. invoke the snapshot using this command:
dbt snapshot --select dummy_snapshot --vars 'run_date: "2021-11-11"'
  1. observe 'none' being injected in the logs rather than the run_date passed in the dbt run call

Relevant log output

2022-01-18T17:40:30.685676Z: 17:40:30  On snapshot.my_new_project.dummy_snapshot: /* {"app": "dbt", "dbt_version": "1.0.1", "profile_name": "user", "target_name": "default", "node_id": "snapshot.my_new_project.dummy_snapshot"} */

      

      create or replace transient table DEMO_DB.potato.dummy_snapshot  as
      (

    select *,
        md5(coalesce(cast(uniqueid as varchar ), '')
         || '|' || coalesce(cast(cast ('None' as timestamp) as varchar ), '')
        ) as dbt_scd_id,
        cast ('None' as timestamp) as dbt_updated_at,
        cast ('None' as timestamp) as dbt_valid_from,
        nullif(cast ('None' as timestamp), cast ('None' as timestamp)) as dbt_valid_to
    from (
        



select 1 as uniqueid, current_timestamp() as ts

    ) sbq



      );

Versus:

2022-01-18T17:51:04.596006Z: On snapshot.jaffle_shop.dummy_snapshot: /* {"app": "dbt", "dbt_version": "0.21.0", "profile_name": "user", "target_name": "default", "node_id": "snapshot.jaffle_shop.dummy_snapshot"} */

      

      create or replace transient table DEMO_DB.potato.dummy_snapshot  as
      (

    select *,
        md5(coalesce(cast(uniqueid as varchar ), '')
         || '|' || coalesce(cast(cast ('2021-11-11' as timestamp) as varchar ), '')
        ) as dbt_scd_id,
        cast ('2021-11-11' as timestamp) as dbt_updated_at,
        cast ('2021-11-11' as timestamp) as dbt_valid_from,
        nullif(cast ('2021-11-11' as timestamp), cast ('2021-11-11' as timestamp)) as dbt_valid_to
    from (
        



select 1 as uniqueid, current_timestamp() as ts

    ) sbq


      );


### Environment

```markdown
- OS:
- Python:
- dbt: 1.0.0 vs 0.21.0
dbt cloud

What database are you using dbt with?

snowflake

Additional Context

No response

jtcohen6
jtcohen6

Here's what I know about this issue:

  • This is an issue with the RPC server + partial parsing
  • @jeremyyeo detailed the same issue over in https://github.com/dbt-labs/dbt-rpc/issues/48#issuecomment-997537912
  • This represents a regression in v1.0 for folks who use the dbt Cloud IDE, which relies on the RPC server
  • The RPC server should be fully re-parsing all project files whenever --vars are passed into a cli_args method, and storing the new values of config(). That's what happens on the CLI, and why we're returning the correct results there—but it clearly isn't happening here.
  • We're not planning to support dbt-rpc for the indefinite future. I believe that this won't be an issue for the new Server, given how it will work with parsing in particular. But it would not feel good to document this as a known limitation, given that it's a regression from functionality that worked in previous versions.

Here's what I don't know:

  • What's causing the mismatch between dbt-core and dbt-rpc, as far as triggering that full re-parse?
  • How big of a lift would it be to fix?
  • How big is that lift, compared to the lift of improving the behavior of --vars + partial parsing in ways we know we want to? Namely, only files that call a changed var() should actually need to be re-parsed, rather than triggering a full re-parse any time any --vars are passed — i.e. exactly the same as the improvement we made for env_var() + partial parsing

@leahwicz @gshank Given the unknowns here, I think we need a time-boxed investigation, to try to answer the question of what would be required to fix this.


Putting all that aside, and returning to the specifics of this bug report

@saraleon1 Could you say a bit more about the actual use case here? Based on the dummy example, it looks like the user wants to run the snapshot multiple times in sequence, once for each day of (historical?) data. Is the mismatch between ts and updated_at intentional? Would it be possible to refactor the snapshot code to simply be:

{% snapshot dummy_snapshsot %}

{{
    config(
        target_schema = 'potato',
        unique_key = 'uniqueid',
        strategy = 'check',
        check_cols = 'all',
        updated_at = 'updated_at',
        file_format = 'delta',
        invalidate_hard_deletes = true
    )
}}

select 1 as uniqueid, cast({{ var('run_date', 'current_timestamp') }} as timestamp) as updated_at

{% endsnapshot %}

If that accomplishes the use case, I believe that should work. Only config() values are resolved at parse time, and will be affected by this partial parsing + RPC bug. Compiled SQL for nodes (including snaphots) is always re-rendered when the node is actually executed.

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

drop support for Python 3.7.0 and 3.7.1

resolves #4584

Running dbt v1.0.0 and v0.21 (and possibly other dbt versions) with these Python versions causes a Python exception to be thrown. This disallows installing dbt with when running pip with these Python versions.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change
jtcohen6
jtcohen6

Should this go in for 1.0.2? Generally, we shouldn't raise dependency lower bounds in patch releases, though it would be better to raise a more helpful error than the one currently encountered. My sense, though, is that bumping python_requires will only prevent installation if users are running 3.7.0/3.7.1; it won't actually raise a more helpful error at runtime if already installed.

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

Support specifying specific data tests with "dbt test --data"

Feature

Feature description

I'd like to be able to run:

dbt test --data my_custom_test

This would be useful when building new custom tests so I don't have to run all of them when testing. And it would encourage more testing since it's easier to iterate

Who will this benefit?

Everybody who uses custom tests!

Jan
18
6 days ago
Activity icon
issue

jtcohen6 issue comment dbt-labs/docs.getdbt.com

jtcohen6
jtcohen6

Docs: docker image

via https://hub.docker.com/r/fishtownanalytics/dbt

Let's:

  • update the dockerhub page
  • add info about the existence of these images, and how you can use them
Activity icon
issue

jtcohen6 issue dbt-labs/docs.getdbt.com

jtcohen6
jtcohen6

Docs: docker image

via https://hub.docker.com/r/fishtownanalytics/dbt

Let's:

  • update the dockerhub page
  • add info about the existence of these images, and how you can use them
Activity icon
issue

jtcohen6 issue dbt-labs/docs.getdbt.com

jtcohen6
jtcohen6

Update: Use Docker to install dbt

This page is a placeholder currently: https://docs.getdbt.com/dbt-cli/install/docker

Update docs on installing via Docker with pointers to official Dockerfile + public images once available

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[Bug] dbt deps sometimes takes >90sec to complete

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Over the last couple of weeks, I've noticed that dbt deps sometimes takes >90 seconds between Running with dbt==0.21.0 and the first Installing... line.

It's sporadic, and when I re-run to try and capture it performance improves, which makes me suspect it's some sort of cold cache thing.

Expected Behavior

Let's go faster

Steps To Reproduce

With this packages.yml file (from the internal-analyics project):

packages:
  - local: '../dbt-utils'
  - package: dbt-labs/snowplow
    version: 0.13.0
  - package: dbt-labs/codegen
    version: 0.4.0
  - package: tailsdotcom/dbt_artifacts
    version: 0.5.0
  - package: fivetran/hubspot
    version: 0.4.0
  - package: fivetran/zendesk_source
    version: 0.4.0
  - package: dbt-labs/audit_helper
    version: 0.4.0

run dbt deps. I was on the VPN, but from memory I've experienced it off the VPN as well.

Note that this combination will ultimately result in a duplicate package error due to codegen depending on utils as well, but I don't think that's relevant.

Relevant log output

https://www.loom.com/share/95a1c1f62c304673a4649d609aa98781

Environment

- OS: macOS 11.6.1
- Python: Python 3.8.10
- dbt: I've noticed this on 0.21.0 as well as various prerelease versions of 1.0

What database are you using dbt with?

snowflake

Additional Context

No response

jtcohen6
jtcohen6

I don't feel so strongly about where version/dependency resolution happens. It's a bit strange that the logic for it lives in dbt-core; it might be a little bit stranger if it lived within hub.getdbt.com. We've talked about splitting this out into a separate package (dbt-deps?), anticipating a future in which we need its capabilities in multiple places and want to avoid duplicating code.

From a functional standpoint, if all version/dependency resolution were to happen via a single batch call to the Hub API, what would this look like? Send up the full contents of packages.yml, and receive back the link to all relevant tarballs? I'm worried that this might not work for package specifications that mix local packages, git packages, and Hub-hosted package entries—thinking, too, that the first two can declare their own dependencies on the third.

Activity icon
issue

jtcohen6 issue dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-54] [Bug] dbt --version doesn’t identify patch releases as candidates for upgrade

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When I run dbt --version with 1.0.0 installed, I get:

(dbt-prod) [email protected] GitHub % dbt --version
installed version: 1.0.0
   latest version: 1.0.0

Up to date!

Plugins:
  - snowflake: 1.0.0
  - postgres: 1.0.0

Even though 1.0.1 is out.

When I run (dbt-prod) [email protected] GitHub % pip install dbt-core dbt-postgres --upgrade, followed by dbt --version, I then get

(dbt-prod) [email protected] GitHub % dbt --version                              
installed version: 1.0.1
   latest version: 1.0.1

Up to date!

Plugins:
  - snowflake: 1.0.0
  - postgres: 1.0.1

Expected Behavior

It not to say Up to date! when the 1.0.1 patch of Core is available.

not sure how it should behave when there's a patch for an adapter but not core (would that ever happen?)

Steps To Reproduce

No response

Relevant log output

No response

Environment

- OS: MacOS
- Python:
- dbt: 1.0.0

What database are you using dbt with?

snowflake

Additional Context

No response

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-54] [Bug] dbt --version doesn’t identify patch releases as candidates for upgrade

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When I run dbt --version with 1.0.0 installed, I get:

(dbt-prod) [email protected] GitHub % dbt --version
installed version: 1.0.0
   latest version: 1.0.0

Up to date!

Plugins:
  - snowflake: 1.0.0
  - postgres: 1.0.0

Even though 1.0.1 is out.

When I run (dbt-prod) [email protected] GitHub % pip install dbt-core dbt-postgres --upgrade, followed by dbt --version, I then get

(dbt-prod) [email protected] GitHub % dbt --version                              
installed version: 1.0.1
   latest version: 1.0.1

Up to date!

Plugins:
  - snowflake: 1.0.0
  - postgres: 1.0.1

Expected Behavior

It not to say Up to date! when the 1.0.1 patch of Core is available.

not sure how it should behave when there's a patch for an adapter but not core (would that ever happen?)

Steps To Reproduce

No response

Relevant log output

No response

Environment

- OS: MacOS
- Python:
- dbt: 1.0.0

What database are you using dbt with?

snowflake

Additional Context

No response

jtcohen6
jtcohen6

I think this was resolved in https://github.com/dbt-labs/dbt-core/pull/4434 (i.e. in v1.0.1). There was a bug in v1.0.0 code such that it was still checking PyPi for the latest available version of a package named dbt, rather than dbt-core.


That said, the fixed code in v1.0.1 is still only going to report "Your version of dbt is out of date!" for dbt-core. It won't check whether there's a newer version of specific plugins. So, for instance, if/when we release dbt-snowflake==1.0.1, you'd still see:

(dbt-prod) [email protected] GitHub % dbt --version                              
installed version: 1.0.1
   latest version: 1.0.1

Up to date!

Plugins:
  - snowflake: 1.0.0
  - postgres: 1.0.1

https://github.com/dbt-labs/dbt-core/pull/4565 adds messaging around whether the installed version of an adapter plugin is compatible with the installed version of dbt-core. We could also look into adding checks for each installed plugin, to see if they have a newer compatible version available—at the teensy cost of a bit more semver logic, and a few additional requests to the PyPi API.

For now, I think I'm going to close this issue as resolved. If you think it's important to get plugin-specific "up to date" / "out of date" checks, could I ask you to comment over in #4438 / #4565?

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-spark

jtcohen6
jtcohen6

Workaround for some limitations due to `list_relations_without_caching` method

Describe the feature

I am currently facing an issue using DBT with Spark on AWS/Glue/EMR environment as discussed already in https://github.com/dbt-labs/dbt-spark/issues/215 (but already raised here https://github.com/dbt-labs/dbt-spark/issues/93).

The current issue is about the adapter's method list_relations_without_caching:

https://github.com/dbt-labs/dbt/blob/HEAD/core/dbt/include/global_project/macros/adapters/common.sql#L240

which in the Spark Adapter implementation is:

https://github.com/dbt-labs/dbt-spark/blob/a8a85c54d10920af1c5efcbb4d2a51eb7cfcad11/dbt/include/spark/macros/adapters.sql#L133-L139

In this case you can see that the command show table extended in {{ relation }} like '*' is executed. It will force Spark to go through all the tables info in the schema (as Spark has not Information Schema Layer) in a sort of "discovery mode" and this approach produces two main issues:

  1. Bad performance: some environments can have hundreds or even thousands of tables generated not only by DBT but also by other processes in the same schema. In that case this operation can be very costly, especially when you have different DBT processes that run some updates at different times on a few tables.

  2. Instability, as I verified in AWS/Glue/EMR environment, where you can have views without "S3 Location" defined, like an Athena/Presto view, that will make crash a DBT process running SparkSql on EMR with errors like:

show table extended in my_schema like '*'
  ' with 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:00:22 INFO SparkExecuteStatementOperation: Running query with 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:01:03 INFO DAGScheduler: Asked to cancel job group 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:01:03 ERROR SparkExecuteStatementOperation: Error executing query with 62bb6394-b13a-4b79-92dd-2e0918831cf3, currentState RUNNING,
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string

Describe alternatives you've considered

I do not see the reason why DBT process should care of the "rest of the world" like the Athena views from before or tables created from other processes that are in the same schema.

So I can think ideally to replace the method:

show table extended in <schema> like '*'

with something like:

show table extended in <schema> like ('<table1>|<table2>|…')

where my <table1>, <table2>, etc. are determined automatically when I run a command like

dbt run --models my_folder

where my_folder contains the files: table1.sql, table2.sql, etc

but from the current method interface, only the schema params can be passed.

Two questions here: How can I infer automatically the name of the tables involved when a command like dbt run --models my_folder run and how can I pass them eventually to the list_relations_without_caching?

Additional context

I found it relevant for Spark on AWS environment but can be potentially a similar issue for other implementations.

Who will this benefit?

On DBT's slack channel I talked to another used "affected" by similar issue, but probably whoever is going to use Spark in distributed environment can be affected by this (AWS and non).

Are you interested in contributing this feature?

Sure, both coding and testing.

jtcohen6
jtcohen6

@jeremyyeo I think this approach would indeed require implementing _get_cache_schemas(), i.e. a change to the python methods in dbt-spark—so there's no way to test this out with purely user-space code in the meantime.

An alternative approach that either I just remembered, or which just occurred to me, in the threads where we're discussing this internally:

  • The reason we initially opted (back in https://github.com/dbt-labs/dbt-spark/issues/49) for show table extended (verbose) over show tables (concise) is because the concise version lacks a field/specifier that enables us to tell between views and tables. (Contrary to the nomenclature, show tables includes both views + tables.)
  • We could look into running two concise queries rather than one verbose one, by running (for each relevant database) show views in <database> + show tables in <database>. Then, we can reasonably infer that every object returned by the former is a view, and every object returned by the latter and not in former is a table.
  • This bare-bones approach would be enough for the initial adapter cache that powers materializations—though it wouldn't let us do incidentally clever things like returning get_columns_in_relation from the cache if available.

It might be worth experimenting with both approaches, and seeing which one yields greater benefits. Does the slowness have more to do with the verbosity of show table extended (contextual information we don't strictly need)? Or more to do with trying to return metadata for thousands of extraneous objects that dbt doesn't care about, but which happen to live alongside relevant objects in the same database/schema?

Jan
17
1 week ago
Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[Bug] dbt-postgres doesn't install like the other adapters.

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Unlike the rest of the adapters which live in their own repo, dbt-postgres lives inside the codebase for dbt-core. This behavior leads to inconsistencies and recently was root cause (or a large contributor) for a multi-hour outage in cloud. (see incident #39).

Expected Behavior

All dbt db adapters operate in the same way.

Steps To Reproduce

No response

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

What database are you using dbt with?

No response

Additional Context

Some additional context as to why dbt-postgres is a special snowflake: https://github.com/dbt-labs/dbt-core/issues/3968#issuecomment-941081542

jtcohen6
jtcohen6

@iknox-fa Could you say more about this:

we have to do special things with setuptools in dbt-postgres's setup.py in order to make pip work correctly with the different code location..

Following @kwigley's explanations in the incident channel, my sense is that:

  • dbt-postgres checks an environment variable at build time, DBT_PSYCOPG2_NAME, which can be either psycopg2-binary (default) or psycopg2. For some light background reading on this: https://github.com/dbt-labs/dbt-core/issues/1221#issuecomment-511546464
  • dbt Cloud sets this env var to be psycopg2 in its base image. Since this env var (with non-default value) must be checked at build time, we use pip install --no-binary dbt-postgres, i.e. installing from source rather than from wheel/binary
  • The breaking change in setuptools causes it to use its own vendored version of distutils, rather than the stdlib version, and it sticks source-built code in a different location (site-packages instead of dist-packages)

That is, we would have run into this issue whether the dbt-postgres code was located in dbt-core or a separate repo. The determining factor is the need to build from source, given the need to check an env var at build time to identity the right psycopg2 package dependency.

Activity icon
issue

jtcohen6 issue dbt-labs/dbt-core

jtcohen6
jtcohen6

dbt-docs bug fixes for v1.0.2

Fixes:

Once those fixes are made, we'll need to transfer the compiled index.html to this repo for distribution

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-docs

jtcohen6
jtcohen6

Meta configs in model details are not rendered in docs site

Describe the bug

Models meta config are not being rendered in the docs site. Columns meta config however are rendered just fine.

Steps To Reproduce

With a very basic project and the following schema.yml file:

version: 2

models:
  - name: my_model
    description: "Test description"
    config:
      meta:
        contains_pii: "true"
    columns:
      - name: user_name
        meta:
          contains_pii: "true"
  - name: my_model_2
    config:
      meta:
        my_meta: "test"
    columns:
      - name: user_name
        meta:
          my_meta: "test"

Generate docs (dbt docs generate) and then inspect them.

Expected behavior

Config meta key-values should render under model details just like column details.

Screenshots and log output

Screen Shot 2022-01-17 at 11 49 21 AM

Gif: 2022-01-17 11 00 01

The output of dbt --version:

1.0.1

Tested in both core and cloud but same behaviour.

Additional context

This behaviour is also reproduced when using the meta config within the model file itself:

-- my_model.sql

{{ config(meta = {'contains_pii': 'true'}) }}

select 1 as user_name

Note that this doesn't affect dbt execution in terms of selection models to run via the meta config (e.g. dbt run -s config.meta.contains_pii:true) but more of a UI rendering issue with the docs site.

Best guess currently is due to this: https://github.com/dbt-labs/dbt-docs/blob/main/src/app/components/table_details/table_details.js#L23 (though to be honest, not really that familiar with the docs site / angular).

jtcohen6
jtcohen6

Thanks @jeremyyeo! I think this is actually a dbt-core bug: https://github.com/dbt-labs/dbt-core/issues/4459

The contents of node.config.meta are not successfully copied over to node.meta, even though this was our intention, so they're missing from node.meta in manifest.json. While the docs site code could work around this in the meantime by peeking into both node.config.meta and node.meta, I'd prefer to resolve this by achieving more consistent behavior in the metadata artifacts produced by dbt-core.

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-45] [Feature] Expose Snowflake query id in case of dbt test failure

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

When we get a dbt test failure, I often look through the debug logs to find the exact query which was run. Having the dbt_internal_test query with failures, should_warn, and should_error are an amazing first step, but the issue is that when I rerun the query, it often passes. If the debug logs (or the friendly logs) contained the sfqid, then I could easily modify the query to use snowflake's timetravel feature, so I can recreate the failing test and debug it further.

Right now, I see the following info is logged:

022-01-14 00:14:37.419997 (Thread-39): 00:14:37  On test.[redacted].c630bf19e2: /* {"app": "dbt", "dbt_version": "1.0.1", "profile_name": "user", "target_name": "default", "node_id": "[redacted]"} */

Related issue: https://github.com/dbt-labs/dbt-core/pull/2358

Describe alternatives you've considered

Right now, I look through the snowflake query history to find the sfqid, which works but is a bit tedious.

Who will this benefit?

Analytics engineers running dbt tests being able to track down issues faster. Speed is critical because the standard edition of snowflake only allows for 1 day of timetravel, so you need to debug it quickly.

The easier it is to track down intermittent test issues the better, so that people are encouraged to write more.

Are you interested in contributing this feature?

Sure! I have a bit of experience with python (mostly ruby/javascript), but could figure it out if I know:

  • people think this is a good idea (so it's not a waste of everyone's time)
  • a point in the right direction

Anything else?

I'm loving dbt! Thanks for all of the hard work.

jtcohen6
jtcohen6

@tommyh Thanks for opening, and for linking the prior art in #2358!

I think there's some additional overlap with the proposal in https://github.com/dbt-labs/dbt-snowflake/issues/7, which would include the Snowflake query id in the result object / run_results.json. I think that would work well for test results, including test failures, since dbt's results object colocates a lot of useful contextual information for tracking test results over time.

I'm going to transfer this issue over to the dbt-snowflake repo, since that's where any associated code change would need to happen. We might also decide that the proposal in https://github.com/dbt-labs/dbt-snowflake/issues/7 / implementation in https://github.com/dbt-labs/dbt-snowflake/pull/40 would get you enough of the way there.

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-48] [Bug] Logger throws recursion errors because of Deprecation Warnings

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

dbt debug shows that everything is green. Then I run When running dbt run (or seed or deps) I get a recursion error (see output section).

Solution: When I did some low key investigation it seems that this is triggered by deprecation warnings. I then took the liberty to add the line

warnings.filterwarnings("ignore", category=DeprecationWarning) 

to dbt/logger.py and now it works. (Not sure if this is the best solution though)

Expected Behavior

It should not get the recursion error.

Steps To Reproduce

git clone [email protected]:dbt-labs/jaffle_shop.git
cd jaffle_shop
pyenv local 3.10.1
python -m venv .venv
source .venv/bin/activate
pip install dbt-postgres
dbt run

Relevant log output

Traceback (most recent call last):
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 216, in handle
    self.emit(record)
  File "jaffle_shop/.venv/lib/python3.10/site-packages/dbt/logger.py", line 467, in emit
    super().emit(record)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 836, in emit
    msg = self.format(record)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/dbt/logger.py", line 454, in format
    msg = super().format(record)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 195, in format
    return self.formatter(record, self)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 387, in __call__
    line = self.format_record(record, handler)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 371, in format_record
    return self._formatter.format(record=record, handler=handler)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/helpers.py", line 283, in __get__
    value = self.func(obj)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/base.py", line 675, in thread_name
    return thread_get_name()
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/concurrency.py", line 141, in thread_get_name
    return currentThread().getName()
  File "/home/xx/.pyenv/versions/3.10.1/lib/python3.10/threading.py", line 1442, in currentThread
    warnings.warn('currentThread() is deprecated, use current_thread() instead',
RecursionError: maximum recursion depth exceeded


### Environment

```markdown
- OS: Ubuntu 20.04
- Python:3.10.1
- dbt: 1.0.1

What database are you using dbt with?

postgres

Additional Context

No response

jtcohen6
jtcohen6

Hey @senickel, thanks for the detailed write-up!

I believe this is a duplicate of https://github.com/dbt-labs/dbt-core/issues/4537. dbt v1.0 does not officially support Python 3.10; we're looking to add support for it in v1.1 (https://github.com/dbt-labs/dbt-core/issues/4562), which will require (at minimum) addressing the flood of deprecation warnings related to the legacy logger / logbook.

Until then, the official resolution will be to downgrade your Python version to 3.9 while using dbt.

Activity icon
issue

jtcohen6 issue dbt-labs/dbt-core

jtcohen6
jtcohen6

[CT-48] [Bug] Logger throws recursion errors because of Deprecation Warnings

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

dbt debug shows that everything is green. Then I run When running dbt run (or seed or deps) I get a recursion error (see output section).

Solution: When I did some low key investigation it seems that this is triggered by deprecation warnings. I then took the liberty to add the line

warnings.filterwarnings("ignore", category=DeprecationWarning) 

to dbt/logger.py and now it works. (Not sure if this is the best solution though)

Expected Behavior

It should not get the recursion error.

Steps To Reproduce

git clone [email protected]:dbt-labs/jaffle_shop.git
cd jaffle_shop
pyenv local 3.10.1
python -m venv .venv
source .venv/bin/activate
pip install dbt-postgres
dbt run

Relevant log output

Traceback (most recent call last):
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 216, in handle
    self.emit(record)
  File "jaffle_shop/.venv/lib/python3.10/site-packages/dbt/logger.py", line 467, in emit
    super().emit(record)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 836, in emit
    msg = self.format(record)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/dbt/logger.py", line 454, in format
    msg = super().format(record)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 195, in format
    return self.formatter(record, self)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 387, in __call__
    line = self.format_record(record, handler)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/handlers.py", line 371, in format_record
    return self._formatter.format(record=record, handler=handler)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/helpers.py", line 283, in __get__
    value = self.func(obj)
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/base.py", line 675, in thread_name
    return thread_get_name()
  File "/jaffle_shop/.venv/lib/python3.10/site-packages/logbook/concurrency.py", line 141, in thread_get_name
    return currentThread().getName()
  File "/home/xx/.pyenv/versions/3.10.1/lib/python3.10/threading.py", line 1442, in currentThread
    warnings.warn('currentThread() is deprecated, use current_thread() instead',
RecursionError: maximum recursion depth exceeded


### Environment

```markdown
- OS: Ubuntu 20.04
- Python:3.10.1
- dbt: 1.0.1

What database are you using dbt with?

postgres

Additional Context

No response

Jan
12
1 week ago
Activity icon
issue

jtcohen6 issue comment dbt-labs/docs.getdbt.com

jtcohen6
jtcohen6

dbt init profile behavior changed in dbt 1.0.0, 'default' no longer default 'profile' value

Contributions

  • I have read the contribution docs, and understand what's expected of me.

Link to the page on docs.getdbt.com requiring updates

https://docs.getdbt.com/reference/project-configs/profile

What part(s) of the page would you like to see updated?

If you are working on more than one project, do not use profile: default as your profile name (as set by the dbt init command), as it will become hard to manage multiple profiles.

In dbt 1.0.0 I'm seeing different behavior. The dbt init command sets the profile to the project name, eg dbt init -s test_project results in profile: 'test_project' not profile: 'default'

Additional information

I'm not quite sure what's the best way to update this line in the document, but it definitely appears invalid now that 1.0.0 is released.

jtcohen6
jtcohen6

Absolutely! The init behavior is better now, I'd say :) so that parenthetical can go away

Activity icon
issue

jtcohen6 issue comment dbt-msft/dbt-synapse

jtcohen6
jtcohen6

dbt found two macros named "synapse__drop_relation" in the project

Hello,

With versions dbt-core==1.0.0, dbt-sqlserver==1.0.0, dbt-synapse==1.0.1, when I run a dbt command such as dbt snapshot, I get the error

20:32:03  Running with dbt=1.0.0
20:32:04  Partial parse save file not found. Starting full parse.
20:32:04  Encountered an error:
Compilation Error
  dbt found two macros named "synapse__drop_relation" in the project
  "dbt_synapse".
   To fix this error, rename or remove one of the following macros:
      - macros/adapters/relation.sql
      - macros/adapters.sql

I do not get this error with versions dbt-core==0.20.2, dbt-sqlserver==0.20.1, dbt-synapse==0.20.0. I'm using Python version 3.9.9.

jtcohen6
jtcohen6

@swanderz Given the error message, and that #72 moved this macro from macros/adapters.sql to macros/adapters/relation.sql... If you install dbt-synapse pre-v1.0, and then upgrade to v1.0, is it possible that the old file keeps "hanging out" in the source installation?

I quickly tried replicating this locally, and I wasn't able to

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

Ability to set log levels

Describe the feature

Allow configuring the log level, e.g. from INFO to WARN or ERROR. Could be configurable as a global CLI flag.

Additional context

Some of our SQL occasionally has sensitive information that we wouldn't like to be logged.

Who will this benefit?

  • People who want less "chatty" logging
  • People using Spark external tables with credentials.
jtcohen6
jtcohen6

@Gunnnn That sounds like it may be caused by something unrelated. Shot in the dark: Any chance you're using python 3.10? It sounds like folks are seeing tons of extra log lines related to some deprecation warnings in a dependency (https://github.com/dbt-labs/dbt-core/issues/4537), something we plan to fix when we add official support (https://github.com/dbt-labs/dbt-core/issues/4562)

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

Reinstatement of of ssh tunnel

Hi

I'm looking at : https://github.com/fishtown-analytics/dbt/issues/93 which suggests that once upon a time ssh-host would have been a valid property under a target of profiles.yml where a user can set up an ssh tunnel to connect to the data source.

I would love to be able to use this feature again.

Some context

In our case, the scenario is that our data is brought in by stitch in many schemas. I'd like to give local access to a data engineer to some of that data, but that redshift is secured using a data bastion sitting in front.

Some issues we're trying to resolve

In our case, we can develop a lot of the scripts locally against postgres, however there are limitations:

  • Speed is not greatly representative locally (anonymised production data / cut down data sets, etc.)
  • Redhisft & Postgres differ in function names - we use jinja to abstract that away in the model files, but we can't test portions of jinja until we're in Redshift itself and using production data.

What we would like to do

  • Allow a local user to connect to Redshift through our data bastion. They have access to all the source models) but can only right to a 'user' schema (say 'chris').
  • They can then test the actual DBT scripts per-se.
jtcohen6
jtcohen6

Hey folks, I know we all have varying levels of experience with SSH-based connections, bastion hosts, etc. This confused me the first time I had to do it a few years ago.

We decided to keep this capability separate from dbt-core because it's not really specific to dbt. There are good tools for solving this problem, specific to the machine you're running on.

The basic idea, as I remember it:

  • You register a public SSH key with the remote location, tied to a private key that lives on your machine
  • You use a CLI tool (e.g. ssh, autossh) to "forward" a local port to the remote location (bastion host)
  • In profiles.yml, instead of putting the host/port of a remote database, you put localhost and the number of the "forwarding" port
  • Voila! Your connection is forwarded to the bastion host, authenticated via SSH, and passed along to the database

Does anyone have any good walkthroughs / blog posts with examples that they recommend when onboarding a new colleague to this sort of setup?

Activity icon
issue

jtcohen6 issue comment dbt-labs/dbt-core

jtcohen6
jtcohen6

Make it possible to generate a DBT docs subset

Describe the feature

As we are building a data platform for multiple, different clients we would like to generate a subset of the complete DBT docs for them. Reasons being: -Why expose documentation that will never be relevant for a client? -ETL logic / privacy / legal reasons

My initial thought is to make it possible to generate 'sub-sites' using the already existing model selector methods syntax:

Examples: This should generate docs based on a tag and all further outgoing nodes: dbt docs --models tag:client_x+

This should generate docs based on a tag, the incoming nodes, all outgoing nodes and the parents of the children: dbt docs --models +tag:[email protected]

etc..

Support for exclude is also important for us here.

Describe alternatives you've considered

James Weakley showed this DIY method on DBT slack: import json data = None with open('target/manifest.json') as f: data = json.load(f) for node in data['nodes']: print(f"Checking node: {node}") if 'some_tag' in data['nodes'][node]['tag']: del data['nodes'][node] with open('target/manifest.json', 'w') as f: json.dump(data, f)

While that could partially work, it will not be as complete as what we would like.

Additional context

This is not database specific, I guess the metadata is there in DBT to make this possible

Who will this benefit?

Anyone that has the same use case as us: reducing the amount of info for a consumer and privacy / legal reasons

Are you interested in contributing this feature?

My team can code in python / they could potentially help

jtcohen6
jtcohen6

Just to clarify that point: The --select argument to dbt docs generate does perform a function, by selecting which resources will be compiled. It does not affect which resources will be included in manifest.json (all enabled resources in the project + packages), but it does affect which of those resources include their compiled_sql.

If you skip all compilation, dbt docs generate --no-compile, then then inclusion of a --select flag wouldn't have any effect.

Previous