Ask AI

You are viewing an unreleased or outdated version of the documentation

Changelog#

1.2.1 (core) / 0.18.1 (libraries)#

Bugfixes#

  • Fixed a bug with postgres storage where daemon heartbeats were failing on instances that had not been migrated with dagster instance migrate after upgrading to 1.2.0.

1.2.0 (core) / 0.18.0 (libraries)#

Major Changes since 1.1.0 (core) / 0.17.0 (libraries)#

Core#

  • Added a new dagster dev command that can be used to run both Dagit and the Dagster daemon in the same process during local development. [docs]
  • Config and Resources
  • Repository > Definitions [docs]
  • Declarative scheduling
    • The asset reconciliation sensor is now 100x more performant in many situations, meaning that it can handle more assets and more partitions.
    • You can now set freshness policies on time-partitioned assets.
    • You can now hover over a stale asset to learn why that asset is considered stale.
  • Partitions
    • DynamicPartitionsDefinition allows partitioning assets dynamically - you can add and remove partitions without reloading your definitions (experimental). [docs]
    • The asset graph in the UI now displays the number of materialized, missing, and failed partitions for each partitioned asset.
    • Asset partitions can now depend on earlier time partitions of the same asset. Backfills and the asset reconciliation sensor respect these dependencies when requesting runs [example].
    • TimeWindowPartitionMapping now accepts start_offset and end_offset arguments that allow specifying that time partitions depend on earlier or later time partitions of upstream assets [docs].
  • Backfills
    • Dagster now allows backfills that target assets with different partitions, such as a daily asset which rolls up into a weekly asset, as long as the root assets in the selection are partitioned in the same way.
    • You can now choose to pass a range of asset partitions to a single run rather than launching a backfill with a run per partition [instructions].

Integrations#

  • Weights and Biases - A new integration dagster-wandb with Weights & Biases allows you to orchestrate your MLOps pipelines and maintain ML assets with Dagster. [docs]
  • Snowflake + PySpark - A new integration dagster-snowflake-pyspark allows you to store and load PySpark DataFrames as Snowflake tables using the snowflake_pyspark_io_manager. [docs]
  • Google BigQuery - A new BigQuery I/O manager and new integrations dagster-gcp-pandas and dagster-gcp-pyspark allow you to store and load Pandas and PySpark DataFrames as BigQuery tables using the bigquery_pandas_io_manager and bigquery_pyspark_io_manager. [docs]
  • Airflow The dagster-airflow integration library was bumped to 1.x.x, with that major bump the library has been refocused on enabling migration from Airflow to Dagster. Refer to the docs for an in-depth migration guide.
  • Databricks - Changes:
    • Added op factories to create ops for running existing Databricks jobs (create_databricks_run_now_op), as well as submitting one-off Databricks jobs (create_databricks_submit_run_op).
    • Added a new Databricks guide.
    • The previous create_databricks_job_op op factory is now deprecated.

Docs#

  • Automating pipelines guide - Check out the best practices for automating your Dagster data pipelines with this new guide. Learn when to use different Dagster tools, such as schedules and sensors, using this guide and its included cheatsheet.
  • Structuring your Dagster project guide - Need some help structuring your Dagster project? Learn about our recommendations for getting started and scaling sustainably.
  • Tutorial revamp - Goodbye cereals and hello HackerNews! We’ve overhauled our intro to assets tutorial to not only focus on a more realistic example, but to touch on more Dagster concepts as you build your first end-to-end pipeline in Dagster. Check it out here.

Stay tuned, as this is only the first part of the overhaul. We’ll be adding more chapters - including automating materializations, using resources, using I/O managers, and more - in the next few weeks.

Since 1.1.21 (core) / 0.17.21 (libraries)#

New#

  • Freshness policies can now be assigned to assets constructed with @graph_asset and @graph_multi_asset.
  • The project_fully_featured example now uses the built in DuckDB and Snowflake I/O managers.
  • A new “failed” state on asset partitions makes it more clear which partitions did not materialize successfully. The number of failed partitions is shown on the asset graph and a new red state appears on asset health bars and status dots.
  • Hovering over “Stale” asset tags in the Dagster UI now explains why the annotated assets are stale. Reasons can include more recent upstream data, changes to code versions, and more.
  • [dagster-airflow] support for persisting airflow db state has been added with make_persistent_airflow_db_resource this enables support for Airflow features like pools and cross-dagrun state sharing. In particular retry-from-failure now works for jobs generated from Airflow DAGs.
  • [dagster-gcp-pandas] The BigQueryPandasTypeHandler now uses google.bigquery.Client methods load_table_from_dataframe and query rather than the pandas_gbq library to store and fetch DataFrames.
  • [dagster-k8s] The Dagster Helm chart now only overrides args instead of both command and args for user code deployments, allowing to include a custom ENTRYPOINT in your the Dockerfile that loads your code.
  • The protobuf<4 pin in Dagster has been removed. Installing either protobuf 3 or protobuf 4 will both work with Dagster.
  • [dagster-fivetran] Added the ability to specify op_tags to build_fivetran_assets (thanks @Sedosa!)
  • @graph_asset and @graph_multi_asset now support passing metadata (thanks @askvinni)!

Bugfixes#

  • Fixed a bug that caused descriptions supplied to @graph_asset and @graph_multi_asset to be ignored.
  • Fixed a bug that serialization errors occurred when using TableRecord.
  • Fixed an issue where partitions definitions passed to @multi_asset and other functions would register as type errors for mypy and other static analyzers.
  • [dagster-aws] Fixed an issue where the EcsRunLauncher failed to launch runs for Windows tasks.
  • [dagster-airflow] Fixed an issue where pendulum timezone strings for Airflow DAG start_date would not be converted correctly causing runs to fail.
  • [dagster-airbyte] Fixed an issue when attaching I/O managers to Airbyte assets would result in errors.
  • [dagster-fivetran] Fixed an issue when attaching I/O managers to Fivetran assets would result in errors.

Database migration#

  • Optional database schema migrations, which can be run via dagster instance migrate:
    • Improves Dagit performance by adding a database index which should speed up job run views.
    • Enables dynamic partitions definitions by creating a database table to store partition keys. This feature is experimental and may require future migrations.
    • Adds a primary key id column to the kvs, daemon_heartbeats and instance_info tables, enforcing that all tables have a primary key.

Breaking Changes#

  • The minimum grpcio version supported by Dagster has been increased to 1.44.0 so that Dagster can support both protobuf 3 and protobuf 4. Similarly, the minimum protobuf version supported by Dagster has been increased to 3.20.0. We are working closely with the gRPC team on resolving the upstream issues keeping the upper-bound grpcio pin in place in Dagster, and hope to be able to remove it very soon.

  • Prior to 0.9.19, asset keys were serialized in a legacy format. This release removes support for querying asset events serialized with this legacy format. Contact #dagster-support for tooling to migrate legacy events to the supported version. Users who began using assets after 0.9.19 will not be affected by this change.

  • [dagster-snowflake] The execute_queryand execute_queries methods of the SnowflakeResource now have consistent behavior based on the values of the fetch_results and use_pandas_result parameters. If fetch_results is True, the standard Snowflake result will be returned. If fetch_results and use_pandas_result are True, a pandas DataFrame will be returned. If fetch_results is False and use_pandas_result is True, an error will be raised. If both are False, no result will be returned.

  • [dagster-snowflake] The execute_queries command now returns a list of DataFrames when use_pandas_result is True, rather than appending the results of each query to a single DataFrame.

  • [dagster-shell] The default behavior of the execute and execute_shell_command functions is now to include any environment variables in the calling op. To restore the previous behavior, you can pass in env={} to these functions.

  • [dagster-k8s] Several Dagster features that were previously disabled by default in the Dagster Helm chart are now enabled by default. These features are:

    • The run queue (by default, without a limit). Runs will now always be launched from the Daemon.
    • Run queue parallelism - by default, up to 4 runs can now be pulled off of the queue at a time (as long as the global run limit or tag-based concurrency limits are not exceeded).
    • Run retries - runs will now retry if they have the dagster/max_retries tag set. You can configure a global number of retries in the Helm chart by setting run_retries.max_retries to a value greater than the default of 0.
    • Schedule and sensor parallelism - by default, the daemon will now run up to 4 sensors and up to 4 schedules in parallel.
    • Run monitoring - Dagster will detect hanging runs and move them into a FAILURE state for you (or start a retry for you if the run is configured to allow retries). By default, runs that have been in STARTING for more than 5 minutes will be assumed to be hanging and will be terminated.

    Each of these features can be disabled in the Helm chart to restore the previous behavior.

  • [dagster-k8s] The experimental k8s_job_op op and execute_k8s_job functions no longer automatically include configuration from a dagster-k8s/config tag on the Dagster job in the launched Kubernetes job. To include raw Kubernetes configuration in a k8s_job_op, you can set the container_config, pod_template_spec_metadata, pod_spec_config, or job_metadata config fields on the k8s_job_op (or arguments to the execute_k8s_job function).

  • [dagster-databricks] The integration has now been refactored to support the official Databricks API.

    • create_databricks_job_op is now deprecated. To submit one-off runs of Databricks tasks, you must now use the create_databricks_submit_run_op.
    • The Databricks token that is passed to the databricks_client resource must now begin with https://.

Changes to experimental APIs#

  • [experimental] LogicalVersion has been renamed to DataVersion and LogicalVersionProvenance has been renamed to DataProvenance.
  • [experimental] Methods on the experimental DynamicPartitionsDefinition to add, remove, and check for existence of partitions have been removed. Refer to documentation for updated API methods.

Removal of deprecated APIs#

  • [previously deprecated, 0.15.0] Static constructors on MetadataEntry have been removed.
  • [previously deprecated, 1.0.0] DagsterTypeMaterializer, DagsterTypeMaterializerContext, and @dagster_type_materializer have been removed.
  • [previously deprecated, 1.0.0] PartitionScheduleDefinition has been removed.
  • [previously deprecated, 1.0.0] RunRecord.pipeline_run has been removed (use RunRecord.dagster_run).
  • [previously deprecated, 1.0.0] DependencyDefinition.solid has been removed (use DependencyDefinition.node).
  • [previously deprecated, 1.0.0] The pipeline_run argument to build_resources has been removed (use dagster_run)

Community Contributions#

  • Deprecated iteritems usage was removed and changed to the recommended items within dagster-snowflake-pandas (thanks @sethkimmel3)!
  • Refactor to simply the new @asset_graph decorator (thanks @simonvanderveldt)!

Experimental#

  • User-computed DataVersions can now be returned on Output
  • Asset provenance info can be accessed via OpExecutionContext.get_asset_provenance

Documentation#

  • The Asset Versioning and Caching Guide now includes a section on user-provided data versions
  • The community contributions doc block Picking a github issue was not correctly rendering, this has been fixed (thanks @Sedosa)!

1.1.21 (core) / 0.17.21 (libraries)#

New#

  • Further performance improvements for build_asset_reconciliation_sensor.

  • Dagster now allows you to backfill asset selections that include mapped partition definitions, such as a daily asset which rolls up into a weekly asset, as long as the root assets in your selection share a partition definition.

  • Dagit now includes information about the cause of an asset’s staleness.

  • Improved the error message for non-matching cron schedules in TimeWindowPartitionMappings with offsets. (Thanks Sean Han!)

  • [dagster-aws] The EcsRunLauncher now allows you to configure the runtimePlatform field for the task definitions of the runs that it launches, allowing it to launch runs using Windows Docker images.

  • [dagster-azure] Add support for DefaultAzureCredential for adls2_resource (Thanks Martin Picard!)

  • [dagster-databricks] Added op factories to create ops for running existing Databricks jobs (create_databricks_run_now_op), as well as submitting one-off Databricks jobs (create_databricks_submit_run_op). See the new Databricks guide for more details.

  • [dagster-duckdb-polars] Added a dagster-duckdb-polars library that includes a DuckDBPolarsTypeHandler for use with build_duckdb_io_manager, which allows loading / storing Polars DataFrames from/to DuckDB. (Thanks Pezhman Zarabadi-Poor!)

  • [dagster-gcp-pyspark] New PySpark TypeHandler for the BigQuery I/O manager. Store and load your PySpark DataFrames in BigQuery using bigquery_pyspark_io_manager.

  • [dagster-snowflake][dagster-duckdb] The Snowflake and DuckDB IO managers can now load multiple partitions in a single step - e.g. when a non-partitioned asset depends on a partitioned asset or a single partition of an asset depends on multiple partitions of an upstream asset. Loading occurs using a single SQL query and returns a single DataFrame.

  • [dagster-k8s] The Helm chart now supports the full kubernetes env var spec for user code deployments. Example:

    dagster-user-deployments:
      deployments:
        - name: my-code
          env:
            - name: FOO
              valueFrom:
                fieldFre:
                  fieldPath: metadata.uid
    

    If includeConfigInLaunchedRuns is enabled, these env vars will also be applied to the containers for launched runs.

Bugfixes#

  • Previously, if an AssetSelection which matched no assets was passed into define_asset_job, the resulting job would target all assets in the repository. This has been fixed.
  • Fixed a bug that caused the UI to show an error if you tried to preview a future schedule tick for a schedule built using build_schedule_from_partitioned_job.
  • When a non-partitioned non-asset job has an input that comes from a partitioned SourceAsset, we now load all partitions of that asset.
  • Updated the fs_io_manager to store multipartitioned materializations in directory levels by dimension. This resolves a bug on windows where multipartitioned materializations could not be stored with the fs_io_manager.
  • Schedules and sensors previously timed out when attempting to yield many multipartitioned run requests. This has been fixed.
  • Fixed a bug where context.partition_key would raise an error when executing on a partition range within a single run via Dagit.
  • Fixed a bug that caused the default IO manager to incorrectly raise type errors in some situations with partitioned inputs.
  • [ui] Fixed a bug where partition health would fail to display for certain time window partitions definitions with positive offsets.
  • [ui] Always show the “Reload all” button on the code locations list page, to avoid an issue where the button was not available when adding a second location.
  • [ui] Fixed a bug where users running multiple replicas of dagit would see repeated Definitions reloaded messages on fresh page loads.
  • [ui] The asset graph now shows only the last path component of linked assets for better readability.
  • [ui] The op metadata panel now longer capitalizes metadata keys
  • [ui] The asset partitions page, asset sidebar and materialization dialog are significantly smoother when viewing assets with a large number of partitions (100k+)
  • [dagster-gcp-pandas] The Pandas TypeHandler for BigQuery now respects user provided location information.
  • [dagster-snowflake] ProgrammingError was imported from the wrong library, this has been fixed. Thanks @herbert-allium!

Experimental#

  • You can now set an explicit logical version on Output objects rather than using Dagster’s auto-generated versions.
  • New get_asset_provenance method on OpExecutionContext allows fetching logical version provenance for an arbitrary asset key.
  • [ui] - you can now create dynamic partitions from the partition selection UI when materializing a dynamically partitioned asset

Documentation#

1.1.20 (core) / 0.17.20 (libraries)#

New#

  • The new @graph_asset and @graph_multi_asset decorators make it more ergonomic to define graph-backed assets.

  • Dagster will auto-infer dependency relationships between single-dimensionally partitioned assets and multipartitioned assets, when the single-dimensional partitions definition is a dimension of the MultiPartitionsDefinition.

  • A new Test sensor / Test schedule button that allows you to perform a dry-run of your sensor / schedule. Check out the docs on this functionality here for sensors and here for schedules.

  • [dagit] Added (back) tag autocompletion in the runs filter, now with improved query performance.

  • [dagit] The Dagster libraries and their versions that were used when loading definitions can now be viewed in the actions menu for each code location.

  • New bigquery_pandas_io_manager can store and load Pandas dataframes in BigQuery.

  • [dagster-snowflake, dagster-duckdb] SnowflakeIOManagers and DuckDBIOManagers can now default to loading inputs as a specified type if a type annotation does not exist for the input.

  • [dagster-dbt] Added the ability to use the “state:” selector

  • [dagster-k8s] The Helm chart now supports the full kubernetes env var spec for Dagit and the Daemon. E.g.

    dagit:
      env:
        - name: “FOO”
          valueFrom:
            fieldRef:
              fieldPath: metadata.uid
    

Bugfixes#

  • Previously, graphs would fail to resolve an input with a custom type and an input manager key. This has been fixed.
  • Fixes a bug where negative partition counts were displayed in the asset graph.
  • Previously, when an asset sensor did not yield run requests, it returned an empty result. This has been updated to yield a meaningful message.
  • Fix an issue with a non-partitioned asset downstream of a partitioned asset with self-dependencies causing a GQL error in dagit.
  • [dagster-snowflake-pyspark] Fixed a bug where the PySparkTypeHandler was incorrectly loading partitioned data.
  • [dagster-k8s] Fixed an issue where run monitoring sometimes failed to detect that the kubernetes job for a run had stopped, leaving the run hanging.

Documentation#

  • Updated contributor docs to reference our new toolchain (ruff, pyright).
  • (experimental) Documentation for the dynamic partitions definition is now added.
  • [dagster-snowflake] The Snowflake I/O Manager reference page now includes information on working with partitioned assets.

1.1.19 (core) / 0.17.19 (libraries)#

New#

  • The FreshnessPolicy object now supports a cron_schedule_timezone argument.
  • AssetsDefinition.from_graph now supports a freshness_policies_by_output_name parameter.
  • The @asset_sensor will now display an informative SkipReason when no new materializations have been created since the last sensor tick.
  • AssetsDefinition now has a to_source_asset method, which returns a representation of this asset as a SourceAsset.
  • You can now designate assets as inputs to ops within a graph or graph-based job. E.g.
from dagster import asset, job, op

@asset
def emails_to_send():
    ...

@op
def send_emails(emails) -> None:
    ...

@job
def send_emails_job():
    send_emails(emails_to_send.to_source_asset())
  • Added a --dagit-host/-h argument to the dagster dev command to allow customization of the host where Dagit runs.
  • [dagster-snowflake, dagster-duckdb] Database I/O managers (Snowflake, DuckDB) now support static partitions, multi-partitions, and dynamic partitions.

Bugfixes#

  • Previously, if a description was provided for an op that backed a multi-asset, the op’s description would override the descriptions in Dagit for the individual assets. This has been fixed.
  • Sometimes, when applying an input_manager_key to an asset’s input, incorrect resource config could be used when loading that input. This has been fixed.
  • Previously, the backfill page errored when partitions definitions changed for assets that had been backfilled. This has been fixed.
  • When displaying materialized partitions for multipartitioned assets, Dagit would error if a dimension had zero partitions. This has been fixed.
  • [dagster-k8s] Fixed an issue where setting runK8sConfig in the Dagster Helm chart would not pass configuration through to pods launched using the k8s_job_executor.
  • [dagster-k8s] Previously, using the execute_k8s_job op downstream of a dynamic output would result in k8s jobs with duplicate names being created. This has been fixed.
  • [dagster-snowflake] Previously, if the schema for storing outputs didn’t exist, the Snowflake I/O manager would fail. Now it creates the schema.

Breaking Changes#

  • Removed the experimental, undocumented asset_key, asset_partitions, and asset_partitions_defs arguments on Out.
  • @multi_asset no longer accepts Out values in the dictionary passed to its outs argument. This was experimental and deprecated. Instead, use AssetOut.
  • The experimental, undocumented top_level_resources argument to the repository decorator has been renamed to _top_level_resources to emphasize that it should not be set manually.

Community Contributions#

  • load_asset_values now accepts resource configuration (thanks @Nintorac!)
  • Previously, when using the UPathIOManager, paths with the "." character in them would be incorrectly truncated, which could result in multiple distinct objects being written to the same path. This has been fixed. (Thanks @spenczar!)

Experimental#

  • [dagster-dbt] Added documentation to our dbt Cloud integration to cache the loading of software-defined assets from a dbt Cloud job.

Documentation#

  • Revamped the introduction to the Partitions concepts page to make it clear that non-time-window partitions are equally encouraged.
  • In Navigation, moved the Partitions and Backfill concept pages to their own section underneath Concepts.
  • Moved the Running Dagster locally guide from Deployment to Guides to reflect that OSS and Cloud users can follow it.
  • Added a new guide covering asset versioning and caching.