Skip to content

Unity / Databricks

The rivet-databricks plugin provides a compute engine (databricks) and two catalog plugins (unity for Unity Catalog REST API, databricks for Databricks-managed catalogs).

pip install 'rivetsql[databricks]'

Engine Configuration

The Databricks engine executes SQL via the Statement Execution API against a SQL warehouse.

default:
  engines:
    - name: dbx
      type: databricks
      options:
        warehouse_id: abc123def456
        workspace_url: https://my-workspace.cloud.databricks.com
        token: ${DATABRICKS_TOKEN}
        wait_timeout: "30s"
        max_rows_per_chunk: 100000
        concurrency_limit: 4
      catalogs: [unity_catalog, dbx_catalog]

Engine Options

Option Required Type Default Description
warehouse_id yes str SQL warehouse ID
workspace_url yes str Workspace URL (https://...)
token yes str Personal access token
wait_timeout no str "30s" Statement execution timeout
max_rows_per_chunk no int 100000 Max rows per Arrow chunk
concurrency_limit no int 1 Max fused groups executing in parallel on this engine. Set higher to run independent pipeline branches concurrently against the warehouse. See Parallel Execution.

Supported Write Strategies

All seven: append, replace, truncate_insert, merge, delete_insert, incremental_append, scd2

Cross-Engine Adapters

Adapter Requires Description
DatabricksUnityAdapter Read/write Unity tables through Databricks (supports native SQL write)
DatabricksAdapter Read/write Databricks-managed tables (supports native SQL write)
DatabricksDuckDBAdapter duckdb Read Databricks/Unity tables from local DuckDB
DatabricksCrossJointAdapter Cross-engine joins

Both DatabricksAdapter and DatabricksUnityAdapter support native SQL write for replace, append, and truncate_insert strategies — the fused SQL is executed directly on the Databricks SQL Warehouse, eliminating the Arrow round-trip. See Native SQL Write Optimization for details.


Unity Catalog

default:
  catalogs:
    - name: unity_catalog
      type: unity
      options:
        host: https://my-workspace.cloud.databricks.com
        catalog_name: main
        # schema: prod_silver  # optional — restricts explore/sources to this schema
        token: ${DATABRICKS_TOKEN}

Unity Options

Option Required Type Default Description
host yes str Unity Catalog server URL
catalog_name yes str Catalog name
schema no str None Default schema. When set, restricts explore/REPL and source declarations to this schema only.

Complex Type Support

Unity Catalog supports complex types through schema introspection:

  • Arrays: array<T> syntax (e.g., array<string>, array<bigint>)
  • Structs: struct<field:type,...> syntax (e.g., struct<name:string,age:int>)
  • Nested types: Arbitrary nesting supported (e.g., array<struct<...>>, struct<field:array<...>>)

Complex types are automatically mapped to Arrow types during schema introspection. See Complex Type Support for details.

Unity Credentials

Option Description
token Personal access token (env: DATABRICKS_TOKEN)
client_id OAuth M2M client ID
client_secret OAuth M2M client secret

Auth types: pat, oauth_m2m, azure_cli, gcp_login. Resolves via: explicit options → env vars → ~/.databrickscfg → cloud-native auth.


Databricks Catalog

default:
  catalogs:
    - name: dbx_catalog
      type: databricks
      options:
        workspace_url: https://my-workspace.cloud.databricks.com
        catalog: main
        # schema: prod_silver  # optional — restricts explore/sources to this schema
        token: ${DATABRICKS_TOKEN}

Databricks Options

Option Required Type Default Description
workspace_url yes str Workspace URL
catalog yes str Catalog name
schema no str None Default schema for writes. When set, restricts explore/REPL and source declarations to this schema only.
http_path no str None SQL warehouse HTTP path

Usage Examples

Source from Unity

-- rivet:name: raw_sales
-- rivet:type: source
-- rivet:catalog: unity_catalog
-- rivet:table: main.sales.transactions
name: raw_sales
type: source
catalog: unity_catalog
table: main.sales.transactions
from rivet_core.models import Joint

raw_sales = Joint(
    name="raw_sales",
    joint_type="source",
    catalog="unity_catalog",
    table="main.sales.transactions",
)

Sink to Databricks

-- rivet:name: write_summary
-- rivet:type: sink
-- rivet:upstream: daily_summary
-- rivet:catalog: dbx_catalog
-- rivet:table: main.analytics.daily_summary
-- rivet:write_strategy: merge
-- rivet:merge_keys: date_key
name: write_summary
type: sink
upstream: daily_summary
catalog: dbx_catalog
table: main.analytics.daily_summary
write_strategy:
  mode: merge
  key_columns: [date_key]
from rivet_core.models import Joint

write_summary = Joint(
    name="write_summary",
    joint_type="sink",
    upstream=["daily_summary"],
    catalog="dbx_catalog",
    table="main.analytics.daily_summary",
    write_strategy="merge",
)

Known Limitations

  • Network dependency — Requires Databricks workspace connectivity. Subject to warehouse auto-scaling delays.
  • Warehouse startup — Serverless/auto-stopped warehouses may take 30-120s. Adjust wait_timeout.
  • Result sizeEXTERNAL_LINKS disposition streams Arrow IPC chunks via pre-signed URLs.
  • Auth complexity — Multiple methods (PAT, OAuth M2M, Azure AD, GCP). Ensure correct credentials for your workspace type.
  • Three-part table names — Unity Catalog uses catalog.schema.table. Include all three parts when needed.