Unity / Databricks¶

The rivet-databricks plugin provides a compute engine (databricks) and two catalog plugins (unity for Unity Catalog REST API, databricks for Databricks-managed catalogs).

pip install 'rivetsql[databricks]'

Engine Configuration¶

The Databricks engine executes SQL via the Statement Execution API against a SQL warehouse.

default:
  engines:
    - name: dbx
      type: databricks
      options:
        warehouse_id: abc123def456
        workspace_url: https://my-workspace.cloud.databricks.com
        token: ${DATABRICKS_TOKEN}
        wait_timeout: "30s"
        max_rows_per_chunk: 100000
        concurrency_limit: 4
      catalogs: [unity_catalog, dbx_catalog]

Engine Options¶

Option	Required	Type	Default	Description
`warehouse_id`	yes	`str`	—	SQL warehouse ID
`workspace_url`	yes	`str`	—	Workspace URL (`https://...`)
`token`	yes	`str`	—	Personal access token
`wait_timeout`	no	`str`	`"30s"`	Statement execution timeout
`max_rows_per_chunk`	no	`int`	`100000`	Max rows per Arrow chunk
`concurrency_limit`	no	`int`	`1`	Max fused groups executing in parallel on this engine. Set higher to run independent pipeline branches concurrently against the warehouse. See Parallel Execution.

Supported Write Strategies¶

All seven: append, replace, truncate_insert, merge, delete_insert, incremental_append, scd2

Cross-Engine Adapters¶

Adapter	Requires	Description
`DatabricksUnityAdapter`	—	Read/write Unity tables through Databricks (supports native SQL write)
`DatabricksAdapter`	—	Read/write Databricks-managed tables (supports native SQL write)
`DatabricksDuckDBAdapter`	`duckdb`	Read Databricks/Unity tables from local DuckDB
`DatabricksCrossJointAdapter`	—	Cross-engine joins

Both DatabricksAdapter and DatabricksUnityAdapter support native SQL write for replace, append, and truncate_insert strategies — the fused SQL is executed directly on the Databricks SQL Warehouse, eliminating the Arrow round-trip. See Native SQL Write Optimization for details.

Unity Catalog¶

default:
  catalogs:
    - name: unity_catalog
      type: unity
      options:
        host: https://my-workspace.cloud.databricks.com
        catalog_name: main
        # schema: prod_silver  # optional — restricts explore/sources to this schema
        token: ${DATABRICKS_TOKEN}

Unity Options¶

Option	Required	Type	Default	Description
`host`	yes	`str`	—	Unity Catalog server URL
`catalog_name`	yes	`str`	—	Catalog name
`schema`	no	`str`	`None`	Default schema. When set, restricts explore/REPL and source declarations to this schema only.

Complex Type Support¶

Unity Catalog supports complex types through schema introspection:

Arrays: array<T> syntax (e.g., array<string>, array<bigint>)
Structs: struct<field:type,...> syntax (e.g., struct<name:string,age:int>)
Nested types: Arbitrary nesting supported (e.g., array<struct<...>>, struct<field:array<...>>)

Complex types are automatically mapped to Arrow types during schema introspection. See Complex Type Support for details.

Unity Credentials¶

Option	Description
`token`	Personal access token (env: `DATABRICKS_TOKEN`)
`client_id`	OAuth M2M client ID
`client_secret`	OAuth M2M client secret

Auth types: pat, oauth_m2m, azure_cli, gcp_login. Resolves via: explicit options → env vars → ~/.databrickscfg → cloud-native auth.

Databricks Catalog¶

default:
  catalogs:
    - name: dbx_catalog
      type: databricks
      options:
        workspace_url: https://my-workspace.cloud.databricks.com
        catalog: main
        # schema: prod_silver  # optional — restricts explore/sources to this schema
        token: ${DATABRICKS_TOKEN}

Databricks Options¶

Option	Required	Type	Default	Description
`workspace_url`	yes	`str`	—	Workspace URL
`catalog`	yes	`str`	—	Catalog name
`schema`	no	`str`	`None`	Default schema for writes. When set, restricts explore/REPL and source declarations to this schema only.
`http_path`	no	`str`	`None`	SQL warehouse HTTP path

Usage Examples¶

Source from Unity¶

SQLYAMLRivet API

-- rivet:name: raw_sales
-- rivet:type: source
-- rivet:catalog: unity_catalog
-- rivet:table: main.sales.transactions

name: raw_sales
type: source
catalog: unity_catalog
table: main.sales.transactions

from rivet_core.models import Joint

raw_sales = Joint(
    name="raw_sales",
    joint_type="source",
    catalog="unity_catalog",
    table="main.sales.transactions",
)

Sink to Databricks¶

SQLYAMLRivet API

-- rivet:name: write_summary
-- rivet:type: sink
-- rivet:upstream: daily_summary
-- rivet:catalog: dbx_catalog
-- rivet:table: main.analytics.daily_summary
-- rivet:write_strategy: merge
-- rivet:merge_keys: date_key

name: write_summary
type: sink
upstream: daily_summary
catalog: dbx_catalog
table: main.analytics.daily_summary
write_strategy:
  mode: merge
  key_columns: [date_key]

from rivet_core.models import Joint

write_summary = Joint(
    name="write_summary",
    joint_type="sink",
    upstream=["daily_summary"],
    catalog="dbx_catalog",
    table="main.analytics.daily_summary",
    write_strategy="merge",
)

Known Limitations¶

Network dependency — Requires Databricks workspace connectivity. Subject to warehouse auto-scaling delays.
Warehouse startup — Serverless/auto-stopped warehouses may take 30-120s. Adjust wait_timeout.
Result size — EXTERNAL_LINKS disposition streams Arrow IPC chunks via pre-signed URLs.
Auth complexity — Multiple methods (PAT, OAuth M2M, Azure AD, GCP). Ensure correct credentials for your workspace type.
Three-part table names — Unity Catalog uses catalog.schema.table. Include all three parts when needed.