Unity / Databricks¶
The rivet-databricks plugin provides a compute engine (databricks) and two catalog plugins (unity for Unity Catalog REST API, databricks for Databricks-managed catalogs).
Engine Configuration¶
The Databricks engine executes SQL via the Statement Execution API against a SQL warehouse.
default:
engines:
- name: dbx
type: databricks
options:
warehouse_id: abc123def456
workspace_url: https://my-workspace.cloud.databricks.com
token: ${DATABRICKS_TOKEN}
wait_timeout: "30s"
max_rows_per_chunk: 100000
concurrency_limit: 4
catalogs: [unity_catalog, dbx_catalog]
Engine Options¶
| Option | Required | Type | Default | Description |
|---|---|---|---|---|
warehouse_id |
yes | str |
— | SQL warehouse ID |
workspace_url |
yes | str |
— | Workspace URL (https://...) |
token |
yes | str |
— | Personal access token |
wait_timeout |
no | str |
"30s" |
Statement execution timeout |
max_rows_per_chunk |
no | int |
100000 |
Max rows per Arrow chunk |
concurrency_limit |
no | int |
1 |
Max fused groups executing in parallel on this engine. Set higher to run independent pipeline branches concurrently against the warehouse. See Parallel Execution. |
Supported Write Strategies¶
All seven: append, replace, truncate_insert, merge, delete_insert, incremental_append, scd2
Cross-Engine Adapters¶
| Adapter | Requires | Description |
|---|---|---|
DatabricksUnityAdapter |
— | Read/write Unity tables through Databricks (supports native SQL write) |
DatabricksAdapter |
— | Read/write Databricks-managed tables (supports native SQL write) |
DatabricksDuckDBAdapter |
duckdb |
Read Databricks/Unity tables from local DuckDB |
DatabricksCrossJointAdapter |
— | Cross-engine joins |
Both DatabricksAdapter and DatabricksUnityAdapter support native SQL write for replace, append, and truncate_insert strategies — the fused SQL is executed directly on the Databricks SQL Warehouse, eliminating the Arrow round-trip. See Native SQL Write Optimization for details.
Unity Catalog¶
default:
catalogs:
- name: unity_catalog
type: unity
options:
host: https://my-workspace.cloud.databricks.com
catalog_name: main
# schema: prod_silver # optional — restricts explore/sources to this schema
token: ${DATABRICKS_TOKEN}
Unity Options¶
| Option | Required | Type | Default | Description |
|---|---|---|---|---|
host |
yes | str |
— | Unity Catalog server URL |
catalog_name |
yes | str |
— | Catalog name |
schema |
no | str |
None |
Default schema. When set, restricts explore/REPL and source declarations to this schema only. |
Complex Type Support¶
Unity Catalog supports complex types through schema introspection:
- Arrays:
array<T>syntax (e.g.,array<string>,array<bigint>) - Structs:
struct<field:type,...>syntax (e.g.,struct<name:string,age:int>) - Nested types: Arbitrary nesting supported (e.g.,
array<struct<...>>,struct<field:array<...>>)
Complex types are automatically mapped to Arrow types during schema introspection. See Complex Type Support for details.
Unity Credentials¶
| Option | Description |
|---|---|
token |
Personal access token (env: DATABRICKS_TOKEN) |
client_id |
OAuth M2M client ID |
client_secret |
OAuth M2M client secret |
Auth types: pat, oauth_m2m, azure_cli, gcp_login. Resolves via: explicit options → env vars → ~/.databrickscfg → cloud-native auth.
Databricks Catalog¶
default:
catalogs:
- name: dbx_catalog
type: databricks
options:
workspace_url: https://my-workspace.cloud.databricks.com
catalog: main
# schema: prod_silver # optional — restricts explore/sources to this schema
token: ${DATABRICKS_TOKEN}
Databricks Options¶
| Option | Required | Type | Default | Description |
|---|---|---|---|---|
workspace_url |
yes | str |
— | Workspace URL |
catalog |
yes | str |
— | Catalog name |
schema |
no | str |
None |
Default schema for writes. When set, restricts explore/REPL and source declarations to this schema only. |
http_path |
no | str |
None |
SQL warehouse HTTP path |
Usage Examples¶
Source from Unity¶
Sink to Databricks¶
Known Limitations¶
- Network dependency — Requires Databricks workspace connectivity. Subject to warehouse auto-scaling delays.
- Warehouse startup — Serverless/auto-stopped warehouses may take 30-120s. Adjust
wait_timeout. - Result size —
EXTERNAL_LINKSdisposition streams Arrow IPC chunks via pre-signed URLs. - Auth complexity — Multiple methods (PAT, OAuth M2M, Azure AD, GCP). Ensure correct credentials for your workspace type.
- Three-part table names — Unity Catalog uses
catalog.schema.table. Include all three parts when needed.