Compilation¶

Compilation transforms a set of joint declarations into an immutable CompiledAssembly — the single source of truth that drives execution, CLI display, testing, and inspection.

Compilation is pure

compile() performs no data reads, no data writes, and no runtime introspection. Given the same inputs, it always produces the same output.

The Pipeline¶

graph LR
    A[Config Parsing] --> B[Bridge Forward]
    B --> C[Assembly Building]
    C --> D[Compilation]
    D --> E[Execution]

    style A fill:#6c63ff,color:#fff,stroke:none
    style B fill:#7c74ff,color:#fff,stroke:none
    style C fill:#818cf8,color:#fff,stroke:none
    style D fill:#3b82f6,color:#fff,stroke:none
    style E fill:#2563eb,color:#fff,stroke:none

Stage	What happens
Config Parsing	Read `rivet.yaml` and `profiles.yaml`, resolve profile, validate schemas
Bridge Forward	Instantiate catalog and engine objects, resolve plugin entry points
Assembly Building	Collect joints, resolve upstream references, build the DAG
Compilation	Validate DAG, assign execution order, fuse adjacent joints, produce `CompiledAssembly`
Execution	Follow `execution_order` exactly — no re-resolution at runtime

Stage 1: Config Parsing¶

Rivet reads two configuration files:

rivet.yaml — project manifest: directory paths for sources, joints, sinks, tests, and profiles
profiles.yaml — environment-specific: catalogs, engines, default engine, credentials

The active profile is selected by --profile flag or RIVET_PROFILE environment variable.

Stage 2: Bridge Forward¶

rivet_bridge converts raw config into live objects:

Catalog configs become Catalog model instances
Engine configs become ComputeEngine instances with resolved plugin types
Plugin entry points are resolved via the plugin registry

This is the only stage where plugin code is loaded. After bridge forward, the rest of the pipeline works with pure data models.

Stage 3: Assembly Building¶

The Assembly validates structural integrity of the joint DAG:

Joint names must be globally unique
All upstream references must resolve to existing joints
Source joints must have no upstream
Sink joints must have at least one upstream
The graph must be acyclic

Violations raise an AssemblyError with a structured RivetError containing the error code, message, and remediation hint.

SQLYAMLRivet API

-- rivet:name: raw_orders
-- rivet:type: source
-- rivet:catalog: local
-- rivet:table: orders

-- rivet:name: daily_revenue
-- rivet:type: sql
-- rivet:upstream: raw_orders
SELECT order_date, SUM(amount) AS revenue
FROM raw_orders
GROUP BY order_date

-- rivet:name: revenue_sink
-- rivet:type: sink
-- rivet:upstream: daily_revenue
-- rivet:catalog: warehouse
-- rivet:table: daily_revenue
-- rivet:write_strategy: replace

# sources/raw_orders.yaml
name: raw_orders
type: source
catalog: local
table: orders

# joints/daily_revenue.yaml
name: daily_revenue
type: sql
upstream: [raw_orders]
sql: |
  SELECT order_date, SUM(amount) AS revenue
  FROM raw_orders
  GROUP BY order_date

# sinks/revenue_sink.yaml
name: revenue_sink
type: sink
upstream: daily_revenue
catalog: warehouse
table: daily_revenue
write_strategy: replace

from rivet_core.models import Joint

raw_orders = Joint(
    name="raw_orders",
    joint_type="source",
    catalog="local",
    table="orders",
)

daily_revenue = Joint(
    name="daily_revenue",
    joint_type="sql",
    upstream=["raw_orders"],
    sql="SELECT order_date, SUM(amount) AS revenue FROM raw_orders GROUP BY order_date",
)

revenue_sink = Joint(
    name="revenue_sink",
    joint_type="sink",
    upstream=["daily_revenue"],
    catalog="warehouse",
    table="daily_revenue",
    write_strategy="replace",
)

Stage 4: Compilation¶

compile() takes an Assembly and produces a CompiledAssembly.

Engine Resolution¶

Each joint is assigned an engine. Resolution order:

Joint-level engine override (highest priority)
Profile-level default_engine

SQL Fusion¶

Adjacent SQL joints on the same engine instance are fused into a single query using CTEs:

graph LR
    A[raw_orders<br/><small>source</small>] --> B[filter_orders<br/><small>sql</small>]
    B --> C[daily_revenue<br/><small>sql</small>]
    C --> D[revenue_sink<br/><small>sink</small>]

    style B fill:#6c63ff,color:#fff,stroke:none
    style C fill:#6c63ff,color:#fff,stroke:none

Fusion is broken by:

A Python joint between two SQL joints
An engine instance change
An explicit eager: true flag
An assertion (quality check) on the upstream joint

Multi-Upstream Fusion¶

When a multi-input joint such as a SQL JOIN references several upstream joints, the optimizer merges all eligible upstream groups into a single fused group. This avoids wasteful standalone executions — for example, a bare SELECT * FROM ... adapter read for a source whose data is already available server-side inside the fused query.

graph LR
    S1[customers<br/><small>source</small>] --> J[customer_orders<br/><small>sql JOIN</small>]
    S2[orders<br/><small>source</small>] --> J
    J --> K[sink<br/><small>sink</small>]

    style S1 fill:#6c63ff,color:#fff,stroke:none
    style S2 fill:#6c63ff,color:#fff,stroke:none
    style J fill:#6c63ff,color:#fff,stroke:none

In the diagram above, both source joints and the JOIN are fused into a single CTE chain executed as one query. Without multi-upstream fusion, each source would run as a separate standalone group.

The same fusion-breaking conditions apply to each upstream independently — if one upstream is on a different engine or has eager: true, only that upstream stays in its own group while the remaining eligible upstreams are still merged.

Execution Order¶

compile() produces a topologically sorted execution_order list. The executor follows this list exactly — it never re-resolves or re-orders at runtime.

Introspection (Best-Effort)¶

During compilation, Rivet optionally introspects source schemas from catalogs. This improves SQL validation and column lineage tracking, but introspection failures never block compilation.

Sink Schema Inference¶

After all upstream joints are compiled, Rivet automatically infers output schemas for sink joints based on their upstream data flow. This provides visibility into what data structure is being written and enables schema validation at write time.

Single upstream: When a sink has exactly one upstream joint, the sink inherits that upstream's output schema directly.

Multiple upstreams: When a sink has multiple upstream joints:

If all upstream schemas are identical (same columns, types, nullability, and order), the sink uses that shared schema
If upstream schemas differ or any upstream has no schema, the sink's output schema is set to None and a warning is emitted

Schema confidence: Sinks inherit schema confidence from their upstream joints:

introspected — upstream data came from catalog introspection
inferred — upstream schema was inferred from SQL analysis
partial — schema merging failed due to conflicts
none — no schema information available

Schema confidence follows the ranking: introspected > inferred > partial > none. When multiple upstreams have different confidence levels, the sink inherits the highest confidence present.

Example output:

✓ Compiled 3 joints in 0.12s

Execution order:
  1. raw_orders (source, duckdb, introspected schema: id: int64, amount: float64)
  2. revenue_sink (sink, duckdb, introspected schema: id: int64, amount: float64)

Schema confidence: introspected

When schemas conflict:

⚠ Warning: Sink 'combined_sink' has conflicting upstream schemas from joints: 'source1', 'source2'.
Schema inference failed. Sink output_schema set to None.

The CompiledAssembly¶

CompiledAssembly is an immutable frozen dataclass:

Field	Description
`success`	Whether compilation succeeded
`joints`	`CompiledJoint` objects with resolved engine, adapter, SQL, and schema
`fused_groups`	Groups of SQL joints fused into single queries
`execution_order`	Topologically sorted list of group IDs and standalone joint names
`materializations`	Where intermediate results are materialized and why
`engine_boundaries`	Engine type changes between adjacent groups
`errors`	Structured `RivetError` list (non-empty when `success=False`)
`warnings`	Non-fatal issues (missing schemas, skipped introspection)

Compilation Errors¶

If compilation fails, success is False and the executor refuses to run. Common errors:

Code	Cause
`RVT-301`	Duplicate joint name
`RVT-302`	Unknown upstream reference
`RVT-303`	Source joint has upstream
`RVT-304`	Sink joint has no upstream
`RVT-305`	Cyclic dependency in DAG