AWS (S3 + Glue)¶

The rivet-aws plugin provides two catalog plugins: s3 for S3 object storage and glue for the AWS Glue Data Catalog.

pip install 'rivetsql[aws]'

S3 Catalog¶

The S3 catalog treats an S3 bucket (with optional prefix) as a data store for file-based tables.

default:
  catalogs:
    - name: lake
      type: s3
      options:
        bucket: my-data-lake
        prefix: raw/
        region: us-east-1
        format: parquet

S3 Options¶

Option	Required	Type	Default	Description
`bucket`	yes	`str`	—	S3 bucket name
`prefix`	no	`str`	`""`	Key prefix
`region`	no	`str`	`"us-east-1"`	AWS region
`endpoint_url`	no	`str`	`None`	Custom endpoint (MinIO, LocalStack). Passed as-is; scheme is stripped for DuckDB's httpfs.
`format`	no	`str`	`"parquet"`	Default format (`parquet`, `csv`, `json`, `orc`, `delta`). Auto-detected from file extension when the table name includes one (e.g., `customers.csv`).

S3 with Custom Endpoint (MinIO, LocalStack)¶

default:
  catalogs:
    - name: local_s3
      type: s3
      options:
        bucket: rivet-data
        endpoint_url: http://localhost:9000
        path_style_access: true
        access_key_id: minioadmin
        secret_access_key: minioadmin123
        format: csv

S3 Credentials¶

Option	Description
`access_key_id`	AWS access key ID
`secret_access_key`	AWS secret access key
`session_token`	Temporary session token (STS)
`profile`	AWS CLI profile from `~/.aws/credentials`
`role_arn`	IAM role ARN to assume
`auth_type`	`iam_keys`, `profile`, `assume_role`, `web_identity`, `default`

Falls back to the default AWS credential chain if no explicit credentials are provided.

Glue Catalog¶

The Glue catalog connects to the AWS Glue Data Catalog for schema-managed access to S3-backed tables.

default:
  catalogs:
    - name: glue_db
      type: glue
      options:
        database: analytics
        region: us-east-1
        catalog_id: "123456789012"

Glue Options¶

Option	Required	Type	Default	Description
`database`	no	`str`	`None`	Glue database name
`region`	no	`str`	`"us-east-1"`	AWS region
`catalog_id`	no	`str`	`None`	AWS account ID for cross-account
`lf_enabled`	no	`bool`	`false`	Use Lake Formation vended credentials

Complex Type Support¶

Glue Catalog supports complex types through schema introspection:

Arrays: array<T> syntax (e.g., array<string>, array<int>)
Structs: struct<field:type,...> syntax (e.g., struct<name:string,value:double>)
Nested types: Arbitrary nesting supported (e.g., array<struct<...>>)

Complex types are automatically mapped to Arrow types during schema introspection. See Complex Type Support for details.

Uses the same credential options as S3.

Engine¶

The AWS plugin does not provide a compute engine. Pair S3 or Glue catalogs with an engine that has cross-catalog adapters:

Engine	S3	Glue
DuckDB	(via `httpfs`)
Polars
PySpark	(via Hadoop S3A)

Usage Examples¶

S3 source¶

SQLYAMLRivet API

-- rivet:name: raw_events
-- rivet:type: source
-- rivet:catalog: lake
-- rivet:table: events

name: raw_events
type: source
catalog: lake
table: events

from rivet_core.models import Joint

raw_events = Joint(
    name="raw_events",
    joint_type="source",
    catalog="lake",
    table="events",
)

S3 sink¶

SQLYAMLRivet API

-- rivet:name: write_results
-- rivet:type: sink
-- rivet:upstream: transformed
-- rivet:catalog: lake
-- rivet:table: results
-- rivet:write_strategy: append

name: write_results
type: sink
upstream: transformed
catalog: lake
table: results
write_strategy: append

from rivet_core.models import Joint

write_results = Joint(
    name="write_results",
    joint_type="sink",
    upstream=["transformed"],
    catalog="lake",
    table="results",
    write_strategy="append",
)

Lake Formation¶

When lf_enabled: true, the plugin uses Lake Formation for temporary table-level credentials:

default:
  catalogs:
    - name: governed
      type: glue
      options:
        database: analytics
        region: us-east-1
        lf_enabled: true

Known Limitations¶

No compute engine — Pair with DuckDB, Polars, or PySpark.
Credential complexity — Multiple auth methods supported. Start with explicit keys for testing, then move to IAM roles.
S3 listing — Listing operations on prefixes with many objects may be slow.
Glue API rate limits — Heavy list_tables/get_schema usage may hit throttling. Consider caching.
Lake Formation — LF-vended credentials have limited lifetime. Long pipelines may need refresh.