AWS (S3 + Glue)¶
The rivet-aws plugin provides two catalog plugins: s3 for S3 object storage and glue for the AWS Glue Data Catalog.
S3 Catalog¶
The S3 catalog treats an S3 bucket (with optional prefix) as a data store for file-based tables.
default:
catalogs:
- name: lake
type: s3
options:
bucket: my-data-lake
prefix: raw/
region: us-east-1
format: parquet
S3 Options¶
| Option | Required | Type | Default | Description |
|---|---|---|---|---|
bucket |
yes | str |
— | S3 bucket name |
prefix |
no | str |
"" |
Key prefix |
region |
no | str |
"us-east-1" |
AWS region |
endpoint_url |
no | str |
None |
Custom endpoint (MinIO, LocalStack). Passed as-is; scheme is stripped for DuckDB's httpfs. |
format |
no | str |
"parquet" |
Default format (parquet, csv, json, orc, delta). Auto-detected from file extension when the table name includes one (e.g., customers.csv). |
S3 with Custom Endpoint (MinIO, LocalStack)¶
default:
catalogs:
- name: local_s3
type: s3
options:
bucket: rivet-data
endpoint_url: http://localhost:9000
path_style_access: true
access_key_id: minioadmin
secret_access_key: minioadmin123
format: csv
S3 Credentials¶
| Option | Description |
|---|---|
access_key_id |
AWS access key ID |
secret_access_key |
AWS secret access key |
session_token |
Temporary session token (STS) |
profile |
AWS CLI profile from ~/.aws/credentials |
role_arn |
IAM role ARN to assume |
auth_type |
iam_keys, profile, assume_role, web_identity, default |
Falls back to the default AWS credential chain if no explicit credentials are provided.
Glue Catalog¶
The Glue catalog connects to the AWS Glue Data Catalog for schema-managed access to S3-backed tables.
default:
catalogs:
- name: glue_db
type: glue
options:
database: analytics
region: us-east-1
catalog_id: "123456789012"
Glue Options¶
| Option | Required | Type | Default | Description |
|---|---|---|---|---|
database |
no | str |
None |
Glue database name |
region |
no | str |
"us-east-1" |
AWS region |
catalog_id |
no | str |
None |
AWS account ID for cross-account |
lf_enabled |
no | bool |
false |
Use Lake Formation vended credentials |
Complex Type Support¶
Glue Catalog supports complex types through schema introspection:
- Arrays:
array<T>syntax (e.g.,array<string>,array<int>) - Structs:
struct<field:type,...>syntax (e.g.,struct<name:string,value:double>) - Nested types: Arbitrary nesting supported (e.g.,
array<struct<...>>)
Complex types are automatically mapped to Arrow types during schema introspection. See Complex Type Support for details.
Uses the same credential options as S3.
Engine¶
The AWS plugin does not provide a compute engine. Pair S3 or Glue catalogs with an engine that has cross-catalog adapters:
| Engine | S3 | Glue |
|---|---|---|
| DuckDB | (via httpfs) |
|
| Polars | ||
| PySpark | (via Hadoop S3A) |
Usage Examples¶
S3 source¶
S3 sink¶
Lake Formation¶
When lf_enabled: true, the plugin uses Lake Formation for temporary table-level credentials:
default:
catalogs:
- name: governed
type: glue
options:
database: analytics
region: us-east-1
lf_enabled: true
Known Limitations¶
- No compute engine — Pair with DuckDB, Polars, or PySpark.
- Credential complexity — Multiple auth methods supported. Start with explicit keys for testing, then move to IAM roles.
- S3 listing — Listing operations on prefixes with many objects may be slow.
- Glue API rate limits — Heavy
list_tables/get_schemausage may hit throttling. Consider caching. - Lake Formation — LF-vended credentials have limited lifetime. Long pipelines may need refresh.