Ducklake
Overview
Section titled “Overview”Ducklake enables storage of large datasets in S3 while maintaining efficient SQL query capabilities. It’s part of Orvanta’s persistent storage solutions for data management.
Getting started
Section titled “Getting started”Prerequisites include:
- Configured workspace storage
- A Postgres or MySQL resource (unless you’re a superuser with access to a custom instance database)
Configuration occurs in workspace settings under “Object storage (S3)” where you can set up a Ducklake instance.
Usage in scripts
Section titled “Usage in scripts”Ducklakes are referenced by name, with main as the default that can be omitted.
TypeScript example:
import * as orvanta from 'orvanta-client';
export async function main(user_id: string) { let sql = orvanta.ducklake(); let friend = await sql`SELECT * FROM friend WHERE id = ${user_id}`.fetchOne(); return friend;}Python example:
import orvanta
def main(user_id: str): dl = orvanta.ducklake() friend = dl.query('SELECT * FROM friend WHERE id = $id', id=user_id).fetch_one() return friendDuckDB example:
ATTACH 'ducklake' AS dl;USE dl;SELECT * FROM friend WHERE id = $user_id;Database manager
Section titled “Database manager”The “Explore” button in Ducklake settings opens a database manager for CRUD operations via UI or SQL REPL.
Performance comparison: Ducklake vs Snowflake
Section titled “Performance comparison: Ducklake vs Snowflake”Benchmarking used TPC-DS across three dataset scales:
- SF10 (10 GB): Orvanta with 2× m6a.8xlarge completed in 67 seconds (21% faster than Snowflake Small) at 2.5x lower cost.
- SF100 (100 GB): Orvanta with 3× m6a.16xlarge (157 seconds, $0.36) outperformed Snowflake Large (194 seconds, $0.86) while costing 2.4x less.
- SF1000 (1 TB): Snowflake Large completed in 28 minutes versus Orvanta’s 70 minutes, though Orvanta cost 15% less ($6.35 vs $7.43).
Pandas comparison
Section titled “Pandas comparison”Testing revealed that Airflow + Pandas takes 47 minutes compared to Orvanta’s 67 seconds — a 42x slowdown — at 10 GB scale due to memory constraints and single-threaded execution limitations.
Backend architecture
Section titled “Backend architecture”Ducklake stores metadata in catalog tables and maintains data as columnar Parquet files in S3, enabling efficient analytical processing without loading everything into RAM.
Related
Section titled “Related”- Data tables — Relational data with minimal setup.
- Large data files — Object storage for unstructured data.