Skip to content

Ducklake

Ducklake enables storage of large datasets in S3 while maintaining efficient SQL query capabilities. It’s part of Orvanta’s persistent storage solutions for data management.

Prerequisites include:

  • Configured workspace storage
  • A Postgres or MySQL resource (unless you’re a superuser with access to a custom instance database)

Configuration occurs in workspace settings under “Object storage (S3)” where you can set up a Ducklake instance.

Ducklakes are referenced by name, with main as the default that can be omitted.

TypeScript example:

import * as orvanta from 'orvanta-client';
export async function main(user_id: string) {
let sql = orvanta.ducklake();
let friend = await sql`SELECT * FROM friend WHERE id = ${user_id}`.fetchOne();
return friend;
}

Python example:

import orvanta
def main(user_id: str):
dl = orvanta.ducklake()
friend = dl.query('SELECT * FROM friend WHERE id = $id', id=user_id).fetch_one()
return friend

DuckDB example:

ATTACH 'ducklake' AS dl;
USE dl;
SELECT * FROM friend WHERE id = $user_id;

The “Explore” button in Ducklake settings opens a database manager for CRUD operations via UI or SQL REPL.

Performance comparison: Ducklake vs Snowflake

Section titled “Performance comparison: Ducklake vs Snowflake”

Benchmarking used TPC-DS across three dataset scales:

  • SF10 (10 GB): Orvanta with 2× m6a.8xlarge completed in 67 seconds (21% faster than Snowflake Small) at 2.5x lower cost.
  • SF100 (100 GB): Orvanta with 3× m6a.16xlarge (157 seconds, $0.36) outperformed Snowflake Large (194 seconds, $0.86) while costing 2.4x less.
  • SF1000 (1 TB): Snowflake Large completed in 28 minutes versus Orvanta’s 70 minutes, though Orvanta cost 15% less ($6.35 vs $7.43).

Testing revealed that Airflow + Pandas takes 47 minutes compared to Orvanta’s 67 seconds — a 42x slowdown — at 10 GB scale due to memory constraints and single-threaded execution limitations.

Ducklake stores metadata in catalog tables and maintains data as columnar Parquet files in S3, enabling efficient analytical processing without loading everything into RAM.