Ducklake

Overview

Ducklake enables storage of large datasets in S3 while maintaining efficient SQL query capabilities. It’s part of Orvanta’s persistent storage solutions for data management.

Getting started

Prerequisites include:

Configured workspace storage
A Postgres or MySQL resource (unless you’re a superuser with access to a custom instance database)

Configuration occurs in workspace settings under “Object storage (S3)” where you can set up a Ducklake instance.

Usage in scripts

Ducklakes are referenced by name, with main as the default that can be omitted.

TypeScript example:

import * as orvanta from 'orvanta-client';

export async function main(user_id: string) {
  let sql = orvanta.ducklake();
  let friend = await sql`SELECT * FROM friend WHERE id = ${user_id}`.fetchOne();
  return friend;
}

Python example:

import orvanta

def main(user_id: str):
  dl = orvanta.ducklake()
  friend = dl.query('SELECT * FROM friend WHERE id = $id', id=user_id).fetch_one()
  return friend

DuckDB example:

ATTACH 'ducklake' AS dl;
USE dl;
SELECT * FROM friend WHERE id = $user_id;

Database manager

The “Explore” button in Ducklake settings opens a database manager for CRUD operations via UI or SQL REPL.

Performance comparison: Ducklake vs Snowflake

Benchmarking used TPC-DS across three dataset scales:

SF10 (10 GB): Orvanta with 2× m6a.8xlarge completed in 67 seconds (21% faster than Snowflake Small) at 2.5x lower cost.
SF100 (100 GB): Orvanta with 3× m6a.16xlarge (157 seconds, $0.36) outperformed Snowflake Large (194 seconds, $0.86) while costing 2.4x less.
SF1000 (1 TB): Snowflake Large completed in 28 minutes versus Orvanta’s 70 minutes, though Orvanta cost 15% less ($6.35 vs $7.43).

Pandas comparison

Testing revealed that Airflow + Pandas takes 47 minutes compared to Orvanta’s 67 seconds at 10 GB scale, a 42x slowdown caused by memory constraints and single-threaded execution limitations.

Backend architecture

Ducklake stores metadata in catalog tables and maintains data as columnar Parquet files in S3, enabling efficient analytical processing without loading everything into RAM.

Data tables: Relational data with minimal setup.
Large data files: Object storage for unstructured data.