Large data files

This page is part of the section on Persistent storage & databases, which covers where to effectively store and manage the data manipulated by Orvanta.

For heavier data objects and unstructured data storage, Amazon S3 (Simple Storage Service) and its alternatives Cloudflare R2 and MinIO, as well as Azure Blob Storage and Google Cloud Storage, are highly scalable and durable object storage services that provide secure, reliable, and cost-effective storage for a wide range of data types and use cases.

Orvanta comes with a native integration with S3, Azure Blob, and Google Cloud Storage, making them the recommended storage for large objects like files and binary data.

Workspace object storage

Connect your Orvanta workspace to your S3 bucket, Azure Blob storage, or Google Cloud Storage to enable users to read and write from S3 without having to have access to the credentials.

Orvanta integration with Polars and DuckDB for data pipelines

ETLs can be easily implemented in Orvanta using its integration with Polars and DuckDB to facilitate working with tabular data. In this case, you don’t need to manually interact with the S3 bucket, because Polars/DuckDB does it natively and in an efficient way. Reading and writing datasets to S3 can be done seamlessly.

DuckDB (SQL)

-- $file1 (s3object)
-- Run queries directly on an S3 parquet file passed as an argument
SELECT * FROM read_parquet($file1)
-- Or using an explicit path in a workspace storage
SELECT * FROM read_json('s3:///demo/data.json')
-- You can also specify a secondary workspace storage
SELECT * FROM read_csv('s3://secondary_storage/demo/data.csv')
-- Write the result of a query to a different parquet file on S3
COPY (
    SELECT COUNT(*) FROM read_parquet($file1)
) TO 's3:///demo/output.pq' (FORMAT 'parquet');

Polars (Python)

#requirements:polars==0.20.2
#s3fs==2023.12.0
#orvanta>=1.229.0
import orvanta
from orvanta import S3Object
import polars as pl
import s3fs

def main(input_file: S3Object):
    bucket = orvanta.get_resource("<PATH_TO_S3_RESOURCE>")["bucket"]
    # this will default to the workspace S3 resource
    storage_options = orvanta.polars_connection_settings().storage_options
    # this will use the designated resource
    # storage_options = orvanta.polars_connection_settings("<PATH_TO_S3_RESOURCE>").storage_options
    # input is a parquet file, we use read_parquet in lazy mode.
    input_uri = "s3://{}/{}".format(bucket, input_file["s3"])
    input_df = pl.read_parquet(input_uri, storage_options=storage_options).lazy()
    # process the Polars dataframe
    output_df = input_df.collect()
    print(output_df)
    # To write back the result to S3, Polars needs an s3fs connection
    s3 = s3fs.S3FileSystem(**orvanta.polars_connection_settings().s3fs_args)
    output_file = "output/result.parquet"
    output_uri = "s3://{}/{}".format(bucket, output_file)
    with s3.open(output_uri, mode="wb") as output_s3:
        # persist the output dataframe back to S3 and return it
        output_df.write_parquet(output_s3)
    return S3Object(s3=output_file)

DuckDB (Python)

#requirements:
#orvanta>=1.229.0
#duckdb==0.9.1
import orvanta
from orvanta import S3Object
import duckdb

def main(input_file: S3Object):
    bucket = orvanta.get_resource("u/admin/orvanta-cloud-demo")["bucket"]
    # create a DuckDB database in memory
    conn = duckdb.connect()
    # this will default to the workspace S3 resource
    args = orvanta.duckdb_connection_settings().connection_settings_str
    # this will use the designated resource
    # args = orvanta.duckdb_connection_settings("<PATH_TO_S3_RESOURCE>").connection_settings_str
    # connect duck db to the S3 bucket - this will default to the workspace S3 resource
    conn.execute(args)
    input_uri = "s3://{}/{}".format(bucket, input_file["s3"])
    output_file = "output/result.parquet"
    output_uri = "s3://{}/{}".format(bucket, output_file)
    # Run queries directly on the parquet file
    query_result = conn.sql(
        """
        SELECT * FROM read_parquet('{}')
    """.format(
            input_uri
        )
    )
    query_result.show()
    # Write the result of a query to a different parquet file on S3
    conn.execute(
        """
        COPY (
            SELECT COUNT(*) FROM read_parquet('{input_uri}')
        ) TO '{output_uri}' (FORMAT 'parquet');
    """.format(
            input_uri=input_uri, output_uri=output_uri
        )
    )
    conn.close()
    return S3Object(s3=output_file)

For more information on Data pipelines in Orvanta, see the Data pipelines documentation.

Use Amazon S3, R2, MinIO, Azure Blob, and Google Cloud Storage directly

Amazon S3, Cloudflare R2, and MinIO all follow the same API schema and therefore have a common Orvanta resource type. Azure Blob and Google Cloud Storage have slightly different APIs than S3 but work with Orvanta as well using their dedicated resource types.