Large data files
This page is part of the section on Persistent storage & databases, which covers where to effectively store and manage the data manipulated by Orvanta.
For heavier data objects and unstructured data storage, Amazon S3 (Simple Storage Service) and its alternatives Cloudflare R2 and MinIO, as well as Azure Blob Storage and Google Cloud Storage, are highly scalable and durable object storage services that provide secure, reliable, and cost-effective storage for a wide range of data types and use cases.
Orvanta comes with a native integration with S3, Azure Blob, and Google Cloud Storage, making them the recommended storage for large objects like files and binary data.
Workspace object storage
Section titled “Workspace object storage”Connect your Orvanta workspace to your S3 bucket, Azure Blob storage, or Google Cloud Storage to enable users to read and write from S3 without having to have access to the credentials.
Orvanta integration with Polars and DuckDB for data pipelines
Section titled “Orvanta integration with Polars and DuckDB for data pipelines”ETLs can be easily implemented in Orvanta using its integration with Polars and DuckDB to facilitate working with tabular data. In this case, you don’t need to manually interact with the S3 bucket — Polars/DuckDB does it natively and in an efficient way. Reading and writing datasets to S3 can be done seamlessly.
DuckDB (SQL)
Section titled “DuckDB (SQL)”-- $file1 (s3object)-- Run queries directly on an S3 parquet file passed as an argumentSELECT * FROM read_parquet($file1)-- Or using an explicit path in a workspace storageSELECT * FROM read_json('s3:///demo/data.json')-- You can also specify a secondary workspace storageSELECT * FROM read_csv('s3://secondary_storage/demo/data.csv')-- Write the result of a query to a different parquet file on S3COPY ( SELECT COUNT(*) FROM read_parquet($file1)) TO 's3:///demo/output.pq' (FORMAT 'parquet');Polars (Python)
Section titled “Polars (Python)”#requirements:polars==0.20.2#s3fs==2023.12.0#orvanta>=1.229.0import orvantafrom orvanta import S3Objectimport polars as plimport s3fs
def main(input_file: S3Object): bucket = orvanta.get_resource("<PATH_TO_S3_RESOURCE>")["bucket"] # this will default to the workspace S3 resource storage_options = orvanta.polars_connection_settings().storage_options # this will use the designated resource # storage_options = orvanta.polars_connection_settings("<PATH_TO_S3_RESOURCE>").storage_options # input is a parquet file, we use read_parquet in lazy mode. input_uri = "s3://{}/{}".format(bucket, input_file["s3"]) input_df = pl.read_parquet(input_uri, storage_options=storage_options).lazy() # process the Polars dataframe output_df = input_df.collect() print(output_df) # To write back the result to S3, Polars needs an s3fs connection s3 = s3fs.S3FileSystem(**orvanta.polars_connection_settings().s3fs_args) output_file = "output/result.parquet" output_uri = "s3://{}/{}".format(bucket, output_file) with s3.open(output_uri, mode="wb") as output_s3: # persist the output dataframe back to S3 and return it output_df.write_parquet(output_s3) return S3Object(s3=output_file)DuckDB (Python)
Section titled “DuckDB (Python)”#requirements:#orvanta>=1.229.0#duckdb==0.9.1import orvantafrom orvanta import S3Objectimport duckdb
def main(input_file: S3Object): bucket = orvanta.get_resource("u/admin/orvanta-cloud-demo")["bucket"] # create a DuckDB database in memory conn = duckdb.connect() # this will default to the workspace S3 resource args = orvanta.duckdb_connection_settings().connection_settings_str # this will use the designated resource # args = orvanta.duckdb_connection_settings("<PATH_TO_S3_RESOURCE>").connection_settings_str # connect duck db to the S3 bucket - this will default to the workspace S3 resource conn.execute(args) input_uri = "s3://{}/{}".format(bucket, input_file["s3"]) output_file = "output/result.parquet" output_uri = "s3://{}/{}".format(bucket, output_file) # Run queries directly on the parquet file query_result = conn.sql( """ SELECT * FROM read_parquet('{}') """.format( input_uri ) ) query_result.show() # Write the result of a query to a different parquet file on S3 conn.execute( """ COPY ( SELECT COUNT(*) FROM read_parquet('{input_uri}') ) TO '{output_uri}' (FORMAT 'parquet'); """.format( input_uri=input_uri, output_uri=output_uri ) ) conn.close() return S3Object(s3=output_file)For more information on Data pipelines in Orvanta, see the Data pipelines documentation.
Use Amazon S3, R2, MinIO, Azure Blob, and Google Cloud Storage directly
Section titled “Use Amazon S3, R2, MinIO, Azure Blob, and Google Cloud Storage directly”Amazon S3, Cloudflare R2, and MinIO all follow the same API schema and therefore have a common Orvanta resource type. Azure Blob and Google Cloud Storage have slightly different APIs than S3 but work with Orvanta as well using their dedicated resource types.