Skip to content

Conversation

@shaofengshi
Copy link
Contributor

@shaofengshi shaofengshi commented Jan 7, 2026

Changes Made

Add Gravitino virtual file system (gvfs://) support write operation. So that user can use "gvfs://filesets/catalog/schema/fileset_name" path to write s3 (and other cloud storages in the future) location.

To ensure the functionality, the integration test cases has been updated with write operations with a MinIO service to simulate s3.

Related Issues

This is the third pr for #5503

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 84.87805% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.13%. Comparing base (e06cbec) to head (675b1f0).
⚠️ Report is 37 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-io/src/python.rs 65.11% 15 Missing ⚠️
src/daft-writers/src/csv_writer.rs 0.00% 10 Missing ⚠️
daft/filesystem.py 0.00% 4 Missing ⚠️
src/daft-io/src/gravitino.rs 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5965      +/-   ##
==========================================
- Coverage   72.52%   72.13%   -0.40%     
==========================================
  Files         970      971       +1     
  Lines      126303   127451    +1148     
==========================================
+ Hits        91598    91933     +335     
- Misses      34705    35518     +813     
Files with missing lines Coverage Δ
daft/io/gravitino_filesystem.py 100.00% <100.00%> (ø)
src/daft-io/src/lib.rs 77.82% <100.00%> (ø)
src/daft-io/src/mock.rs 92.30% <100.00%> (ø)
src/daft-io/src/gravitino.rs 62.58% <0.00%> (-0.87%) ⬇️
daft/filesystem.py 40.78% <0.00%> (-0.70%) ⬇️
src/daft-writers/src/csv_writer.rs 71.42% <0.00%> (-6.16%) ⬇️
src/daft-io/src/python.rs 78.30% <65.11%> (-9.00%) ⬇️

... and 29 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shaofengshi shaofengshi changed the title Gravitino split3 [WIP] feat: Gravitino split3 [WIP] Jan 7, 2026
@github-actions github-actions bot added the feat label Jan 7, 2026
@shaofengshi shaofengshi changed the title feat: Gravitino split3 [WIP] feat: Add Apache Gravitino virtual file system (gvfs://) write support in io module Jan 7, 2026
@shaofengshi shaofengshi marked this pull request as ready for review January 7, 2026 08:58
@shaofengshi
Copy link
Contributor Author

shaofengshi commented Jan 7, 2026

@kevinzwang Kevin, this is the third PR for #5503 , please review. Thanks!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 7, 2026

Greptile Summary

This PR adds write support for Apache Gravitino's virtual filesystem (gvfs:// protocol), enabling users to write Parquet, CSV, and JSON files directly to Gravitino filesets backed by S3 storage.

Key Changes:

  • Implemented GravitinoFileSystem and GravitinoOutputStream in Python to handle PyArrow write operations by buffering data in memory before flushing to storage
  • Added io_put Rust function with async runtime detection to handle write operations from Python
  • Extended GravitinoSource in Rust with put and create_multipart_writer methods that delegate to underlying storage
  • Updated CSV writer to support Gravitino source type
  • Added comprehensive integration tests with MinIO to verify write operations and roundtrip read/write cycles
  • Updated documentation with GVFS write examples

Architecture:
The write flow uses PyArrow's filesystem abstraction: PyArrow writes to GravitinoOutputStream which buffers data, then on close() calls the Rust io_put function. The Rust layer parses the gvfs:// URL, loads fileset metadata and credentials from Gravitino, translates to the underlying storage path (e.g., S3), and performs the actual write operation.

Minor Issue:

  • Inline import logging statement violates project style guidelines (should be at module top)

Confidence Score: 4/5

  • This PR is safe to merge with one minor style fix recommended
  • The implementation is well-architected with proper delegation between Python and Rust layers, includes comprehensive integration tests with MinIO that verify write operations work correctly, and follows existing patterns in the codebase. The write flow properly handles async runtime contexts and buffers data appropriately. Only issue is a minor style violation with an inline import statement.
  • daft/io/gravitino_filesystem.py - has inline import that should be moved to module top

Important Files Changed

Filename Overview
daft/io/gravitino_filesystem.py new PyArrow filesystem implementation for gvfs:// write operations, buffers data in memory before writing
src/daft-io/src/python.rs added io_put Python binding with runtime detection logic for async context handling
src/daft-io/src/gravitino.rs added put and create_multipart_writer methods to delegate write operations to underlying storage
tests/integration/gravitino/test_gravitino_fileset_s3.py comprehensive integration tests for write operations with MinIO, includes parquet/csv/json and roundtrip verification
src/daft-writers/src/csv_writer.rs added Gravitino source type support to CSV writer with object storage backend
daft/filesystem.py added gvfs protocol support to filesystem inference logic, creates GravitinoFileSystem for write operations
docs/connectors/gravitino.md comprehensive documentation update covering GVFS write operations with code examples

Sequence Diagram

sequenceDiagram
    participant User
    participant DataFrame
    participant PyArrow
    participant GravitinoFileSystem
    participant GravitinoOutputStream
    participant io_put
    participant GravitinoSource
    participant UnderlyingStorage

    User->>DataFrame: write_parquet(gvfs://...)
    DataFrame->>PyArrow: Write data to filesystem
    PyArrow->>GravitinoFileSystem: Resolve filesystem for gvfs://
    PyArrow->>GravitinoFileSystem: open_output_stream(path)
    GravitinoFileSystem->>GravitinoOutputStream: Create stream
    GravitinoOutputStream-->>PyArrow: Return stream
    
    loop Write chunks
        PyArrow->>GravitinoOutputStream: write(data)
        GravitinoOutputStream->>GravitinoOutputStream: Buffer data in memory
    end
    
    PyArrow->>GravitinoOutputStream: close()
    GravitinoOutputStream->>GravitinoOutputStream: Get buffered data
    GravitinoOutputStream->>io_put: Call Rust function
    io_put->>GravitinoSource: Parse gvfs:// URL
    GravitinoSource->>GravitinoSource: Extract catalog/schema/fileset
    GravitinoSource->>GravitinoSource: Load fileset & credentials
    GravitinoSource->>GravitinoSource: Translate to storage path
    GravitinoSource->>UnderlyingStorage: put(s3_path, data)
    UnderlyingStorage-->>GravitinoSource: Success
    GravitinoSource-->>io_put: Success
    io_put-->>GravitinoOutputStream: Success
    GravitinoOutputStream-->>PyArrow: Success
    PyArrow-->>DataFrame: Success
    DataFrame-->>User: Write complete
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. daft/io/gravitino_filesystem.py, line 213 (link)

    style: move import to top of file per project guidelines (rule 430ffc3f)

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Context Used: Rule from dashboard - Import statements should be placed at the top of the file rather than inline within functions or met... (source)

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@shaofengshi
Copy link
Contributor Author

Additional Comments (1)

  1. daft/io/gravitino_filesystem.py, line 213 (link)
    style: move import to top of file per project guidelines (rule 430ffc3f)
    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
    Context Used: Rule from dashboard - Import statements should be placed at the top of the file rather than inline within functions or met... (source)

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

fixed in the latest commit.

@kevinzwang kevinzwang self-requested a review January 8, 2026 21:27
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not closely review the Gravitino-specific logic because I am not very familiar with the protocols, but I trust that you have a good grasp of that and have implemented it correctly. Everything else looks good to me.

Let me know when this is ready to be merged!

@shaofengshi
Copy link
Contributor Author

Hi Kevin, yes it is ready and safe to merge. Thanks for your effort! @kevinzwang

@kevinzwang kevinzwang merged commit 2855a34 into Eventual-Inc:main Jan 10, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants