-
Notifications
You must be signed in to change notification settings - Fork 380
feat: Add Apache Gravitino virtual file system (gvfs://) write support in io module #5965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5965 +/- ##
==========================================
- Coverage 72.52% 72.13% -0.40%
==========================================
Files 970 971 +1
Lines 126303 127451 +1148
==========================================
+ Hits 91598 91933 +335
- Misses 34705 35518 +813
🚀 New features to boost your workflow:
|
7b677dc to
ff2a16c
Compare
ff2a16c to
c6268d7
Compare
|
@kevinzwang Kevin, this is the third PR for #5503 , please review. Thanks! |
Greptile SummaryThis PR adds write support for Apache Gravitino's virtual filesystem ( Key Changes:
Architecture: Minor Issue:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant DataFrame
participant PyArrow
participant GravitinoFileSystem
participant GravitinoOutputStream
participant io_put
participant GravitinoSource
participant UnderlyingStorage
User->>DataFrame: write_parquet(gvfs://...)
DataFrame->>PyArrow: Write data to filesystem
PyArrow->>GravitinoFileSystem: Resolve filesystem for gvfs://
PyArrow->>GravitinoFileSystem: open_output_stream(path)
GravitinoFileSystem->>GravitinoOutputStream: Create stream
GravitinoOutputStream-->>PyArrow: Return stream
loop Write chunks
PyArrow->>GravitinoOutputStream: write(data)
GravitinoOutputStream->>GravitinoOutputStream: Buffer data in memory
end
PyArrow->>GravitinoOutputStream: close()
GravitinoOutputStream->>GravitinoOutputStream: Get buffered data
GravitinoOutputStream->>io_put: Call Rust function
io_put->>GravitinoSource: Parse gvfs:// URL
GravitinoSource->>GravitinoSource: Extract catalog/schema/fileset
GravitinoSource->>GravitinoSource: Load fileset & credentials
GravitinoSource->>GravitinoSource: Translate to storage path
GravitinoSource->>UnderlyingStorage: put(s3_path, data)
UnderlyingStorage-->>GravitinoSource: Success
GravitinoSource-->>io_put: Success
io_put-->>GravitinoOutputStream: Success
GravitinoOutputStream-->>PyArrow: Success
PyArrow-->>DataFrame: Success
DataFrame-->>User: Write complete
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
daft/io/gravitino_filesystem.py, line 213 (link)style: move import to top of file per project guidelines (rule 430ffc3f)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Context Used: Rule from
dashboard- Import statements should be placed at the top of the file rather than inline within functions or met... (source)
10 files reviewed, 1 comment
fixed in the latest commit. |
kevinzwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not closely review the Gravitino-specific logic because I am not very familiar with the protocols, but I trust that you have a good grasp of that and have implemented it correctly. Everything else looks good to me.
Let me know when this is ready to be merged!
|
Hi Kevin, yes it is ready and safe to merge. Thanks for your effort! @kevinzwang |
Changes Made
Add Gravitino virtual file system (gvfs://) support write operation. So that user can use "gvfs://filesets/catalog/schema/fileset_name" path to write s3 (and other cloud storages in the future) location.
To ensure the functionality, the integration test cases has been updated with write operations with a MinIO service to simulate s3.
Related Issues
This is the third pr for #5503