-
Notifications
You must be signed in to change notification settings - Fork 93
feat: Add NpyCodec for lazy-loading numpy arrays #1331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add migrate_external() and migrate_filepath() to datajoint.migrate module for safe migration of 0.x external storage columns to 2.0 JSON format. Migration strategy: 1. Add new <column>_v2 columns with JSON type 2. Copy and convert data from old columns 3. User verifies data accessible via DataJoint 2.0 4. Finalize: rename columns (old → _v1, new → original) This allows 0.x and 2.0 to coexist during migration and provides rollback capability if issues are discovered. Functions: - migrate_external(schema, dry_run=True, finalize=False) - migrate_filepath(schema, dry_run=True, finalize=False) - _find_external_columns(schema) - detect 0.x external columns - _find_filepath_columns(schema) - detect 0.x filepath columns Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement the `<npy@>` codec for schema-addressed numpy array storage:
- Add SchemaCodec base class for path-addressed storage codecs
- Add NpyRef class for lazy array references with metadata
- Add NpyCodec using .npy format with shape/dtype inspection
- Refactor ObjectCodec to inherit from SchemaCodec
- Rename is_external to is_store throughout codebase
- Export SchemaCodec and NpyRef from public API
- Bump version to 2.0.0a17
Key features:
- Lazy loading: inspect shape/dtype without downloading
- NumPy integration via __array__ protocol
- Safe bulk fetch: returns NpyRef objects, not arrays
- Schema-addressed paths: {schema}/{table}/{pk}/{attr}.npy
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8d7c92e to
08d5c6a
Compare
The SchemaCodec (used by NpyCodec and ObjectCodec) needs _schema,
_table, _field, and primary key values to construct schema-addressed
storage paths. Previously, key=None was passed, resulting in
"unknown/unknown" paths.
Now builds proper context dict from table metadata and row values,
enabling navigable paths like:
{schema}/{table}/objects/{pk_path}/{attribute}.npy
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…to feature/npy-codec
Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address reviewer feedback from PR #1330: attr should never be None since field_name comes from heading.names. Raising an error surfaces bugs immediately rather than silently returning a misleading placeholder. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…orage Major changes to hash-addressed storage model: - Rename content_registry.py → hash_registry.py for clarity - Always store full path in metadata (protects against config changes) - Use stored path directly for retrieval (no path regeneration) - Add delete_path() as primary function, deprecate delete_hash() - Add get_size() as primary function, deprecate get_hash_size() - Update gc.py to work with paths instead of hashes - Update builtin_codecs.py HashCodec to use new API This design enables seamless migration from v0.14: - Legacy data keeps old paths in metadata - New data uses new path structure - GC compares stored paths against filesystem Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
58e92e9 to
0d1ffe7
Compare
- Remove uuid_from_buffer from hash.py (dead code) - connection.py now uses hashlib.md5().hexdigest() directly - Update test_hash.py to test key_hash instead Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove dead code that was only tested but never used in production: - hash_exists (gc uses set operations on paths) - delete_hash (gc uses delete_path directly) - get_size (gc collects sizes during walk) - get_hash_size (wrapper for get_size) Remaining API: compute_hash, build_hash_path, get_store_backend, get_store_subfolding, put_hash, get_hash, delete_path Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Introduces the
<npy@>codec for schema-addressed NumPy array storage with lazy loading, and refactors hash-addressed storage to use path-based retrieval.Key Features
NpyCodec (
<npy@>)shapeanddtypewithout downloadingmmap_mode__array__protocolNpyRefobjects instead of downloading all arrays.npyfiles readable by NumPy, MATLAB, etc.{schema}/{table}/{pk}/{attr}.npy)Hash Registry Refactoring
_hash/{schema}/{hash})Codec Types
<blob><blob@store><attach@store><npy@store><object@store>Plugin codecs (separate packages, coming soon):
<zarr@store>- Zarr arrays<tiff@store>- TIFF images<parquet@store>- Parquet tablesAddressing Schemes
Usage
Changes
New:
hash_registry.py- Refactored fromcontent_registry.pywith path-based storageSchemaCodec- Abstract base class for schema-addressed codecsNpyRef- Lazy reference with metadata accessNpyCodec- Codec implementation using.npyformatRefactoring:
ObjectCodecnow inherits fromSchemaCodecis_external→is_storethroughout codebasehash_registryfunctions use stored paths for retrievalgc.pyupdated to work with paths instead of hashesTest Plan
NpyRefmetadata and mmap_modeDocumentation
See
datajoint-docs(docs-2.0-migrationbranch):Co-Authored-By: Claude Code noreply@anthropic.com