Added Custom Ref Handlers
This commit is contained in:
145
CLAUDE.md
145
CLAUDE.md
@@ -84,19 +84,36 @@ pip install -e .[dev]
|
||||
- Converts Python objects to/from JSON-compatible dictionaries
|
||||
- Handles circular references using object ID tracking
|
||||
- Supports custom serialization via handlers (see handlers.py)
|
||||
- Special tags: `__object__`, `__id__`, `__tuple__`, `__set__`, `__ref__`, `__enum__`
|
||||
- Special tags: `__object__`, `__id__`, `__tuple__`, `__set__`, `__ref__`, `__digest__`, `__enum__`
|
||||
- Objects can define `use_refs()` method to specify fields that should be pickled instead of JSON-serialized
|
||||
- `__ref__`: Used for `use_refs()` system (pickle-based storage)
|
||||
- `__digest__`: Used by BaseRefHandler for custom binary formats (numpy, etc.)
|
||||
|
||||
**Handlers** (`src/dbengine/handlers.py`)
|
||||
- Extensible handler system for custom type serialization
|
||||
- BaseHandler interface: `is_eligible_for()`, `tag()`, `serialize()`, `deserialize()`
|
||||
- Currently implements DateHandler for datetime.date objects
|
||||
- Three-tier hierarchy:
|
||||
- `BaseHandler`: Base interface with `is_eligible_for()` and `tag()`
|
||||
- `BaseInlineHandler`: For JSON-inline storage (e.g., DateHandler)
|
||||
- `BaseRefHandler`: For custom binary formats stored in `refs/` (e.g., DataFrames)
|
||||
- `BaseInlineHandler`: Implements `serialize(obj) → dict` and `deserialize(dict) → obj`
|
||||
- `BaseRefHandler`: Implements `serialize_to_bytes(obj) → bytes` and `deserialize_from_bytes(bytes) → obj`
|
||||
- Currently implements `DateHandler` (BaseInlineHandler) for datetime.date objects
|
||||
- Use `handlers.register_handler()` to add custom handlers
|
||||
|
||||
**Utils** (`src/dbengine/utils.py`)
|
||||
- Type checking utilities: `is_primitive()`, `is_dictionary()`, `is_list()`, etc.
|
||||
- Class introspection: `get_full_qualified_name()`, `importable_name()`, `get_class()`
|
||||
- Stream digest computation with SHA-256
|
||||
- Digest computation: `compute_digest_from_stream()`, `compute_digest_from_bytes()`
|
||||
|
||||
**RefHelper and PickleRefHelper** (`src/dbengine/dbengine.py`)
|
||||
- `RefHelper`: Base class for content-addressable storage in `refs/` directory
|
||||
- `save_ref_from_bytes(data: bytes) → digest`: Store raw bytes
|
||||
- `load_ref_to_bytes(digest) → bytes`: Load raw bytes
|
||||
- Used by `BaseRefHandler` for custom binary formats
|
||||
- `PickleRefHelper(RefHelper)`: Adds pickle serialization layer
|
||||
- `save_ref(obj) → digest`: Pickle and store object
|
||||
- `load_ref(digest) → obj`: Load and unpickle object
|
||||
- Used by `use_refs()` system and `Serializer`
|
||||
|
||||
### Storage Architecture
|
||||
|
||||
@@ -107,11 +124,15 @@ pip install -e .[dev]
|
||||
│ └── objects/
|
||||
│ └── {digest_prefix}/ # First 24 chars of digest
|
||||
│ └── {full_digest} # JSON snapshot with metadata
|
||||
└── refs/ # Shared pickled references
|
||||
└── refs/ # Shared binary references (cross-tenant)
|
||||
└── {digest_prefix}/
|
||||
└── {full_digest}
|
||||
└── {full_digest} # Pickle or custom binary format
|
||||
```
|
||||
|
||||
**Note**: The `refs/` directory stores binary data in content-addressable format:
|
||||
- Pickled objects (via `use_refs()` or `PickleRefHelper`)
|
||||
- Custom binary formats (via `BaseRefHandler`, e.g., numpy arrays)
|
||||
|
||||
### Metadata System
|
||||
|
||||
Each snapshot includes automatic metadata fields:
|
||||
@@ -163,13 +184,17 @@ Objects can opt into pickle-based storage for specific fields:
|
||||
|
||||
### Custom Type Handlers
|
||||
|
||||
To serialize custom types that aren't handled by default serialization:
|
||||
MyDbEngine supports two types of custom handlers for serializing types:
|
||||
|
||||
**1. Create a handler class:**
|
||||
#### 1. BaseInlineHandler - For JSON Storage
|
||||
|
||||
Use when data should be stored directly in the JSON snapshot (human-readable, smaller datasets).
|
||||
|
||||
**Example: Custom date handler**
|
||||
```python
|
||||
from dbengine.handlers import BaseHandler, TAG_SPECIAL
|
||||
from dbengine.handlers import BaseInlineHandler, handlers
|
||||
|
||||
class MyCustomHandler(BaseHandler):
|
||||
class MyCustomHandler(BaseInlineHandler):
|
||||
def is_eligible_for(self, obj):
|
||||
return isinstance(obj, MyCustomType)
|
||||
|
||||
@@ -178,26 +203,85 @@ class MyCustomHandler(BaseHandler):
|
||||
|
||||
def serialize(self, obj) -> dict:
|
||||
return {
|
||||
TAG_SPECIAL: self.tag(),
|
||||
"__special__": self.tag(),
|
||||
"data": obj.to_dict()
|
||||
}
|
||||
|
||||
def deserialize(self, data: dict) -> object:
|
||||
return MyCustomType.from_dict(data["data"])
|
||||
```
|
||||
|
||||
**2. Register the handler:**
|
||||
```python
|
||||
from dbengine.handlers import handlers
|
||||
|
||||
# Register the handler
|
||||
handlers.register_handler(MyCustomHandler())
|
||||
```
|
||||
|
||||
**When to use handlers:**
|
||||
- Complex types that need custom serialization logic
|
||||
- Types that can't be pickled reliably
|
||||
- Types requiring validation during deserialization
|
||||
- External library types (datetime.date example in handlers.py)
|
||||
**When to use BaseInlineHandler:**
|
||||
- Small data structures that fit well in JSON
|
||||
- Types requiring human-readable storage
|
||||
- Types needing validation during deserialization
|
||||
- Simple external library types (e.g., datetime.date)
|
||||
|
||||
#### 2. BaseRefHandler - For Binary Storage
|
||||
|
||||
Use when data should be stored in optimized binary format in `refs/` directory (large datasets, better compression).
|
||||
|
||||
**Example: pandas DataFrame handler**
|
||||
```python
|
||||
from dbengine.handlers import BaseRefHandler, handlers
|
||||
import pandas as pd
|
||||
import json
|
||||
|
||||
class DataFrameHandler(BaseRefHandler):
|
||||
def is_eligible_for(self, obj):
|
||||
return isinstance(obj, pd.DataFrame)
|
||||
|
||||
def tag(self):
|
||||
return "DataFrame"
|
||||
|
||||
def serialize_to_bytes(self, df) -> bytes:
|
||||
"""Convert DataFrame to compact binary format"""
|
||||
import numpy as np
|
||||
|
||||
# Store metadata + numpy bytes
|
||||
metadata = {
|
||||
"columns": df.columns.tolist(),
|
||||
"index": df.index.tolist(),
|
||||
"dtype": str(df.values.dtype)
|
||||
}
|
||||
metadata_bytes = json.dumps(metadata).encode('utf-8')
|
||||
metadata_length = len(metadata_bytes).to_bytes(4, 'big')
|
||||
numpy_bytes = df.to_numpy().tobytes()
|
||||
|
||||
return metadata_length + metadata_bytes + numpy_bytes
|
||||
|
||||
def deserialize_from_bytes(self, data: bytes) -> object:
|
||||
"""Reconstruct DataFrame from binary format"""
|
||||
import numpy as np
|
||||
|
||||
# Read metadata
|
||||
metadata_length = int.from_bytes(data[:4], 'big')
|
||||
metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
|
||||
numpy_bytes = data[4+metadata_length:]
|
||||
|
||||
# Reconstruct array and DataFrame
|
||||
array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
|
||||
array = array.reshape(len(metadata['index']), len(metadata['columns']))
|
||||
|
||||
return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])
|
||||
|
||||
# Register the handler
|
||||
handlers.register_handler(DataFrameHandler())
|
||||
```
|
||||
|
||||
**When to use BaseRefHandler:**
|
||||
- Large binary data (DataFrames, numpy arrays, images)
|
||||
- Data that benefits from custom compression (e.g., numpy's compact format)
|
||||
- Types that lose information in JSON conversion
|
||||
- Shared data across snapshots (automatic deduplication via SHA-256)
|
||||
|
||||
**Key differences:**
|
||||
- `BaseInlineHandler`: Data stored in JSON snapshot → `{"__special__": "Tag", "data": {...}}`
|
||||
- `BaseRefHandler`: Data stored in `refs/` → `{"__special__": "Tag", "__digest__": "abc123..."}`
|
||||
- BaseRefHandler provides automatic deduplication and smaller JSON snapshots
|
||||
|
||||
### Using References (use_refs)
|
||||
|
||||
@@ -215,16 +299,23 @@ class MyDataObject:
|
||||
return {"large_dataframe"}
|
||||
```
|
||||
|
||||
**When to use refs:**
|
||||
- Large data structures (DataFrames, numpy arrays)
|
||||
- Objects that lose information in JSON conversion
|
||||
- Data shared across multiple snapshots/tenants (deduplication benefit)
|
||||
**When to use use_refs():**
|
||||
- Quick solution for large nested objects without writing custom handler
|
||||
- Works with any picklable object
|
||||
- Per-object control (some fields in JSON, others pickled)
|
||||
|
||||
**use_refs() vs BaseRefHandler:**
|
||||
- `use_refs()`: Uses pickle (via `PickleRefHelper`), simple but less optimized
|
||||
- `BaseRefHandler`: Custom binary format (e.g., numpy), optimized but requires handler code
|
||||
- Both store in `refs/` and get automatic SHA-256 deduplication
|
||||
- `use_refs()` generates `{"__ref__": "digest"}` tags
|
||||
- `BaseRefHandler` generates `{"__special__": "Tag", "__digest__": "digest"}` tags
|
||||
|
||||
**Trade-offs:**
|
||||
- ✅ Smaller JSON snapshots
|
||||
- ✅ Cross-tenant deduplication
|
||||
- ❌ Less human-readable (binary pickle format)
|
||||
- ❌ Python version compatibility concerns with pickle
|
||||
- ❌ Less human-readable (binary format)
|
||||
- ❌ Python version compatibility concerns with pickle (use_refs only)
|
||||
|
||||
## Testing Notes
|
||||
|
||||
|
||||
Reference in New Issue
Block a user