Added Custom Ref Handlers

This commit is contained in:
2025-12-21 17:42:17 +01:00
parent b17fc450a2
commit 618e21e012
8 changed files with 655 additions and 181 deletions

145
CLAUDE.md
View File

@@ -84,19 +84,36 @@ pip install -e .[dev]
- Converts Python objects to/from JSON-compatible dictionaries
- Handles circular references using object ID tracking
- Supports custom serialization via handlers (see handlers.py)
- Special tags: `__object__`, `__id__`, `__tuple__`, `__set__`, `__ref__`, `__enum__`
- Special tags: `__object__`, `__id__`, `__tuple__`, `__set__`, `__ref__`, `__digest__`, `__enum__`
- Objects can define `use_refs()` method to specify fields that should be pickled instead of JSON-serialized
- `__ref__`: Used for `use_refs()` system (pickle-based storage)
- `__digest__`: Used by BaseRefHandler for custom binary formats (numpy, etc.)
**Handlers** (`src/dbengine/handlers.py`)
- Extensible handler system for custom type serialization
- BaseHandler interface: `is_eligible_for()`, `tag()`, `serialize()`, `deserialize()`
- Currently implements DateHandler for datetime.date objects
- Three-tier hierarchy:
- `BaseHandler`: Base interface with `is_eligible_for()` and `tag()`
- `BaseInlineHandler`: For JSON-inline storage (e.g., DateHandler)
- `BaseRefHandler`: For custom binary formats stored in `refs/` (e.g., DataFrames)
- `BaseInlineHandler`: Implements `serialize(obj) → dict` and `deserialize(dict) → obj`
- `BaseRefHandler`: Implements `serialize_to_bytes(obj) → bytes` and `deserialize_from_bytes(bytes) → obj`
- Currently implements `DateHandler` (BaseInlineHandler) for datetime.date objects
- Use `handlers.register_handler()` to add custom handlers
**Utils** (`src/dbengine/utils.py`)
- Type checking utilities: `is_primitive()`, `is_dictionary()`, `is_list()`, etc.
- Class introspection: `get_full_qualified_name()`, `importable_name()`, `get_class()`
- Stream digest computation with SHA-256
- Digest computation: `compute_digest_from_stream()`, `compute_digest_from_bytes()`
**RefHelper and PickleRefHelper** (`src/dbengine/dbengine.py`)
- `RefHelper`: Base class for content-addressable storage in `refs/` directory
- `save_ref_from_bytes(data: bytes) → digest`: Store raw bytes
- `load_ref_to_bytes(digest) → bytes`: Load raw bytes
- Used by `BaseRefHandler` for custom binary formats
- `PickleRefHelper(RefHelper)`: Adds pickle serialization layer
- `save_ref(obj) → digest`: Pickle and store object
- `load_ref(digest) → obj`: Load and unpickle object
- Used by `use_refs()` system and `Serializer`
### Storage Architecture
@@ -107,11 +124,15 @@ pip install -e .[dev]
│ └── objects/
│ └── {digest_prefix}/ # First 24 chars of digest
│ └── {full_digest} # JSON snapshot with metadata
└── refs/ # Shared pickled references
└── refs/ # Shared binary references (cross-tenant)
└── {digest_prefix}/
└── {full_digest}
└── {full_digest} # Pickle or custom binary format
```
**Note**: The `refs/` directory stores binary data in content-addressable format:
- Pickled objects (via `use_refs()` or `PickleRefHelper`)
- Custom binary formats (via `BaseRefHandler`, e.g., numpy arrays)
### Metadata System
Each snapshot includes automatic metadata fields:
@@ -163,13 +184,17 @@ Objects can opt into pickle-based storage for specific fields:
### Custom Type Handlers
To serialize custom types that aren't handled by default serialization:
MyDbEngine supports two types of custom handlers for serializing types:
**1. Create a handler class:**
#### 1. BaseInlineHandler - For JSON Storage
Use when data should be stored directly in the JSON snapshot (human-readable, smaller datasets).
**Example: Custom date handler**
```python
from dbengine.handlers import BaseHandler, TAG_SPECIAL
from dbengine.handlers import BaseInlineHandler, handlers
class MyCustomHandler(BaseHandler):
class MyCustomHandler(BaseInlineHandler):
def is_eligible_for(self, obj):
return isinstance(obj, MyCustomType)
@@ -178,26 +203,85 @@ class MyCustomHandler(BaseHandler):
def serialize(self, obj) -> dict:
return {
TAG_SPECIAL: self.tag(),
"__special__": self.tag(),
"data": obj.to_dict()
}
def deserialize(self, data: dict) -> object:
return MyCustomType.from_dict(data["data"])
```
**2. Register the handler:**
```python
from dbengine.handlers import handlers
# Register the handler
handlers.register_handler(MyCustomHandler())
```
**When to use handlers:**
- Complex types that need custom serialization logic
- Types that can't be pickled reliably
- Types requiring validation during deserialization
- External library types (datetime.date example in handlers.py)
**When to use BaseInlineHandler:**
- Small data structures that fit well in JSON
- Types requiring human-readable storage
- Types needing validation during deserialization
- Simple external library types (e.g., datetime.date)
#### 2. BaseRefHandler - For Binary Storage
Use when data should be stored in optimized binary format in `refs/` directory (large datasets, better compression).
**Example: pandas DataFrame handler**
```python
from dbengine.handlers import BaseRefHandler, handlers
import pandas as pd
import json
class DataFrameHandler(BaseRefHandler):
def is_eligible_for(self, obj):
return isinstance(obj, pd.DataFrame)
def tag(self):
return "DataFrame"
def serialize_to_bytes(self, df) -> bytes:
"""Convert DataFrame to compact binary format"""
import numpy as np
# Store metadata + numpy bytes
metadata = {
"columns": df.columns.tolist(),
"index": df.index.tolist(),
"dtype": str(df.values.dtype)
}
metadata_bytes = json.dumps(metadata).encode('utf-8')
metadata_length = len(metadata_bytes).to_bytes(4, 'big')
numpy_bytes = df.to_numpy().tobytes()
return metadata_length + metadata_bytes + numpy_bytes
def deserialize_from_bytes(self, data: bytes) -> object:
"""Reconstruct DataFrame from binary format"""
import numpy as np
# Read metadata
metadata_length = int.from_bytes(data[:4], 'big')
metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
numpy_bytes = data[4+metadata_length:]
# Reconstruct array and DataFrame
array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
array = array.reshape(len(metadata['index']), len(metadata['columns']))
return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])
# Register the handler
handlers.register_handler(DataFrameHandler())
```
**When to use BaseRefHandler:**
- Large binary data (DataFrames, numpy arrays, images)
- Data that benefits from custom compression (e.g., numpy's compact format)
- Types that lose information in JSON conversion
- Shared data across snapshots (automatic deduplication via SHA-256)
**Key differences:**
- `BaseInlineHandler`: Data stored in JSON snapshot → `{"__special__": "Tag", "data": {...}}`
- `BaseRefHandler`: Data stored in `refs/``{"__special__": "Tag", "__digest__": "abc123..."}`
- BaseRefHandler provides automatic deduplication and smaller JSON snapshots
### Using References (use_refs)
@@ -215,16 +299,23 @@ class MyDataObject:
return {"large_dataframe"}
```
**When to use refs:**
- Large data structures (DataFrames, numpy arrays)
- Objects that lose information in JSON conversion
- Data shared across multiple snapshots/tenants (deduplication benefit)
**When to use use_refs():**
- Quick solution for large nested objects without writing custom handler
- Works with any picklable object
- Per-object control (some fields in JSON, others pickled)
**use_refs() vs BaseRefHandler:**
- `use_refs()`: Uses pickle (via `PickleRefHelper`), simple but less optimized
- `BaseRefHandler`: Custom binary format (e.g., numpy), optimized but requires handler code
- Both store in `refs/` and get automatic SHA-256 deduplication
- `use_refs()` generates `{"__ref__": "digest"}` tags
- `BaseRefHandler` generates `{"__special__": "Tag", "__digest__": "digest"}` tags
**Trade-offs:**
- ✅ Smaller JSON snapshots
- ✅ Cross-tenant deduplication
- ❌ Less human-readable (binary pickle format)
- ❌ Python version compatibility concerns with pickle
- ❌ Less human-readable (binary format)
- ❌ Python version compatibility concerns with pickle (use_refs only)
## Testing Notes