Added Custom Ref Handlers

2025-12-21 17:42:17 +01:00
parent b17fc450a2
commit 618e21e012
8 changed files with 655 additions and 181 deletions
--- a/README.md
+++ b/README.md
@@ -1,187 +1,352 @@
-# DbEngine
+# MyDbEngine

-A lightweight, git-inspired database engine for Python that maintains complete history of all modifications.
+A lightweight, git-inspired versioned database engine for Python with content-addressable storage and complete history tracking.

-## Overview
+## What is MyDbEngine?

-DbEngine is a personal implementation of a versioned database engine that stores snapshots of data changes over time. Each modification creates a new immutable snapshot, allowing you to track the complete history of your data.
+MyDbEngine is a file-based versioned database that treats data like Git treats code. Every modification creates an immutable snapshot with a SHA-256 digest, enabling complete history tracking, deduplication, and multi-tenant isolation.

-## Key Features
+**Key Features:**
+- **Immutable Snapshots**: Every change creates a new version, never modifying existing data
+- **Content-Addressable Storage**: Identical objects stored only once, referenced by SHA-256 digest
+- **Multi-Tenant**: Isolated storage per tenant with shared deduplication in `refs/`
+- **Extensible Serialization**: Custom handlers for optimized storage (JSON, binary, pickle)
+- **Thread-Safe**: Built-in RLock for concurrent access
+- **Zero Dependencies**: Pure Python with no runtime dependencies (pytest only for dev)

- **Version Control**: Every change creates a new snapshot with a unique digest (SHA-256 hash)
- **History Tracking**: Access any previous version of your data
- **Multi-tenant Support**: Isolated data storage per tenant
- **Thread-safe**: Built-in locking mechanism for concurrent access
- **Git-inspired Architecture**: Objects are stored in a content-addressable format
- **Efficient Storage**: Identical objects are stored only once
+**When to Use:**
+- Version tracking for configuration, user data, or application state
+- Multi-tenant applications requiring isolated data with shared deduplication
+- Scenarios where you need both human-readable JSON and optimized binary storage

-## Architecture
+**When NOT to Use:**
+- High-frequency writes (creates a snapshot per modification)
+- Relational queries (no SQL, no joins)
+- Large-scale production databases (file-based, not optimized for millions of records)

-The engine uses a file-based storage system with the following structure:
+## Installation
+
+```bash
+pip install mydbengine
+```
+
+## Quick Start
+
+```python
+from dbengine.dbengine import DbEngine
+
+# Initialize engine and tenant
+engine = DbEngine(root=".mytools_db")
+engine.init("tenant_1")
+
+# Save and load data
+engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
+data = engine.load("tenant_1", "config")
+print(data)  # {"theme": "dark", "lang": "en"}
+```
+
+## Core Concepts
+
+### Immutable Snapshots
+
+Each `save()` or `put()` operation creates a new snapshot with automatic metadata:
+- `__parent__`: List containing digest of previous version (or `[None]` for first)
+- `__user_id__`: User ID who created the snapshot
+- `__date__`: ISO timestamp `YYYYMMDD HH:MM:SS %z`
+
+### Storage Architecture

 ```
 .mytools_db/
 ├── {tenant_id}/
-│   ├── head                    # Points to latest version of each entry
+│   ├── head                          # JSON: {"entry_name": "latest_digest"}
 │   └── objects/
-│       └── {digest_prefix}/
-│           └── {full_digest}   # Actual object data
-└── refs/                       # Shared references
+│       └── {digest_prefix}/         # First 24 chars of digest
+│           └── {full_digest}        # JSON snapshot with metadata
+└── refs/                            # Shared binary references (cross-tenant)
+    └── {digest_prefix}/
+        └── {full_digest}            # Pickle or custom binary format
 ```

-## Installation
+### Two Usage Patterns

+**Pattern 1: Snapshot-based** - Store complete object states
 ```python
-from db_engine import DbEngine
-
-# Initialize with default root
-db = DbEngine()
-
-# Or specify custom root directory
-db = DbEngine(root="/path/to/database")
+engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
+config = engine.load("tenant_1", "config")
 ```

+**Pattern 2: Record-based** - Incremental updates to collections
+```python
+engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
+engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
+all_users = engine.get("tenant_1", "users")  # Returns list of all users
+```
+
+**Important:** Do not mix patterns for the same entry - they use different data structures.
+
 ## Basic Usage

-### Initialize Database for a Tenant
+### Save and Load Complete Snapshots

 ```python
-tenant_id = "my_company"
-db.init(tenant_id)
-```
+# Save any Python object
+data = {"users": ["alice", "bob"], "count": 2}
+digest = engine.save("tenant_1", "user_1", "session", data)

-### Save Data
-
-```python
-# Save a complete object
-user_id = "john_doe"
-entry = "users"
-data = {"name": "John", "age": 30}
-
-digest = db.save(tenant_id, user_id, entry, data)
-```
-
-### Load Data
-
-```python
 # Load latest version
-data = db.load(tenant_id, entry="users")
+session = engine.load("tenant_1", "session")

 # Load specific version by digest
-data = db.load(tenant_id, entry="users", digest="abc123...")
+old_session = engine.load("tenant_1", "session", digest=digest)
 ```

-### Work with Individual Records
+### Incremental Record Updates

 ```python
-# Add or update a single record
-db.put(tenant_id, user_id, entry="users", key="john", value={"name": "John", "age": 30})
+# Add/update single record
+engine.put("tenant_1", "user_1", "users", "alice", {"name": "Alice", "role": "admin"})

-# Add or update multiple records at once
-items = {
-    "john": {"name": "John", "age": 30},
-    "jane": {"name": "Jane", "age": 25}
+# Add/update multiple records
+users = {
+    "bob": {"name": "Bob", "role": "user"},
+    "charlie": {"name": "Charlie", "role": "user"}
 }
-db.put_many(tenant_id, user_id, entry="users", items=items)
+engine.put_many("tenant_1", "user_1", "users", users)

-# Get a specific record
-user = db.get(tenant_id, entry="users", key="john")
+# Get specific record
+alice = engine.get("tenant_1", "users", key="alice")

-# Get all records
-all_users = db.get(tenant_id, entry="users")
+# Get all records as list
+all_users = engine.get("tenant_1", "users")
 ```

-### Check Existence
+### History Navigation

 ```python
-if db.exists(tenant_id, entry="users"):
+# Get history chain (list of digests, newest first)
+history = engine.history("tenant_1", "config", max_items=10)
+
+# Load previous version
+previous = engine.load("tenant_1", "config", digest=history[1])
+
+# Check if entry exists
+if engine.exists("tenant_1", "config"):
    print("Entry exists")
 ```

-### Access History
+## Custom Serialization
+
+MyDbEngine supports three approaches for custom serialization:
+
+### 1. BaseInlineHandler - JSON Storage
+
+For small data types that should be human-readable in snapshots:

 ```python
-# Get history of an entry (returns list of digests)
-history = db.history(tenant_id, entry="users", max_items=10)
+from dbengine.handlers import BaseInlineHandler, handlers
+import datetime

-# Load a previous version
-old_data = db.load(tenant_id, entry="users", digest=history[1])
+class DateHandler(BaseInlineHandler):
+    def is_eligible_for(self, obj):
+        return isinstance(obj, datetime.date)
+
+    def tag(self):
+        return "Date"
+
+    def serialize(self, obj):
+        return {
+            "__special__": self.tag(),
+            "year": obj.year,
+            "month": obj.month,
+            "day": obj.day
+        }
+
+    def deserialize(self, data):
+        return datetime.date(year=data["year"], month=data["month"], day=data["day"])
+
+handlers.register_handler(DateHandler())
 ```

-## Metadata
+### 2. BaseRefHandler - Optimized Binary Storage

-Each snapshot automatically includes metadata:
+For large data structures that benefit from custom binary formats:

- `__parent__`: Digest of the previous version
- `__user_id__`: User ID who made the change
- `__date__`: Timestamp of the change (format: `YYYYMMDD HH:MM:SS`)
+```python
+from dbengine.handlers import BaseRefHandler, handlers
+import pandas as pd
+import numpy as np
+import json
+
+class DataFrameHandler(BaseRefHandler):
+    def is_eligible_for(self, obj):
+        return isinstance(obj, pd.DataFrame)
+
+    def tag(self):
+        return "DataFrame"
+
+    def serialize_to_bytes(self, df):
+        """Convert DataFrame to compact binary format"""
+        # Store metadata + numpy bytes
+        metadata = {
+            "columns": df.columns.tolist(),
+            "index": df.index.tolist(),
+            "dtype": str(df.values.dtype)
+        }
+        metadata_bytes = json.dumps(metadata).encode('utf-8')
+        metadata_length = len(metadata_bytes).to_bytes(4, 'big')
+        numpy_bytes = df.to_numpy().tobytes()
+
+        return metadata_length + metadata_bytes + numpy_bytes
+
+    def deserialize_from_bytes(self, data):
+        """Reconstruct DataFrame from binary format"""
+        # Read metadata
+        metadata_length = int.from_bytes(data[:4], 'big')
+        metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
+        numpy_bytes = data[4+metadata_length:]
+
+        # Reconstruct array and DataFrame
+        array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
+        array = array.reshape(len(metadata['index']), len(metadata['columns']))
+
+        return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])
+
+handlers.register_handler(DataFrameHandler())
+
+# Now DataFrames are automatically stored in optimized binary format
+df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
+engine.save("tenant_1", "user_1", "data", df)
+```
+
+**Result:**
+- JSON snapshot contains: `{"__special__": "DataFrame", "__digest__": "abc123..."}`
+- Binary data stored in `refs/abc123...` (more compact than pickle)
+- Automatic deduplication across tenants
+
+### 3. use_refs() - Selective Pickle Storage
+
+For objects with specific fields that should be pickled:
+
+```python
+class MyDataObject:
+    def __init__(self, metadata, large_array):
+        self.metadata = metadata
+        self.large_array = large_array  # Large numpy array or similar
+
+    @staticmethod
+    def use_refs():
+        """Fields to pickle instead of JSON-serialize"""
+        return {"large_array"}
+
+# metadata goes to JSON, large_array goes to refs/ (pickled)
+obj = MyDataObject({"name": "dataset_1"}, np.zeros((1000, 1000)))
+engine.save("tenant_1", "user_1", "my_data", obj)
+```
+
+**Comparison:**
+
+| Approach | Storage | Format | Use Case |
+|----------|---------|--------|----------|
+| `BaseInlineHandler` | JSON snapshot | Custom dict | Small data, human-readable |
+| `BaseRefHandler` | `refs/` directory | Custom binary | Large data, optimized format |
+| `use_refs()` | `refs/` directory | Pickle | Quick solution, no handler needed |

 ## API Reference

-### Core Methods
+### Initialization

-#### `init(tenant_id: str)`
-Initialize database structure for a tenant.
+| Method | Description |
+|--------|-------------|
+| `DbEngine(root: str = ".mytools_db")` | Initialize engine with storage root |
+| `init(tenant_id: str)` | Create tenant directory structure |
+| `is_initialized(tenant_id: str) -> bool` | Check if tenant is initialized |

-#### `save(tenant_id: str, user_id: str, entry: str, obj: object) -> str`
-Save a complete snapshot. Returns the digest of the saved object.
+### Data Operations

-#### `load(tenant_id: str, entry: str, digest: str = None) -> object`
-Load a snapshot. If digest is None, loads the latest version.
+| Method | Description |
+|--------|-------------|
+| `save(tenant_id, user_id, entry, obj) -> str` | Save complete snapshot, returns digest |
+| `load(tenant_id, entry, digest=None) -> object` | Load snapshot (latest if digest=None) |
+| `put(tenant_id, user_id, entry, key, value) -> bool` | Add/update single record |
+| `put_many(tenant_id, user_id, entry, items) -> bool` | Add/update multiple records |
+| `get(tenant_id, entry, key=None, digest=None) -> object` | Get record(s) |
+| `exists(tenant_id, entry) -> bool` | Check if entry exists |

-#### `put(tenant_id: str, user_id: str, entry: str, key: str, value: object) -> bool`
-Add or update a single record. Returns True if a new snapshot was created.
+### History

-#### `put_many(tenant_id: str, user_id: str, entry: str, items: list | dict) -> bool`
-Add or update multiple records. Returns True if a new snapshot was created.
+| Method | Description |
+|--------|-------------|
+| `history(tenant_id, entry, digest=None, max_items=1000) -> list` | Get history chain of digests |
+| `get_digest(tenant_id, entry) -> str` | Get current digest for entry |

-#### `get(tenant_id: str, entry: str, key: str = None, digest: str = None) -> object`
-Retrieve record(s). If key is None, returns all records as a list.
+## Performance & Limitations

-#### `exists(tenant_id: str, entry: str) -> bool`
-Check if an entry exists.
+**Strengths:**
+- ✅ Deduplication: Identical objects stored once (SHA-256 content addressing)
+- ✅ History: Complete audit trail with zero overhead for unchanged data
+- ✅ Custom formats: Binary handlers optimize storage (e.g., numpy vs pickle)

-#### `history(tenant_id: str, entry: str, digest: str = None, max_items: int = 1000) -> list`
-Get the history chain of digests for an entry.
+**Limitations:**
+- ❌ **File-based**: Not suitable for high-throughput applications
+- ❌ **No indexing**: No SQL queries, no complex filtering
+- ❌ **Snapshot overhead**: Each change creates a new snapshot
+- ❌ **History chains**: Long histories require multiple file reads

-#### `get_digest(tenant_id: str, entry: str) -> str`
-Get the current digest for an entry.
+**Performance Tips:**
+- Use `put_many()` instead of multiple `put()` calls (creates one snapshot)
+- Use `BaseRefHandler` for large binary data instead of pickle
+- Limit history traversal with `max_items` parameter
+- Consider archiving old snapshots for long-running entries

-## Usage Patterns
+## Development

-### Pattern 1: Snapshot-based (using `save()`)
-Best for saving complete states of complex objects.
+### Running Tests

-```python
-config = {"theme": "dark", "language": "en"}
-db.save(tenant_id, user_id, "config", config)
+```bash
+# All tests
+pytest
+
+# Specific test file
+pytest tests/test_dbengine.py
+pytest tests/test_serializer.py
+
+# Single test
+pytest tests/test_dbengine.py::test_i_can_save_and_load
 ```

-### Pattern 2: Record-based (using `put()` / `put_many()`)
-Best for managing collections of items incrementally.
+### Building Package

-```python
-db.put(tenant_id, user_id, "settings", "theme", "dark")
-db.put(tenant_id, user_id, "settings", "language", "en")
+```bash
+# Build distribution
+python -m build
+
+# Clean build artifacts
+make clean
 ```

-**Note**: Don't mix these patterns for the same entry, as they use different data structures.
+### Project Structure

-## Thread Safety
+```
+src/dbengine/
+├── dbengine.py          # Main DbEngine and RefHelper classes
+├── serializer.py        # JSON serialization with handlers
+├── handlers.py          # BaseHandler, BaseInlineHandler, BaseRefHandler
+└── utils.py             # Type checking and digest computation

-DbEngine uses `RLock` internally, making it safe for multi-threaded applications.
+tests/
+├── test_dbengine.py     # DbEngine functionality tests
+└── test_serializer.py   # Serialization and handler tests
+```

-## Exceptions
+## Contributing

- `DbException`: Raised for database-related errors (missing entries, invalid parameters, etc.)
-
-## Performance Considerations
-
- Objects are stored as JSON files
- Identical objects (same SHA-256) are stored only once
- History chains can become long; use `max_items` parameter to limit traversal
- File system performance impacts overall speed
+This is a personal implementation. For bug reports or feature requests, please contact the author.

 ## License

-This is a personal implementation. Please check with the author for licensing terms.
+See LICENSE file for details.
+
+## Version History
+* 0.1.0 - Initial release
+* 0.2.0 - Added custom reference handlers