Added Custom Ref Handlers
This commit is contained in:
389
README.md
389
README.md
@@ -1,187 +1,352 @@
|
||||
# DbEngine
|
||||
# MyDbEngine
|
||||
|
||||
A lightweight, git-inspired database engine for Python that maintains complete history of all modifications.
|
||||
A lightweight, git-inspired versioned database engine for Python with content-addressable storage and complete history tracking.
|
||||
|
||||
## Overview
|
||||
## What is MyDbEngine?
|
||||
|
||||
DbEngine is a personal implementation of a versioned database engine that stores snapshots of data changes over time. Each modification creates a new immutable snapshot, allowing you to track the complete history of your data.
|
||||
MyDbEngine is a file-based versioned database that treats data like Git treats code. Every modification creates an immutable snapshot with a SHA-256 digest, enabling complete history tracking, deduplication, and multi-tenant isolation.
|
||||
|
||||
## Key Features
|
||||
**Key Features:**
|
||||
- **Immutable Snapshots**: Every change creates a new version, never modifying existing data
|
||||
- **Content-Addressable Storage**: Identical objects stored only once, referenced by SHA-256 digest
|
||||
- **Multi-Tenant**: Isolated storage per tenant with shared deduplication in `refs/`
|
||||
- **Extensible Serialization**: Custom handlers for optimized storage (JSON, binary, pickle)
|
||||
- **Thread-Safe**: Built-in RLock for concurrent access
|
||||
- **Zero Dependencies**: Pure Python with no runtime dependencies (pytest only for dev)
|
||||
|
||||
- **Version Control**: Every change creates a new snapshot with a unique digest (SHA-256 hash)
|
||||
- **History Tracking**: Access any previous version of your data
|
||||
- **Multi-tenant Support**: Isolated data storage per tenant
|
||||
- **Thread-safe**: Built-in locking mechanism for concurrent access
|
||||
- **Git-inspired Architecture**: Objects are stored in a content-addressable format
|
||||
- **Efficient Storage**: Identical objects are stored only once
|
||||
**When to Use:**
|
||||
- Version tracking for configuration, user data, or application state
|
||||
- Multi-tenant applications requiring isolated data with shared deduplication
|
||||
- Scenarios where you need both human-readable JSON and optimized binary storage
|
||||
|
||||
## Architecture
|
||||
**When NOT to Use:**
|
||||
- High-frequency writes (creates a snapshot per modification)
|
||||
- Relational queries (no SQL, no joins)
|
||||
- Large-scale production databases (file-based, not optimized for millions of records)
|
||||
|
||||
The engine uses a file-based storage system with the following structure:
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install mydbengine
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from dbengine.dbengine import DbEngine
|
||||
|
||||
# Initialize engine and tenant
|
||||
engine = DbEngine(root=".mytools_db")
|
||||
engine.init("tenant_1")
|
||||
|
||||
# Save and load data
|
||||
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
|
||||
data = engine.load("tenant_1", "config")
|
||||
print(data) # {"theme": "dark", "lang": "en"}
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Immutable Snapshots
|
||||
|
||||
Each `save()` or `put()` operation creates a new snapshot with automatic metadata:
|
||||
- `__parent__`: List containing digest of previous version (or `[None]` for first)
|
||||
- `__user_id__`: User ID who created the snapshot
|
||||
- `__date__`: ISO timestamp `YYYYMMDD HH:MM:SS %z`
|
||||
|
||||
### Storage Architecture
|
||||
|
||||
```
|
||||
.mytools_db/
|
||||
├── {tenant_id}/
|
||||
│ ├── head # Points to latest version of each entry
|
||||
│ ├── head # JSON: {"entry_name": "latest_digest"}
|
||||
│ └── objects/
|
||||
│ └── {digest_prefix}/
|
||||
│ └── {full_digest} # Actual object data
|
||||
└── refs/ # Shared references
|
||||
│ └── {digest_prefix}/ # First 24 chars of digest
|
||||
│ └── {full_digest} # JSON snapshot with metadata
|
||||
└── refs/ # Shared binary references (cross-tenant)
|
||||
└── {digest_prefix}/
|
||||
└── {full_digest} # Pickle or custom binary format
|
||||
```
|
||||
|
||||
## Installation
|
||||
### Two Usage Patterns
|
||||
|
||||
**Pattern 1: Snapshot-based** - Store complete object states
|
||||
```python
|
||||
from db_engine import DbEngine
|
||||
|
||||
# Initialize with default root
|
||||
db = DbEngine()
|
||||
|
||||
# Or specify custom root directory
|
||||
db = DbEngine(root="/path/to/database")
|
||||
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
|
||||
config = engine.load("tenant_1", "config")
|
||||
```
|
||||
|
||||
**Pattern 2: Record-based** - Incremental updates to collections
|
||||
```python
|
||||
engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
|
||||
engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
|
||||
all_users = engine.get("tenant_1", "users") # Returns list of all users
|
||||
```
|
||||
|
||||
**Important:** Do not mix patterns for the same entry - they use different data structures.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Initialize Database for a Tenant
|
||||
### Save and Load Complete Snapshots
|
||||
|
||||
```python
|
||||
tenant_id = "my_company"
|
||||
db.init(tenant_id)
|
||||
```
|
||||
# Save any Python object
|
||||
data = {"users": ["alice", "bob"], "count": 2}
|
||||
digest = engine.save("tenant_1", "user_1", "session", data)
|
||||
|
||||
### Save Data
|
||||
|
||||
```python
|
||||
# Save a complete object
|
||||
user_id = "john_doe"
|
||||
entry = "users"
|
||||
data = {"name": "John", "age": 30}
|
||||
|
||||
digest = db.save(tenant_id, user_id, entry, data)
|
||||
```
|
||||
|
||||
### Load Data
|
||||
|
||||
```python
|
||||
# Load latest version
|
||||
data = db.load(tenant_id, entry="users")
|
||||
session = engine.load("tenant_1", "session")
|
||||
|
||||
# Load specific version by digest
|
||||
data = db.load(tenant_id, entry="users", digest="abc123...")
|
||||
old_session = engine.load("tenant_1", "session", digest=digest)
|
||||
```
|
||||
|
||||
### Work with Individual Records
|
||||
### Incremental Record Updates
|
||||
|
||||
```python
|
||||
# Add or update a single record
|
||||
db.put(tenant_id, user_id, entry="users", key="john", value={"name": "John", "age": 30})
|
||||
# Add/update single record
|
||||
engine.put("tenant_1", "user_1", "users", "alice", {"name": "Alice", "role": "admin"})
|
||||
|
||||
# Add or update multiple records at once
|
||||
items = {
|
||||
"john": {"name": "John", "age": 30},
|
||||
"jane": {"name": "Jane", "age": 25}
|
||||
# Add/update multiple records
|
||||
users = {
|
||||
"bob": {"name": "Bob", "role": "user"},
|
||||
"charlie": {"name": "Charlie", "role": "user"}
|
||||
}
|
||||
db.put_many(tenant_id, user_id, entry="users", items=items)
|
||||
engine.put_many("tenant_1", "user_1", "users", users)
|
||||
|
||||
# Get a specific record
|
||||
user = db.get(tenant_id, entry="users", key="john")
|
||||
# Get specific record
|
||||
alice = engine.get("tenant_1", "users", key="alice")
|
||||
|
||||
# Get all records
|
||||
all_users = db.get(tenant_id, entry="users")
|
||||
# Get all records as list
|
||||
all_users = engine.get("tenant_1", "users")
|
||||
```
|
||||
|
||||
### Check Existence
|
||||
### History Navigation
|
||||
|
||||
```python
|
||||
if db.exists(tenant_id, entry="users"):
|
||||
# Get history chain (list of digests, newest first)
|
||||
history = engine.history("tenant_1", "config", max_items=10)
|
||||
|
||||
# Load previous version
|
||||
previous = engine.load("tenant_1", "config", digest=history[1])
|
||||
|
||||
# Check if entry exists
|
||||
if engine.exists("tenant_1", "config"):
|
||||
print("Entry exists")
|
||||
```
|
||||
|
||||
### Access History
|
||||
## Custom Serialization
|
||||
|
||||
MyDbEngine supports three approaches for custom serialization:
|
||||
|
||||
### 1. BaseInlineHandler - JSON Storage
|
||||
|
||||
For small data types that should be human-readable in snapshots:
|
||||
|
||||
```python
|
||||
# Get history of an entry (returns list of digests)
|
||||
history = db.history(tenant_id, entry="users", max_items=10)
|
||||
from dbengine.handlers import BaseInlineHandler, handlers
|
||||
import datetime
|
||||
|
||||
# Load a previous version
|
||||
old_data = db.load(tenant_id, entry="users", digest=history[1])
|
||||
class DateHandler(BaseInlineHandler):
|
||||
def is_eligible_for(self, obj):
|
||||
return isinstance(obj, datetime.date)
|
||||
|
||||
def tag(self):
|
||||
return "Date"
|
||||
|
||||
def serialize(self, obj):
|
||||
return {
|
||||
"__special__": self.tag(),
|
||||
"year": obj.year,
|
||||
"month": obj.month,
|
||||
"day": obj.day
|
||||
}
|
||||
|
||||
def deserialize(self, data):
|
||||
return datetime.date(year=data["year"], month=data["month"], day=data["day"])
|
||||
|
||||
handlers.register_handler(DateHandler())
|
||||
```
|
||||
|
||||
## Metadata
|
||||
### 2. BaseRefHandler - Optimized Binary Storage
|
||||
|
||||
Each snapshot automatically includes metadata:
|
||||
For large data structures that benefit from custom binary formats:
|
||||
|
||||
- `__parent__`: Digest of the previous version
|
||||
- `__user_id__`: User ID who made the change
|
||||
- `__date__`: Timestamp of the change (format: `YYYYMMDD HH:MM:SS`)
|
||||
```python
|
||||
from dbengine.handlers import BaseRefHandler, handlers
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import json
|
||||
|
||||
class DataFrameHandler(BaseRefHandler):
|
||||
def is_eligible_for(self, obj):
|
||||
return isinstance(obj, pd.DataFrame)
|
||||
|
||||
def tag(self):
|
||||
return "DataFrame"
|
||||
|
||||
def serialize_to_bytes(self, df):
|
||||
"""Convert DataFrame to compact binary format"""
|
||||
# Store metadata + numpy bytes
|
||||
metadata = {
|
||||
"columns": df.columns.tolist(),
|
||||
"index": df.index.tolist(),
|
||||
"dtype": str(df.values.dtype)
|
||||
}
|
||||
metadata_bytes = json.dumps(metadata).encode('utf-8')
|
||||
metadata_length = len(metadata_bytes).to_bytes(4, 'big')
|
||||
numpy_bytes = df.to_numpy().tobytes()
|
||||
|
||||
return metadata_length + metadata_bytes + numpy_bytes
|
||||
|
||||
def deserialize_from_bytes(self, data):
|
||||
"""Reconstruct DataFrame from binary format"""
|
||||
# Read metadata
|
||||
metadata_length = int.from_bytes(data[:4], 'big')
|
||||
metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
|
||||
numpy_bytes = data[4+metadata_length:]
|
||||
|
||||
# Reconstruct array and DataFrame
|
||||
array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
|
||||
array = array.reshape(len(metadata['index']), len(metadata['columns']))
|
||||
|
||||
return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])
|
||||
|
||||
handlers.register_handler(DataFrameHandler())
|
||||
|
||||
# Now DataFrames are automatically stored in optimized binary format
|
||||
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
|
||||
engine.save("tenant_1", "user_1", "data", df)
|
||||
```
|
||||
|
||||
**Result:**
|
||||
- JSON snapshot contains: `{"__special__": "DataFrame", "__digest__": "abc123..."}`
|
||||
- Binary data stored in `refs/abc123...` (more compact than pickle)
|
||||
- Automatic deduplication across tenants
|
||||
|
||||
### 3. use_refs() - Selective Pickle Storage
|
||||
|
||||
For objects with specific fields that should be pickled:
|
||||
|
||||
```python
|
||||
class MyDataObject:
|
||||
def __init__(self, metadata, large_array):
|
||||
self.metadata = metadata
|
||||
self.large_array = large_array # Large numpy array or similar
|
||||
|
||||
@staticmethod
|
||||
def use_refs():
|
||||
"""Fields to pickle instead of JSON-serialize"""
|
||||
return {"large_array"}
|
||||
|
||||
# metadata goes to JSON, large_array goes to refs/ (pickled)
|
||||
obj = MyDataObject({"name": "dataset_1"}, np.zeros((1000, 1000)))
|
||||
engine.save("tenant_1", "user_1", "my_data", obj)
|
||||
```
|
||||
|
||||
**Comparison:**
|
||||
|
||||
| Approach | Storage | Format | Use Case |
|
||||
|----------|---------|--------|----------|
|
||||
| `BaseInlineHandler` | JSON snapshot | Custom dict | Small data, human-readable |
|
||||
| `BaseRefHandler` | `refs/` directory | Custom binary | Large data, optimized format |
|
||||
| `use_refs()` | `refs/` directory | Pickle | Quick solution, no handler needed |
|
||||
|
||||
## API Reference
|
||||
|
||||
### Core Methods
|
||||
### Initialization
|
||||
|
||||
#### `init(tenant_id: str)`
|
||||
Initialize database structure for a tenant.
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `DbEngine(root: str = ".mytools_db")` | Initialize engine with storage root |
|
||||
| `init(tenant_id: str)` | Create tenant directory structure |
|
||||
| `is_initialized(tenant_id: str) -> bool` | Check if tenant is initialized |
|
||||
|
||||
#### `save(tenant_id: str, user_id: str, entry: str, obj: object) -> str`
|
||||
Save a complete snapshot. Returns the digest of the saved object.
|
||||
### Data Operations
|
||||
|
||||
#### `load(tenant_id: str, entry: str, digest: str = None) -> object`
|
||||
Load a snapshot. If digest is None, loads the latest version.
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `save(tenant_id, user_id, entry, obj) -> str` | Save complete snapshot, returns digest |
|
||||
| `load(tenant_id, entry, digest=None) -> object` | Load snapshot (latest if digest=None) |
|
||||
| `put(tenant_id, user_id, entry, key, value) -> bool` | Add/update single record |
|
||||
| `put_many(tenant_id, user_id, entry, items) -> bool` | Add/update multiple records |
|
||||
| `get(tenant_id, entry, key=None, digest=None) -> object` | Get record(s) |
|
||||
| `exists(tenant_id, entry) -> bool` | Check if entry exists |
|
||||
|
||||
#### `put(tenant_id: str, user_id: str, entry: str, key: str, value: object) -> bool`
|
||||
Add or update a single record. Returns True if a new snapshot was created.
|
||||
### History
|
||||
|
||||
#### `put_many(tenant_id: str, user_id: str, entry: str, items: list | dict) -> bool`
|
||||
Add or update multiple records. Returns True if a new snapshot was created.
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `history(tenant_id, entry, digest=None, max_items=1000) -> list` | Get history chain of digests |
|
||||
| `get_digest(tenant_id, entry) -> str` | Get current digest for entry |
|
||||
|
||||
#### `get(tenant_id: str, entry: str, key: str = None, digest: str = None) -> object`
|
||||
Retrieve record(s). If key is None, returns all records as a list.
|
||||
## Performance & Limitations
|
||||
|
||||
#### `exists(tenant_id: str, entry: str) -> bool`
|
||||
Check if an entry exists.
|
||||
**Strengths:**
|
||||
- ✅ Deduplication: Identical objects stored once (SHA-256 content addressing)
|
||||
- ✅ History: Complete audit trail with zero overhead for unchanged data
|
||||
- ✅ Custom formats: Binary handlers optimize storage (e.g., numpy vs pickle)
|
||||
|
||||
#### `history(tenant_id: str, entry: str, digest: str = None, max_items: int = 1000) -> list`
|
||||
Get the history chain of digests for an entry.
|
||||
**Limitations:**
|
||||
- ❌ **File-based**: Not suitable for high-throughput applications
|
||||
- ❌ **No indexing**: No SQL queries, no complex filtering
|
||||
- ❌ **Snapshot overhead**: Each change creates a new snapshot
|
||||
- ❌ **History chains**: Long histories require multiple file reads
|
||||
|
||||
#### `get_digest(tenant_id: str, entry: str) -> str`
|
||||
Get the current digest for an entry.
|
||||
**Performance Tips:**
|
||||
- Use `put_many()` instead of multiple `put()` calls (creates one snapshot)
|
||||
- Use `BaseRefHandler` for large binary data instead of pickle
|
||||
- Limit history traversal with `max_items` parameter
|
||||
- Consider archiving old snapshots for long-running entries
|
||||
|
||||
## Usage Patterns
|
||||
## Development
|
||||
|
||||
### Pattern 1: Snapshot-based (using `save()`)
|
||||
Best for saving complete states of complex objects.
|
||||
### Running Tests
|
||||
|
||||
```python
|
||||
config = {"theme": "dark", "language": "en"}
|
||||
db.save(tenant_id, user_id, "config", config)
|
||||
```bash
|
||||
# All tests
|
||||
pytest
|
||||
|
||||
# Specific test file
|
||||
pytest tests/test_dbengine.py
|
||||
pytest tests/test_serializer.py
|
||||
|
||||
# Single test
|
||||
pytest tests/test_dbengine.py::test_i_can_save_and_load
|
||||
```
|
||||
|
||||
### Pattern 2: Record-based (using `put()` / `put_many()`)
|
||||
Best for managing collections of items incrementally.
|
||||
### Building Package
|
||||
|
||||
```python
|
||||
db.put(tenant_id, user_id, "settings", "theme", "dark")
|
||||
db.put(tenant_id, user_id, "settings", "language", "en")
|
||||
```bash
|
||||
# Build distribution
|
||||
python -m build
|
||||
|
||||
# Clean build artifacts
|
||||
make clean
|
||||
```
|
||||
|
||||
**Note**: Don't mix these patterns for the same entry, as they use different data structures.
|
||||
### Project Structure
|
||||
|
||||
## Thread Safety
|
||||
```
|
||||
src/dbengine/
|
||||
├── dbengine.py # Main DbEngine and RefHelper classes
|
||||
├── serializer.py # JSON serialization with handlers
|
||||
├── handlers.py # BaseHandler, BaseInlineHandler, BaseRefHandler
|
||||
└── utils.py # Type checking and digest computation
|
||||
|
||||
DbEngine uses `RLock` internally, making it safe for multi-threaded applications.
|
||||
tests/
|
||||
├── test_dbengine.py # DbEngine functionality tests
|
||||
└── test_serializer.py # Serialization and handler tests
|
||||
```
|
||||
|
||||
## Exceptions
|
||||
## Contributing
|
||||
|
||||
- `DbException`: Raised for database-related errors (missing entries, invalid parameters, etc.)
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- Objects are stored as JSON files
|
||||
- Identical objects (same SHA-256) are stored only once
|
||||
- History chains can become long; use `max_items` parameter to limit traversal
|
||||
- File system performance impacts overall speed
|
||||
This is a personal implementation. For bug reports or feature requests, please contact the author.
|
||||
|
||||
## License
|
||||
|
||||
This is a personal implementation. Please check with the author for licensing terms.
|
||||
See LICENSE file for details.
|
||||
|
||||
## Version History
|
||||
* 0.1.0 - Initial release
|
||||
* 0.2.0 - Added custom reference handlers
|
||||
Reference in New Issue
Block a user