MyDbEngine
A lightweight, git-inspired versioned database engine for Python with content-addressable storage and complete history tracking.
What is MyDbEngine?
MyDbEngine is a file-based versioned database that treats data like Git treats code. Every modification creates an immutable snapshot with a SHA-256 digest, enabling complete history tracking, deduplication, and multi-tenant isolation.
Key Features:
- Immutable Snapshots: Every change creates a new version, never modifying existing data
- Content-Addressable Storage: Identical objects stored only once, referenced by SHA-256 digest
- Multi-Tenant: Isolated storage per tenant with shared deduplication in
refs/ - Extensible Serialization: Custom handlers for optimized storage (JSON, binary, pickle)
- Thread-Safe: Built-in RLock for concurrent access
- Zero Dependencies: Pure Python with no runtime dependencies (pytest only for dev)
When to Use:
- Version tracking for configuration, user data, or application state
- Multi-tenant applications requiring isolated data with shared deduplication
- Scenarios where you need both human-readable JSON and optimized binary storage
When NOT to Use:
- High-frequency writes (creates a snapshot per modification)
- Relational queries (no SQL, no joins)
- Large-scale production databases (file-based, not optimized for millions of records)
Installation
pip install mydbengine
Quick Start
from dbengine.dbengine import DbEngine
# Initialize engine and tenant
engine = DbEngine(root=".mytools_db")
engine.init("tenant_1")
# Save and load data
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
data = engine.load("tenant_1", "config")
print(data) # {"theme": "dark", "lang": "en"}
Core Concepts
Immutable Snapshots
Each save() or put() operation creates a new snapshot with automatic metadata:
__parent__: List containing digest of previous version (or[None]for first)__user_id__: User ID who created the snapshot__date__: ISO timestampYYYYMMDD HH:MM:SS %z
Storage Architecture
.mytools_db/
├── {tenant_id}/
│ ├── head # JSON: {"entry_name": "latest_digest"}
│ └── objects/
│ └── {digest_prefix}/ # First 24 chars of digest
│ └── {full_digest} # JSON snapshot with metadata
└── refs/ # Shared binary references (cross-tenant)
└── {digest_prefix}/
└── {full_digest} # Pickle or custom binary format
Two Usage Patterns
Pattern 1: Snapshot-based - Store complete object states
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
config = engine.load("tenant_1", "config")
Pattern 2: Record-based - Incremental updates to collections
engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
all_users = engine.get("tenant_1", "users") # Returns list of all users
Important: Do not mix patterns for the same entry - they use different data structures.
Basic Usage
Save and Load Complete Snapshots
# Save any Python object
data = {"users": ["alice", "bob"], "count": 2}
digest = engine.save("tenant_1", "user_1", "session", data)
# Load latest version
session = engine.load("tenant_1", "session")
# Load specific version by digest
old_session = engine.load("tenant_1", "session", digest=digest)
Incremental Record Updates
# Add/update single record
engine.put("tenant_1", "user_1", "users", "alice", {"name": "Alice", "role": "admin"})
# Add/update multiple records
users = {
"bob": {"name": "Bob", "role": "user"},
"charlie": {"name": "Charlie", "role": "user"}
}
engine.put_many("tenant_1", "user_1", "users", users)
# Get specific record
alice = engine.get("tenant_1", "users", key="alice")
# Get all records as list
all_users = engine.get("tenant_1", "users")
History Navigation
# Get history chain (list of digests, newest first)
history = engine.history("tenant_1", "config", max_items=10)
# Load previous version
previous = engine.load("tenant_1", "config", digest=history[1])
# Check if entry exists
if engine.exists("tenant_1", "config"):
print("Entry exists")
Custom Serialization
MyDbEngine supports three approaches for custom serialization:
1. BaseInlineHandler - JSON Storage
For small data types that should be human-readable in snapshots:
from dbengine.handlers import BaseInlineHandler, handlers
import datetime
class DateHandler(BaseInlineHandler):
def is_eligible_for(self, obj):
return isinstance(obj, datetime.date)
def tag(self):
return "Date"
def serialize(self, obj):
return {
"__special__": self.tag(),
"year": obj.year,
"month": obj.month,
"day": obj.day
}
def deserialize(self, data):
return datetime.date(year=data["year"], month=data["month"], day=data["day"])
handlers.register_handler(DateHandler())
2. BaseRefHandler - Optimized Binary Storage
For large data structures that benefit from custom binary formats:
from dbengine.handlers import BaseRefHandler, handlers
import pandas as pd
import numpy as np
import json
class DataFrameHandler(BaseRefHandler):
def is_eligible_for(self, obj):
return isinstance(obj, pd.DataFrame)
def tag(self):
return "DataFrame"
def serialize_to_bytes(self, df):
"""Convert DataFrame to compact binary format"""
# Store metadata + numpy bytes
metadata = {
"columns": df.columns.tolist(),
"index": df.index.tolist(),
"dtype": str(df.values.dtype)
}
metadata_bytes = json.dumps(metadata).encode('utf-8')
metadata_length = len(metadata_bytes).to_bytes(4, 'big')
numpy_bytes = df.to_numpy().tobytes()
return metadata_length + metadata_bytes + numpy_bytes
def deserialize_from_bytes(self, data):
"""Reconstruct DataFrame from binary format"""
# Read metadata
metadata_length = int.from_bytes(data[:4], 'big')
metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
numpy_bytes = data[4+metadata_length:]
# Reconstruct array and DataFrame
array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
array = array.reshape(len(metadata['index']), len(metadata['columns']))
return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])
handlers.register_handler(DataFrameHandler())
# Now DataFrames are automatically stored in optimized binary format
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
engine.save("tenant_1", "user_1", "data", df)
Result:
- JSON snapshot contains:
{"__special__": "DataFrame", "__digest__": "abc123..."} - Binary data stored in
refs/abc123...(more compact than pickle) - Automatic deduplication across tenants
3. use_refs() - Selective Pickle Storage
For objects with specific fields that should be pickled:
class MyDataObject:
def __init__(self, metadata, large_array):
self.metadata = metadata
self.large_array = large_array # Large numpy array or similar
@staticmethod
def use_refs():
"""Fields to pickle instead of JSON-serialize"""
return {"large_array"}
# metadata goes to JSON, large_array goes to refs/ (pickled)
obj = MyDataObject({"name": "dataset_1"}, np.zeros((1000, 1000)))
engine.save("tenant_1", "user_1", "my_data", obj)
Comparison:
| Approach | Storage | Format | Use Case |
|---|---|---|---|
BaseInlineHandler |
JSON snapshot | Custom dict | Small data, human-readable |
BaseRefHandler |
refs/ directory |
Custom binary | Large data, optimized format |
use_refs() |
refs/ directory |
Pickle | Quick solution, no handler needed |
API Reference
Initialization
| Method | Description |
|---|---|
DbEngine(root: str = ".mytools_db") |
Initialize engine with storage root |
init(tenant_id: str) |
Create tenant directory structure |
is_initialized(tenant_id: str) -> bool |
Check if tenant is initialized |
Data Operations
| Method | Description |
|---|---|
save(tenant_id, user_id, entry, obj) -> str |
Save complete snapshot, returns digest |
load(tenant_id, entry, digest=None) -> object |
Load snapshot (latest if digest=None) |
put(tenant_id, user_id, entry, key, value) -> bool |
Add/update single record |
put_many(tenant_id, user_id, entry, items) -> bool |
Add/update multiple records |
get(tenant_id, entry, key=None, digest=None) -> object |
Get record(s) |
exists(tenant_id, entry) -> bool |
Check if entry exists |
History
| Method | Description |
|---|---|
history(tenant_id, entry, digest=None, max_items=1000) -> list |
Get history chain of digests |
get_digest(tenant_id, entry) -> str |
Get current digest for entry |
Performance & Limitations
Strengths:
- ✅ Deduplication: Identical objects stored once (SHA-256 content addressing)
- ✅ History: Complete audit trail with zero overhead for unchanged data
- ✅ Custom formats: Binary handlers optimize storage (e.g., numpy vs pickle)
Limitations:
- ❌ File-based: Not suitable for high-throughput applications
- ❌ No indexing: No SQL queries, no complex filtering
- ❌ Snapshot overhead: Each change creates a new snapshot
- ❌ History chains: Long histories require multiple file reads
Performance Tips:
- Use
put_many()instead of multipleput()calls (creates one snapshot) - Use
BaseRefHandlerfor large binary data instead of pickle - Limit history traversal with
max_itemsparameter - Consider archiving old snapshots for long-running entries
Development
Running Tests
# All tests
pytest
# Specific test file
pytest tests/test_dbengine.py
pytest tests/test_serializer.py
# Single test
pytest tests/test_dbengine.py::test_i_can_save_and_load
Building Package
# Build distribution
python -m build
# Clean build artifacts
make clean
Project Structure
src/dbengine/
├── dbengine.py # Main DbEngine and RefHelper classes
├── serializer.py # JSON serialization with handlers
├── handlers.py # BaseHandler, BaseInlineHandler, BaseRefHandler
└── utils.py # Type checking and digest computation
tests/
├── test_dbengine.py # DbEngine functionality tests
└── test_serializer.py # Serialization and handler tests
Contributing
This is a personal implementation. For bug reports or feature requests, please contact the author.
License
See LICENSE file for details.
Version History
- 0.1.0 - Initial release
- 0.2.0 - Added custom reference handlers