# MyDbEngine A lightweight, git-inspired versioned database engine for Python with content-addressable storage and complete history tracking. ## What is MyDbEngine? MyDbEngine is a file-based versioned database that treats data like Git treats code. Every modification creates an immutable snapshot with a SHA-256 digest, enabling complete history tracking, deduplication, and multi-tenant isolation. **Key Features:** - **Immutable Snapshots**: Every change creates a new version, never modifying existing data - **Content-Addressable Storage**: Identical objects stored only once, referenced by SHA-256 digest - **Multi-Tenant**: Isolated storage per tenant with shared deduplication in `refs/` - **Extensible Serialization**: Custom handlers for optimized storage (JSON, binary, pickle) - **Thread-Safe**: Built-in RLock for concurrent access - **Zero Dependencies**: Pure Python with no runtime dependencies (pytest only for dev) **When to Use:** - Version tracking for configuration, user data, or application state - Multi-tenant applications requiring isolated data with shared deduplication - Scenarios where you need both human-readable JSON and optimized binary storage **When NOT to Use:** - High-frequency writes (creates a snapshot per modification) - Relational queries (no SQL, no joins) - Large-scale production databases (file-based, not optimized for millions of records) ## Installation ```bash pip install mydbengine ``` ## Quick Start ```python from dbengine.dbengine import DbEngine # Initialize engine and tenant engine = DbEngine(root=".mytools_db") engine.init("tenant_1") # Save and load data engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"}) data = engine.load("tenant_1", "config") print(data) # {"theme": "dark", "lang": "en"} ``` ## Core Concepts ### Immutable Snapshots Each `save()` or `put()` operation creates a new snapshot with automatic metadata: - `__parent__`: List containing digest of previous version (or `[None]` for first) - `__user_id__`: User ID who created the snapshot - `__date__`: ISO timestamp `YYYYMMDD HH:MM:SS %z` ### Storage Architecture ``` .mytools_db/ ├── {tenant_id}/ │ ├── head # JSON: {"entry_name": "latest_digest"} │ └── objects/ │ └── {digest_prefix}/ # First 24 chars of digest │ └── {full_digest} # JSON snapshot with metadata └── refs/ # Shared binary references (cross-tenant) └── {digest_prefix}/ └── {full_digest} # Pickle or custom binary format ``` ### Two Usage Patterns **Pattern 1: Snapshot-based** - Store complete object states ```python engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"}) config = engine.load("tenant_1", "config") ``` **Pattern 2: Record-based** - Incremental updates to collections ```python engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30}) engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25}) all_users = engine.get("tenant_1", "users") # Returns list of all users ``` **Important:** Do not mix patterns for the same entry - they use different data structures. ## Basic Usage ### Save and Load Complete Snapshots ```python # Save any Python object data = {"users": ["alice", "bob"], "count": 2} digest = engine.save("tenant_1", "user_1", "session", data) # Load latest version session = engine.load("tenant_1", "session") # Load specific version by digest old_session = engine.load("tenant_1", "session", digest=digest) ``` ### Incremental Record Updates ```python # Add/update single record engine.put("tenant_1", "user_1", "users", "alice", {"name": "Alice", "role": "admin"}) # Add/update multiple records users = { "bob": {"name": "Bob", "role": "user"}, "charlie": {"name": "Charlie", "role": "user"} } engine.put_many("tenant_1", "user_1", "users", users) # Get specific record alice = engine.get("tenant_1", "users", key="alice") # Get all records as list all_users = engine.get("tenant_1", "users") ``` ### History Navigation ```python # Get history chain (list of digests, newest first) history = engine.history("tenant_1", "config", max_items=10) # Load previous version previous = engine.load("tenant_1", "config", digest=history[1]) # Check if entry exists if engine.exists("tenant_1", "config"): print("Entry exists") ``` ## Custom Serialization MyDbEngine supports three approaches for custom serialization: ### 1. BaseInlineHandler - JSON Storage For small data types that should be human-readable in snapshots: ```python from dbengine.handlers import BaseInlineHandler, handlers import datetime class DateHandler(BaseInlineHandler): def is_eligible_for(self, obj): return isinstance(obj, datetime.date) def tag(self): return "Date" def serialize(self, obj): return { "__special__": self.tag(), "year": obj.year, "month": obj.month, "day": obj.day } def deserialize(self, data): return datetime.date(year=data["year"], month=data["month"], day=data["day"]) handlers.register_handler(DateHandler()) ``` ### 2. BaseRefHandler - Optimized Binary Storage For large data structures that benefit from custom binary formats: ```python from dbengine.handlers import BaseRefHandler, handlers import pandas as pd import numpy as np import json class DataFrameHandler(BaseRefHandler): def is_eligible_for(self, obj): return isinstance(obj, pd.DataFrame) def tag(self): return "DataFrame" def serialize_to_bytes(self, df): """Convert DataFrame to compact binary format""" # Store metadata + numpy bytes metadata = { "columns": df.columns.tolist(), "index": df.index.tolist(), "dtype": str(df.values.dtype) } metadata_bytes = json.dumps(metadata).encode('utf-8') metadata_length = len(metadata_bytes).to_bytes(4, 'big') numpy_bytes = df.to_numpy().tobytes() return metadata_length + metadata_bytes + numpy_bytes def deserialize_from_bytes(self, data): """Reconstruct DataFrame from binary format""" # Read metadata metadata_length = int.from_bytes(data[:4], 'big') metadata = json.loads(data[4:4+metadata_length].decode('utf-8')) numpy_bytes = data[4+metadata_length:] # Reconstruct array and DataFrame array = np.frombuffer(numpy_bytes, dtype=metadata['dtype']) array = array.reshape(len(metadata['index']), len(metadata['columns'])) return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index']) handlers.register_handler(DataFrameHandler()) # Now DataFrames are automatically stored in optimized binary format df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]}) engine.save("tenant_1", "user_1", "data", df) ``` **Result:** - JSON snapshot contains: `{"__special__": "DataFrame", "__digest__": "abc123..."}` - Binary data stored in `refs/abc123...` (more compact than pickle) - Automatic deduplication across tenants ### 3. use_refs() - Selective Pickle Storage For objects with specific fields that should be pickled: ```python class MyDataObject: def __init__(self, metadata, large_array): self.metadata = metadata self.large_array = large_array # Large numpy array or similar @staticmethod def use_refs(): """Fields to pickle instead of JSON-serialize""" return {"large_array"} # metadata goes to JSON, large_array goes to refs/ (pickled) obj = MyDataObject({"name": "dataset_1"}, np.zeros((1000, 1000))) engine.save("tenant_1", "user_1", "my_data", obj) ``` **Comparison:** | Approach | Storage | Format | Use Case | |----------|---------|--------|----------| | `BaseInlineHandler` | JSON snapshot | Custom dict | Small data, human-readable | | `BaseRefHandler` | `refs/` directory | Custom binary | Large data, optimized format | | `use_refs()` | `refs/` directory | Pickle | Quick solution, no handler needed | ## API Reference ### Initialization | Method | Description | |--------|-------------| | `DbEngine(root: str = ".mytools_db")` | Initialize engine with storage root | | `init(tenant_id: str)` | Create tenant directory structure | | `is_initialized(tenant_id: str) -> bool` | Check if tenant is initialized | ### Data Operations | Method | Description | |--------|-------------| | `save(tenant_id, user_id, entry, obj) -> str` | Save complete snapshot, returns digest | | `load(tenant_id, entry, digest=None) -> object` | Load snapshot (latest if digest=None) | | `put(tenant_id, user_id, entry, key, value) -> bool` | Add/update single record | | `put_many(tenant_id, user_id, entry, items) -> bool` | Add/update multiple records | | `get(tenant_id, entry, key=None, digest=None) -> object` | Get record(s) | | `exists(tenant_id, entry) -> bool` | Check if entry exists | ### History | Method | Description | |--------|-------------| | `history(tenant_id, entry, digest=None, max_items=1000) -> list` | Get history chain of digests | | `get_digest(tenant_id, entry) -> str` | Get current digest for entry | ## Performance & Limitations **Strengths:** - ✅ Deduplication: Identical objects stored once (SHA-256 content addressing) - ✅ History: Complete audit trail with zero overhead for unchanged data - ✅ Custom formats: Binary handlers optimize storage (e.g., numpy vs pickle) **Limitations:** - ❌ **File-based**: Not suitable for high-throughput applications - ❌ **No indexing**: No SQL queries, no complex filtering - ❌ **Snapshot overhead**: Each change creates a new snapshot - ❌ **History chains**: Long histories require multiple file reads **Performance Tips:** - Use `put_many()` instead of multiple `put()` calls (creates one snapshot) - Use `BaseRefHandler` for large binary data instead of pickle - Limit history traversal with `max_items` parameter - Consider archiving old snapshots for long-running entries ## Development ### Running Tests ```bash # All tests pytest # Specific test file pytest tests/test_dbengine.py pytest tests/test_serializer.py # Single test pytest tests/test_dbengine.py::test_i_can_save_and_load ``` ### Building Package ```bash # Build distribution python -m build # Clean build artifacts make clean ``` ### Project Structure ``` src/dbengine/ ├── dbengine.py # Main DbEngine and RefHelper classes ├── serializer.py # JSON serialization with handlers ├── handlers.py # BaseHandler, BaseInlineHandler, BaseRefHandler └── utils.py # Type checking and digest computation tests/ ├── test_dbengine.py # DbEngine functionality tests └── test_serializer.py # Serialization and handler tests ``` ## Contributing This is a personal implementation. For bug reports or feature requests, please contact the author. ## License See LICENSE file for details. ## Version History * 0.1.0 - Initial release * 0.2.0 - Added custom reference handlers * 0.2.1 - A handler can only be registered once