2025-12-21 15:44:05 +01:00
2025-10-17 21:08:20 +02:00
2025-12-21 17:42:17 +01:00
2025-12-21 17:42:17 +01:00
2025-10-17 21:08:20 +02:00
2025-12-21 17:42:17 +01:00
2025-10-17 22:24:19 +02:00
2025-10-17 21:08:20 +02:00
2025-10-27 22:43:19 +01:00
2025-12-21 17:42:17 +01:00
2025-12-21 17:42:17 +01:00

MyDbEngine

A lightweight, git-inspired versioned database engine for Python with content-addressable storage and complete history tracking.

What is MyDbEngine?

MyDbEngine is a file-based versioned database that treats data like Git treats code. Every modification creates an immutable snapshot with a SHA-256 digest, enabling complete history tracking, deduplication, and multi-tenant isolation.

Key Features:

  • Immutable Snapshots: Every change creates a new version, never modifying existing data
  • Content-Addressable Storage: Identical objects stored only once, referenced by SHA-256 digest
  • Multi-Tenant: Isolated storage per tenant with shared deduplication in refs/
  • Extensible Serialization: Custom handlers for optimized storage (JSON, binary, pickle)
  • Thread-Safe: Built-in RLock for concurrent access
  • Zero Dependencies: Pure Python with no runtime dependencies (pytest only for dev)

When to Use:

  • Version tracking for configuration, user data, or application state
  • Multi-tenant applications requiring isolated data with shared deduplication
  • Scenarios where you need both human-readable JSON and optimized binary storage

When NOT to Use:

  • High-frequency writes (creates a snapshot per modification)
  • Relational queries (no SQL, no joins)
  • Large-scale production databases (file-based, not optimized for millions of records)

Installation

pip install mydbengine

Quick Start

from dbengine.dbengine import DbEngine

# Initialize engine and tenant
engine = DbEngine(root=".mytools_db")
engine.init("tenant_1")

# Save and load data
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
data = engine.load("tenant_1", "config")
print(data)  # {"theme": "dark", "lang": "en"}

Core Concepts

Immutable Snapshots

Each save() or put() operation creates a new snapshot with automatic metadata:

  • __parent__: List containing digest of previous version (or [None] for first)
  • __user_id__: User ID who created the snapshot
  • __date__: ISO timestamp YYYYMMDD HH:MM:SS %z

Storage Architecture

.mytools_db/
├── {tenant_id}/
│   ├── head                          # JSON: {"entry_name": "latest_digest"}
│   └── objects/
│       └── {digest_prefix}/         # First 24 chars of digest
│           └── {full_digest}        # JSON snapshot with metadata
└── refs/                            # Shared binary references (cross-tenant)
    └── {digest_prefix}/
        └── {full_digest}            # Pickle or custom binary format

Two Usage Patterns

Pattern 1: Snapshot-based - Store complete object states

engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
config = engine.load("tenant_1", "config")

Pattern 2: Record-based - Incremental updates to collections

engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
all_users = engine.get("tenant_1", "users")  # Returns list of all users

Important: Do not mix patterns for the same entry - they use different data structures.

Basic Usage

Save and Load Complete Snapshots

# Save any Python object
data = {"users": ["alice", "bob"], "count": 2}
digest = engine.save("tenant_1", "user_1", "session", data)

# Load latest version
session = engine.load("tenant_1", "session")

# Load specific version by digest
old_session = engine.load("tenant_1", "session", digest=digest)

Incremental Record Updates

# Add/update single record
engine.put("tenant_1", "user_1", "users", "alice", {"name": "Alice", "role": "admin"})

# Add/update multiple records
users = {
    "bob": {"name": "Bob", "role": "user"},
    "charlie": {"name": "Charlie", "role": "user"}
}
engine.put_many("tenant_1", "user_1", "users", users)

# Get specific record
alice = engine.get("tenant_1", "users", key="alice")

# Get all records as list
all_users = engine.get("tenant_1", "users")

History Navigation

# Get history chain (list of digests, newest first)
history = engine.history("tenant_1", "config", max_items=10)

# Load previous version
previous = engine.load("tenant_1", "config", digest=history[1])

# Check if entry exists
if engine.exists("tenant_1", "config"):
    print("Entry exists")

Custom Serialization

MyDbEngine supports three approaches for custom serialization:

1. BaseInlineHandler - JSON Storage

For small data types that should be human-readable in snapshots:

from dbengine.handlers import BaseInlineHandler, handlers
import datetime

class DateHandler(BaseInlineHandler):
    def is_eligible_for(self, obj):
        return isinstance(obj, datetime.date)

    def tag(self):
        return "Date"

    def serialize(self, obj):
        return {
            "__special__": self.tag(),
            "year": obj.year,
            "month": obj.month,
            "day": obj.day
        }

    def deserialize(self, data):
        return datetime.date(year=data["year"], month=data["month"], day=data["day"])

handlers.register_handler(DateHandler())

2. BaseRefHandler - Optimized Binary Storage

For large data structures that benefit from custom binary formats:

from dbengine.handlers import BaseRefHandler, handlers
import pandas as pd
import numpy as np
import json

class DataFrameHandler(BaseRefHandler):
    def is_eligible_for(self, obj):
        return isinstance(obj, pd.DataFrame)

    def tag(self):
        return "DataFrame"

    def serialize_to_bytes(self, df):
        """Convert DataFrame to compact binary format"""
        # Store metadata + numpy bytes
        metadata = {
            "columns": df.columns.tolist(),
            "index": df.index.tolist(),
            "dtype": str(df.values.dtype)
        }
        metadata_bytes = json.dumps(metadata).encode('utf-8')
        metadata_length = len(metadata_bytes).to_bytes(4, 'big')
        numpy_bytes = df.to_numpy().tobytes()

        return metadata_length + metadata_bytes + numpy_bytes

    def deserialize_from_bytes(self, data):
        """Reconstruct DataFrame from binary format"""
        # Read metadata
        metadata_length = int.from_bytes(data[:4], 'big')
        metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
        numpy_bytes = data[4+metadata_length:]

        # Reconstruct array and DataFrame
        array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
        array = array.reshape(len(metadata['index']), len(metadata['columns']))

        return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])

handlers.register_handler(DataFrameHandler())

# Now DataFrames are automatically stored in optimized binary format
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
engine.save("tenant_1", "user_1", "data", df)

Result:

  • JSON snapshot contains: {"__special__": "DataFrame", "__digest__": "abc123..."}
  • Binary data stored in refs/abc123... (more compact than pickle)
  • Automatic deduplication across tenants

3. use_refs() - Selective Pickle Storage

For objects with specific fields that should be pickled:

class MyDataObject:
    def __init__(self, metadata, large_array):
        self.metadata = metadata
        self.large_array = large_array  # Large numpy array or similar

    @staticmethod
    def use_refs():
        """Fields to pickle instead of JSON-serialize"""
        return {"large_array"}

# metadata goes to JSON, large_array goes to refs/ (pickled)
obj = MyDataObject({"name": "dataset_1"}, np.zeros((1000, 1000)))
engine.save("tenant_1", "user_1", "my_data", obj)

Comparison:

Approach Storage Format Use Case
BaseInlineHandler JSON snapshot Custom dict Small data, human-readable
BaseRefHandler refs/ directory Custom binary Large data, optimized format
use_refs() refs/ directory Pickle Quick solution, no handler needed

API Reference

Initialization

Method Description
DbEngine(root: str = ".mytools_db") Initialize engine with storage root
init(tenant_id: str) Create tenant directory structure
is_initialized(tenant_id: str) -> bool Check if tenant is initialized

Data Operations

Method Description
save(tenant_id, user_id, entry, obj) -> str Save complete snapshot, returns digest
load(tenant_id, entry, digest=None) -> object Load snapshot (latest if digest=None)
put(tenant_id, user_id, entry, key, value) -> bool Add/update single record
put_many(tenant_id, user_id, entry, items) -> bool Add/update multiple records
get(tenant_id, entry, key=None, digest=None) -> object Get record(s)
exists(tenant_id, entry) -> bool Check if entry exists

History

Method Description
history(tenant_id, entry, digest=None, max_items=1000) -> list Get history chain of digests
get_digest(tenant_id, entry) -> str Get current digest for entry

Performance & Limitations

Strengths:

  • Deduplication: Identical objects stored once (SHA-256 content addressing)
  • History: Complete audit trail with zero overhead for unchanged data
  • Custom formats: Binary handlers optimize storage (e.g., numpy vs pickle)

Limitations:

  • File-based: Not suitable for high-throughput applications
  • No indexing: No SQL queries, no complex filtering
  • Snapshot overhead: Each change creates a new snapshot
  • History chains: Long histories require multiple file reads

Performance Tips:

  • Use put_many() instead of multiple put() calls (creates one snapshot)
  • Use BaseRefHandler for large binary data instead of pickle
  • Limit history traversal with max_items parameter
  • Consider archiving old snapshots for long-running entries

Development

Running Tests

# All tests
pytest

# Specific test file
pytest tests/test_dbengine.py
pytest tests/test_serializer.py

# Single test
pytest tests/test_dbengine.py::test_i_can_save_and_load

Building Package

# Build distribution
python -m build

# Clean build artifacts
make clean

Project Structure

src/dbengine/
├── dbengine.py          # Main DbEngine and RefHelper classes
├── serializer.py        # JSON serialization with handlers
├── handlers.py          # BaseHandler, BaseInlineHandler, BaseRefHandler
└── utils.py             # Type checking and digest computation

tests/
├── test_dbengine.py     # DbEngine functionality tests
└── test_serializer.py   # Serialization and handler tests

Contributing

This is a personal implementation. For bug reports or feature requests, please contact the author.

License

See LICENSE file for details.

Version History

  • 0.1.0 - Initial release
  • 0.2.0 - Added custom reference handlers
Description
No description provided
Readme 102 KiB
Languages
Python 98.9%
Makefile 1.1%