kodjo/MyDbEngine

Fork 0

Files

Kodjo Sossouvi 618e21e012 Added Custom Ref Handlers

2025-12-21 17:42:17 +01:00

15 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Available Personas

This project uses specialized personas for different types of work. Use these commands to switch modes:

/developer - Full development mode with validation workflow (options-first, wait for approval before coding)
/unit-tester - Specialized mode for writing comprehensive unit tests for existing code
/technical-writer - User documentation writing mode (README, guides, tutorials)
/reset - Return to default Claude Code mode

Each persona has specific rules and workflows defined in .claude/ directory. See the respective files for detailed guidelines.

Project Overview

MyDbEngine is a lightweight, git-inspired versioned database engine for Python. It maintains complete history of all data modifications using immutable snapshots with SHA-256 content addressing. The project supports multi-tenant storage with thread-safe operations.

Quick Start Example

from dbengine.dbengine import DbEngine

# Initialize engine
engine = DbEngine(root=".mytools_db")
engine.init("tenant_1")

# Pattern 1: Snapshot-based (complete state saves)
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
data = engine.load("tenant_1", "config")

# Pattern 2: Record-based (incremental updates)
engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
all_users = engine.get("tenant_1", "users")  # Returns list of all users

Development Commands

Testing

# Run all tests
pytest

# Run specific test file
pytest tests/test_dbengine.py
pytest tests/test_serializer.py

# Run single test function
pytest tests/test_dbengine.py::test_i_can_save_and_load

Building and Packaging

# Build package
python -m build

# Clean build artifacts
make clean

# Clean package artifacts only
make clean-package

Installation

# Install in development mode with test dependencies
pip install -e .[dev]

Architecture

Core Components

DbEngine (src/dbengine/dbengine.py)

Main database engine class using RLock for thread safety
Manages tenant-specific storage in .mytools_db/{tenant_id}/ structure
Tracks latest versions via head file (JSON mapping entry names to digests)
Stores objects in content-addressable format: objects/{digest_prefix}/{full_digest}
Shared refs/ directory for cross-tenant pickle-based references

Serializer (src/dbengine/serializer.py)

Converts Python objects to/from JSON-compatible dictionaries
Handles circular references using object ID tracking
Supports custom serialization via handlers (see handlers.py)
Special tags: __object__, __id__, __tuple__, __set__, __ref__, __digest__, __enum__
Objects can define use_refs() method to specify fields that should be pickled instead of JSON-serialized
__ref__: Used for use_refs() system (pickle-based storage)
__digest__: Used by BaseRefHandler for custom binary formats (numpy, etc.)

Handlers (src/dbengine/handlers.py)

Extensible handler system for custom type serialization
Three-tier hierarchy:
- BaseHandler: Base interface with is_eligible_for() and tag()
- BaseInlineHandler: For JSON-inline storage (e.g., DateHandler)
- BaseRefHandler: For custom binary formats stored in refs/ (e.g., DataFrames)
BaseInlineHandler: Implements serialize(obj) → dict and deserialize(dict) → obj
BaseRefHandler: Implements serialize_to_bytes(obj) → bytes and deserialize_from_bytes(bytes) → obj
Currently implements DateHandler (BaseInlineHandler) for datetime.date objects
Use handlers.register_handler() to add custom handlers

Utils (src/dbengine/utils.py)

Type checking utilities: is_primitive(), is_dictionary(), is_list(), etc.
Class introspection: get_full_qualified_name(), importable_name(), get_class()
Digest computation: compute_digest_from_stream(), compute_digest_from_bytes()

RefHelper and PickleRefHelper (src/dbengine/dbengine.py)

RefHelper: Base class for content-addressable storage in refs/ directory
- save_ref_from_bytes(data: bytes) → digest: Store raw bytes
- load_ref_to_bytes(digest) → bytes: Load raw bytes
- Used by BaseRefHandler for custom binary formats
PickleRefHelper(RefHelper): Adds pickle serialization layer
- save_ref(obj) → digest: Pickle and store object
- load_ref(digest) → obj: Load and unpickle object
- Used by use_refs() system and Serializer

Storage Architecture

.mytools_db/
├── {tenant_id}/
│   ├── head                           # JSON: {"entry_name": "latest_digest"}
│   └── objects/
│       └── {digest_prefix}/          # First 24 chars of digest
│           └── {full_digest}         # JSON snapshot with metadata
└── refs/                             # Shared binary references (cross-tenant)
    └── {digest_prefix}/
        └── {full_digest}             # Pickle or custom binary format

Note: The refs/ directory stores binary data in content-addressable format:

Pickled objects (via use_refs() or PickleRefHelper)
Custom binary formats (via BaseRefHandler, e.g., numpy arrays)

Metadata System

Each snapshot includes automatic metadata fields:

__parent__: List containing digest of previous version (or [None] for first)
__user_id__: User ID who created the snapshot (was __user__ in TAG constant)
__date__: ISO timestamp YYYYMMDD HH:MM:SS %z

Two Usage Patterns

Pattern 1: Snapshot-based (save()/load())

Save complete object states
Best for configuration objects or complete state snapshots
Direct control over what gets saved

Pattern 2: Record-based (put()/put_many()/get())

Incremental updates to dictionary-like collections
Automatically creates snapshots only when data changes
Returns True/False indicating if snapshot was created
Best for managing collections of items

Important: Do not mix patterns for the same entry - they expect different data structures.

Common Pitfalls

⚠️ Mixing save() and put() on the same entry

save() expects to store complete snapshots (any object)
put() expects dictionary-like structures with key-value pairs
Using both on the same entry will cause data structure conflicts

⚠️ Refs are shared across tenants

Objects stored via use_refs() go to shared refs/ directory
Not isolated per tenant - identical objects reused across all tenants
Good for deduplication, but be aware of cross-tenant sharing

⚠️ Parent digest is always a list

__parent__ field is stored as [digest] or [None]
Always access as data[TAG_PARENT][0], not data[TAG_PARENT]
This allows for future support of multiple parents (merge scenarios)

Reference System

Objects can opt into pickle-based storage for specific fields:

Define use_refs() method returning set of field names
Serializer stores those fields in shared refs/ directory
Reduces JSON snapshot size and enables cross-tenant deduplication
Example: DummyObjWithRef in test_dbengine.py

Extension Points

Custom Type Handlers

MyDbEngine supports two types of custom handlers for serializing types:

1. BaseInlineHandler - For JSON Storage

Use when data should be stored directly in the JSON snapshot (human-readable, smaller datasets).

Example: Custom date handler

from dbengine.handlers import BaseInlineHandler, handlers

class MyCustomHandler(BaseInlineHandler):
    def is_eligible_for(self, obj):
        return isinstance(obj, MyCustomType)

    def tag(self):
        return "MyCustomType"

    def serialize(self, obj) -> dict:
        return {
            "__special__": self.tag(),
            "data": obj.to_dict()
        }

    def deserialize(self, data: dict) -> object:
        return MyCustomType.from_dict(data["data"])

# Register the handler
handlers.register_handler(MyCustomHandler())

When to use BaseInlineHandler:

Small data structures that fit well in JSON
Types requiring human-readable storage
Types needing validation during deserialization
Simple external library types (e.g., datetime.date)

2. BaseRefHandler - For Binary Storage

Use when data should be stored in optimized binary format in refs/ directory (large datasets, better compression).

Example: pandas DataFrame handler

from dbengine.handlers import BaseRefHandler, handlers
import pandas as pd
import json

class DataFrameHandler(BaseRefHandler):
    def is_eligible_for(self, obj):
        return isinstance(obj, pd.DataFrame)

    def tag(self):
        return "DataFrame"

    def serialize_to_bytes(self, df) -> bytes:
        """Convert DataFrame to compact binary format"""
        import numpy as np

        # Store metadata + numpy bytes
        metadata = {
            "columns": df.columns.tolist(),
            "index": df.index.tolist(),
            "dtype": str(df.values.dtype)
        }
        metadata_bytes = json.dumps(metadata).encode('utf-8')
        metadata_length = len(metadata_bytes).to_bytes(4, 'big')
        numpy_bytes = df.to_numpy().tobytes()

        return metadata_length + metadata_bytes + numpy_bytes

    def deserialize_from_bytes(self, data: bytes) -> object:
        """Reconstruct DataFrame from binary format"""
        import numpy as np

        # Read metadata
        metadata_length = int.from_bytes(data[:4], 'big')
        metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
        numpy_bytes = data[4+metadata_length:]

        # Reconstruct array and DataFrame
        array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
        array = array.reshape(len(metadata['index']), len(metadata['columns']))

        return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])

# Register the handler
handlers.register_handler(DataFrameHandler())

When to use BaseRefHandler:

Large binary data (DataFrames, numpy arrays, images)
Data that benefits from custom compression (e.g., numpy's compact format)
Types that lose information in JSON conversion
Shared data across snapshots (automatic deduplication via SHA-256)

Key differences:

BaseInlineHandler: Data stored in JSON snapshot → {"__special__": "Tag", "data": {...}}
BaseRefHandler: Data stored in refs/ → {"__special__": "Tag", "__digest__": "abc123..."}
BaseRefHandler provides automatic deduplication and smaller JSON snapshots

Using References (use_refs)

For objects with large nested data structures that should be pickled instead of JSON-serialized:

class MyDataObject:
    def __init__(self, metadata, large_dataframe):
        self.metadata = metadata
        self.large_dataframe = large_dataframe  # pandas DataFrame, for example

    @staticmethod
    def use_refs():
        """Return set of field names to pickle instead of JSON-serialize"""
        return {"large_dataframe"}

When to use use_refs():

Quick solution for large nested objects without writing custom handler
Works with any picklable object
Per-object control (some fields in JSON, others pickled)

use_refs() vs BaseRefHandler:

use_refs(): Uses pickle (via PickleRefHelper), simple but less optimized
BaseRefHandler: Custom binary format (e.g., numpy), optimized but requires handler code
Both store in refs/ and get automatic SHA-256 deduplication
use_refs() generates {"__ref__": "digest"} tags
BaseRefHandler generates {"__special__": "Tag", "__digest__": "digest"} tags

Trade-offs:

✅ Smaller JSON snapshots
✅ Cross-tenant deduplication
❌ Less human-readable (binary format)
❌ Python version compatibility concerns with pickle (use_refs only)

Testing Notes

Test fixtures use DB_ENGINE_ROOT = "TestDBEngineRoot" for isolation
Tests clean up temp directories using shutil.rmtree() in fixtures
Test classes like DummyObj, DummyObjWithRef, DummyObjWithKey demonstrate usage patterns
Thread safety is built-in via RLock but not explicitly tested

Key Design Decisions

Immutability: Snapshots never modified after creation (git-style)
Content Addressing: Identical objects stored only once (deduplication via SHA-256)
Change Detection: put() and put_many() skip saving if data unchanged
Thread Safety: All DbEngine operations protected by RLock
No Dependencies: Core engine has zero runtime dependencies (pytest only for dev)

Development Workflow and Guidelines

Development Process

Code must always be testable. Before writing any code:

Explain available options first - Present different approaches to solve the problem
Wait for validation - Ensure mutual understanding of requirements before implementation
No code without approval - Only proceed after explicit validation

Collaboration Style

Ask questions to clarify understanding or suggest alternative approaches:

Ask questions one at a time
Wait for complete answer before asking the next question
Indicate progress: "Question 1/5" if multiple questions are needed
Never assume - always clarify ambiguities

Communication

Conversations: French or English Code, documentation, comments: English only

Code Standards

Follow PEP 8 conventions strictly:

Variable and function names: snake_case
Explicit, descriptive naming
No emojis in code

Documentation:

Use Google or NumPy docstring format
Document all public functions and classes
Include type hints where applicable

Dependency Management

When introducing new dependencies:

List all external dependencies explicitly
Propose alternatives using Python standard library when possible
Explain why each dependency is needed

Unit Testing with pytest

Test naming patterns:

Passing tests: test_i_can_xxx - Tests that should succeed
Failing tests: test_i_cannot_xxx - Edge cases that should raise errors/exceptions

Test structure:

Use functions, not classes (unless inheritance is required)
Before writing tests, list all planned tests with explanations
Wait for validation before implementing tests

Example:

def test_i_can_save_and_load_object():
    """Test that an object can be saved and loaded successfully."""
    engine = DbEngine(root="test_db")
    engine.init("tenant_1")
    digest = engine.save("tenant_1", "user_1", "entry_1", {"key": "value"})
    assert digest is not None

def test_i_cannot_save_with_empty_tenant_id():
    """Test that saving with empty tenant_id raises DbException."""
    engine = DbEngine(root="test_db")
    with pytest.raises(DbException):
        engine.save("", "user_1", "entry_1", {"key": "value"})

File Management

Always specify the full file path when adding or modifying files:

✅ Modifying: src/dbengine/dbengine.py
✅ Creating: tests/test_new_feature.py

Error Handling

When errors occur:

Explain the problem clearly first
Do not propose a fix immediately
Wait for validation that the diagnosis is correct
Only then propose solutions

15 KiB Raw Blame History