Files
MyDbEngine/CLAUDE.md

15 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Available Personas

This project uses specialized personas for different types of work. Use these commands to switch modes:

  • /developer - Full development mode with validation workflow (options-first, wait for approval before coding)
  • /unit-tester - Specialized mode for writing comprehensive unit tests for existing code
  • /technical-writer - User documentation writing mode (README, guides, tutorials)
  • /reset - Return to default Claude Code mode

Each persona has specific rules and workflows defined in .claude/ directory. See the respective files for detailed guidelines.

Project Overview

MyDbEngine is a lightweight, git-inspired versioned database engine for Python. It maintains complete history of all data modifications using immutable snapshots with SHA-256 content addressing. The project supports multi-tenant storage with thread-safe operations.

Quick Start Example

from dbengine.dbengine import DbEngine

# Initialize engine
engine = DbEngine(root=".mytools_db")
engine.init("tenant_1")

# Pattern 1: Snapshot-based (complete state saves)
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
data = engine.load("tenant_1", "config")

# Pattern 2: Record-based (incremental updates)
engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
all_users = engine.get("tenant_1", "users")  # Returns list of all users

Development Commands

Testing

# Run all tests
pytest

# Run specific test file
pytest tests/test_dbengine.py
pytest tests/test_serializer.py

# Run single test function
pytest tests/test_dbengine.py::test_i_can_save_and_load

Building and Packaging

# Build package
python -m build

# Clean build artifacts
make clean

# Clean package artifacts only
make clean-package

Installation

# Install in development mode with test dependencies
pip install -e .[dev]

Architecture

Core Components

DbEngine (src/dbengine/dbengine.py)

  • Main database engine class using RLock for thread safety
  • Manages tenant-specific storage in .mytools_db/{tenant_id}/ structure
  • Tracks latest versions via head file (JSON mapping entry names to digests)
  • Stores objects in content-addressable format: objects/{digest_prefix}/{full_digest}
  • Shared refs/ directory for cross-tenant pickle-based references

Serializer (src/dbengine/serializer.py)

  • Converts Python objects to/from JSON-compatible dictionaries
  • Handles circular references using object ID tracking
  • Supports custom serialization via handlers (see handlers.py)
  • Special tags: __object__, __id__, __tuple__, __set__, __ref__, __digest__, __enum__
  • Objects can define use_refs() method to specify fields that should be pickled instead of JSON-serialized
  • __ref__: Used for use_refs() system (pickle-based storage)
  • __digest__: Used by BaseRefHandler for custom binary formats (numpy, etc.)

Handlers (src/dbengine/handlers.py)

  • Extensible handler system for custom type serialization
  • Three-tier hierarchy:
    • BaseHandler: Base interface with is_eligible_for() and tag()
    • BaseInlineHandler: For JSON-inline storage (e.g., DateHandler)
    • BaseRefHandler: For custom binary formats stored in refs/ (e.g., DataFrames)
  • BaseInlineHandler: Implements serialize(obj) → dict and deserialize(dict) → obj
  • BaseRefHandler: Implements serialize_to_bytes(obj) → bytes and deserialize_from_bytes(bytes) → obj
  • Currently implements DateHandler (BaseInlineHandler) for datetime.date objects
  • Use handlers.register_handler() to add custom handlers

Utils (src/dbengine/utils.py)

  • Type checking utilities: is_primitive(), is_dictionary(), is_list(), etc.
  • Class introspection: get_full_qualified_name(), importable_name(), get_class()
  • Digest computation: compute_digest_from_stream(), compute_digest_from_bytes()

RefHelper and PickleRefHelper (src/dbengine/dbengine.py)

  • RefHelper: Base class for content-addressable storage in refs/ directory
    • save_ref_from_bytes(data: bytes) → digest: Store raw bytes
    • load_ref_to_bytes(digest) → bytes: Load raw bytes
    • Used by BaseRefHandler for custom binary formats
  • PickleRefHelper(RefHelper): Adds pickle serialization layer
    • save_ref(obj) → digest: Pickle and store object
    • load_ref(digest) → obj: Load and unpickle object
    • Used by use_refs() system and Serializer

Storage Architecture

.mytools_db/
├── {tenant_id}/
│   ├── head                           # JSON: {"entry_name": "latest_digest"}
│   └── objects/
│       └── {digest_prefix}/          # First 24 chars of digest
│           └── {full_digest}         # JSON snapshot with metadata
└── refs/                             # Shared binary references (cross-tenant)
    └── {digest_prefix}/
        └── {full_digest}             # Pickle or custom binary format

Note: The refs/ directory stores binary data in content-addressable format:

  • Pickled objects (via use_refs() or PickleRefHelper)
  • Custom binary formats (via BaseRefHandler, e.g., numpy arrays)

Metadata System

Each snapshot includes automatic metadata fields:

  • __parent__: List containing digest of previous version (or [None] for first)
  • __user_id__: User ID who created the snapshot (was __user__ in TAG constant)
  • __date__: ISO timestamp YYYYMMDD HH:MM:SS %z

Two Usage Patterns

Pattern 1: Snapshot-based (save()/load())

  • Save complete object states
  • Best for configuration objects or complete state snapshots
  • Direct control over what gets saved

Pattern 2: Record-based (put()/put_many()/get())

  • Incremental updates to dictionary-like collections
  • Automatically creates snapshots only when data changes
  • Returns True/False indicating if snapshot was created
  • Best for managing collections of items

Important: Do not mix patterns for the same entry - they expect different data structures.

Common Pitfalls

⚠️ Mixing save() and put() on the same entry

  • save() expects to store complete snapshots (any object)
  • put() expects dictionary-like structures with key-value pairs
  • Using both on the same entry will cause data structure conflicts

⚠️ Refs are shared across tenants

  • Objects stored via use_refs() go to shared refs/ directory
  • Not isolated per tenant - identical objects reused across all tenants
  • Good for deduplication, but be aware of cross-tenant sharing

⚠️ Parent digest is always a list

  • __parent__ field is stored as [digest] or [None]
  • Always access as data[TAG_PARENT][0], not data[TAG_PARENT]
  • This allows for future support of multiple parents (merge scenarios)

Reference System

Objects can opt into pickle-based storage for specific fields:

  1. Define use_refs() method returning set of field names
  2. Serializer stores those fields in shared refs/ directory
  3. Reduces JSON snapshot size and enables cross-tenant deduplication
  4. Example: DummyObjWithRef in test_dbengine.py

Extension Points

Custom Type Handlers

MyDbEngine supports two types of custom handlers for serializing types:

1. BaseInlineHandler - For JSON Storage

Use when data should be stored directly in the JSON snapshot (human-readable, smaller datasets).

Example: Custom date handler

from dbengine.handlers import BaseInlineHandler, handlers

class MyCustomHandler(BaseInlineHandler):
    def is_eligible_for(self, obj):
        return isinstance(obj, MyCustomType)

    def tag(self):
        return "MyCustomType"

    def serialize(self, obj) -> dict:
        return {
            "__special__": self.tag(),
            "data": obj.to_dict()
        }

    def deserialize(self, data: dict) -> object:
        return MyCustomType.from_dict(data["data"])

# Register the handler
handlers.register_handler(MyCustomHandler())

When to use BaseInlineHandler:

  • Small data structures that fit well in JSON
  • Types requiring human-readable storage
  • Types needing validation during deserialization
  • Simple external library types (e.g., datetime.date)

2. BaseRefHandler - For Binary Storage

Use when data should be stored in optimized binary format in refs/ directory (large datasets, better compression).

Example: pandas DataFrame handler

from dbengine.handlers import BaseRefHandler, handlers
import pandas as pd
import json

class DataFrameHandler(BaseRefHandler):
    def is_eligible_for(self, obj):
        return isinstance(obj, pd.DataFrame)

    def tag(self):
        return "DataFrame"

    def serialize_to_bytes(self, df) -> bytes:
        """Convert DataFrame to compact binary format"""
        import numpy as np

        # Store metadata + numpy bytes
        metadata = {
            "columns": df.columns.tolist(),
            "index": df.index.tolist(),
            "dtype": str(df.values.dtype)
        }
        metadata_bytes = json.dumps(metadata).encode('utf-8')
        metadata_length = len(metadata_bytes).to_bytes(4, 'big')
        numpy_bytes = df.to_numpy().tobytes()

        return metadata_length + metadata_bytes + numpy_bytes

    def deserialize_from_bytes(self, data: bytes) -> object:
        """Reconstruct DataFrame from binary format"""
        import numpy as np

        # Read metadata
        metadata_length = int.from_bytes(data[:4], 'big')
        metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
        numpy_bytes = data[4+metadata_length:]

        # Reconstruct array and DataFrame
        array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
        array = array.reshape(len(metadata['index']), len(metadata['columns']))

        return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])

# Register the handler
handlers.register_handler(DataFrameHandler())

When to use BaseRefHandler:

  • Large binary data (DataFrames, numpy arrays, images)
  • Data that benefits from custom compression (e.g., numpy's compact format)
  • Types that lose information in JSON conversion
  • Shared data across snapshots (automatic deduplication via SHA-256)

Key differences:

  • BaseInlineHandler: Data stored in JSON snapshot → {"__special__": "Tag", "data": {...}}
  • BaseRefHandler: Data stored in refs/{"__special__": "Tag", "__digest__": "abc123..."}
  • BaseRefHandler provides automatic deduplication and smaller JSON snapshots

Using References (use_refs)

For objects with large nested data structures that should be pickled instead of JSON-serialized:

class MyDataObject:
    def __init__(self, metadata, large_dataframe):
        self.metadata = metadata
        self.large_dataframe = large_dataframe  # pandas DataFrame, for example

    @staticmethod
    def use_refs():
        """Return set of field names to pickle instead of JSON-serialize"""
        return {"large_dataframe"}

When to use use_refs():

  • Quick solution for large nested objects without writing custom handler
  • Works with any picklable object
  • Per-object control (some fields in JSON, others pickled)

use_refs() vs BaseRefHandler:

  • use_refs(): Uses pickle (via PickleRefHelper), simple but less optimized
  • BaseRefHandler: Custom binary format (e.g., numpy), optimized but requires handler code
  • Both store in refs/ and get automatic SHA-256 deduplication
  • use_refs() generates {"__ref__": "digest"} tags
  • BaseRefHandler generates {"__special__": "Tag", "__digest__": "digest"} tags

Trade-offs:

  • Smaller JSON snapshots
  • Cross-tenant deduplication
  • Less human-readable (binary format)
  • Python version compatibility concerns with pickle (use_refs only)

Testing Notes

  • Test fixtures use DB_ENGINE_ROOT = "TestDBEngineRoot" for isolation
  • Tests clean up temp directories using shutil.rmtree() in fixtures
  • Test classes like DummyObj, DummyObjWithRef, DummyObjWithKey demonstrate usage patterns
  • Thread safety is built-in via RLock but not explicitly tested

Key Design Decisions

  • Immutability: Snapshots never modified after creation (git-style)
  • Content Addressing: Identical objects stored only once (deduplication via SHA-256)
  • Change Detection: put() and put_many() skip saving if data unchanged
  • Thread Safety: All DbEngine operations protected by RLock
  • No Dependencies: Core engine has zero runtime dependencies (pytest only for dev)

Development Workflow and Guidelines

Development Process

Code must always be testable. Before writing any code:

  1. Explain available options first - Present different approaches to solve the problem
  2. Wait for validation - Ensure mutual understanding of requirements before implementation
  3. No code without approval - Only proceed after explicit validation

Collaboration Style

Ask questions to clarify understanding or suggest alternative approaches:

  • Ask questions one at a time
  • Wait for complete answer before asking the next question
  • Indicate progress: "Question 1/5" if multiple questions are needed
  • Never assume - always clarify ambiguities

Communication

Conversations: French or English Code, documentation, comments: English only

Code Standards

Follow PEP 8 conventions strictly:

  • Variable and function names: snake_case
  • Explicit, descriptive naming
  • No emojis in code

Documentation:

  • Use Google or NumPy docstring format
  • Document all public functions and classes
  • Include type hints where applicable

Dependency Management

When introducing new dependencies:

  • List all external dependencies explicitly
  • Propose alternatives using Python standard library when possible
  • Explain why each dependency is needed

Unit Testing with pytest

Test naming patterns:

  • Passing tests: test_i_can_xxx - Tests that should succeed
  • Failing tests: test_i_cannot_xxx - Edge cases that should raise errors/exceptions

Test structure:

  • Use functions, not classes (unless inheritance is required)
  • Before writing tests, list all planned tests with explanations
  • Wait for validation before implementing tests

Example:

def test_i_can_save_and_load_object():
    """Test that an object can be saved and loaded successfully."""
    engine = DbEngine(root="test_db")
    engine.init("tenant_1")
    digest = engine.save("tenant_1", "user_1", "entry_1", {"key": "value"})
    assert digest is not None

def test_i_cannot_save_with_empty_tenant_id():
    """Test that saving with empty tenant_id raises DbException."""
    engine = DbEngine(root="test_db")
    with pytest.raises(DbException):
        engine.save("", "user_1", "entry_1", {"key": "value"})

File Management

Always specify the full file path when adding or modifying files:

✅ Modifying: src/dbengine/dbengine.py
✅ Creating: tests/test_new_feature.py

Error Handling

When errors occur:

  1. Explain the problem clearly first
  2. Do not propose a fix immediately
  3. Wait for validation that the diagnosis is correct
  4. Only then propose solutions