# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Available Personas This project uses specialized personas for different types of work. Use these commands to switch modes: - **`/developer`** - Full development mode with validation workflow (options-first, wait for approval before coding) - **`/unit-tester`** - Specialized mode for writing comprehensive unit tests for existing code - **`/technical-writer`** - User documentation writing mode (README, guides, tutorials) - **`/reset`** - Return to default Claude Code mode Each persona has specific rules and workflows defined in `.claude/` directory. See the respective files for detailed guidelines. ## Project Overview MyDbEngine is a lightweight, git-inspired versioned database engine for Python. It maintains complete history of all data modifications using immutable snapshots with SHA-256 content addressing. The project supports multi-tenant storage with thread-safe operations. ### Quick Start Example ```python from dbengine.dbengine import DbEngine # Initialize engine engine = DbEngine(root=".mytools_db") engine.init("tenant_1") # Pattern 1: Snapshot-based (complete state saves) engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"}) data = engine.load("tenant_1", "config") # Pattern 2: Record-based (incremental updates) engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30}) engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25}) all_users = engine.get("tenant_1", "users") # Returns list of all users ``` ## Development Commands ### Testing ```bash # Run all tests pytest # Run specific test file pytest tests/test_dbengine.py pytest tests/test_serializer.py # Run single test function pytest tests/test_dbengine.py::test_i_can_save_and_load ``` ### Building and Packaging ```bash # Build package python -m build # Clean build artifacts make clean # Clean package artifacts only make clean-package ``` ### Installation ```bash # Install in development mode with test dependencies pip install -e .[dev] ``` ## Architecture ### Core Components **DbEngine** (`src/dbengine/dbengine.py`) - Main database engine class using RLock for thread safety - Manages tenant-specific storage in `.mytools_db/{tenant_id}/` structure - Tracks latest versions via `head` file (JSON mapping entry names to digests) - Stores objects in content-addressable format: `objects/{digest_prefix}/{full_digest}` - Shared `refs/` directory for cross-tenant pickle-based references **Serializer** (`src/dbengine/serializer.py`) - Converts Python objects to/from JSON-compatible dictionaries - Handles circular references using object ID tracking - Supports custom serialization via handlers (see handlers.py) - Special tags: `__object__`, `__id__`, `__tuple__`, `__set__`, `__ref__`, `__digest__`, `__enum__` - Objects can define `use_refs()` method to specify fields that should be pickled instead of JSON-serialized - `__ref__`: Used for `use_refs()` system (pickle-based storage) - `__digest__`: Used by BaseRefHandler for custom binary formats (numpy, etc.) **Handlers** (`src/dbengine/handlers.py`) - Extensible handler system for custom type serialization - Three-tier hierarchy: - `BaseHandler`: Base interface with `is_eligible_for()` and `tag()` - `BaseInlineHandler`: For JSON-inline storage (e.g., DateHandler) - `BaseRefHandler`: For custom binary formats stored in `refs/` (e.g., DataFrames) - `BaseInlineHandler`: Implements `serialize(obj) → dict` and `deserialize(dict) → obj` - `BaseRefHandler`: Implements `serialize_to_bytes(obj) → bytes` and `deserialize_from_bytes(bytes) → obj` - Currently implements `DateHandler` (BaseInlineHandler) for datetime.date objects - Use `handlers.register_handler()` to add custom handlers **Utils** (`src/dbengine/utils.py`) - Type checking utilities: `is_primitive()`, `is_dictionary()`, `is_list()`, etc. - Class introspection: `get_full_qualified_name()`, `importable_name()`, `get_class()` - Digest computation: `compute_digest_from_stream()`, `compute_digest_from_bytes()` **RefHelper and PickleRefHelper** (`src/dbengine/dbengine.py`) - `RefHelper`: Base class for content-addressable storage in `refs/` directory - `save_ref_from_bytes(data: bytes) → digest`: Store raw bytes - `load_ref_to_bytes(digest) → bytes`: Load raw bytes - Used by `BaseRefHandler` for custom binary formats - `PickleRefHelper(RefHelper)`: Adds pickle serialization layer - `save_ref(obj) → digest`: Pickle and store object - `load_ref(digest) → obj`: Load and unpickle object - Used by `use_refs()` system and `Serializer` ### Storage Architecture ``` .mytools_db/ ├── {tenant_id}/ │ ├── head # JSON: {"entry_name": "latest_digest"} │ └── objects/ │ └── {digest_prefix}/ # First 24 chars of digest │ └── {full_digest} # JSON snapshot with metadata └── refs/ # Shared binary references (cross-tenant) └── {digest_prefix}/ └── {full_digest} # Pickle or custom binary format ``` **Note**: The `refs/` directory stores binary data in content-addressable format: - Pickled objects (via `use_refs()` or `PickleRefHelper`) - Custom binary formats (via `BaseRefHandler`, e.g., numpy arrays) ### Metadata System Each snapshot includes automatic metadata fields: - `__parent__`: List containing digest of previous version (or `[None]` for first) - `__user_id__`: User ID who created the snapshot (was `__user__` in TAG constant) - `__date__`: ISO timestamp `YYYYMMDD HH:MM:SS %z` ### Two Usage Patterns **Pattern 1: Snapshot-based (`save()`/`load()`)** - Save complete object states - Best for configuration objects or complete state snapshots - Direct control over what gets saved **Pattern 2: Record-based (`put()`/`put_many()`/`get()`)** - Incremental updates to dictionary-like collections - Automatically creates snapshots only when data changes - Returns `True/False` indicating if snapshot was created - Best for managing collections of items **Important**: Do not mix patterns for the same entry - they expect different data structures. ### Common Pitfalls ⚠️ **Mixing save() and put() on the same entry** - `save()` expects to store complete snapshots (any object) - `put()` expects dictionary-like structures with key-value pairs - Using both on the same entry will cause data structure conflicts ⚠️ **Refs are shared across tenants** - Objects stored via `use_refs()` go to shared `refs/` directory - Not isolated per tenant - identical objects reused across all tenants - Good for deduplication, but be aware of cross-tenant sharing ⚠️ **Parent digest is always a list** - `__parent__` field is stored as `[digest]` or `[None]` - Always access as `data[TAG_PARENT][0]`, not `data[TAG_PARENT]` - This allows for future support of multiple parents (merge scenarios) ### Reference System Objects can opt into pickle-based storage for specific fields: 1. Define `use_refs()` method returning set of field names 2. Serializer stores those fields in shared `refs/` directory 3. Reduces JSON snapshot size and enables cross-tenant deduplication 4. Example: `DummyObjWithRef` in test_dbengine.py ## Extension Points ### Custom Type Handlers MyDbEngine supports two types of custom handlers for serializing types: #### 1. BaseInlineHandler - For JSON Storage Use when data should be stored directly in the JSON snapshot (human-readable, smaller datasets). **Example: Custom date handler** ```python from dbengine.handlers import BaseInlineHandler, handlers class MyCustomHandler(BaseInlineHandler): def is_eligible_for(self, obj): return isinstance(obj, MyCustomType) def tag(self): return "MyCustomType" def serialize(self, obj) -> dict: return { "__special__": self.tag(), "data": obj.to_dict() } def deserialize(self, data: dict) -> object: return MyCustomType.from_dict(data["data"]) # Register the handler handlers.register_handler(MyCustomHandler()) ``` **When to use BaseInlineHandler:** - Small data structures that fit well in JSON - Types requiring human-readable storage - Types needing validation during deserialization - Simple external library types (e.g., datetime.date) #### 2. BaseRefHandler - For Binary Storage Use when data should be stored in optimized binary format in `refs/` directory (large datasets, better compression). **Example: pandas DataFrame handler** ```python from dbengine.handlers import BaseRefHandler, handlers import pandas as pd import json class DataFrameHandler(BaseRefHandler): def is_eligible_for(self, obj): return isinstance(obj, pd.DataFrame) def tag(self): return "DataFrame" def serialize_to_bytes(self, df) -> bytes: """Convert DataFrame to compact binary format""" import numpy as np # Store metadata + numpy bytes metadata = { "columns": df.columns.tolist(), "index": df.index.tolist(), "dtype": str(df.values.dtype) } metadata_bytes = json.dumps(metadata).encode('utf-8') metadata_length = len(metadata_bytes).to_bytes(4, 'big') numpy_bytes = df.to_numpy().tobytes() return metadata_length + metadata_bytes + numpy_bytes def deserialize_from_bytes(self, data: bytes) -> object: """Reconstruct DataFrame from binary format""" import numpy as np # Read metadata metadata_length = int.from_bytes(data[:4], 'big') metadata = json.loads(data[4:4+metadata_length].decode('utf-8')) numpy_bytes = data[4+metadata_length:] # Reconstruct array and DataFrame array = np.frombuffer(numpy_bytes, dtype=metadata['dtype']) array = array.reshape(len(metadata['index']), len(metadata['columns'])) return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index']) # Register the handler handlers.register_handler(DataFrameHandler()) ``` **When to use BaseRefHandler:** - Large binary data (DataFrames, numpy arrays, images) - Data that benefits from custom compression (e.g., numpy's compact format) - Types that lose information in JSON conversion - Shared data across snapshots (automatic deduplication via SHA-256) **Key differences:** - `BaseInlineHandler`: Data stored in JSON snapshot → `{"__special__": "Tag", "data": {...}}` - `BaseRefHandler`: Data stored in `refs/` → `{"__special__": "Tag", "__digest__": "abc123..."}` - BaseRefHandler provides automatic deduplication and smaller JSON snapshots ### Using References (use_refs) For objects with large nested data structures that should be pickled instead of JSON-serialized: ```python class MyDataObject: def __init__(self, metadata, large_dataframe): self.metadata = metadata self.large_dataframe = large_dataframe # pandas DataFrame, for example @staticmethod def use_refs(): """Return set of field names to pickle instead of JSON-serialize""" return {"large_dataframe"} ``` **When to use use_refs():** - Quick solution for large nested objects without writing custom handler - Works with any picklable object - Per-object control (some fields in JSON, others pickled) **use_refs() vs BaseRefHandler:** - `use_refs()`: Uses pickle (via `PickleRefHelper`), simple but less optimized - `BaseRefHandler`: Custom binary format (e.g., numpy), optimized but requires handler code - Both store in `refs/` and get automatic SHA-256 deduplication - `use_refs()` generates `{"__ref__": "digest"}` tags - `BaseRefHandler` generates `{"__special__": "Tag", "__digest__": "digest"}` tags **Trade-offs:** - ✅ Smaller JSON snapshots - ✅ Cross-tenant deduplication - ❌ Less human-readable (binary format) - ❌ Python version compatibility concerns with pickle (use_refs only) ## Testing Notes - Test fixtures use `DB_ENGINE_ROOT = "TestDBEngineRoot"` for isolation - Tests clean up temp directories using `shutil.rmtree()` in fixtures - Test classes like `DummyObj`, `DummyObjWithRef`, `DummyObjWithKey` demonstrate usage patterns - Thread safety is built-in via RLock but not explicitly tested ## Key Design Decisions - **Immutability**: Snapshots never modified after creation (git-style) - **Content Addressing**: Identical objects stored only once (deduplication via SHA-256) - **Change Detection**: `put()` and `put_many()` skip saving if data unchanged - **Thread Safety**: All DbEngine operations protected by RLock - **No Dependencies**: Core engine has zero runtime dependencies (pytest only for dev) ## Development Workflow and Guidelines ### Development Process **Code must always be testable**. Before writing any code: 1. **Explain available options first** - Present different approaches to solve the problem 2. **Wait for validation** - Ensure mutual understanding of requirements before implementation 3. **No code without approval** - Only proceed after explicit validation ### Collaboration Style **Ask questions to clarify understanding or suggest alternative approaches:** - Ask questions **one at a time** - Wait for complete answer before asking the next question - Indicate progress: "Question 1/5" if multiple questions are needed - Never assume - always clarify ambiguities ### Communication **Conversations**: French or English **Code, documentation, comments**: English only ### Code Standards **Follow PEP 8** conventions strictly: - Variable and function names: `snake_case` - Explicit, descriptive naming - **No emojis in code** **Documentation**: - Use Google or NumPy docstring format - Document all public functions and classes - Include type hints where applicable ### Dependency Management **When introducing new dependencies:** - List all external dependencies explicitly - Propose alternatives using Python standard library when possible - Explain why each dependency is needed ### Unit Testing with pytest **Test naming patterns:** - Passing tests: `test_i_can_xxx` - Tests that should succeed - Failing tests: `test_i_cannot_xxx` - Edge cases that should raise errors/exceptions **Test structure:** - Use **functions**, not classes (unless inheritance is required) - Before writing tests, **list all planned tests with explanations** - Wait for validation before implementing tests **Example:** ```python def test_i_can_save_and_load_object(): """Test that an object can be saved and loaded successfully.""" engine = DbEngine(root="test_db") engine.init("tenant_1") digest = engine.save("tenant_1", "user_1", "entry_1", {"key": "value"}) assert digest is not None def test_i_cannot_save_with_empty_tenant_id(): """Test that saving with empty tenant_id raises DbException.""" engine = DbEngine(root="test_db") with pytest.raises(DbException): engine.save("", "user_1", "entry_1", {"key": "value"}) ``` ### File Management **Always specify the full file path** when adding or modifying files: ``` ✅ Modifying: src/dbengine/dbengine.py ✅ Creating: tests/test_new_feature.py ``` ### Error Handling **When errors occur:** 1. **Explain the problem clearly first** 2. **Do not propose a fix immediately** 3. **Wait for validation** that the diagnosis is correct 4. Only then propose solutions