15 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Available Personas
This project uses specialized personas for different types of work. Use these commands to switch modes:
/developer- Full development mode with validation workflow (options-first, wait for approval before coding)/unit-tester- Specialized mode for writing comprehensive unit tests for existing code/technical-writer- User documentation writing mode (README, guides, tutorials)/reset- Return to default Claude Code mode
Each persona has specific rules and workflows defined in .claude/ directory. See the respective files for detailed guidelines.
Project Overview
MyDbEngine is a lightweight, git-inspired versioned database engine for Python. It maintains complete history of all data modifications using immutable snapshots with SHA-256 content addressing. The project supports multi-tenant storage with thread-safe operations.
Quick Start Example
from dbengine.dbengine import DbEngine
# Initialize engine
engine = DbEngine(root=".mytools_db")
engine.init("tenant_1")
# Pattern 1: Snapshot-based (complete state saves)
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
data = engine.load("tenant_1", "config")
# Pattern 2: Record-based (incremental updates)
engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
all_users = engine.get("tenant_1", "users") # Returns list of all users
Development Commands
Testing
# Run all tests
pytest
# Run specific test file
pytest tests/test_dbengine.py
pytest tests/test_serializer.py
# Run single test function
pytest tests/test_dbengine.py::test_i_can_save_and_load
Building and Packaging
# Build package
python -m build
# Clean build artifacts
make clean
# Clean package artifacts only
make clean-package
Installation
# Install in development mode with test dependencies
pip install -e .[dev]
Architecture
Core Components
DbEngine (src/dbengine/dbengine.py)
- Main database engine class using RLock for thread safety
- Manages tenant-specific storage in
.mytools_db/{tenant_id}/structure - Tracks latest versions via
headfile (JSON mapping entry names to digests) - Stores objects in content-addressable format:
objects/{digest_prefix}/{full_digest} - Shared
refs/directory for cross-tenant pickle-based references
Serializer (src/dbengine/serializer.py)
- Converts Python objects to/from JSON-compatible dictionaries
- Handles circular references using object ID tracking
- Supports custom serialization via handlers (see handlers.py)
- Special tags:
__object__,__id__,__tuple__,__set__,__ref__,__digest__,__enum__ - Objects can define
use_refs()method to specify fields that should be pickled instead of JSON-serialized __ref__: Used foruse_refs()system (pickle-based storage)__digest__: Used by BaseRefHandler for custom binary formats (numpy, etc.)
Handlers (src/dbengine/handlers.py)
- Extensible handler system for custom type serialization
- Three-tier hierarchy:
BaseHandler: Base interface withis_eligible_for()andtag()BaseInlineHandler: For JSON-inline storage (e.g., DateHandler)BaseRefHandler: For custom binary formats stored inrefs/(e.g., DataFrames)
BaseInlineHandler: Implementsserialize(obj) → dictanddeserialize(dict) → objBaseRefHandler: Implementsserialize_to_bytes(obj) → bytesanddeserialize_from_bytes(bytes) → obj- Currently implements
DateHandler(BaseInlineHandler) for datetime.date objects - Use
handlers.register_handler()to add custom handlers
Utils (src/dbengine/utils.py)
- Type checking utilities:
is_primitive(),is_dictionary(),is_list(), etc. - Class introspection:
get_full_qualified_name(),importable_name(),get_class() - Digest computation:
compute_digest_from_stream(),compute_digest_from_bytes()
RefHelper and PickleRefHelper (src/dbengine/dbengine.py)
RefHelper: Base class for content-addressable storage inrefs/directorysave_ref_from_bytes(data: bytes) → digest: Store raw bytesload_ref_to_bytes(digest) → bytes: Load raw bytes- Used by
BaseRefHandlerfor custom binary formats
PickleRefHelper(RefHelper): Adds pickle serialization layersave_ref(obj) → digest: Pickle and store objectload_ref(digest) → obj: Load and unpickle object- Used by
use_refs()system andSerializer
Storage Architecture
.mytools_db/
├── {tenant_id}/
│ ├── head # JSON: {"entry_name": "latest_digest"}
│ └── objects/
│ └── {digest_prefix}/ # First 24 chars of digest
│ └── {full_digest} # JSON snapshot with metadata
└── refs/ # Shared binary references (cross-tenant)
└── {digest_prefix}/
└── {full_digest} # Pickle or custom binary format
Note: The refs/ directory stores binary data in content-addressable format:
- Pickled objects (via
use_refs()orPickleRefHelper) - Custom binary formats (via
BaseRefHandler, e.g., numpy arrays)
Metadata System
Each snapshot includes automatic metadata fields:
__parent__: List containing digest of previous version (or[None]for first)__user_id__: User ID who created the snapshot (was__user__in TAG constant)__date__: ISO timestampYYYYMMDD HH:MM:SS %z
Two Usage Patterns
Pattern 1: Snapshot-based (save()/load())
- Save complete object states
- Best for configuration objects or complete state snapshots
- Direct control over what gets saved
Pattern 2: Record-based (put()/put_many()/get())
- Incremental updates to dictionary-like collections
- Automatically creates snapshots only when data changes
- Returns
True/Falseindicating if snapshot was created - Best for managing collections of items
Important: Do not mix patterns for the same entry - they expect different data structures.
Common Pitfalls
⚠️ Mixing save() and put() on the same entry
save()expects to store complete snapshots (any object)put()expects dictionary-like structures with key-value pairs- Using both on the same entry will cause data structure conflicts
⚠️ Refs are shared across tenants
- Objects stored via
use_refs()go to sharedrefs/directory - Not isolated per tenant - identical objects reused across all tenants
- Good for deduplication, but be aware of cross-tenant sharing
⚠️ Parent digest is always a list
__parent__field is stored as[digest]or[None]- Always access as
data[TAG_PARENT][0], notdata[TAG_PARENT] - This allows for future support of multiple parents (merge scenarios)
Reference System
Objects can opt into pickle-based storage for specific fields:
- Define
use_refs()method returning set of field names - Serializer stores those fields in shared
refs/directory - Reduces JSON snapshot size and enables cross-tenant deduplication
- Example:
DummyObjWithRefin test_dbengine.py
Extension Points
Custom Type Handlers
MyDbEngine supports two types of custom handlers for serializing types:
1. BaseInlineHandler - For JSON Storage
Use when data should be stored directly in the JSON snapshot (human-readable, smaller datasets).
Example: Custom date handler
from dbengine.handlers import BaseInlineHandler, handlers
class MyCustomHandler(BaseInlineHandler):
def is_eligible_for(self, obj):
return isinstance(obj, MyCustomType)
def tag(self):
return "MyCustomType"
def serialize(self, obj) -> dict:
return {
"__special__": self.tag(),
"data": obj.to_dict()
}
def deserialize(self, data: dict) -> object:
return MyCustomType.from_dict(data["data"])
# Register the handler
handlers.register_handler(MyCustomHandler())
When to use BaseInlineHandler:
- Small data structures that fit well in JSON
- Types requiring human-readable storage
- Types needing validation during deserialization
- Simple external library types (e.g., datetime.date)
2. BaseRefHandler - For Binary Storage
Use when data should be stored in optimized binary format in refs/ directory (large datasets, better compression).
Example: pandas DataFrame handler
from dbengine.handlers import BaseRefHandler, handlers
import pandas as pd
import json
class DataFrameHandler(BaseRefHandler):
def is_eligible_for(self, obj):
return isinstance(obj, pd.DataFrame)
def tag(self):
return "DataFrame"
def serialize_to_bytes(self, df) -> bytes:
"""Convert DataFrame to compact binary format"""
import numpy as np
# Store metadata + numpy bytes
metadata = {
"columns": df.columns.tolist(),
"index": df.index.tolist(),
"dtype": str(df.values.dtype)
}
metadata_bytes = json.dumps(metadata).encode('utf-8')
metadata_length = len(metadata_bytes).to_bytes(4, 'big')
numpy_bytes = df.to_numpy().tobytes()
return metadata_length + metadata_bytes + numpy_bytes
def deserialize_from_bytes(self, data: bytes) -> object:
"""Reconstruct DataFrame from binary format"""
import numpy as np
# Read metadata
metadata_length = int.from_bytes(data[:4], 'big')
metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
numpy_bytes = data[4+metadata_length:]
# Reconstruct array and DataFrame
array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
array = array.reshape(len(metadata['index']), len(metadata['columns']))
return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])
# Register the handler
handlers.register_handler(DataFrameHandler())
When to use BaseRefHandler:
- Large binary data (DataFrames, numpy arrays, images)
- Data that benefits from custom compression (e.g., numpy's compact format)
- Types that lose information in JSON conversion
- Shared data across snapshots (automatic deduplication via SHA-256)
Key differences:
BaseInlineHandler: Data stored in JSON snapshot →{"__special__": "Tag", "data": {...}}BaseRefHandler: Data stored inrefs/→{"__special__": "Tag", "__digest__": "abc123..."}- BaseRefHandler provides automatic deduplication and smaller JSON snapshots
Using References (use_refs)
For objects with large nested data structures that should be pickled instead of JSON-serialized:
class MyDataObject:
def __init__(self, metadata, large_dataframe):
self.metadata = metadata
self.large_dataframe = large_dataframe # pandas DataFrame, for example
@staticmethod
def use_refs():
"""Return set of field names to pickle instead of JSON-serialize"""
return {"large_dataframe"}
When to use use_refs():
- Quick solution for large nested objects without writing custom handler
- Works with any picklable object
- Per-object control (some fields in JSON, others pickled)
use_refs() vs BaseRefHandler:
use_refs(): Uses pickle (viaPickleRefHelper), simple but less optimizedBaseRefHandler: Custom binary format (e.g., numpy), optimized but requires handler code- Both store in
refs/and get automatic SHA-256 deduplication use_refs()generates{"__ref__": "digest"}tagsBaseRefHandlergenerates{"__special__": "Tag", "__digest__": "digest"}tags
Trade-offs:
- ✅ Smaller JSON snapshots
- ✅ Cross-tenant deduplication
- ❌ Less human-readable (binary format)
- ❌ Python version compatibility concerns with pickle (use_refs only)
Testing Notes
- Test fixtures use
DB_ENGINE_ROOT = "TestDBEngineRoot"for isolation - Tests clean up temp directories using
shutil.rmtree()in fixtures - Test classes like
DummyObj,DummyObjWithRef,DummyObjWithKeydemonstrate usage patterns - Thread safety is built-in via RLock but not explicitly tested
Key Design Decisions
- Immutability: Snapshots never modified after creation (git-style)
- Content Addressing: Identical objects stored only once (deduplication via SHA-256)
- Change Detection:
put()andput_many()skip saving if data unchanged - Thread Safety: All DbEngine operations protected by RLock
- No Dependencies: Core engine has zero runtime dependencies (pytest only for dev)
Development Workflow and Guidelines
Development Process
Code must always be testable. Before writing any code:
- Explain available options first - Present different approaches to solve the problem
- Wait for validation - Ensure mutual understanding of requirements before implementation
- No code without approval - Only proceed after explicit validation
Collaboration Style
Ask questions to clarify understanding or suggest alternative approaches:
- Ask questions one at a time
- Wait for complete answer before asking the next question
- Indicate progress: "Question 1/5" if multiple questions are needed
- Never assume - always clarify ambiguities
Communication
Conversations: French or English Code, documentation, comments: English only
Code Standards
Follow PEP 8 conventions strictly:
- Variable and function names:
snake_case - Explicit, descriptive naming
- No emojis in code
Documentation:
- Use Google or NumPy docstring format
- Document all public functions and classes
- Include type hints where applicable
Dependency Management
When introducing new dependencies:
- List all external dependencies explicitly
- Propose alternatives using Python standard library when possible
- Explain why each dependency is needed
Unit Testing with pytest
Test naming patterns:
- Passing tests:
test_i_can_xxx- Tests that should succeed - Failing tests:
test_i_cannot_xxx- Edge cases that should raise errors/exceptions
Test structure:
- Use functions, not classes (unless inheritance is required)
- Before writing tests, list all planned tests with explanations
- Wait for validation before implementing tests
Example:
def test_i_can_save_and_load_object():
"""Test that an object can be saved and loaded successfully."""
engine = DbEngine(root="test_db")
engine.init("tenant_1")
digest = engine.save("tenant_1", "user_1", "entry_1", {"key": "value"})
assert digest is not None
def test_i_cannot_save_with_empty_tenant_id():
"""Test that saving with empty tenant_id raises DbException."""
engine = DbEngine(root="test_db")
with pytest.raises(DbException):
engine.save("", "user_1", "entry_1", {"key": "value"})
File Management
Always specify the full file path when adding or modifying files:
✅ Modifying: src/dbengine/dbengine.py
✅ Creating: tests/test_new_feature.py
Error Handling
When errors occur:
- Explain the problem clearly first
- Do not propose a fix immediately
- Wait for validation that the diagnosis is correct
- Only then propose solutions