419 lines
15 KiB
Markdown
419 lines
15 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Available Personas
|
|
|
|
This project uses specialized personas for different types of work. Use these commands to switch modes:
|
|
|
|
- **`/developer`** - Full development mode with validation workflow (options-first, wait for approval before coding)
|
|
- **`/unit-tester`** - Specialized mode for writing comprehensive unit tests for existing code
|
|
- **`/technical-writer`** - User documentation writing mode (README, guides, tutorials)
|
|
- **`/reset`** - Return to default Claude Code mode
|
|
|
|
Each persona has specific rules and workflows defined in `.claude/` directory. See the respective files for detailed guidelines.
|
|
|
|
## Project Overview
|
|
|
|
MyDbEngine is a lightweight, git-inspired versioned database engine for Python. It maintains complete history of all data modifications using immutable snapshots with SHA-256 content addressing. The project supports multi-tenant storage with thread-safe operations.
|
|
|
|
### Quick Start Example
|
|
|
|
```python
|
|
from dbengine.dbengine import DbEngine
|
|
|
|
# Initialize engine
|
|
engine = DbEngine(root=".mytools_db")
|
|
engine.init("tenant_1")
|
|
|
|
# Pattern 1: Snapshot-based (complete state saves)
|
|
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
|
|
data = engine.load("tenant_1", "config")
|
|
|
|
# Pattern 2: Record-based (incremental updates)
|
|
engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
|
|
engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
|
|
all_users = engine.get("tenant_1", "users") # Returns list of all users
|
|
```
|
|
|
|
## Development Commands
|
|
|
|
### Testing
|
|
```bash
|
|
# Run all tests
|
|
pytest
|
|
|
|
# Run specific test file
|
|
pytest tests/test_dbengine.py
|
|
pytest tests/test_serializer.py
|
|
|
|
# Run single test function
|
|
pytest tests/test_dbengine.py::test_i_can_save_and_load
|
|
```
|
|
|
|
### Building and Packaging
|
|
```bash
|
|
# Build package
|
|
python -m build
|
|
|
|
# Clean build artifacts
|
|
make clean
|
|
|
|
# Clean package artifacts only
|
|
make clean-package
|
|
```
|
|
|
|
### Installation
|
|
```bash
|
|
# Install in development mode with test dependencies
|
|
pip install -e .[dev]
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
**DbEngine** (`src/dbengine/dbengine.py`)
|
|
- Main database engine class using RLock for thread safety
|
|
- Manages tenant-specific storage in `.mytools_db/{tenant_id}/` structure
|
|
- Tracks latest versions via `head` file (JSON mapping entry names to digests)
|
|
- Stores objects in content-addressable format: `objects/{digest_prefix}/{full_digest}`
|
|
- Shared `refs/` directory for cross-tenant pickle-based references
|
|
|
|
**Serializer** (`src/dbengine/serializer.py`)
|
|
- Converts Python objects to/from JSON-compatible dictionaries
|
|
- Handles circular references using object ID tracking
|
|
- Supports custom serialization via handlers (see handlers.py)
|
|
- Special tags: `__object__`, `__id__`, `__tuple__`, `__set__`, `__ref__`, `__digest__`, `__enum__`
|
|
- Objects can define `use_refs()` method to specify fields that should be pickled instead of JSON-serialized
|
|
- `__ref__`: Used for `use_refs()` system (pickle-based storage)
|
|
- `__digest__`: Used by BaseRefHandler for custom binary formats (numpy, etc.)
|
|
|
|
**Handlers** (`src/dbengine/handlers.py`)
|
|
- Extensible handler system for custom type serialization
|
|
- Three-tier hierarchy:
|
|
- `BaseHandler`: Base interface with `is_eligible_for()` and `tag()`
|
|
- `BaseInlineHandler`: For JSON-inline storage (e.g., DateHandler)
|
|
- `BaseRefHandler`: For custom binary formats stored in `refs/` (e.g., DataFrames)
|
|
- `BaseInlineHandler`: Implements `serialize(obj) → dict` and `deserialize(dict) → obj`
|
|
- `BaseRefHandler`: Implements `serialize_to_bytes(obj) → bytes` and `deserialize_from_bytes(bytes) → obj`
|
|
- Currently implements `DateHandler` (BaseInlineHandler) for datetime.date objects
|
|
- Use `handlers.register_handler()` to add custom handlers
|
|
|
|
**Utils** (`src/dbengine/utils.py`)
|
|
- Type checking utilities: `is_primitive()`, `is_dictionary()`, `is_list()`, etc.
|
|
- Class introspection: `get_full_qualified_name()`, `importable_name()`, `get_class()`
|
|
- Digest computation: `compute_digest_from_stream()`, `compute_digest_from_bytes()`
|
|
|
|
**RefHelper and PickleRefHelper** (`src/dbengine/dbengine.py`)
|
|
- `RefHelper`: Base class for content-addressable storage in `refs/` directory
|
|
- `save_ref_from_bytes(data: bytes) → digest`: Store raw bytes
|
|
- `load_ref_to_bytes(digest) → bytes`: Load raw bytes
|
|
- Used by `BaseRefHandler` for custom binary formats
|
|
- `PickleRefHelper(RefHelper)`: Adds pickle serialization layer
|
|
- `save_ref(obj) → digest`: Pickle and store object
|
|
- `load_ref(digest) → obj`: Load and unpickle object
|
|
- Used by `use_refs()` system and `Serializer`
|
|
|
|
### Storage Architecture
|
|
|
|
```
|
|
.mytools_db/
|
|
├── {tenant_id}/
|
|
│ ├── head # JSON: {"entry_name": "latest_digest"}
|
|
│ └── objects/
|
|
│ └── {digest_prefix}/ # First 24 chars of digest
|
|
│ └── {full_digest} # JSON snapshot with metadata
|
|
└── refs/ # Shared binary references (cross-tenant)
|
|
└── {digest_prefix}/
|
|
└── {full_digest} # Pickle or custom binary format
|
|
```
|
|
|
|
**Note**: The `refs/` directory stores binary data in content-addressable format:
|
|
- Pickled objects (via `use_refs()` or `PickleRefHelper`)
|
|
- Custom binary formats (via `BaseRefHandler`, e.g., numpy arrays)
|
|
|
|
### Metadata System
|
|
|
|
Each snapshot includes automatic metadata fields:
|
|
- `__parent__`: List containing digest of previous version (or `[None]` for first)
|
|
- `__user_id__`: User ID who created the snapshot (was `__user__` in TAG constant)
|
|
- `__date__`: ISO timestamp `YYYYMMDD HH:MM:SS %z`
|
|
|
|
### Two Usage Patterns
|
|
|
|
**Pattern 1: Snapshot-based (`save()`/`load()`)**
|
|
- Save complete object states
|
|
- Best for configuration objects or complete state snapshots
|
|
- Direct control over what gets saved
|
|
|
|
**Pattern 2: Record-based (`put()`/`put_many()`/`get()`)**
|
|
- Incremental updates to dictionary-like collections
|
|
- Automatically creates snapshots only when data changes
|
|
- Returns `True/False` indicating if snapshot was created
|
|
- Best for managing collections of items
|
|
|
|
**Important**: Do not mix patterns for the same entry - they expect different data structures.
|
|
|
|
### Common Pitfalls
|
|
|
|
⚠️ **Mixing save() and put() on the same entry**
|
|
- `save()` expects to store complete snapshots (any object)
|
|
- `put()` expects dictionary-like structures with key-value pairs
|
|
- Using both on the same entry will cause data structure conflicts
|
|
|
|
⚠️ **Refs are shared across tenants**
|
|
- Objects stored via `use_refs()` go to shared `refs/` directory
|
|
- Not isolated per tenant - identical objects reused across all tenants
|
|
- Good for deduplication, but be aware of cross-tenant sharing
|
|
|
|
⚠️ **Parent digest is always a list**
|
|
- `__parent__` field is stored as `[digest]` or `[None]`
|
|
- Always access as `data[TAG_PARENT][0]`, not `data[TAG_PARENT]`
|
|
- This allows for future support of multiple parents (merge scenarios)
|
|
|
|
### Reference System
|
|
|
|
Objects can opt into pickle-based storage for specific fields:
|
|
1. Define `use_refs()` method returning set of field names
|
|
2. Serializer stores those fields in shared `refs/` directory
|
|
3. Reduces JSON snapshot size and enables cross-tenant deduplication
|
|
4. Example: `DummyObjWithRef` in test_dbengine.py
|
|
|
|
## Extension Points
|
|
|
|
### Custom Type Handlers
|
|
|
|
MyDbEngine supports two types of custom handlers for serializing types:
|
|
|
|
#### 1. BaseInlineHandler - For JSON Storage
|
|
|
|
Use when data should be stored directly in the JSON snapshot (human-readable, smaller datasets).
|
|
|
|
**Example: Custom date handler**
|
|
```python
|
|
from dbengine.handlers import BaseInlineHandler, handlers
|
|
|
|
class MyCustomHandler(BaseInlineHandler):
|
|
def is_eligible_for(self, obj):
|
|
return isinstance(obj, MyCustomType)
|
|
|
|
def tag(self):
|
|
return "MyCustomType"
|
|
|
|
def serialize(self, obj) -> dict:
|
|
return {
|
|
"__special__": self.tag(),
|
|
"data": obj.to_dict()
|
|
}
|
|
|
|
def deserialize(self, data: dict) -> object:
|
|
return MyCustomType.from_dict(data["data"])
|
|
|
|
# Register the handler
|
|
handlers.register_handler(MyCustomHandler())
|
|
```
|
|
|
|
**When to use BaseInlineHandler:**
|
|
- Small data structures that fit well in JSON
|
|
- Types requiring human-readable storage
|
|
- Types needing validation during deserialization
|
|
- Simple external library types (e.g., datetime.date)
|
|
|
|
#### 2. BaseRefHandler - For Binary Storage
|
|
|
|
Use when data should be stored in optimized binary format in `refs/` directory (large datasets, better compression).
|
|
|
|
**Example: pandas DataFrame handler**
|
|
```python
|
|
from dbengine.handlers import BaseRefHandler, handlers
|
|
import pandas as pd
|
|
import json
|
|
|
|
class DataFrameHandler(BaseRefHandler):
|
|
def is_eligible_for(self, obj):
|
|
return isinstance(obj, pd.DataFrame)
|
|
|
|
def tag(self):
|
|
return "DataFrame"
|
|
|
|
def serialize_to_bytes(self, df) -> bytes:
|
|
"""Convert DataFrame to compact binary format"""
|
|
import numpy as np
|
|
|
|
# Store metadata + numpy bytes
|
|
metadata = {
|
|
"columns": df.columns.tolist(),
|
|
"index": df.index.tolist(),
|
|
"dtype": str(df.values.dtype)
|
|
}
|
|
metadata_bytes = json.dumps(metadata).encode('utf-8')
|
|
metadata_length = len(metadata_bytes).to_bytes(4, 'big')
|
|
numpy_bytes = df.to_numpy().tobytes()
|
|
|
|
return metadata_length + metadata_bytes + numpy_bytes
|
|
|
|
def deserialize_from_bytes(self, data: bytes) -> object:
|
|
"""Reconstruct DataFrame from binary format"""
|
|
import numpy as np
|
|
|
|
# Read metadata
|
|
metadata_length = int.from_bytes(data[:4], 'big')
|
|
metadata = json.loads(data[4:4+metadata_length].decode('utf-8'))
|
|
numpy_bytes = data[4+metadata_length:]
|
|
|
|
# Reconstruct array and DataFrame
|
|
array = np.frombuffer(numpy_bytes, dtype=metadata['dtype'])
|
|
array = array.reshape(len(metadata['index']), len(metadata['columns']))
|
|
|
|
return pd.DataFrame(array, columns=metadata['columns'], index=metadata['index'])
|
|
|
|
# Register the handler
|
|
handlers.register_handler(DataFrameHandler())
|
|
```
|
|
|
|
**When to use BaseRefHandler:**
|
|
- Large binary data (DataFrames, numpy arrays, images)
|
|
- Data that benefits from custom compression (e.g., numpy's compact format)
|
|
- Types that lose information in JSON conversion
|
|
- Shared data across snapshots (automatic deduplication via SHA-256)
|
|
|
|
**Key differences:**
|
|
- `BaseInlineHandler`: Data stored in JSON snapshot → `{"__special__": "Tag", "data": {...}}`
|
|
- `BaseRefHandler`: Data stored in `refs/` → `{"__special__": "Tag", "__digest__": "abc123..."}`
|
|
- BaseRefHandler provides automatic deduplication and smaller JSON snapshots
|
|
|
|
### Using References (use_refs)
|
|
|
|
For objects with large nested data structures that should be pickled instead of JSON-serialized:
|
|
|
|
```python
|
|
class MyDataObject:
|
|
def __init__(self, metadata, large_dataframe):
|
|
self.metadata = metadata
|
|
self.large_dataframe = large_dataframe # pandas DataFrame, for example
|
|
|
|
@staticmethod
|
|
def use_refs():
|
|
"""Return set of field names to pickle instead of JSON-serialize"""
|
|
return {"large_dataframe"}
|
|
```
|
|
|
|
**When to use use_refs():**
|
|
- Quick solution for large nested objects without writing custom handler
|
|
- Works with any picklable object
|
|
- Per-object control (some fields in JSON, others pickled)
|
|
|
|
**use_refs() vs BaseRefHandler:**
|
|
- `use_refs()`: Uses pickle (via `PickleRefHelper`), simple but less optimized
|
|
- `BaseRefHandler`: Custom binary format (e.g., numpy), optimized but requires handler code
|
|
- Both store in `refs/` and get automatic SHA-256 deduplication
|
|
- `use_refs()` generates `{"__ref__": "digest"}` tags
|
|
- `BaseRefHandler` generates `{"__special__": "Tag", "__digest__": "digest"}` tags
|
|
|
|
**Trade-offs:**
|
|
- ✅ Smaller JSON snapshots
|
|
- ✅ Cross-tenant deduplication
|
|
- ❌ Less human-readable (binary format)
|
|
- ❌ Python version compatibility concerns with pickle (use_refs only)
|
|
|
|
## Testing Notes
|
|
|
|
- Test fixtures use `DB_ENGINE_ROOT = "TestDBEngineRoot"` for isolation
|
|
- Tests clean up temp directories using `shutil.rmtree()` in fixtures
|
|
- Test classes like `DummyObj`, `DummyObjWithRef`, `DummyObjWithKey` demonstrate usage patterns
|
|
- Thread safety is built-in via RLock but not explicitly tested
|
|
|
|
## Key Design Decisions
|
|
|
|
- **Immutability**: Snapshots never modified after creation (git-style)
|
|
- **Content Addressing**: Identical objects stored only once (deduplication via SHA-256)
|
|
- **Change Detection**: `put()` and `put_many()` skip saving if data unchanged
|
|
- **Thread Safety**: All DbEngine operations protected by RLock
|
|
- **No Dependencies**: Core engine has zero runtime dependencies (pytest only for dev)
|
|
|
|
## Development Workflow and Guidelines
|
|
|
|
### Development Process
|
|
|
|
**Code must always be testable**. Before writing any code:
|
|
|
|
1. **Explain available options first** - Present different approaches to solve the problem
|
|
2. **Wait for validation** - Ensure mutual understanding of requirements before implementation
|
|
3. **No code without approval** - Only proceed after explicit validation
|
|
|
|
### Collaboration Style
|
|
|
|
**Ask questions to clarify understanding or suggest alternative approaches:**
|
|
- Ask questions **one at a time**
|
|
- Wait for complete answer before asking the next question
|
|
- Indicate progress: "Question 1/5" if multiple questions are needed
|
|
- Never assume - always clarify ambiguities
|
|
|
|
### Communication
|
|
|
|
**Conversations**: French or English
|
|
**Code, documentation, comments**: English only
|
|
|
|
### Code Standards
|
|
|
|
**Follow PEP 8** conventions strictly:
|
|
- Variable and function names: `snake_case`
|
|
- Explicit, descriptive naming
|
|
- **No emojis in code**
|
|
|
|
**Documentation**:
|
|
- Use Google or NumPy docstring format
|
|
- Document all public functions and classes
|
|
- Include type hints where applicable
|
|
|
|
### Dependency Management
|
|
|
|
**When introducing new dependencies:**
|
|
- List all external dependencies explicitly
|
|
- Propose alternatives using Python standard library when possible
|
|
- Explain why each dependency is needed
|
|
|
|
### Unit Testing with pytest
|
|
|
|
**Test naming patterns:**
|
|
- Passing tests: `test_i_can_xxx` - Tests that should succeed
|
|
- Failing tests: `test_i_cannot_xxx` - Edge cases that should raise errors/exceptions
|
|
|
|
**Test structure:**
|
|
- Use **functions**, not classes (unless inheritance is required)
|
|
- Before writing tests, **list all planned tests with explanations**
|
|
- Wait for validation before implementing tests
|
|
|
|
**Example:**
|
|
```python
|
|
def test_i_can_save_and_load_object():
|
|
"""Test that an object can be saved and loaded successfully."""
|
|
engine = DbEngine(root="test_db")
|
|
engine.init("tenant_1")
|
|
digest = engine.save("tenant_1", "user_1", "entry_1", {"key": "value"})
|
|
assert digest is not None
|
|
|
|
def test_i_cannot_save_with_empty_tenant_id():
|
|
"""Test that saving with empty tenant_id raises DbException."""
|
|
engine = DbEngine(root="test_db")
|
|
with pytest.raises(DbException):
|
|
engine.save("", "user_1", "entry_1", {"key": "value"})
|
|
```
|
|
|
|
### File Management
|
|
|
|
**Always specify the full file path** when adding or modifying files:
|
|
```
|
|
✅ Modifying: src/dbengine/dbengine.py
|
|
✅ Creating: tests/test_new_feature.py
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
**When errors occur:**
|
|
1. **Explain the problem clearly first**
|
|
2. **Do not propose a fix immediately**
|
|
3. **Wait for validation** that the diagnosis is correct
|
|
4. Only then propose solutions
|