MyDbEngine/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Available Personas

This project uses specialized personas for different types of work. Use these commands to switch modes:

- **`/developer`** - Full development mode with validation workflow (options-first, wait for approval before coding)
- **`/unit-tester`** - Specialized mode for writing comprehensive unit tests for existing code
- **`/technical-writer`** - User documentation writing mode (README, guides, tutorials)
- **`/reset`** - Return to default Claude Code mode

Each persona has specific rules and workflows defined in `.claude/` directory. See the respective files for detailed guidelines.

## Project Overview

MyDbEngine is a lightweight, git-inspired versioned database engine for Python. It maintains complete history of all data modifications using immutable snapshots with SHA-256 content addressing. The project supports multi-tenant storage with thread-safe operations.

### Quick Start Example

```python
from dbengine.dbengine import DbEngine

# Initialize engine
engine = DbEngine(root=".mytools_db")
engine.init("tenant_1")

# Pattern 1: Snapshot-based (complete state saves)
engine.save("tenant_1", "user_1", "config", {"theme": "dark", "lang": "en"})
data = engine.load("tenant_1", "config")

# Pattern 2: Record-based (incremental updates)
engine.put("tenant_1", "user_1", "users", "john", {"name": "John", "age": 30})
engine.put("tenant_1", "user_1", "users", "jane", {"name": "Jane", "age": 25})
all_users = engine.get("tenant_1", "users")  # Returns list of all users
```

## Development Commands

### Testing
```bash
# Run all tests
pytest

# Run specific test file
pytest tests/test_dbengine.py
pytest tests/test_serializer.py

# Run single test function
pytest tests/test_dbengine.py::test_i_can_save_and_load
```

### Building and Packaging
```bash
# Build package
python -m build

# Clean build artifacts
make clean

# Clean package artifacts only
make clean-package
```

### Installation
```bash
# Install in development mode with test dependencies
pip install -e .[dev]
```

## Architecture

### Core Components

**DbEngine** (`src/dbengine/dbengine.py`)
- Main database engine class using RLock for thread safety
- Manages tenant-specific storage in `.mytools_db/{tenant_id}/` structure
- Tracks latest versions via `head` file (JSON mapping entry names to digests)
- Stores objects in content-addressable format: `objects/{digest_prefix}/{full_digest}`
- Shared `refs/` directory for cross-tenant pickle-based references

**Serializer** (`src/dbengine/serializer.py`)
- Converts Python objects to/from JSON-compatible dictionaries
- Handles circular references using object ID tracking
- Supports custom serialization via handlers (see handlers.py)
- Special tags: `__object__`, `__id__`, `__tuple__`, `__set__`, `__ref__`, `__enum__`
- Objects can define `use_refs()` method to specify fields that should be pickled instead of JSON-serialized

**Handlers** (`src/dbengine/handlers.py`)
- Extensible handler system for custom type serialization
- BaseHandler interface: `is_eligible_for()`, `tag()`, `serialize()`, `deserialize()`
- Currently implements DateHandler for datetime.date objects
- Use `handlers.register_handler()` to add custom handlers

**Utils** (`src/dbengine/utils.py`)
- Type checking utilities: `is_primitive()`, `is_dictionary()`, `is_list()`, etc.
- Class introspection: `get_full_qualified_name()`, `importable_name()`, `get_class()`
- Stream digest computation with SHA-256

### Storage Architecture

```
.mytools_db/
├── {tenant_id}/
│   ├── head                           # JSON: {"entry_name": "latest_digest"}
│   └── objects/
│       └── {digest_prefix}/          # First 24 chars of digest
│           └── {full_digest}         # JSON snapshot with metadata
└── refs/                             # Shared pickled references
    └── {digest_prefix}/
        └── {full_digest}
```

### Metadata System

Each snapshot includes automatic metadata fields:
- `__parent__`: List containing digest of previous version (or `[None]` for first)
- `__user_id__`: User ID who created the snapshot (was `__user__` in TAG constant)
- `__date__`: ISO timestamp `YYYYMMDD HH:MM:SS %z`

### Two Usage Patterns

**Pattern 1: Snapshot-based (`save()`/`load()`)**
- Save complete object states
- Best for configuration objects or complete state snapshots
- Direct control over what gets saved

**Pattern 2: Record-based (`put()`/`put_many()`/`get()`)**
- Incremental updates to dictionary-like collections
- Automatically creates snapshots only when data changes
- Returns `True/False` indicating if snapshot was created
- Best for managing collections of items

**Important**: Do not mix patterns for the same entry - they expect different data structures.

### Common Pitfalls

⚠️ **Mixing save() and put() on the same entry**
- `save()` expects to store complete snapshots (any object)
- `put()` expects dictionary-like structures with key-value pairs
- Using both on the same entry will cause data structure conflicts

⚠️ **Refs are shared across tenants**
- Objects stored via `use_refs()` go to shared `refs/` directory
- Not isolated per tenant - identical objects reused across all tenants
- Good for deduplication, but be aware of cross-tenant sharing

⚠️ **Parent digest is always a list**
- `__parent__` field is stored as `[digest]` or `[None]`
- Always access as `data[TAG_PARENT][0]`, not `data[TAG_PARENT]`
- This allows for future support of multiple parents (merge scenarios)

### Reference System

Objects can opt into pickle-based storage for specific fields:
1. Define `use_refs()` method returning set of field names
2. Serializer stores those fields in shared `refs/` directory
3. Reduces JSON snapshot size and enables cross-tenant deduplication
4. Example: `DummyObjWithRef` in test_dbengine.py

## Extension Points

### Custom Type Handlers

To serialize custom types that aren't handled by default serialization:

**1. Create a handler class:**
```python
from dbengine.handlers import BaseHandler, TAG_SPECIAL

class MyCustomHandler(BaseHandler):
    def is_eligible_for(self, obj):
        return isinstance(obj, MyCustomType)

    def tag(self):
        return "MyCustomType"

    def serialize(self, obj) -> dict:
        return {
            TAG_SPECIAL: self.tag(),
            "data": obj.to_dict()
        }

    def deserialize(self, data: dict) -> object:
        return MyCustomType.from_dict(data["data"])
```

**2. Register the handler:**
```python
from dbengine.handlers import handlers

handlers.register_handler(MyCustomHandler())
```

**When to use handlers:**
- Complex types that need custom serialization logic
- Types that can't be pickled reliably
- Types requiring validation during deserialization
- External library types (datetime.date example in handlers.py)

### Using References (use_refs)

For objects with large nested data structures that should be pickled instead of JSON-serialized:

```python
class MyDataObject:
    def __init__(self, metadata, large_dataframe):
        self.metadata = metadata
        self.large_dataframe = large_dataframe  # pandas DataFrame, for example

    @staticmethod
    def use_refs():
        """Return set of field names to pickle instead of JSON-serialize"""
        return {"large_dataframe"}
```

**When to use refs:**
- Large data structures (DataFrames, numpy arrays)
- Objects that lose information in JSON conversion
- Data shared across multiple snapshots/tenants (deduplication benefit)

**Trade-offs:**
- ✅ Smaller JSON snapshots
- ✅ Cross-tenant deduplication
- ❌ Less human-readable (binary pickle format)
- ❌ Python version compatibility concerns with pickle

## Testing Notes

- Test fixtures use `DB_ENGINE_ROOT = "TestDBEngineRoot"` for isolation
- Tests clean up temp directories using `shutil.rmtree()` in fixtures
- Test classes like `DummyObj`, `DummyObjWithRef`, `DummyObjWithKey` demonstrate usage patterns
- Thread safety is built-in via RLock but not explicitly tested

## Key Design Decisions

- **Immutability**: Snapshots never modified after creation (git-style)
- **Content Addressing**: Identical objects stored only once (deduplication via SHA-256)
- **Change Detection**: `put()` and `put_many()` skip saving if data unchanged
- **Thread Safety**: All DbEngine operations protected by RLock
- **No Dependencies**: Core engine has zero runtime dependencies (pytest only for dev)

## Development Workflow and Guidelines

### Development Process

**Code must always be testable**. Before writing any code:

1. **Explain available options first** - Present different approaches to solve the problem
2. **Wait for validation** - Ensure mutual understanding of requirements before implementation
3. **No code without approval** - Only proceed after explicit validation

### Collaboration Style

**Ask questions to clarify understanding or suggest alternative approaches:**
- Ask questions **one at a time**
- Wait for complete answer before asking the next question
- Indicate progress: "Question 1/5" if multiple questions are needed
- Never assume - always clarify ambiguities

### Communication

**Conversations**: French or English
**Code, documentation, comments**: English only

### Code Standards

**Follow PEP 8** conventions strictly:
- Variable and function names: `snake_case`
- Explicit, descriptive naming
- **No emojis in code**

**Documentation**:
- Use Google or NumPy docstring format
- Document all public functions and classes
- Include type hints where applicable

### Dependency Management

**When introducing new dependencies:**
- List all external dependencies explicitly
- Propose alternatives using Python standard library when possible
- Explain why each dependency is needed

### Unit Testing with pytest

**Test naming patterns:**
- Passing tests: `test_i_can_xxx` - Tests that should succeed
- Failing tests: `test_i_cannot_xxx` - Edge cases that should raise errors/exceptions

**Test structure:**
- Use **functions**, not classes (unless inheritance is required)
- Before writing tests, **list all planned tests with explanations**
- Wait for validation before implementing tests

**Example:**
```python
def test_i_can_save_and_load_object():
    """Test that an object can be saved and loaded successfully."""
    engine = DbEngine(root="test_db")
    engine.init("tenant_1")
    digest = engine.save("tenant_1", "user_1", "entry_1", {"key": "value"})
    assert digest is not None

def test_i_cannot_save_with_empty_tenant_id():
    """Test that saving with empty tenant_id raises DbException."""
    engine = DbEngine(root="test_db")
    with pytest.raises(DbException):
        engine.save("", "user_1", "entry_1", {"key": "value"})
```

### File Management

**Always specify the full file path** when adding or modifying files:
```
✅ Modifying: src/dbengine/dbengine.py
✅ Creating: tests/test_new_feature.py
```

### Error Handling

**When errors occur:**
1. **Explain the problem clearly first**
2. **Do not propose a fix immediately**
3. **Wait for validation** that the diagnosis is correct
4. Only then propose solutions