MyObsidianAI/README.md

# Obsidian RAG Backend

A local, semantic search backend for Obsidian markdown files.

## Project Overview

### Context

- **Target vault size**: ~1900 files, 480 MB
- **Deployment**: 100% local (no external APIs)
- **Usage**: Command-line interface (CLI)
- **Language**: Python 3.12

### Phase 1 Scope (Current)

Semantic search system that:

- Indexes markdown files from an Obsidian vault
- Performs semantic search using local embeddings
- Returns relevant results with metadata

**Phase 2 (Future)**: Add LLM integration for answer generation using Phase 1 search results.

## Features

### Indexation

- Manual, on-demand indexing
- Processes all `.md` files in vault
- Extracts document structure (sections, line numbers)
- Hybrid chunking strategy:
  - Short sections (≤200 tokens): indexed as-is
  - Long sections: split with sliding window (200 tokens, 30 tokens overlap)
- Robust error handling: continues indexing even if individual files fail

### Search Results

Each search result includes:

- File path (relative to vault root)
- Similarity score
- Relevant text excerpt
- Location in file (section and line number)

## Architecture

```
obsidian_rag/
├── obsidian_rag/
│   ├── __init__.py
│   ├── markdown_parser.py    # Parse .md files, extract structure
│   ├── indexer.py             # Generate embeddings and vector index
│   ├── searcher.py            # Perform semantic search
│   └── cli.py                 # Typer CLI interface
├── tests/
│   ├── __init__.py
│   ├── test_markdown_parser.py
│   ├── test_indexer.py
│   └── test_searcher.py
├── pyproject.toml
└── README.md
```

## Technical Choices

### Technology Stack

| Component     | Technology                                 | Rationale                                    |
|---------------|--------------------------------------------|----------------------------------------------|
| Embeddings    | sentence-transformers (`all-MiniLM-L6-v2`) | Local, lightweight (~80MB), good performance |
| Vector Store  | ChromaDB                                   | Simple, persistent, good Python integration  |
| CLI Framework | Typer                                      | Modern, type-safe, excellent UX              |
| Testing       | pytest                                     | Standard, powerful, good ecosystem           |

### Design Decisions

1. **Modular architecture**: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability
2. **Local-only**: All processing happens on local machine, no data sent to external services
3. **Manual indexing**: User triggers re-indexing when needed (incremental updates deferred to future phases)
4. **Hybrid chunking**: Preserves small sections intact while handling large sections with sliding window
5. **Token-based chunking**: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap)
6. **Robust error handling**: Indexing continues even if individual files fail, with detailed error reporting
7. **Extensible design**: Architecture prepared for future LLM integration

### Chunking Strategy Details

The indexer uses a hybrid approach:

- **Short sections** (≤200 tokens): Indexed as a single chunk to preserve semantic coherence
- **Long sections** (>200 tokens): Split using sliding window with:
  - Maximum chunk size: 200 tokens (safe margin under model's 256 token limit)
  - Overlap: 30 tokens (~15% overlap to preserve context at boundaries)
  - Token counting: Uses sentence-transformers' native tokenizer for accuracy

### Metadata Structure

Each chunk stored in ChromaDB includes:

```python
{
    "file_path": str,        # Relative path from vault root
    "section_title": str,    # Markdown section heading
    "line_start": int,       # Starting line number in file
    "line_end": int          # Ending line number in file
}
```

## Dependencies

### Required

```bash
sentence-transformers  # Local embeddings model (includes tokenizer)
chromadb              # Vector database
typer                 # CLI framework
rich                  # Terminal formatting (Typer dependency)
```

### Development

```bash
pytest                # Testing framework
pytest-cov           # Test coverage
```

### Installation

```bash
pip install sentence-transformers chromadb typer[all] pytest pytest-cov
```

## Usage (Planned)

```bash
# Index vault
obsidian-rag index /path/to/vault

# Search
obsidian-rag search "your query here"

# Search with options
obsidian-rag search "query" --limit 10 --min-score 0.5
```

## Development Standards

### Code Style

- Follow PEP 8 conventions
- Use snake_case for variables and functions
- Docstrings in Google or NumPy format
- All code, comments, and documentation in English

### Testing Strategy

- Unit tests with pytest
- Test function naming: `test_i_can_xxx` (passing tests) or `test_i_cannot_xxx` (error cases)
- Functions over classes unless inheritance required
- Test plan validation before implementation

### File Management

- All file modifications documented with full file path
- Clear separation of concerns across modules

## Project Status

- [x] Requirements gathering
- [x] Architecture design
- [x] Chunking strategy validation
- [ ] Implementation
    - [x] `markdown_parser.py`
    - [x] `indexer.py`
    - [x] `searcher.py`
    - [x] `cli.py`
- [ ] Unit tests
    - [x] `test_markdown_parser.py`
    - [x] `test_indexer.py` (tests written, debugging in progress)
    - [x] `test_searcher.py`
    - [ ] `test_cli.py`
- [ ] Integration testing
- [ ] Documentation
- [ ] Phase 2: LLM integration

## License

[To be determined]

## Contributing

[To be determined]