Initial commit

2025-12-12 11:31:44 +01:00
commit d4925f7969
21 changed files with 2957 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,191 @@
+# Obsidian RAG Backend
+
+A local, semantic search backend for Obsidian markdown files.
+
+## Project Overview
+
+### Context
+
+- **Target vault size**: ~1900 files, 480 MB
+- **Deployment**: 100% local (no external APIs)
+- **Usage**: Command-line interface (CLI)
+- **Language**: Python 3.12
+
+### Phase 1 Scope (Current)
+
+Semantic search system that:
+
+- Indexes markdown files from an Obsidian vault
+- Performs semantic search using local embeddings
+- Returns relevant results with metadata
+
+**Phase 2 (Future)**: Add LLM integration for answer generation using Phase 1 search results.
+
+## Features
+
+### Indexation
+
+- Manual, on-demand indexing
+- Processes all `.md` files in vault
+- Extracts document structure (sections, line numbers)
+- Hybrid chunking strategy:
+  - Short sections (≤200 tokens): indexed as-is
+  - Long sections: split with sliding window (200 tokens, 30 tokens overlap)
+- Robust error handling: continues indexing even if individual files fail
+
+### Search Results
+
+Each search result includes:
+
+- File path (relative to vault root)
+- Similarity score
+- Relevant text excerpt
+- Location in file (section and line number)
+
+## Architecture
+
+```
+obsidian_rag/
+├── obsidian_rag/
+│   ├── __init__.py
+│   ├── markdown_parser.py    # Parse .md files, extract structure
+│   ├── indexer.py             # Generate embeddings and vector index
+│   ├── searcher.py            # Perform semantic search
+│   └── cli.py                 # Typer CLI interface
+├── tests/
+│   ├── __init__.py
+│   ├── test_markdown_parser.py
+│   ├── test_indexer.py
+│   └── test_searcher.py
+├── pyproject.toml
+└── README.md
+```
+
+## Technical Choices
+
+### Technology Stack
+
+| Component     | Technology                                 | Rationale                                    |
+|---------------|--------------------------------------------|----------------------------------------------|
+| Embeddings    | sentence-transformers (`all-MiniLM-L6-v2`) | Local, lightweight (~80MB), good performance |
+| Vector Store  | ChromaDB                                   | Simple, persistent, good Python integration  |
+| CLI Framework | Typer                                      | Modern, type-safe, excellent UX              |
+| Testing       | pytest                                     | Standard, powerful, good ecosystem           |
+
+### Design Decisions
+
+1. **Modular architecture**: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability
+2. **Local-only**: All processing happens on local machine, no data sent to external services
+3. **Manual indexing**: User triggers re-indexing when needed (incremental updates deferred to future phases)
+4. **Hybrid chunking**: Preserves small sections intact while handling large sections with sliding window
+5. **Token-based chunking**: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap)
+6. **Robust error handling**: Indexing continues even if individual files fail, with detailed error reporting
+7. **Extensible design**: Architecture prepared for future LLM integration
+
+### Chunking Strategy Details
+
+The indexer uses a hybrid approach:
+
+- **Short sections** (≤200 tokens): Indexed as a single chunk to preserve semantic coherence
+- **Long sections** (>200 tokens): Split using sliding window with:
+  - Maximum chunk size: 200 tokens (safe margin under model's 256 token limit)
+  - Overlap: 30 tokens (~15% overlap to preserve context at boundaries)
+  - Token counting: Uses sentence-transformers' native tokenizer for accuracy
+
+### Metadata Structure
+
+Each chunk stored in ChromaDB includes:
+
+```python
+{
+    "file_path": str,        # Relative path from vault root
+    "section_title": str,    # Markdown section heading
+    "line_start": int,       # Starting line number in file
+    "line_end": int          # Ending line number in file
+}
+```
+
+## Dependencies
+
+### Required
+
+```bash
+sentence-transformers  # Local embeddings model (includes tokenizer)
+chromadb              # Vector database
+typer                 # CLI framework
+rich                  # Terminal formatting (Typer dependency)
+```
+
+### Development
+
+```bash
+pytest                # Testing framework
+pytest-cov           # Test coverage
+```
+
+### Installation
+
+```bash
+pip install sentence-transformers chromadb typer[all] pytest pytest-cov
+```
+
+## Usage (Planned)
+
+```bash
+# Index vault
+obsidian-rag index /path/to/vault
+
+# Search
+obsidian-rag search "your query here"
+
+# Search with options
+obsidian-rag search "query" --limit 10 --min-score 0.5
+```
+
+## Development Standards
+
+### Code Style
+
+- Follow PEP 8 conventions
+- Use snake_case for variables and functions
+- Docstrings in Google or NumPy format
+- All code, comments, and documentation in English
+
+### Testing Strategy
+
+- Unit tests with pytest
+- Test function naming: `test_i_can_xxx` (passing tests) or `test_i_cannot_xxx` (error cases)
+- Functions over classes unless inheritance required
+- Test plan validation before implementation
+
+### File Management
+
+- All file modifications documented with full file path
+- Clear separation of concerns across modules
+
+## Project Status
+
+- [x] Requirements gathering
+- [x] Architecture design
+- [x] Chunking strategy validation
+- [ ] Implementation
+    - [x] `markdown_parser.py`
+    - [x] `indexer.py`
+    - [x] `searcher.py`
+    - [x] `cli.py`
+- [ ] Unit tests
+    - [x] `test_markdown_parser.py`
+    - [x] `test_indexer.py` (tests written, debugging in progress)
+    - [x] `test_searcher.py`
+    - [ ] `test_cli.py`
+- [ ] Integration testing
+- [ ] Documentation
+- [ ] Phase 2: LLM integration
+
+## License
+
+[To be determined]
+
+## Contributing
+
+[To be determined]