# Obsidian RAG Backend A local, semantic search backend for Obsidian markdown files. ## Project Overview ### Context - **Target vault size**: ~1900 files, 480 MB - **Deployment**: 100% local (no external APIs) - **Usage**: Command-line interface (CLI) - **Language**: Python 3.12 ### Phase 1 Scope (Current) Semantic search system that: - Indexes markdown files from an Obsidian vault - Performs semantic search using local embeddings - Returns relevant results with metadata **Phase 2 (Future)**: Add LLM integration for answer generation using Phase 1 search results. ## Features ### Indexation - Manual, on-demand indexing - Processes all `.md` files in vault - Extracts document structure (sections, line numbers) - Hybrid chunking strategy: - Short sections (≤200 tokens): indexed as-is - Long sections: split with sliding window (200 tokens, 30 tokens overlap) - Robust error handling: continues indexing even if individual files fail ### Search Results Each search result includes: - File path (relative to vault root) - Similarity score - Relevant text excerpt - Location in file (section and line number) ## Architecture ``` obsidian_rag/ ├── obsidian_rag/ │ ├── __init__.py │ ├── markdown_parser.py # Parse .md files, extract structure │ ├── indexer.py # Generate embeddings and vector index │ ├── searcher.py # Perform semantic search │ └── cli.py # Typer CLI interface ├── tests/ │ ├── __init__.py │ ├── test_markdown_parser.py │ ├── test_indexer.py │ └── test_searcher.py ├── pyproject.toml └── README.md ``` ## Technical Choices ### Technology Stack | Component | Technology | Rationale | |---------------|--------------------------------------------|----------------------------------------------| | Embeddings | sentence-transformers (`all-MiniLM-L6-v2`) | Local, lightweight (~80MB), good performance | | Vector Store | ChromaDB | Simple, persistent, good Python integration | | CLI Framework | Typer | Modern, type-safe, excellent UX | | Testing | pytest | Standard, powerful, good ecosystem | ### Design Decisions 1. **Modular architecture**: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability 2. **Local-only**: All processing happens on local machine, no data sent to external services 3. **Manual indexing**: User triggers re-indexing when needed (incremental updates deferred to future phases) 4. **Hybrid chunking**: Preserves small sections intact while handling large sections with sliding window 5. **Token-based chunking**: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap) 6. **Robust error handling**: Indexing continues even if individual files fail, with detailed error reporting 7. **Extensible design**: Architecture prepared for future LLM integration ### Chunking Strategy Details The indexer uses a hybrid approach: - **Short sections** (≤200 tokens): Indexed as a single chunk to preserve semantic coherence - **Long sections** (>200 tokens): Split using sliding window with: - Maximum chunk size: 200 tokens (safe margin under model's 256 token limit) - Overlap: 30 tokens (~15% overlap to preserve context at boundaries) - Token counting: Uses sentence-transformers' native tokenizer for accuracy ### Metadata Structure Each chunk stored in ChromaDB includes: ```python { "file_path": str, # Relative path from vault root "section_title": str, # Markdown section heading "line_start": int, # Starting line number in file "line_end": int # Ending line number in file } ``` ## Dependencies ### Required ```bash sentence-transformers # Local embeddings model (includes tokenizer) chromadb # Vector database typer # CLI framework rich # Terminal formatting (Typer dependency) ``` ### Development ```bash pytest # Testing framework pytest-cov # Test coverage ``` ### Installation ```bash pip install sentence-transformers chromadb typer[all] pytest pytest-cov ``` ## Usage (Planned) ```bash # Index vault obsidian-rag index /path/to/vault # Search obsidian-rag search "your query here" # Search with options obsidian-rag search "query" --limit 10 --min-score 0.5 ``` ## Development Standards ### Code Style - Follow PEP 8 conventions - Use snake_case for variables and functions - Docstrings in Google or NumPy format - All code, comments, and documentation in English ### Testing Strategy - Unit tests with pytest - Test function naming: `test_i_can_xxx` (passing tests) or `test_i_cannot_xxx` (error cases) - Functions over classes unless inheritance required - Test plan validation before implementation ### File Management - All file modifications documented with full file path - Clear separation of concerns across modules ## Project Status - [x] Requirements gathering - [x] Architecture design - [x] Chunking strategy validation - [ ] Implementation - [x] `markdown_parser.py` - [x] `indexer.py` - [x] `searcher.py` - [x] `cli.py` - [ ] Unit tests - [x] `test_markdown_parser.py` - [x] `test_indexer.py` (tests written, debugging in progress) - [x] `test_searcher.py` - [ ] `test_cli.py` - [ ] Integration testing - [ ] Documentation - [ ] Phase 2: LLM integration ## License [To be determined] ## Contributing [To be determined]