5.6 KiB
5.6 KiB
Obsidian RAG Backend
A local, semantic search backend for Obsidian markdown files.
Project Overview
Context
- Target vault size: ~1900 files, 480 MB
- Deployment: 100% local (no external APIs)
- Usage: Command-line interface (CLI)
- Language: Python 3.12
Phase 1 Scope (Current)
Semantic search system that:
- Indexes markdown files from an Obsidian vault
- Performs semantic search using local embeddings
- Returns relevant results with metadata
Phase 2 (Future): Add LLM integration for answer generation using Phase 1 search results.
Features
Indexation
- Manual, on-demand indexing
- Processes all
.mdfiles in vault - Extracts document structure (sections, line numbers)
- Hybrid chunking strategy:
- Short sections (≤200 tokens): indexed as-is
- Long sections: split with sliding window (200 tokens, 30 tokens overlap)
- Robust error handling: continues indexing even if individual files fail
Search Results
Each search result includes:
- File path (relative to vault root)
- Similarity score
- Relevant text excerpt
- Location in file (section and line number)
Architecture
obsidian_rag/
├── obsidian_rag/
│ ├── __init__.py
│ ├── markdown_parser.py # Parse .md files, extract structure
│ ├── indexer.py # Generate embeddings and vector index
│ ├── searcher.py # Perform semantic search
│ └── cli.py # Typer CLI interface
├── tests/
│ ├── __init__.py
│ ├── test_markdown_parser.py
│ ├── test_indexer.py
│ └── test_searcher.py
├── pyproject.toml
└── README.md
Technical Choices
Technology Stack
| Component | Technology | Rationale |
|---|---|---|
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) |
Local, lightweight (~80MB), good performance |
| Vector Store | ChromaDB | Simple, persistent, good Python integration |
| CLI Framework | Typer | Modern, type-safe, excellent UX |
| Testing | pytest | Standard, powerful, good ecosystem |
Design Decisions
- Modular architecture: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability
- Local-only: All processing happens on local machine, no data sent to external services
- Manual indexing: User triggers re-indexing when needed (incremental updates deferred to future phases)
- Hybrid chunking: Preserves small sections intact while handling large sections with sliding window
- Token-based chunking: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap)
- Robust error handling: Indexing continues even if individual files fail, with detailed error reporting
- Extensible design: Architecture prepared for future LLM integration
Chunking Strategy Details
The indexer uses a hybrid approach:
- Short sections (≤200 tokens): Indexed as a single chunk to preserve semantic coherence
- Long sections (>200 tokens): Split using sliding window with:
- Maximum chunk size: 200 tokens (safe margin under model's 256 token limit)
- Overlap: 30 tokens (~15% overlap to preserve context at boundaries)
- Token counting: Uses sentence-transformers' native tokenizer for accuracy
Metadata Structure
Each chunk stored in ChromaDB includes:
{
"file_path": str, # Relative path from vault root
"section_title": str, # Markdown section heading
"line_start": int, # Starting line number in file
"line_end": int # Ending line number in file
}
Dependencies
Required
sentence-transformers # Local embeddings model (includes tokenizer)
chromadb # Vector database
typer # CLI framework
rich # Terminal formatting (Typer dependency)
Development
pytest # Testing framework
pytest-cov # Test coverage
Installation
pip install sentence-transformers chromadb typer[all] pytest pytest-cov
Usage (Planned)
# Index vault
obsidian-rag index /path/to/vault
# Search
obsidian-rag search "your query here"
# Search with options
obsidian-rag search "query" --limit 10 --min-score 0.5
Development Standards
Code Style
- Follow PEP 8 conventions
- Use snake_case for variables and functions
- Docstrings in Google or NumPy format
- All code, comments, and documentation in English
Testing Strategy
- Unit tests with pytest
- Test function naming:
test_i_can_xxx(passing tests) ortest_i_cannot_xxx(error cases) - Functions over classes unless inheritance required
- Test plan validation before implementation
File Management
- All file modifications documented with full file path
- Clear separation of concerns across modules
Project Status
- Requirements gathering
- Architecture design
- Chunking strategy validation
- Implementation
markdown_parser.pyindexer.pysearcher.pycli.py
- Unit tests
test_markdown_parser.pytest_indexer.py(tests written, debugging in progress)test_searcher.pytest_cli.py
- Integration testing
- Documentation
- Phase 2: LLM integration
License
[To be determined]
Contributing
[To be determined]