191 lines
5.6 KiB
Markdown
191 lines
5.6 KiB
Markdown
# Obsidian RAG Backend
|
|
|
|
A local, semantic search backend for Obsidian markdown files.
|
|
|
|
## Project Overview
|
|
|
|
### Context
|
|
|
|
- **Target vault size**: ~1900 files, 480 MB
|
|
- **Deployment**: 100% local (no external APIs)
|
|
- **Usage**: Command-line interface (CLI)
|
|
- **Language**: Python 3.12
|
|
|
|
### Phase 1 Scope (Current)
|
|
|
|
Semantic search system that:
|
|
|
|
- Indexes markdown files from an Obsidian vault
|
|
- Performs semantic search using local embeddings
|
|
- Returns relevant results with metadata
|
|
|
|
**Phase 2 (Future)**: Add LLM integration for answer generation using Phase 1 search results.
|
|
|
|
## Features
|
|
|
|
### Indexation
|
|
|
|
- Manual, on-demand indexing
|
|
- Processes all `.md` files in vault
|
|
- Extracts document structure (sections, line numbers)
|
|
- Hybrid chunking strategy:
|
|
- Short sections (≤200 tokens): indexed as-is
|
|
- Long sections: split with sliding window (200 tokens, 30 tokens overlap)
|
|
- Robust error handling: continues indexing even if individual files fail
|
|
|
|
### Search Results
|
|
|
|
Each search result includes:
|
|
|
|
- File path (relative to vault root)
|
|
- Similarity score
|
|
- Relevant text excerpt
|
|
- Location in file (section and line number)
|
|
|
|
## Architecture
|
|
|
|
```
|
|
obsidian_rag/
|
|
├── obsidian_rag/
|
|
│ ├── __init__.py
|
|
│ ├── markdown_parser.py # Parse .md files, extract structure
|
|
│ ├── indexer.py # Generate embeddings and vector index
|
|
│ ├── searcher.py # Perform semantic search
|
|
│ └── cli.py # Typer CLI interface
|
|
├── tests/
|
|
│ ├── __init__.py
|
|
│ ├── test_markdown_parser.py
|
|
│ ├── test_indexer.py
|
|
│ └── test_searcher.py
|
|
├── pyproject.toml
|
|
└── README.md
|
|
```
|
|
|
|
## Technical Choices
|
|
|
|
### Technology Stack
|
|
|
|
| Component | Technology | Rationale |
|
|
|---------------|--------------------------------------------|----------------------------------------------|
|
|
| Embeddings | sentence-transformers (`all-MiniLM-L6-v2`) | Local, lightweight (~80MB), good performance |
|
|
| Vector Store | ChromaDB | Simple, persistent, good Python integration |
|
|
| CLI Framework | Typer | Modern, type-safe, excellent UX |
|
|
| Testing | pytest | Standard, powerful, good ecosystem |
|
|
|
|
### Design Decisions
|
|
|
|
1. **Modular architecture**: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability
|
|
2. **Local-only**: All processing happens on local machine, no data sent to external services
|
|
3. **Manual indexing**: User triggers re-indexing when needed (incremental updates deferred to future phases)
|
|
4. **Hybrid chunking**: Preserves small sections intact while handling large sections with sliding window
|
|
5. **Token-based chunking**: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap)
|
|
6. **Robust error handling**: Indexing continues even if individual files fail, with detailed error reporting
|
|
7. **Extensible design**: Architecture prepared for future LLM integration
|
|
|
|
### Chunking Strategy Details
|
|
|
|
The indexer uses a hybrid approach:
|
|
|
|
- **Short sections** (≤200 tokens): Indexed as a single chunk to preserve semantic coherence
|
|
- **Long sections** (>200 tokens): Split using sliding window with:
|
|
- Maximum chunk size: 200 tokens (safe margin under model's 256 token limit)
|
|
- Overlap: 30 tokens (~15% overlap to preserve context at boundaries)
|
|
- Token counting: Uses sentence-transformers' native tokenizer for accuracy
|
|
|
|
### Metadata Structure
|
|
|
|
Each chunk stored in ChromaDB includes:
|
|
|
|
```python
|
|
{
|
|
"file_path": str, # Relative path from vault root
|
|
"section_title": str, # Markdown section heading
|
|
"line_start": int, # Starting line number in file
|
|
"line_end": int # Ending line number in file
|
|
}
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
### Required
|
|
|
|
```bash
|
|
sentence-transformers # Local embeddings model (includes tokenizer)
|
|
chromadb # Vector database
|
|
typer # CLI framework
|
|
rich # Terminal formatting (Typer dependency)
|
|
```
|
|
|
|
### Development
|
|
|
|
```bash
|
|
pytest # Testing framework
|
|
pytest-cov # Test coverage
|
|
```
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install sentence-transformers chromadb typer[all] pytest pytest-cov
|
|
```
|
|
|
|
## Usage (Planned)
|
|
|
|
```bash
|
|
# Index vault
|
|
obsidian-rag index /path/to/vault
|
|
|
|
# Search
|
|
obsidian-rag search "your query here"
|
|
|
|
# Search with options
|
|
obsidian-rag search "query" --limit 10 --min-score 0.5
|
|
```
|
|
|
|
## Development Standards
|
|
|
|
### Code Style
|
|
|
|
- Follow PEP 8 conventions
|
|
- Use snake_case for variables and functions
|
|
- Docstrings in Google or NumPy format
|
|
- All code, comments, and documentation in English
|
|
|
|
### Testing Strategy
|
|
|
|
- Unit tests with pytest
|
|
- Test function naming: `test_i_can_xxx` (passing tests) or `test_i_cannot_xxx` (error cases)
|
|
- Functions over classes unless inheritance required
|
|
- Test plan validation before implementation
|
|
|
|
### File Management
|
|
|
|
- All file modifications documented with full file path
|
|
- Clear separation of concerns across modules
|
|
|
|
## Project Status
|
|
|
|
- [x] Requirements gathering
|
|
- [x] Architecture design
|
|
- [x] Chunking strategy validation
|
|
- [ ] Implementation
|
|
- [x] `markdown_parser.py`
|
|
- [x] `indexer.py`
|
|
- [x] `searcher.py`
|
|
- [x] `cli.py`
|
|
- [ ] Unit tests
|
|
- [x] `test_markdown_parser.py`
|
|
- [x] `test_indexer.py` (tests written, debugging in progress)
|
|
- [x] `test_searcher.py`
|
|
- [ ] `test_cli.py`
|
|
- [ ] Integration testing
|
|
- [ ] Documentation
|
|
- [ ] Phase 2: LLM integration
|
|
|
|
## License
|
|
|
|
[To be determined]
|
|
|
|
## Contributing
|
|
|
|
[To be determined] |