Files
MyObsidianAI/README.md
Kodjo Sossouvi d4925f7969 Initial commit
2025-12-12 11:31:44 +01:00

191 lines
5.6 KiB
Markdown

# Obsidian RAG Backend
A local, semantic search backend for Obsidian markdown files.
## Project Overview
### Context
- **Target vault size**: ~1900 files, 480 MB
- **Deployment**: 100% local (no external APIs)
- **Usage**: Command-line interface (CLI)
- **Language**: Python 3.12
### Phase 1 Scope (Current)
Semantic search system that:
- Indexes markdown files from an Obsidian vault
- Performs semantic search using local embeddings
- Returns relevant results with metadata
**Phase 2 (Future)**: Add LLM integration for answer generation using Phase 1 search results.
## Features
### Indexation
- Manual, on-demand indexing
- Processes all `.md` files in vault
- Extracts document structure (sections, line numbers)
- Hybrid chunking strategy:
- Short sections (≤200 tokens): indexed as-is
- Long sections: split with sliding window (200 tokens, 30 tokens overlap)
- Robust error handling: continues indexing even if individual files fail
### Search Results
Each search result includes:
- File path (relative to vault root)
- Similarity score
- Relevant text excerpt
- Location in file (section and line number)
## Architecture
```
obsidian_rag/
├── obsidian_rag/
│ ├── __init__.py
│ ├── markdown_parser.py # Parse .md files, extract structure
│ ├── indexer.py # Generate embeddings and vector index
│ ├── searcher.py # Perform semantic search
│ └── cli.py # Typer CLI interface
├── tests/
│ ├── __init__.py
│ ├── test_markdown_parser.py
│ ├── test_indexer.py
│ └── test_searcher.py
├── pyproject.toml
└── README.md
```
## Technical Choices
### Technology Stack
| Component | Technology | Rationale |
|---------------|--------------------------------------------|----------------------------------------------|
| Embeddings | sentence-transformers (`all-MiniLM-L6-v2`) | Local, lightweight (~80MB), good performance |
| Vector Store | ChromaDB | Simple, persistent, good Python integration |
| CLI Framework | Typer | Modern, type-safe, excellent UX |
| Testing | pytest | Standard, powerful, good ecosystem |
### Design Decisions
1. **Modular architecture**: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability
2. **Local-only**: All processing happens on local machine, no data sent to external services
3. **Manual indexing**: User triggers re-indexing when needed (incremental updates deferred to future phases)
4. **Hybrid chunking**: Preserves small sections intact while handling large sections with sliding window
5. **Token-based chunking**: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap)
6. **Robust error handling**: Indexing continues even if individual files fail, with detailed error reporting
7. **Extensible design**: Architecture prepared for future LLM integration
### Chunking Strategy Details
The indexer uses a hybrid approach:
- **Short sections** (≤200 tokens): Indexed as a single chunk to preserve semantic coherence
- **Long sections** (>200 tokens): Split using sliding window with:
- Maximum chunk size: 200 tokens (safe margin under model's 256 token limit)
- Overlap: 30 tokens (~15% overlap to preserve context at boundaries)
- Token counting: Uses sentence-transformers' native tokenizer for accuracy
### Metadata Structure
Each chunk stored in ChromaDB includes:
```python
{
"file_path": str, # Relative path from vault root
"section_title": str, # Markdown section heading
"line_start": int, # Starting line number in file
"line_end": int # Ending line number in file
}
```
## Dependencies
### Required
```bash
sentence-transformers # Local embeddings model (includes tokenizer)
chromadb # Vector database
typer # CLI framework
rich # Terminal formatting (Typer dependency)
```
### Development
```bash
pytest # Testing framework
pytest-cov # Test coverage
```
### Installation
```bash
pip install sentence-transformers chromadb typer[all] pytest pytest-cov
```
## Usage (Planned)
```bash
# Index vault
obsidian-rag index /path/to/vault
# Search
obsidian-rag search "your query here"
# Search with options
obsidian-rag search "query" --limit 10 --min-score 0.5
```
## Development Standards
### Code Style
- Follow PEP 8 conventions
- Use snake_case for variables and functions
- Docstrings in Google or NumPy format
- All code, comments, and documentation in English
### Testing Strategy
- Unit tests with pytest
- Test function naming: `test_i_can_xxx` (passing tests) or `test_i_cannot_xxx` (error cases)
- Functions over classes unless inheritance required
- Test plan validation before implementation
### File Management
- All file modifications documented with full file path
- Clear separation of concerns across modules
## Project Status
- [x] Requirements gathering
- [x] Architecture design
- [x] Chunking strategy validation
- [ ] Implementation
- [x] `markdown_parser.py`
- [x] `indexer.py`
- [x] `searcher.py`
- [x] `cli.py`
- [ ] Unit tests
- [x] `test_markdown_parser.py`
- [x] `test_indexer.py` (tests written, debugging in progress)
- [x] `test_searcher.py`
- [ ] `test_cli.py`
- [ ] Integration testing
- [ ] Documentation
- [ ] Phase 2: LLM integration
## License
[To be determined]
## Contributing
[To be determined]