Initial commit
This commit is contained in:
191
README.md
Normal file
191
README.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# Obsidian RAG Backend
|
||||
|
||||
A local, semantic search backend for Obsidian markdown files.
|
||||
|
||||
## Project Overview
|
||||
|
||||
### Context
|
||||
|
||||
- **Target vault size**: ~1900 files, 480 MB
|
||||
- **Deployment**: 100% local (no external APIs)
|
||||
- **Usage**: Command-line interface (CLI)
|
||||
- **Language**: Python 3.12
|
||||
|
||||
### Phase 1 Scope (Current)
|
||||
|
||||
Semantic search system that:
|
||||
|
||||
- Indexes markdown files from an Obsidian vault
|
||||
- Performs semantic search using local embeddings
|
||||
- Returns relevant results with metadata
|
||||
|
||||
**Phase 2 (Future)**: Add LLM integration for answer generation using Phase 1 search results.
|
||||
|
||||
## Features
|
||||
|
||||
### Indexation
|
||||
|
||||
- Manual, on-demand indexing
|
||||
- Processes all `.md` files in vault
|
||||
- Extracts document structure (sections, line numbers)
|
||||
- Hybrid chunking strategy:
|
||||
- Short sections (≤200 tokens): indexed as-is
|
||||
- Long sections: split with sliding window (200 tokens, 30 tokens overlap)
|
||||
- Robust error handling: continues indexing even if individual files fail
|
||||
|
||||
### Search Results
|
||||
|
||||
Each search result includes:
|
||||
|
||||
- File path (relative to vault root)
|
||||
- Similarity score
|
||||
- Relevant text excerpt
|
||||
- Location in file (section and line number)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
obsidian_rag/
|
||||
├── obsidian_rag/
|
||||
│ ├── __init__.py
|
||||
│ ├── markdown_parser.py # Parse .md files, extract structure
|
||||
│ ├── indexer.py # Generate embeddings and vector index
|
||||
│ ├── searcher.py # Perform semantic search
|
||||
│ └── cli.py # Typer CLI interface
|
||||
├── tests/
|
||||
│ ├── __init__.py
|
||||
│ ├── test_markdown_parser.py
|
||||
│ ├── test_indexer.py
|
||||
│ └── test_searcher.py
|
||||
├── pyproject.toml
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Technical Choices
|
||||
|
||||
### Technology Stack
|
||||
|
||||
| Component | Technology | Rationale |
|
||||
|---------------|--------------------------------------------|----------------------------------------------|
|
||||
| Embeddings | sentence-transformers (`all-MiniLM-L6-v2`) | Local, lightweight (~80MB), good performance |
|
||||
| Vector Store | ChromaDB | Simple, persistent, good Python integration |
|
||||
| CLI Framework | Typer | Modern, type-safe, excellent UX |
|
||||
| Testing | pytest | Standard, powerful, good ecosystem |
|
||||
|
||||
### Design Decisions
|
||||
|
||||
1. **Modular architecture**: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability
|
||||
2. **Local-only**: All processing happens on local machine, no data sent to external services
|
||||
3. **Manual indexing**: User triggers re-indexing when needed (incremental updates deferred to future phases)
|
||||
4. **Hybrid chunking**: Preserves small sections intact while handling large sections with sliding window
|
||||
5. **Token-based chunking**: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap)
|
||||
6. **Robust error handling**: Indexing continues even if individual files fail, with detailed error reporting
|
||||
7. **Extensible design**: Architecture prepared for future LLM integration
|
||||
|
||||
### Chunking Strategy Details
|
||||
|
||||
The indexer uses a hybrid approach:
|
||||
|
||||
- **Short sections** (≤200 tokens): Indexed as a single chunk to preserve semantic coherence
|
||||
- **Long sections** (>200 tokens): Split using sliding window with:
|
||||
- Maximum chunk size: 200 tokens (safe margin under model's 256 token limit)
|
||||
- Overlap: 30 tokens (~15% overlap to preserve context at boundaries)
|
||||
- Token counting: Uses sentence-transformers' native tokenizer for accuracy
|
||||
|
||||
### Metadata Structure
|
||||
|
||||
Each chunk stored in ChromaDB includes:
|
||||
|
||||
```python
|
||||
{
|
||||
"file_path": str, # Relative path from vault root
|
||||
"section_title": str, # Markdown section heading
|
||||
"line_start": int, # Starting line number in file
|
||||
"line_end": int # Ending line number in file
|
||||
}
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Required
|
||||
|
||||
```bash
|
||||
sentence-transformers # Local embeddings model (includes tokenizer)
|
||||
chromadb # Vector database
|
||||
typer # CLI framework
|
||||
rich # Terminal formatting (Typer dependency)
|
||||
```
|
||||
|
||||
### Development
|
||||
|
||||
```bash
|
||||
pytest # Testing framework
|
||||
pytest-cov # Test coverage
|
||||
```
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install sentence-transformers chromadb typer[all] pytest pytest-cov
|
||||
```
|
||||
|
||||
## Usage (Planned)
|
||||
|
||||
```bash
|
||||
# Index vault
|
||||
obsidian-rag index /path/to/vault
|
||||
|
||||
# Search
|
||||
obsidian-rag search "your query here"
|
||||
|
||||
# Search with options
|
||||
obsidian-rag search "query" --limit 10 --min-score 0.5
|
||||
```
|
||||
|
||||
## Development Standards
|
||||
|
||||
### Code Style
|
||||
|
||||
- Follow PEP 8 conventions
|
||||
- Use snake_case for variables and functions
|
||||
- Docstrings in Google or NumPy format
|
||||
- All code, comments, and documentation in English
|
||||
|
||||
### Testing Strategy
|
||||
|
||||
- Unit tests with pytest
|
||||
- Test function naming: `test_i_can_xxx` (passing tests) or `test_i_cannot_xxx` (error cases)
|
||||
- Functions over classes unless inheritance required
|
||||
- Test plan validation before implementation
|
||||
|
||||
### File Management
|
||||
|
||||
- All file modifications documented with full file path
|
||||
- Clear separation of concerns across modules
|
||||
|
||||
## Project Status
|
||||
|
||||
- [x] Requirements gathering
|
||||
- [x] Architecture design
|
||||
- [x] Chunking strategy validation
|
||||
- [ ] Implementation
|
||||
- [x] `markdown_parser.py`
|
||||
- [x] `indexer.py`
|
||||
- [x] `searcher.py`
|
||||
- [x] `cli.py`
|
||||
- [ ] Unit tests
|
||||
- [x] `test_markdown_parser.py`
|
||||
- [x] `test_indexer.py` (tests written, debugging in progress)
|
||||
- [x] `test_searcher.py`
|
||||
- [ ] `test_cli.py`
|
||||
- [ ] Integration testing
|
||||
- [ ] Documentation
|
||||
- [ ] Phase 2: LLM integration
|
||||
|
||||
## License
|
||||
|
||||
[To be determined]
|
||||
|
||||
## Contributing
|
||||
|
||||
[To be determined]
|
||||
Reference in New Issue
Block a user