kodjo/MyObsidianAI

Fork 0

Files

Kodjo Sossouvi d4925f7969 Initial commit

2025-12-12 11:31:44 +01:00

5.6 KiB

Raw Permalink Blame History

Obsidian RAG Backend

A local, semantic search backend for Obsidian markdown files.

Project Overview

Context

Target vault size: ~1900 files, 480 MB
Deployment: 100% local (no external APIs)
Usage: Command-line interface (CLI)
Language: Python 3.12

Phase 1 Scope (Current)

Semantic search system that:

Indexes markdown files from an Obsidian vault
Performs semantic search using local embeddings
Returns relevant results with metadata

Phase 2 (Future): Add LLM integration for answer generation using Phase 1 search results.

Features

Indexation

Manual, on-demand indexing
Processes all .md files in vault
Extracts document structure (sections, line numbers)
Hybrid chunking strategy:
- Short sections (≤200 tokens): indexed as-is
- Long sections: split with sliding window (200 tokens, 30 tokens overlap)
Robust error handling: continues indexing even if individual files fail

Search Results

Each search result includes:

File path (relative to vault root)
Similarity score
Relevant text excerpt
Location in file (section and line number)

Architecture

obsidian_rag/
├── obsidian_rag/
│   ├── __init__.py
│   ├── markdown_parser.py    # Parse .md files, extract structure
│   ├── indexer.py             # Generate embeddings and vector index
│   ├── searcher.py            # Perform semantic search
│   └── cli.py                 # Typer CLI interface
├── tests/
│   ├── __init__.py
│   ├── test_markdown_parser.py
│   ├── test_indexer.py
│   └── test_searcher.py
├── pyproject.toml
└── README.md

Technical Choices

Technology Stack

Component	Technology	Rationale
Embeddings	sentence-transformers (`all-MiniLM-L6-v2`)	Local, lightweight (~80MB), good performance
Vector Store	ChromaDB	Simple, persistent, good Python integration
CLI Framework	Typer	Modern, type-safe, excellent UX
Testing	pytest	Standard, powerful, good ecosystem

Design Decisions

Modular architecture: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability
Local-only: All processing happens on local machine, no data sent to external services
Manual indexing: User triggers re-indexing when needed (incremental updates deferred to future phases)
Hybrid chunking: Preserves small sections intact while handling large sections with sliding window
Token-based chunking: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap)
Robust error handling: Indexing continues even if individual files fail, with detailed error reporting
Extensible design: Architecture prepared for future LLM integration

Chunking Strategy Details

The indexer uses a hybrid approach:

Short sections (≤200 tokens): Indexed as a single chunk to preserve semantic coherence
Long sections (>200 tokens): Split using sliding window with:
- Maximum chunk size: 200 tokens (safe margin under model's 256 token limit)
- Overlap: 30 tokens (~15% overlap to preserve context at boundaries)
- Token counting: Uses sentence-transformers' native tokenizer for accuracy

Metadata Structure

Each chunk stored in ChromaDB includes:

{
    "file_path": str,        # Relative path from vault root
    "section_title": str,    # Markdown section heading
    "line_start": int,       # Starting line number in file
    "line_end": int          # Ending line number in file
}

Dependencies

Required

sentence-transformers  # Local embeddings model (includes tokenizer)
chromadb              # Vector database
typer                 # CLI framework
rich                  # Terminal formatting (Typer dependency)

Development

pytest                # Testing framework
pytest-cov           # Test coverage

Installation

pip install sentence-transformers chromadb typer[all] pytest pytest-cov

Usage (Planned)

# Index vault
obsidian-rag index /path/to/vault

# Search
obsidian-rag search "your query here"

# Search with options
obsidian-rag search "query" --limit 10 --min-score 0.5

Development Standards

Code Style

Follow PEP 8 conventions
Use snake_case for variables and functions
Docstrings in Google or NumPy format
All code, comments, and documentation in English

Testing Strategy

Unit tests with pytest
Test function naming: test_i_can_xxx (passing tests) or test_i_cannot_xxx (error cases)
Functions over classes unless inheritance required
Test plan validation before implementation

File Management

All file modifications documented with full file path
Clear separation of concerns across modules

Project Status

Requirements gathering
Architecture design
Chunking strategy validation
Implementation
- markdown_parser.py
- indexer.py
- searcher.py
- cli.py
Unit tests
- test_markdown_parser.py
- test_indexer.py (tests written, debugging in progress)
- test_searcher.py
- test_cli.py
Integration testing
Documentation
Phase 2: LLM integration

License

[To be determined]

Contributing