Files
MyObsidianAI/README.md
Kodjo Sossouvi d4925f7969 Initial commit
2025-12-12 11:31:44 +01:00

5.6 KiB

Obsidian RAG Backend

A local, semantic search backend for Obsidian markdown files.

Project Overview

Context

  • Target vault size: ~1900 files, 480 MB
  • Deployment: 100% local (no external APIs)
  • Usage: Command-line interface (CLI)
  • Language: Python 3.12

Phase 1 Scope (Current)

Semantic search system that:

  • Indexes markdown files from an Obsidian vault
  • Performs semantic search using local embeddings
  • Returns relevant results with metadata

Phase 2 (Future): Add LLM integration for answer generation using Phase 1 search results.

Features

Indexation

  • Manual, on-demand indexing
  • Processes all .md files in vault
  • Extracts document structure (sections, line numbers)
  • Hybrid chunking strategy:
    • Short sections (≤200 tokens): indexed as-is
    • Long sections: split with sliding window (200 tokens, 30 tokens overlap)
  • Robust error handling: continues indexing even if individual files fail

Search Results

Each search result includes:

  • File path (relative to vault root)
  • Similarity score
  • Relevant text excerpt
  • Location in file (section and line number)

Architecture

obsidian_rag/
├── obsidian_rag/
│   ├── __init__.py
│   ├── markdown_parser.py    # Parse .md files, extract structure
│   ├── indexer.py             # Generate embeddings and vector index
│   ├── searcher.py            # Perform semantic search
│   └── cli.py                 # Typer CLI interface
├── tests/
│   ├── __init__.py
│   ├── test_markdown_parser.py
│   ├── test_indexer.py
│   └── test_searcher.py
├── pyproject.toml
└── README.md

Technical Choices

Technology Stack

Component Technology Rationale
Embeddings sentence-transformers (all-MiniLM-L6-v2) Local, lightweight (~80MB), good performance
Vector Store ChromaDB Simple, persistent, good Python integration
CLI Framework Typer Modern, type-safe, excellent UX
Testing pytest Standard, powerful, good ecosystem

Design Decisions

  1. Modular architecture: Separate concerns (parsing, indexing, searching, CLI) for maintainability and testability
  2. Local-only: All processing happens on local machine, no data sent to external services
  3. Manual indexing: User triggers re-indexing when needed (incremental updates deferred to future phases)
  4. Hybrid chunking: Preserves small sections intact while handling large sections with sliding window
  5. Token-based chunking: Uses model's tokenizer for precise chunk sizing (max 200 tokens, 30 tokens overlap)
  6. Robust error handling: Indexing continues even if individual files fail, with detailed error reporting
  7. Extensible design: Architecture prepared for future LLM integration

Chunking Strategy Details

The indexer uses a hybrid approach:

  • Short sections (≤200 tokens): Indexed as a single chunk to preserve semantic coherence
  • Long sections (>200 tokens): Split using sliding window with:
    • Maximum chunk size: 200 tokens (safe margin under model's 256 token limit)
    • Overlap: 30 tokens (~15% overlap to preserve context at boundaries)
    • Token counting: Uses sentence-transformers' native tokenizer for accuracy

Metadata Structure

Each chunk stored in ChromaDB includes:

{
    "file_path": str,        # Relative path from vault root
    "section_title": str,    # Markdown section heading
    "line_start": int,       # Starting line number in file
    "line_end": int          # Ending line number in file
}

Dependencies

Required

sentence-transformers  # Local embeddings model (includes tokenizer)
chromadb              # Vector database
typer                 # CLI framework
rich                  # Terminal formatting (Typer dependency)

Development

pytest                # Testing framework
pytest-cov           # Test coverage

Installation

pip install sentence-transformers chromadb typer[all] pytest pytest-cov

Usage (Planned)

# Index vault
obsidian-rag index /path/to/vault

# Search
obsidian-rag search "your query here"

# Search with options
obsidian-rag search "query" --limit 10 --min-score 0.5

Development Standards

Code Style

  • Follow PEP 8 conventions
  • Use snake_case for variables and functions
  • Docstrings in Google or NumPy format
  • All code, comments, and documentation in English

Testing Strategy

  • Unit tests with pytest
  • Test function naming: test_i_can_xxx (passing tests) or test_i_cannot_xxx (error cases)
  • Functions over classes unless inheritance required
  • Test plan validation before implementation

File Management

  • All file modifications documented with full file path
  • Clear separation of concerns across modules

Project Status

  • Requirements gathering
  • Architecture design
  • Chunking strategy validation
  • Implementation
    • markdown_parser.py
    • indexer.py
    • searcher.py
    • cli.py
  • Unit tests
    • test_markdown_parser.py
    • test_indexer.py (tests written, debugging in progress)
    • test_searcher.py
    • test_cli.py
  • Integration testing
  • Documentation
  • Phase 2: LLM integration

License

[To be determined]

Contributing

[To be determined]