Adding document service

2025-09-19 22:59:41 +02:00
parent e8b306ac4a
commit f1b551d243
13 changed files with 1734 additions and 24 deletions
--- a/Readme.md
+++ b/Readme.md
@@ -232,19 +232,11 @@ Stores file metadata and extracted content:
  "filename": "document.pdf",
  "filepath": "/watched_files/document.pdf",
  "file_type": "pdf",
-  "mime_type": "application/pdf",
-  "file_size": 2048576,
-  "content": "extracted text content...",
-  "encoding": "utf-8",
-  "extraction_method": "direct_text",
-  // direct_text, ocr, hybrid
+  "extraction_method": "direct_text", // direct_text, ocr, hybrid
  "metadata": {
-    "page_count": 15,
-    // for PDFs
-    "word_count": 250,
-    // for text files  
-    "image_dimensions": {
-      // for images
+    "page_count": 15,        // for PDFs
+    "word_count": 250,       // for text files  
+    "image_dimensions": {    // for images
      "width": 1920,
      "height": 1080
    }
@@ -253,6 +245,19 @@ Stores file metadata and extracted content:
  "file_hash": "sha256_hash_value"
 }
 ```
+#### Document Contents Collection
+
+Stores actual file content and technical metadata:
+```json
+{
+  "_id": "ObjectId",
+  "file_hash": "sha256_hash_value",
+  "content": "extracted text content...",
+  "encoding": "utf-8",
+  "file_size": 2048576,
+  "mime_type": "application/pdf"
+}
+```

 #### Processing Jobs Collection

@@ -272,6 +277,25 @@ Tracks processing status and lifecycle:
 }
 ```

+### Data Storage Strategy
+
+- **Choice**: Three separate collections for files, content, and processing status
+- **Rationale**: Normalization prevents content duplication when multiple files have identical content
+- **Benefits**:
+    - Content deduplication via SHA256 hash
+    - Better query performance for metadata vs content searches
+    - Clear separation of concerns between file metadata, content, and processing lifecycle
+    - Multiple files can reference the same content (e.g., identical copies in different locations)
+
+### Content Storage Location
+
+- **Choice**: Store extracted content in separate `document_contents` collection
+- **Rationale**: Content normalization and deduplication
+- **Benefits**: 
+    - Single content storage per unique file hash
+    - Multiple file entries can reference same content
+    - Efficient storage for duplicate files
+
 ### Supported File Types (Initial Implementation)

 - **Text Files** (`.txt`): Direct content reading
@@ -323,6 +347,87 @@ Tracks processing status and lifecycle:
 - **Extensible Metadata**: Flexible metadata storage per file type
 - **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches

+## Document Service Architecture
+
+### Service Overview
+
+The document service provides orchestrated access to file documents and their content through a single interface that coordinates between `FileDocument` and `DocumentContent` repositories.
+
+### Service Design
+
+- **Architecture Pattern**: Service orchestration with separate repositories
+- **Transaction Support**: MongoDB ACID transactions for data consistency
+- **Content Deduplication**: Multiple files can reference the same content via SHA256 hash
+- **Error Handling**: MongoDB standard exceptions with transaction rollback
+
+### Document Service (`document_service.py`)
+
+Orchestrates operations between file and content repositories while maintaining data consistency.
+
+#### Core Functionality
+
+##### `create_document(file_path: str, file_bytes: bytes, encoding: str)`
+
+Creates a new document with automatic attribute calculation and content deduplication.
+
+**Automatic Calculations:**
+- `file_hash`: SHA256 hash of file bytes
+- `file_type`: Detection based on file extension 
+- `mime_type`: Detection via `python-magic` library
+- `file_size`: Length of provided bytes
+- `detected_at`: Current timestamp
+- `metadata`: Empty dictionary (reserved for future extension)
+
+**Deduplication Logic:**
+1. Calculate SHA256 hash of file content
+2. Check if `DocumentContent` with this hash already exists
+3. If EXISTS: Create only `FileDocument` referencing existing content
+4. If NOT EXISTS: Create both `FileDocument` and `DocumentContent` in transaction
+
+**Transaction Flow:**
+```
+BEGIN TRANSACTION
+  IF content_exists(file_hash):
+    CREATE FileDocument with content reference
+  ELSE:
+    CREATE DocumentContent
+    CREATE FileDocument with content reference
+COMMIT TRANSACTION
+```
+
+#### Available Methods
+
+- `create_document(file_path, file_bytes, encoding)`: Create with deduplication
+- `get_document_by_id(document_id)`: Retrieve by document ID
+- `get_document_by_hash(file_hash)`: Retrieve by file hash
+- `get_document_by_filepath(filepath)`: Retrieve by file path
+- `list_documents(skip, limit)`: Paginated document listing
+- `count_documents()`: Total document count
+- `update_document(document_id, update_data)`: Update document metadata
+- `delete_document(document_id)`: Remove document and orphaned content
+
+### Repository Dependencies
+
+The document service coordinates two existing repositories:
+
+#### File Repository (`file_repository.py`)
+- `create_document()`, `find_document_by_id()`, `find_document_by_hash()`
+- `find_document_by_filepath()`, `find_document_by_name()`
+- `list_documents()`, `count_documents()`
+- `update_document()`, `delete_document()`
+
+#### Document Content Repository (`document_content_repository.py`)
+- `create_document_content()`, `find_document_content_by_id()`
+- `find_document_content_by_file_hash()`, `content_exists()`
+- `update_document_content()`, `delete_document_content()`
+- `list_document_contents()`, `count_document_contents()`
+
+### Dependencies
+
+- `python-magic`: MIME type detection
+- `hashlib`: SHA256 hashing (standard library)
+- `pymongo`: MongoDB transactions support
+
 ## Key Implementation Notes

 ### Python Standards