Adding document service

This commit is contained in:
2025-09-19 22:59:41 +02:00
parent e8b306ac4a
commit f1b551d243
13 changed files with 1734 additions and 24 deletions

129
Readme.md
View File

@@ -232,19 +232,11 @@ Stores file metadata and extracted content:
"filename": "document.pdf",
"filepath": "/watched_files/document.pdf",
"file_type": "pdf",
"mime_type": "application/pdf",
"file_size": 2048576,
"content": "extracted text content...",
"encoding": "utf-8",
"extraction_method": "direct_text",
// direct_text, ocr, hybrid
"extraction_method": "direct_text", // direct_text, ocr, hybrid
"metadata": {
"page_count": 15,
// for PDFs
"word_count": 250,
// for text files
"image_dimensions": {
// for images
"page_count": 15, // for PDFs
"word_count": 250, // for text files
"image_dimensions": { // for images
"width": 1920,
"height": 1080
}
@@ -253,6 +245,19 @@ Stores file metadata and extracted content:
"file_hash": "sha256_hash_value"
}
```
#### Document Contents Collection
Stores actual file content and technical metadata:
```json
{
"_id": "ObjectId",
"file_hash": "sha256_hash_value",
"content": "extracted text content...",
"encoding": "utf-8",
"file_size": 2048576,
"mime_type": "application/pdf"
}
```
#### Processing Jobs Collection
@@ -272,6 +277,25 @@ Tracks processing status and lifecycle:
}
```
### Data Storage Strategy
- **Choice**: Three separate collections for files, content, and processing status
- **Rationale**: Normalization prevents content duplication when multiple files have identical content
- **Benefits**:
- Content deduplication via SHA256 hash
- Better query performance for metadata vs content searches
- Clear separation of concerns between file metadata, content, and processing lifecycle
- Multiple files can reference the same content (e.g., identical copies in different locations)
### Content Storage Location
- **Choice**: Store extracted content in separate `document_contents` collection
- **Rationale**: Content normalization and deduplication
- **Benefits**:
- Single content storage per unique file hash
- Multiple file entries can reference same content
- Efficient storage for duplicate files
### Supported File Types (Initial Implementation)
- **Text Files** (`.txt`): Direct content reading
@@ -323,6 +347,87 @@ Tracks processing status and lifecycle:
- **Extensible Metadata**: Flexible metadata storage per file type
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
## Document Service Architecture
### Service Overview
The document service provides orchestrated access to file documents and their content through a single interface that coordinates between `FileDocument` and `DocumentContent` repositories.
### Service Design
- **Architecture Pattern**: Service orchestration with separate repositories
- **Transaction Support**: MongoDB ACID transactions for data consistency
- **Content Deduplication**: Multiple files can reference the same content via SHA256 hash
- **Error Handling**: MongoDB standard exceptions with transaction rollback
### Document Service (`document_service.py`)
Orchestrates operations between file and content repositories while maintaining data consistency.
#### Core Functionality
##### `create_document(file_path: str, file_bytes: bytes, encoding: str)`
Creates a new document with automatic attribute calculation and content deduplication.
**Automatic Calculations:**
- `file_hash`: SHA256 hash of file bytes
- `file_type`: Detection based on file extension
- `mime_type`: Detection via `python-magic` library
- `file_size`: Length of provided bytes
- `detected_at`: Current timestamp
- `metadata`: Empty dictionary (reserved for future extension)
**Deduplication Logic:**
1. Calculate SHA256 hash of file content
2. Check if `DocumentContent` with this hash already exists
3. If EXISTS: Create only `FileDocument` referencing existing content
4. If NOT EXISTS: Create both `FileDocument` and `DocumentContent` in transaction
**Transaction Flow:**
```
BEGIN TRANSACTION
IF content_exists(file_hash):
CREATE FileDocument with content reference
ELSE:
CREATE DocumentContent
CREATE FileDocument with content reference
COMMIT TRANSACTION
```
#### Available Methods
- `create_document(file_path, file_bytes, encoding)`: Create with deduplication
- `get_document_by_id(document_id)`: Retrieve by document ID
- `get_document_by_hash(file_hash)`: Retrieve by file hash
- `get_document_by_filepath(filepath)`: Retrieve by file path
- `list_documents(skip, limit)`: Paginated document listing
- `count_documents()`: Total document count
- `update_document(document_id, update_data)`: Update document metadata
- `delete_document(document_id)`: Remove document and orphaned content
### Repository Dependencies
The document service coordinates two existing repositories:
#### File Repository (`file_repository.py`)
- `create_document()`, `find_document_by_id()`, `find_document_by_hash()`
- `find_document_by_filepath()`, `find_document_by_name()`
- `list_documents()`, `count_documents()`
- `update_document()`, `delete_document()`
#### Document Content Repository (`document_content_repository.py`)
- `create_document_content()`, `find_document_content_by_id()`
- `find_document_content_by_file_hash()`, `content_exists()`
- `update_document_content()`, `delete_document_content()`
- `list_document_contents()`, `count_document_contents()`
### Dependencies
- `python-magic`: MIME type detection
- `hashlib`: SHA256 hashing (standard library)
- `pymongo`: MongoDB transactions support
## Key Implementation Notes
### Python Standards