Adding document service
This commit is contained in:
129
Readme.md
129
Readme.md
@@ -232,19 +232,11 @@ Stores file metadata and extracted content:
|
||||
"filename": "document.pdf",
|
||||
"filepath": "/watched_files/document.pdf",
|
||||
"file_type": "pdf",
|
||||
"mime_type": "application/pdf",
|
||||
"file_size": 2048576,
|
||||
"content": "extracted text content...",
|
||||
"encoding": "utf-8",
|
||||
"extraction_method": "direct_text",
|
||||
// direct_text, ocr, hybrid
|
||||
"extraction_method": "direct_text", // direct_text, ocr, hybrid
|
||||
"metadata": {
|
||||
"page_count": 15,
|
||||
// for PDFs
|
||||
"word_count": 250,
|
||||
// for text files
|
||||
"image_dimensions": {
|
||||
// for images
|
||||
"page_count": 15, // for PDFs
|
||||
"word_count": 250, // for text files
|
||||
"image_dimensions": { // for images
|
||||
"width": 1920,
|
||||
"height": 1080
|
||||
}
|
||||
@@ -253,6 +245,19 @@ Stores file metadata and extracted content:
|
||||
"file_hash": "sha256_hash_value"
|
||||
}
|
||||
```
|
||||
#### Document Contents Collection
|
||||
|
||||
Stores actual file content and technical metadata:
|
||||
```json
|
||||
{
|
||||
"_id": "ObjectId",
|
||||
"file_hash": "sha256_hash_value",
|
||||
"content": "extracted text content...",
|
||||
"encoding": "utf-8",
|
||||
"file_size": 2048576,
|
||||
"mime_type": "application/pdf"
|
||||
}
|
||||
```
|
||||
|
||||
#### Processing Jobs Collection
|
||||
|
||||
@@ -272,6 +277,25 @@ Tracks processing status and lifecycle:
|
||||
}
|
||||
```
|
||||
|
||||
### Data Storage Strategy
|
||||
|
||||
- **Choice**: Three separate collections for files, content, and processing status
|
||||
- **Rationale**: Normalization prevents content duplication when multiple files have identical content
|
||||
- **Benefits**:
|
||||
- Content deduplication via SHA256 hash
|
||||
- Better query performance for metadata vs content searches
|
||||
- Clear separation of concerns between file metadata, content, and processing lifecycle
|
||||
- Multiple files can reference the same content (e.g., identical copies in different locations)
|
||||
|
||||
### Content Storage Location
|
||||
|
||||
- **Choice**: Store extracted content in separate `document_contents` collection
|
||||
- **Rationale**: Content normalization and deduplication
|
||||
- **Benefits**:
|
||||
- Single content storage per unique file hash
|
||||
- Multiple file entries can reference same content
|
||||
- Efficient storage for duplicate files
|
||||
|
||||
### Supported File Types (Initial Implementation)
|
||||
|
||||
- **Text Files** (`.txt`): Direct content reading
|
||||
@@ -323,6 +347,87 @@ Tracks processing status and lifecycle:
|
||||
- **Extensible Metadata**: Flexible metadata storage per file type
|
||||
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
|
||||
|
||||
## Document Service Architecture
|
||||
|
||||
### Service Overview
|
||||
|
||||
The document service provides orchestrated access to file documents and their content through a single interface that coordinates between `FileDocument` and `DocumentContent` repositories.
|
||||
|
||||
### Service Design
|
||||
|
||||
- **Architecture Pattern**: Service orchestration with separate repositories
|
||||
- **Transaction Support**: MongoDB ACID transactions for data consistency
|
||||
- **Content Deduplication**: Multiple files can reference the same content via SHA256 hash
|
||||
- **Error Handling**: MongoDB standard exceptions with transaction rollback
|
||||
|
||||
### Document Service (`document_service.py`)
|
||||
|
||||
Orchestrates operations between file and content repositories while maintaining data consistency.
|
||||
|
||||
#### Core Functionality
|
||||
|
||||
##### `create_document(file_path: str, file_bytes: bytes, encoding: str)`
|
||||
|
||||
Creates a new document with automatic attribute calculation and content deduplication.
|
||||
|
||||
**Automatic Calculations:**
|
||||
- `file_hash`: SHA256 hash of file bytes
|
||||
- `file_type`: Detection based on file extension
|
||||
- `mime_type`: Detection via `python-magic` library
|
||||
- `file_size`: Length of provided bytes
|
||||
- `detected_at`: Current timestamp
|
||||
- `metadata`: Empty dictionary (reserved for future extension)
|
||||
|
||||
**Deduplication Logic:**
|
||||
1. Calculate SHA256 hash of file content
|
||||
2. Check if `DocumentContent` with this hash already exists
|
||||
3. If EXISTS: Create only `FileDocument` referencing existing content
|
||||
4. If NOT EXISTS: Create both `FileDocument` and `DocumentContent` in transaction
|
||||
|
||||
**Transaction Flow:**
|
||||
```
|
||||
BEGIN TRANSACTION
|
||||
IF content_exists(file_hash):
|
||||
CREATE FileDocument with content reference
|
||||
ELSE:
|
||||
CREATE DocumentContent
|
||||
CREATE FileDocument with content reference
|
||||
COMMIT TRANSACTION
|
||||
```
|
||||
|
||||
#### Available Methods
|
||||
|
||||
- `create_document(file_path, file_bytes, encoding)`: Create with deduplication
|
||||
- `get_document_by_id(document_id)`: Retrieve by document ID
|
||||
- `get_document_by_hash(file_hash)`: Retrieve by file hash
|
||||
- `get_document_by_filepath(filepath)`: Retrieve by file path
|
||||
- `list_documents(skip, limit)`: Paginated document listing
|
||||
- `count_documents()`: Total document count
|
||||
- `update_document(document_id, update_data)`: Update document metadata
|
||||
- `delete_document(document_id)`: Remove document and orphaned content
|
||||
|
||||
### Repository Dependencies
|
||||
|
||||
The document service coordinates two existing repositories:
|
||||
|
||||
#### File Repository (`file_repository.py`)
|
||||
- `create_document()`, `find_document_by_id()`, `find_document_by_hash()`
|
||||
- `find_document_by_filepath()`, `find_document_by_name()`
|
||||
- `list_documents()`, `count_documents()`
|
||||
- `update_document()`, `delete_document()`
|
||||
|
||||
#### Document Content Repository (`document_content_repository.py`)
|
||||
- `create_document_content()`, `find_document_content_by_id()`
|
||||
- `find_document_content_by_file_hash()`, `content_exists()`
|
||||
- `update_document_content()`, `delete_document_content()`
|
||||
- `list_document_contents()`, `count_document_contents()`
|
||||
|
||||
### Dependencies
|
||||
|
||||
- `python-magic`: MIME type detection
|
||||
- `hashlib`: SHA256 hashing (standard library)
|
||||
- `pymongo`: MongoDB transactions support
|
||||
|
||||
## Key Implementation Notes
|
||||
|
||||
### Python Standards
|
||||
|
||||
Reference in New Issue
Block a user