Implemented default pipeline

This commit is contained in:
2025-09-26 22:08:39 +02:00
parent f1b551d243
commit 4de732b0ae
56 changed files with 4534 additions and 2837 deletions

358
Readme.md
View File

@@ -13,7 +13,7 @@ architecture with Redis for task queuing and MongoDB for data persistence.
- **Backend API**: FastAPI (Python 3.12)
- **Task Processing**: Celery with Redis broker
- **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber
- **Database**: MongoDB
- **Database**: MongoDB (pymongo)
- **Frontend**: React
- **Containerization**: Docker & Docker Compose
- **File Monitoring**: Python watchdog library
@@ -95,25 +95,32 @@ MyDocManager/
│ │ ├── requirements.txt
│ │ ├── app/
│ │ │ ├── main.py
│ │ │ ├── file_watcher.py
│ │ │ ├── celery_app.py
│ │ │ ├── file_watcher.py # FileWatcher class with observer thread
│ │ │ ├── celery_app.py # Celery Configuration
│ │ │ ├── config/
│ │ │ │ ├── __init__.py
│ │ │ │ └── settings.py # JWT, MongoDB config
│ │ │ ├── models/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── user.py # User Pydantic models
│ │ │ │ ── auth.py # Auth Pydantic models
│ │ │ │ ── auth.py # Auth Pydantic models
│ │ │ │ ├── document.py # Document Pydantic models
│ │ │ │ ├── job.py # Job Processing Pydantic models
│ │ │ │ └── types.py # PyObjectId and other useful types
│ │ │ ├── database/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── connection.py # MongoDB connection
│ │ │ │ ├── connection.py # MongoDB connection (pymongo)
│ │ │ │ └── repositories/
│ │ │ │ ├── __init__.py
│ │ │ │ ── user_repository.py # User CRUD operations
│ │ │ │ ── user_repository.py # User CRUD operations (synchronous)
│ │ │ │ ├── document_repository.py # Document CRUD operations (synchronous)
│ │ │ │ └── job_repository.py # Job CRUD operations (synchronous)
│ │ │ ├── services/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── auth_service.py # JWT & password logic
│ │ │ │ ├── user_service.py # User business logic
│ │ │ │ ├── auth_service.py # JWT & password logic (synchronous)
│ │ │ │ ├── user_service.py # User business logic (synchronous)
│ │ │ │ ├── document_service.py # Document business logic (synchronous)
│ │ │ │ ├── job_service.py # Job processing logic (synchronous)
│ │ │ │ └── init_service.py # Admin creation at startup
│ │ │ ├── api/
│ │ │ │ ├── __init__.py
@@ -125,7 +132,7 @@ MyDocManager/
│ │ │ └── utils/
│ │ │ ├── __init__.py
│ │ │ ├── security.py # Password utilities
│ │ │ └── exceptions.py # Custom exceptions
│ │ │ └── document_matching.py # Fuzzy matching Algorithms
│ ├── worker/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
@@ -133,7 +140,13 @@ MyDocManager/
│ └── frontend/
│ ├── Dockerfile
│ ├── package.json
│ ├── index.html
│ └── src/
│ ├── assets/
│ ├── App.css
│ ├── App.jsx
│ ├── main.css
│ └── main.jsx
├── tests/
│ ├── file-processor/
│ │ ├── test_auth/
@@ -224,78 +237,76 @@ On first startup, the application automatically creates a default admin user:
#### Files Collection
Stores file metadata and extracted content:
Stores file metadata and extracted content using Pydantic models:
```json
{
"_id": "ObjectId",
"filename": "document.pdf",
"filepath": "/watched_files/document.pdf",
"file_type": "pdf",
"extraction_method": "direct_text", // direct_text, ocr, hybrid
"metadata": {
"page_count": 15, // for PDFs
"word_count": 250, // for text files
"image_dimensions": { // for images
"width": 1920,
"height": 1080
}
},
"detected_at": "2024-01-15T10:29:00Z",
"file_hash": "sha256_hash_value"
}
```
#### Document Contents Collection
```python
class FileDocument(BaseModel):
"""
Model for file documents stored in the 'files' collection.
Stores actual file content and technical metadata:
```json
{
"_id": "ObjectId",
"file_hash": "sha256_hash_value",
"content": "extracted text content...",
"encoding": "utf-8",
"file_size": 2048576,
"mime_type": "application/pdf"
}
Represents a file detected in the watched directory with its
metadata and extracted content.
"""
id: Optional[PyObjectId] = Field(default=None, alias="_id")
filename: str = Field(..., description="Original filename")
filepath: str = Field(..., description="Full path to the file")
file_type: FileType = Field(..., description="Type of the file")
extraction_method: Optional[ExtractionMethod] = Field(default=None, description="Method used to extract content")
metadata: Dict[str, Any] = Field(default_factory=dict, description="File-specific metadata")
detected_at: Optional[datetime] = Field(default=None, description="Timestamp when file was detected")
file_hash: Optional[str] = Field(default=None, description="SHA256 hash of file content")
encoding: str = Field(default="utf-8", description="Character encoding for text files")
file_size: int = Field(..., ge=0, description="File size in bytes")
mime_type: str = Field(..., description="MIME type detected")
@field_validator('filepath')
@classmethod
def validate_filepath(cls, v: str) -> str:
"""Validate filepath format."""
if not v.strip():
raise ValueError("Filepath cannot be empty")
return v.strip()
@field_validator('filename')
@classmethod
def validate_filename(cls, v: str) -> str:
"""Validate filename format."""
if not v.strip():
raise ValueError("Filename cannot be empty")
return v.strip()
```
#### Processing Jobs Collection
Tracks processing status and lifecycle:
```json
{
"_id": "ObjectId",
"file_id": "reference_to_files_collection",
"status": "completed",
// pending, processing, completed, failed
"task_id": "celery_task_uuid",
"created_at": "2024-01-15T10:29:00Z",
"started_at": "2024-01-15T10:29:30Z",
"completed_at": "2024-01-15T10:30:00Z",
"error_message": null
}
```python
class ProcessingJob(BaseModel):
"""
Model for processing jobs stored in the 'processing_jobs' collection.
Tracks the lifecycle and status of document processing tasks.
"""
id: Optional[PyObjectId] = Field(default=None, alias="_id")
file_id: PyObjectId = Field(..., description="Reference to file document")
status: ProcessingStatus = Field(default=ProcessingStatus.PENDING, description="Current processing status")
task_id: Optional[str] = Field(default=None, description="Celery task UUID")
created_at: Optional[datetime] = Field(default=None, description="Timestamp when job was created")
started_at: Optional[datetime] = Field(default=None, description="Timestamp when processing started")
completed_at: Optional[datetime] = Field(default=None, description="Timestamp when processing completed")
error_message: Optional[str] = Field(default=None, description="Error message if processing failed")
@field_validator('error_message')
@classmethod
def validate_error_message(cls, v: Optional[str]) -> Optional[str]:
"""Clean up error message."""
if v is not None:
return v.strip() if v.strip() else None
return v
```
### Data Storage Strategy
- **Choice**: Three separate collections for files, content, and processing status
- **Rationale**: Normalization prevents content duplication when multiple files have identical content
- **Benefits**:
- Content deduplication via SHA256 hash
- Better query performance for metadata vs content searches
- Clear separation of concerns between file metadata, content, and processing lifecycle
- Multiple files can reference the same content (e.g., identical copies in different locations)
### Content Storage Location
- **Choice**: Store extracted content in separate `document_contents` collection
- **Rationale**: Content normalization and deduplication
- **Benefits**:
- Single content storage per unique file hash
- Multiple file entries can reference same content
- Efficient storage for duplicate files
### Supported File Types (Initial Implementation)
- **Text Files** (`.txt`): Direct content reading
@@ -306,7 +317,7 @@ Tracks processing status and lifecycle:
#### Watchdog Implementation
- **Choice**: Dedicated observer thread (Option A)
- **Choice**: Dedicated observer thread
- **Rationale**: Standard approach, clean separation of concerns
- **Implementation**: Watchdog observer runs in separate thread from FastAPI
@@ -327,17 +338,94 @@ Tracks processing status and lifecycle:
#### Content Storage Location
- **Choice**: Store extracted content in `files` collection
- **Rationale**: Content is intrinsic property of the file
- **Benefits**: Single query to get file + content, simpler data model
- **Choice**: Store files in the file system, using the SHA256 hash as filename
- **Rationale**: MongoDB is not meant for large files, better performance. Files remain in the file system for easy
access.
### Implementation Order
#### Repository and Services Implementation
- **Choice**: Synchronous implementation using pymongo
- **Rationale**: Full compatibility with Celery workers and simplified workflow
- **Implementation**: All repositories and services operate synchronously for seamless integration
### Implementation Status
1. ✅ Pydantic models for MongoDB collections
2. ✅ Repository layer for data access (files + processing_jobs)
3.Celery tasks for document processing
4.Watchdog file monitoring implementation
5.FastAPI integration and startup coordination
2. ✅ Repository layer for data access (files + processing_jobs + users + documents) - synchronous
3.Service layer for business logic (auth, user, document, job) - synchronous
4.Celery tasks for document processing
5.Watchdog file monitoring implementation
6. ✅ FastAPI integration and startup coordination
## Job Management Layer
### Repository Pattern Implementation
The job management system follows the repository pattern for clean separation between data access and business logic.
#### JobRepository
Handles direct MongoDB operations for processing jobs using synchronous pymongo:
**CRUD Operations:**
- `create_job()` - Create new processing job with automatic `created_at` timestamp
- `get_job_by_id()` - Retrieve job by ObjectId
- `update_job_status()` - Update job status with automatic timestamp management
- `delete_job()` - Remove job from database
- `get_jobs_by_file_id()` - Get all jobs for specific file
- `get_jobs_by_status()` - Get jobs filtered by processing status
**Automatic Timestamp Management:**
- `created_at`: Set automatically during job creation
- `started_at`: Set automatically when status changes to PROCESSING
- `completed_at`: Set automatically when status changes to COMPLETED or FAILED
#### JobService
Provides synchronous business logic layer with strict status transition validation:
**Status Transition Methods:**
- `mark_job_as_started()` - PENDING → PROCESSING
- `mark_job_as_completed()` - PROCESSING → COMPLETED
- `mark_job_as_failed()` - PROCESSING → FAILED
**Validation Rules:**
- Strict status transitions (invalid transitions raise exceptions)
- Job existence verification before any operation
- Automatic timestamp management through repository layer
#### Custom Exceptions
**InvalidStatusTransitionError**: Raised for invalid status transitions
**JobRepositoryError**: Raised for MongoDB operation failures
#### Valid Status Transitions
```
PENDING → PROCESSING (via mark_job_as_started)
PROCESSING → COMPLETED (via mark_job_as_completed)
PROCESSING → FAILED (via mark_job_as_failed)
```
All other transitions are forbidden and will raise `InvalidStatusTransitionError`.
### File Structure
```
src/file-processor/app/
├── database/repositories/
│ ├── job_repository.py # JobRepository class (synchronous)
│ ├── user_repository.py # UserRepository class (synchronous)
│ ├── document_repository.py # DocumentRepository class (synchronous)
│ └── file_repository.py # FileRepository class (synchronous)
├── services/
│ ├── job_service.py # JobService class (synchronous)
│ ├── auth_service.py # AuthService class (synchronous)
│ ├── user_service.py # UserService class (synchronous)
│ └── document_service.py # DocumentService class (synchronous)
└── exceptions/
└── job_exceptions.py # Custom exceptions
```
### Processing Pipeline Features
@@ -346,87 +434,7 @@ Tracks processing status and lifecycle:
- **Status Tracking**: Real-time processing status via `processing_jobs` collection
- **Extensible Metadata**: Flexible metadata storage per file type
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
## Document Service Architecture
### Service Overview
The document service provides orchestrated access to file documents and their content through a single interface that coordinates between `FileDocument` and `DocumentContent` repositories.
### Service Design
- **Architecture Pattern**: Service orchestration with separate repositories
- **Transaction Support**: MongoDB ACID transactions for data consistency
- **Content Deduplication**: Multiple files can reference the same content via SHA256 hash
- **Error Handling**: MongoDB standard exceptions with transaction rollback
### Document Service (`document_service.py`)
Orchestrates operations between file and content repositories while maintaining data consistency.
#### Core Functionality
##### `create_document(file_path: str, file_bytes: bytes, encoding: str)`
Creates a new document with automatic attribute calculation and content deduplication.
**Automatic Calculations:**
- `file_hash`: SHA256 hash of file bytes
- `file_type`: Detection based on file extension
- `mime_type`: Detection via `python-magic` library
- `file_size`: Length of provided bytes
- `detected_at`: Current timestamp
- `metadata`: Empty dictionary (reserved for future extension)
**Deduplication Logic:**
1. Calculate SHA256 hash of file content
2. Check if `DocumentContent` with this hash already exists
3. If EXISTS: Create only `FileDocument` referencing existing content
4. If NOT EXISTS: Create both `FileDocument` and `DocumentContent` in transaction
**Transaction Flow:**
```
BEGIN TRANSACTION
IF content_exists(file_hash):
CREATE FileDocument with content reference
ELSE:
CREATE DocumentContent
CREATE FileDocument with content reference
COMMIT TRANSACTION
```
#### Available Methods
- `create_document(file_path, file_bytes, encoding)`: Create with deduplication
- `get_document_by_id(document_id)`: Retrieve by document ID
- `get_document_by_hash(file_hash)`: Retrieve by file hash
- `get_document_by_filepath(filepath)`: Retrieve by file path
- `list_documents(skip, limit)`: Paginated document listing
- `count_documents()`: Total document count
- `update_document(document_id, update_data)`: Update document metadata
- `delete_document(document_id)`: Remove document and orphaned content
### Repository Dependencies
The document service coordinates two existing repositories:
#### File Repository (`file_repository.py`)
- `create_document()`, `find_document_by_id()`, `find_document_by_hash()`
- `find_document_by_filepath()`, `find_document_by_name()`
- `list_documents()`, `count_documents()`
- `update_document()`, `delete_document()`
#### Document Content Repository (`document_content_repository.py`)
- `create_document_content()`, `find_document_content_by_id()`
- `find_document_content_by_file_hash()`, `content_exists()`
- `update_document_content()`, `delete_document_content()`
- `list_document_contents()`, `count_document_contents()`
### Dependencies
- `python-magic`: MIME type detection
- `hashlib`: SHA256 hashing (standard library)
- `pymongo`: MongoDB transactions support
- **Synchronous Operations**: All database operations use pymongo for Celery compatibility
## Key Implementation Notes
@@ -449,6 +457,7 @@ The document service coordinates two existing repositories:
- **Package Manager**: pip (standard)
- **External Dependencies**: Listed in each service's requirements.txt
- **Standard Library First**: Prefer standard library when possible
- **Database Driver**: pymongo for synchronous MongoDB operations
### Testing Strategy
@@ -473,6 +482,7 @@ The document service coordinates two existing repositories:
12. **Content in Files Collection**: Extracted content stored with file metadata
13. **Direct Task Dispatch**: File watcher directly creates Celery tasks
14. **SHA256 Duplicate Detection**: Prevents reprocessing identical files
15. **Synchronous Implementation**: All repositories and services use pymongo for Celery compatibility
### Development Process Requirements
@@ -483,21 +493,15 @@ The document service coordinates two existing repositories:
### Next Implementation Steps
1. ✅ Create docker-compose.yml with all services => Done
2. ✅ Define user management and authentication architecture => Done
3. ✅ Implement user models and authentication services =>
1. models/user.py => Done
2. models/auth.py => Done
3. database/repositories/user_repository.py => Done
4. ✅ Add automatic admin user creation if it does not exists => Done
5. **IN PROGRESS**: Implement file processing pipeline =>
1. Create Pydantic models for files and processing_jobs collections
2. Implement repository layer for file and processing job data access
3. Create Celery tasks for document processing (.txt, .pdf, .docx)
4. Implement Watchdog file monitoring with dedicated observer
5. Integrate file watcher with FastAPI startup
6. Create protected API routes for user management
7. Build React monitoring interface with authentication
1. **TODO**: Complete file processing pipeline =>
1. ✅ Create Pydantic models for files and processing_jobs collections
2. ✅ Implement repository layer for file and processing job data access (synchronous)
3. ✅ Implement service layer for business logic (synchronous)
4. ✅ Create Celery tasks for document processing (.txt, .pdf, .docx)
5. ✅ Implement Watchdog file monitoring with dedicated observer
6. ✅ Integrate file watcher with FastAPI startup
2. Create protected API routes for user management
3. Build React monitoring interface with authentication
## Annexes
@@ -586,4 +590,4 @@ docker-compose up --scale worker=3
- **file-processor**: Hot-reload enabled via `--reload` flag
- Code changes in `src/file-processor/app/` automatically restart FastAPI
- **worker**: No hot-reload (manual restart required for stability)
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`