Implemented default pipeline

2025-09-26 22:08:39 +02:00
parent f1b551d243
commit 4de732b0ae
56 changed files with 4534 additions and 2837 deletions
--- a/Readme.md
+++ b/Readme.md
@@ -13,7 +13,7 @@ architecture with Redis for task queuing and MongoDB for data persistence.
 - **Backend API**: FastAPI (Python 3.12)
 - **Task Processing**: Celery with Redis broker
 - **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber
- **Database**: MongoDB
+- **Database**: MongoDB (pymongo)
 - **Frontend**: React
 - **Containerization**: Docker & Docker Compose
 - **File Monitoring**: Python watchdog library
@@ -95,25 +95,32 @@ MyDocManager/
 │   │   ├── requirements.txt
 │   │   ├── app/
 │   │   │   ├── main.py
-│   │   │   ├── file_watcher.py
-│   │   │   ├── celery_app.py
+│   │   │   ├── file_watcher.py             # FileWatcher class with observer thread
+│   │   │   ├── celery_app.py               # Celery Configuration 
 │   │   │   ├── config/
 │   │   │   │   ├── __init__.py
 │   │   │   │   └── settings.py              # JWT, MongoDB config
 │   │   │   ├── models/
 │   │   │   │   ├── __init__.py
 │   │   │   │   ├── user.py                  # User Pydantic models
-│   │   │   │   └── auth.py                  # Auth Pydantic models
+│   │   │   │   ├── auth.py                  # Auth Pydantic models
+│   │   │   │   ├── document.py              # Document Pydantic models
+│   │   │   │   ├── job.py                   # Job Processing Pydantic models
+│   │   │   │   └── types.py                 # PyObjectId and other useful types
 │   │   │   ├── database/
 │   │   │   │   ├── __init__.py
-│   │   │   │   ├── connection.py            # MongoDB connection
+│   │   │   │   ├── connection.py            # MongoDB connection (pymongo)
 │   │   │   │   └── repositories/
 │   │   │   │       ├── __init__.py
-│   │   │   │       └── user_repository.py   # User CRUD operations
+│   │   │   │       ├── user_repository.py      # User CRUD operations (synchronous)
+│   │   │   │       ├── document_repository.py  # Document CRUD operations (synchronous)
+│   │   │   │       └── job_repository.py       # Job CRUD operations (synchronous)
 │   │   │   ├── services/
 │   │   │   │   ├── __init__.py
-│   │   │   │   ├── auth_service.py          # JWT & password logic
-│   │   │   │   ├── user_service.py          # User business logic
+│   │   │   │   ├── auth_service.py          # JWT & password logic (synchronous)
+│   │   │   │   ├── user_service.py          # User business logic (synchronous)
+│   │   │   │   ├── document_service.py      # Document business logic (synchronous)
+│   │   │   │   ├── job_service.py           # Job processing logic (synchronous)
 │   │   │   │   └── init_service.py          # Admin creation at startup
 │   │   │   ├── api/
 │   │   │   │   ├── __init__.py
@@ -125,7 +132,7 @@ MyDocManager/
 │   │   │   └── utils/
 │   │   │       ├── __init__.py
 │   │   │       ├── security.py             # Password utilities
-│   │   │       └── exceptions.py           # Custom exceptions
+│   │   │       └── document_matching.py    # Fuzzy matching Algorithms
 │   ├── worker/
 │   │   ├── Dockerfile
 │   │   ├── requirements.txt
@@ -133,7 +140,13 @@ MyDocManager/
 │   └── frontend/
 │       ├── Dockerfile
 │       ├── package.json
+│       ├── index.html
 │       └── src/
+│           ├── assets/
+│           ├── App.css
+│           ├── App.jsx
+│           ├── main.css
+│           └── main.jsx
 ├── tests/
 │   ├── file-processor/
 │   │   ├── test_auth/
@@ -224,78 +237,76 @@ On first startup, the application automatically creates a default admin user:

 #### Files Collection

-Stores file metadata and extracted content:
+Stores file metadata and extracted content using Pydantic models:

-```json
-{
-  "_id": "ObjectId",
-  "filename": "document.pdf",
-  "filepath": "/watched_files/document.pdf",
-  "file_type": "pdf",
-  "extraction_method": "direct_text", // direct_text, ocr, hybrid
-  "metadata": {
-    "page_count": 15,        // for PDFs
-    "word_count": 250,       // for text files  
-    "image_dimensions": {    // for images
-      "width": 1920,
-      "height": 1080
-    }
-  },
-  "detected_at": "2024-01-15T10:29:00Z",
-  "file_hash": "sha256_hash_value"
-}
-```
-#### Document Contents Collection
+```python
+class FileDocument(BaseModel):
+  """
+  Model for file documents stored in the 'files' collection.

-Stores actual file content and technical metadata:
-```json
-{
-  "_id": "ObjectId",
-  "file_hash": "sha256_hash_value",
-  "content": "extracted text content...",
-  "encoding": "utf-8",
-  "file_size": 2048576,
-  "mime_type": "application/pdf"
-}
+  Represents a file detected in the watched directory with its
+  metadata and extracted content.
+  """
+  
+  id: Optional[PyObjectId] = Field(default=None, alias="_id")
+  filename: str = Field(..., description="Original filename")
+  filepath: str = Field(..., description="Full path to the file")
+  file_type: FileType = Field(..., description="Type of the file")
+  extraction_method: Optional[ExtractionMethod] = Field(default=None, description="Method used to extract content")
+  metadata: Dict[str, Any] = Field(default_factory=dict, description="File-specific metadata")
+  detected_at: Optional[datetime] = Field(default=None, description="Timestamp when file was detected")
+  file_hash: Optional[str] = Field(default=None, description="SHA256 hash of file content")
+  encoding: str = Field(default="utf-8", description="Character encoding for text files")
+  file_size: int = Field(..., ge=0, description="File size in bytes")
+  mime_type: str = Field(..., description="MIME type detected")
+  
+  @field_validator('filepath')
+  @classmethod
+  def validate_filepath(cls, v: str) -> str:
+    """Validate filepath format."""
+    if not v.strip():
+      raise ValueError("Filepath cannot be empty")
+    return v.strip()
+  
+  @field_validator('filename')
+  @classmethod
+  def validate_filename(cls, v: str) -> str:
+    """Validate filename format."""
+    if not v.strip():
+      raise ValueError("Filename cannot be empty")
+    return v.strip()
 ```

 #### Processing Jobs Collection

 Tracks processing status and lifecycle:

-```json
-{
-  "_id": "ObjectId",
-  "file_id": "reference_to_files_collection",
-  "status": "completed",
-  // pending, processing, completed, failed
-  "task_id": "celery_task_uuid",
-  "created_at": "2024-01-15T10:29:00Z",
-  "started_at": "2024-01-15T10:29:30Z",
-  "completed_at": "2024-01-15T10:30:00Z",
-  "error_message": null
-}
+```python
+class ProcessingJob(BaseModel):
+  """
+  Model for processing jobs stored in the 'processing_jobs' collection.
+
+  Tracks the lifecycle and status of document processing tasks.
+  """
+  
+  id: Optional[PyObjectId] = Field(default=None, alias="_id")
+  file_id: PyObjectId = Field(..., description="Reference to file document")
+  status: ProcessingStatus = Field(default=ProcessingStatus.PENDING, description="Current processing status")
+  task_id: Optional[str] = Field(default=None, description="Celery task UUID")
+  created_at: Optional[datetime] = Field(default=None, description="Timestamp when job was created")
+  started_at: Optional[datetime] = Field(default=None, description="Timestamp when processing started")
+  completed_at: Optional[datetime] = Field(default=None, description="Timestamp when processing completed")
+  error_message: Optional[str] = Field(default=None, description="Error message if processing failed")
+  
+  @field_validator('error_message')
+  @classmethod
+  def validate_error_message(cls, v: Optional[str]) -> Optional[str]:
+    """Clean up error message."""
+    if v is not None:
+      return v.strip() if v.strip() else None
+    return v
 ```

-### Data Storage Strategy
-
- **Choice**: Three separate collections for files, content, and processing status
- **Rationale**: Normalization prevents content duplication when multiple files have identical content
- **Benefits**:
-    - Content deduplication via SHA256 hash
-    - Better query performance for metadata vs content searches
-    - Clear separation of concerns between file metadata, content, and processing lifecycle
-    - Multiple files can reference the same content (e.g., identical copies in different locations)
-
-### Content Storage Location
-
- **Choice**: Store extracted content in separate `document_contents` collection
- **Rationale**: Content normalization and deduplication
- **Benefits**: 
-    - Single content storage per unique file hash
-    - Multiple file entries can reference same content
-    - Efficient storage for duplicate files
-
 ### Supported File Types (Initial Implementation)

 - **Text Files** (`.txt`): Direct content reading
@@ -306,7 +317,7 @@ Tracks processing status and lifecycle:

 #### Watchdog Implementation

- **Choice**: Dedicated observer thread (Option A)
+- **Choice**: Dedicated observer thread
 - **Rationale**: Standard approach, clean separation of concerns
 - **Implementation**: Watchdog observer runs in separate thread from FastAPI

@@ -327,17 +338,94 @@ Tracks processing status and lifecycle:

 #### Content Storage Location

- **Choice**: Store extracted content in `files` collection
- **Rationale**: Content is intrinsic property of the file
- **Benefits**: Single query to get file + content, simpler data model
+- **Choice**: Store files in the file system, using the SHA256 hash as filename
+- **Rationale**: MongoDB is not meant for large files, better performance. Files remain in the file system for easy
+  access.

-### Implementation Order
+#### Repository and Services Implementation
+
+- **Choice**: Synchronous implementation using pymongo
+- **Rationale**: Full compatibility with Celery workers and simplified workflow
+- **Implementation**: All repositories and services operate synchronously for seamless integration
+
+### Implementation Status

 1. ✅ Pydantic models for MongoDB collections
-2. ✅ Repository layer for data access (files + processing_jobs)
-3. ✅ Celery tasks for document processing
-4. ✅ Watchdog file monitoring implementation
-5. ✅ FastAPI integration and startup coordination
+2. ✅ Repository layer for data access (files + processing_jobs + users + documents) - synchronous
+3. ✅ Service layer for business logic (auth, user, document, job) - synchronous
+4. ✅ Celery tasks for document processing
+5. ✅ Watchdog file monitoring implementation
+6. ✅ FastAPI integration and startup coordination
+
+## Job Management Layer
+
+### Repository Pattern Implementation
+
+The job management system follows the repository pattern for clean separation between data access and business logic.
+
+#### JobRepository
+
+Handles direct MongoDB operations for processing jobs using synchronous pymongo:
+
+**CRUD Operations:**
+- `create_job()` - Create new processing job with automatic `created_at` timestamp
+- `get_job_by_id()` - Retrieve job by ObjectId
+- `update_job_status()` - Update job status with automatic timestamp management
+- `delete_job()` - Remove job from database
+- `get_jobs_by_file_id()` - Get all jobs for specific file
+- `get_jobs_by_status()` - Get jobs filtered by processing status
+
+**Automatic Timestamp Management:**
+- `created_at`: Set automatically during job creation
+- `started_at`: Set automatically when status changes to PROCESSING  
+- `completed_at`: Set automatically when status changes to COMPLETED or FAILED
+
+#### JobService
+
+Provides synchronous business logic layer with strict status transition validation:
+
+**Status Transition Methods:**
+- `mark_job_as_started()` - PENDING → PROCESSING
+- `mark_job_as_completed()` - PROCESSING → COMPLETED
+- `mark_job_as_failed()` - PROCESSING → FAILED
+
+**Validation Rules:**
+- Strict status transitions (invalid transitions raise exceptions)
+- Job existence verification before any operation
+- Automatic timestamp management through repository layer
+
+#### Custom Exceptions
+
+**InvalidStatusTransitionError**: Raised for invalid status transitions  
+**JobRepositoryError**: Raised for MongoDB operation failures
+
+#### Valid Status Transitions
+
+```
+PENDING → PROCESSING    (via mark_job_as_started)
+PROCESSING → COMPLETED  (via mark_job_as_completed)
+PROCESSING → FAILED     (via mark_job_as_failed)
+```
+
+All other transitions are forbidden and will raise `InvalidStatusTransitionError`.
+
+### File Structure
+
+```
+src/file-processor/app/
+├── database/repositories/
+│   ├── job_repository.py           # JobRepository class (synchronous)
+│   ├── user_repository.py          # UserRepository class (synchronous)
+│   ├── document_repository.py      # DocumentRepository class (synchronous)
+│   └── file_repository.py          # FileRepository class (synchronous)
+├── services/  
+│   ├── job_service.py              # JobService class (synchronous)
+│   ├── auth_service.py             # AuthService class (synchronous)
+│   ├── user_service.py             # UserService class (synchronous)
+│   └── document_service.py         # DocumentService class (synchronous)
+└── exceptions/
+    └── job_exceptions.py           # Custom exceptions
+```

 ### Processing Pipeline Features

@@ -346,87 +434,7 @@ Tracks processing status and lifecycle:
 - **Status Tracking**: Real-time processing status via `processing_jobs` collection
 - **Extensible Metadata**: Flexible metadata storage per file type
 - **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
-
-## Document Service Architecture
-
-### Service Overview
-
-The document service provides orchestrated access to file documents and their content through a single interface that coordinates between `FileDocument` and `DocumentContent` repositories.
-
-### Service Design
-
- **Architecture Pattern**: Service orchestration with separate repositories
- **Transaction Support**: MongoDB ACID transactions for data consistency
- **Content Deduplication**: Multiple files can reference the same content via SHA256 hash
- **Error Handling**: MongoDB standard exceptions with transaction rollback
-
-### Document Service (`document_service.py`)
-
-Orchestrates operations between file and content repositories while maintaining data consistency.
-
-#### Core Functionality
-
-##### `create_document(file_path: str, file_bytes: bytes, encoding: str)`
-
-Creates a new document with automatic attribute calculation and content deduplication.
-
-**Automatic Calculations:**
- `file_hash`: SHA256 hash of file bytes
- `file_type`: Detection based on file extension 
- `mime_type`: Detection via `python-magic` library
- `file_size`: Length of provided bytes
- `detected_at`: Current timestamp
- `metadata`: Empty dictionary (reserved for future extension)
-
-**Deduplication Logic:**
-1. Calculate SHA256 hash of file content
-2. Check if `DocumentContent` with this hash already exists
-3. If EXISTS: Create only `FileDocument` referencing existing content
-4. If NOT EXISTS: Create both `FileDocument` and `DocumentContent` in transaction
-
-**Transaction Flow:**
-```
-BEGIN TRANSACTION
-  IF content_exists(file_hash):
-    CREATE FileDocument with content reference
-  ELSE:
-    CREATE DocumentContent
-    CREATE FileDocument with content reference
-COMMIT TRANSACTION
-```
-
-#### Available Methods
-
- `create_document(file_path, file_bytes, encoding)`: Create with deduplication
- `get_document_by_id(document_id)`: Retrieve by document ID
- `get_document_by_hash(file_hash)`: Retrieve by file hash
- `get_document_by_filepath(filepath)`: Retrieve by file path
- `list_documents(skip, limit)`: Paginated document listing
- `count_documents()`: Total document count
- `update_document(document_id, update_data)`: Update document metadata
- `delete_document(document_id)`: Remove document and orphaned content
-
-### Repository Dependencies
-
-The document service coordinates two existing repositories:
-
-#### File Repository (`file_repository.py`)
- `create_document()`, `find_document_by_id()`, `find_document_by_hash()`
- `find_document_by_filepath()`, `find_document_by_name()`
- `list_documents()`, `count_documents()`
- `update_document()`, `delete_document()`
-
-#### Document Content Repository (`document_content_repository.py`)
- `create_document_content()`, `find_document_content_by_id()`
- `find_document_content_by_file_hash()`, `content_exists()`
- `update_document_content()`, `delete_document_content()`
- `list_document_contents()`, `count_document_contents()`
-
-### Dependencies
-
- `python-magic`: MIME type detection
- `hashlib`: SHA256 hashing (standard library)
- `pymongo`: MongoDB transactions support
+- **Synchronous Operations**: All database operations use pymongo for Celery compatibility

 ## Key Implementation Notes

@@ -449,6 +457,7 @@ The document service coordinates two existing repositories:
 - **Package Manager**: pip (standard)
 - **External Dependencies**: Listed in each service's requirements.txt
 - **Standard Library First**: Prefer standard library when possible
+- **Database Driver**: pymongo for synchronous MongoDB operations

 ### Testing Strategy

@@ -473,6 +482,7 @@ The document service coordinates two existing repositories:
 12. **Content in Files Collection**: Extracted content stored with file metadata
 13. **Direct Task Dispatch**: File watcher directly creates Celery tasks
 14. **SHA256 Duplicate Detection**: Prevents reprocessing identical files
+15. **Synchronous Implementation**: All repositories and services use pymongo for Celery compatibility

 ### Development Process Requirements

@@ -483,21 +493,15 @@ The document service coordinates two existing repositories:

 ### Next Implementation Steps

-1. ✅ Create docker-compose.yml with all services => Done
-2. ✅ Define user management and authentication architecture => Done
-3. ✅ Implement user models and authentication services =>
-    1. models/user.py => Done
-    2. models/auth.py => Done
-    3. database/repositories/user_repository.py => Done
-4. ✅ Add automatic admin user creation if it does not exists => Done
-5. **IN PROGRESS**: Implement file processing pipeline =>
-    1. Create Pydantic models for files and processing_jobs collections
-    2. Implement repository layer for file and processing job data access
-    3. Create Celery tasks for document processing (.txt, .pdf, .docx)
-    4. Implement Watchdog file monitoring with dedicated observer
-    5. Integrate file watcher with FastAPI startup
-6. Create protected API routes for user management
-7. Build React monitoring interface with authentication
+1. **TODO**: Complete file processing pipeline =>
+    1. ✅ Create Pydantic models for files and processing_jobs collections
+    2. ✅ Implement repository layer for file and processing job data access (synchronous)
+    3. ✅ Implement service layer for business logic (synchronous)
+    4. ✅ Create Celery tasks for document processing (.txt, .pdf, .docx)
+    5. ✅ Implement Watchdog file monitoring with dedicated observer
+    6. ✅ Integrate file watcher with FastAPI startup
+2. Create protected API routes for user management
+3. Build React monitoring interface with authentication

 ## Annexes

@@ -586,4 +590,4 @@ docker-compose up --scale worker=3
 - **file-processor**: Hot-reload enabled via `--reload` flag
    - Code changes in `src/file-processor/app/` automatically restart FastAPI
 - **worker**: No hot-reload (manual restart required for stability)
-    - Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
+    - Code changes in `src/worker/tasks/` require: `docker-compose restart worker`