Implemented default pipeline
This commit is contained in:
358
Readme.md
358
Readme.md
@@ -13,7 +13,7 @@ architecture with Redis for task queuing and MongoDB for data persistence.
|
||||
- **Backend API**: FastAPI (Python 3.12)
|
||||
- **Task Processing**: Celery with Redis broker
|
||||
- **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber
|
||||
- **Database**: MongoDB
|
||||
- **Database**: MongoDB (pymongo)
|
||||
- **Frontend**: React
|
||||
- **Containerization**: Docker & Docker Compose
|
||||
- **File Monitoring**: Python watchdog library
|
||||
@@ -95,25 +95,32 @@ MyDocManager/
|
||||
│ │ ├── requirements.txt
|
||||
│ │ ├── app/
|
||||
│ │ │ ├── main.py
|
||||
│ │ │ ├── file_watcher.py
|
||||
│ │ │ ├── celery_app.py
|
||||
│ │ │ ├── file_watcher.py # FileWatcher class with observer thread
|
||||
│ │ │ ├── celery_app.py # Celery Configuration
|
||||
│ │ │ ├── config/
|
||||
│ │ │ │ ├── __init__.py
|
||||
│ │ │ │ └── settings.py # JWT, MongoDB config
|
||||
│ │ │ ├── models/
|
||||
│ │ │ │ ├── __init__.py
|
||||
│ │ │ │ ├── user.py # User Pydantic models
|
||||
│ │ │ │ └── auth.py # Auth Pydantic models
|
||||
│ │ │ │ ├── auth.py # Auth Pydantic models
|
||||
│ │ │ │ ├── document.py # Document Pydantic models
|
||||
│ │ │ │ ├── job.py # Job Processing Pydantic models
|
||||
│ │ │ │ └── types.py # PyObjectId and other useful types
|
||||
│ │ │ ├── database/
|
||||
│ │ │ │ ├── __init__.py
|
||||
│ │ │ │ ├── connection.py # MongoDB connection
|
||||
│ │ │ │ ├── connection.py # MongoDB connection (pymongo)
|
||||
│ │ │ │ └── repositories/
|
||||
│ │ │ │ ├── __init__.py
|
||||
│ │ │ │ └── user_repository.py # User CRUD operations
|
||||
│ │ │ │ ├── user_repository.py # User CRUD operations (synchronous)
|
||||
│ │ │ │ ├── document_repository.py # Document CRUD operations (synchronous)
|
||||
│ │ │ │ └── job_repository.py # Job CRUD operations (synchronous)
|
||||
│ │ │ ├── services/
|
||||
│ │ │ │ ├── __init__.py
|
||||
│ │ │ │ ├── auth_service.py # JWT & password logic
|
||||
│ │ │ │ ├── user_service.py # User business logic
|
||||
│ │ │ │ ├── auth_service.py # JWT & password logic (synchronous)
|
||||
│ │ │ │ ├── user_service.py # User business logic (synchronous)
|
||||
│ │ │ │ ├── document_service.py # Document business logic (synchronous)
|
||||
│ │ │ │ ├── job_service.py # Job processing logic (synchronous)
|
||||
│ │ │ │ └── init_service.py # Admin creation at startup
|
||||
│ │ │ ├── api/
|
||||
│ │ │ │ ├── __init__.py
|
||||
@@ -125,7 +132,7 @@ MyDocManager/
|
||||
│ │ │ └── utils/
|
||||
│ │ │ ├── __init__.py
|
||||
│ │ │ ├── security.py # Password utilities
|
||||
│ │ │ └── exceptions.py # Custom exceptions
|
||||
│ │ │ └── document_matching.py # Fuzzy matching Algorithms
|
||||
│ ├── worker/
|
||||
│ │ ├── Dockerfile
|
||||
│ │ ├── requirements.txt
|
||||
@@ -133,7 +140,13 @@ MyDocManager/
|
||||
│ └── frontend/
|
||||
│ ├── Dockerfile
|
||||
│ ├── package.json
|
||||
│ ├── index.html
|
||||
│ └── src/
|
||||
│ ├── assets/
|
||||
│ ├── App.css
|
||||
│ ├── App.jsx
|
||||
│ ├── main.css
|
||||
│ └── main.jsx
|
||||
├── tests/
|
||||
│ ├── file-processor/
|
||||
│ │ ├── test_auth/
|
||||
@@ -224,78 +237,76 @@ On first startup, the application automatically creates a default admin user:
|
||||
|
||||
#### Files Collection
|
||||
|
||||
Stores file metadata and extracted content:
|
||||
Stores file metadata and extracted content using Pydantic models:
|
||||
|
||||
```json
|
||||
{
|
||||
"_id": "ObjectId",
|
||||
"filename": "document.pdf",
|
||||
"filepath": "/watched_files/document.pdf",
|
||||
"file_type": "pdf",
|
||||
"extraction_method": "direct_text", // direct_text, ocr, hybrid
|
||||
"metadata": {
|
||||
"page_count": 15, // for PDFs
|
||||
"word_count": 250, // for text files
|
||||
"image_dimensions": { // for images
|
||||
"width": 1920,
|
||||
"height": 1080
|
||||
}
|
||||
},
|
||||
"detected_at": "2024-01-15T10:29:00Z",
|
||||
"file_hash": "sha256_hash_value"
|
||||
}
|
||||
```
|
||||
#### Document Contents Collection
|
||||
```python
|
||||
class FileDocument(BaseModel):
|
||||
"""
|
||||
Model for file documents stored in the 'files' collection.
|
||||
|
||||
Stores actual file content and technical metadata:
|
||||
```json
|
||||
{
|
||||
"_id": "ObjectId",
|
||||
"file_hash": "sha256_hash_value",
|
||||
"content": "extracted text content...",
|
||||
"encoding": "utf-8",
|
||||
"file_size": 2048576,
|
||||
"mime_type": "application/pdf"
|
||||
}
|
||||
Represents a file detected in the watched directory with its
|
||||
metadata and extracted content.
|
||||
"""
|
||||
|
||||
id: Optional[PyObjectId] = Field(default=None, alias="_id")
|
||||
filename: str = Field(..., description="Original filename")
|
||||
filepath: str = Field(..., description="Full path to the file")
|
||||
file_type: FileType = Field(..., description="Type of the file")
|
||||
extraction_method: Optional[ExtractionMethod] = Field(default=None, description="Method used to extract content")
|
||||
metadata: Dict[str, Any] = Field(default_factory=dict, description="File-specific metadata")
|
||||
detected_at: Optional[datetime] = Field(default=None, description="Timestamp when file was detected")
|
||||
file_hash: Optional[str] = Field(default=None, description="SHA256 hash of file content")
|
||||
encoding: str = Field(default="utf-8", description="Character encoding for text files")
|
||||
file_size: int = Field(..., ge=0, description="File size in bytes")
|
||||
mime_type: str = Field(..., description="MIME type detected")
|
||||
|
||||
@field_validator('filepath')
|
||||
@classmethod
|
||||
def validate_filepath(cls, v: str) -> str:
|
||||
"""Validate filepath format."""
|
||||
if not v.strip():
|
||||
raise ValueError("Filepath cannot be empty")
|
||||
return v.strip()
|
||||
|
||||
@field_validator('filename')
|
||||
@classmethod
|
||||
def validate_filename(cls, v: str) -> str:
|
||||
"""Validate filename format."""
|
||||
if not v.strip():
|
||||
raise ValueError("Filename cannot be empty")
|
||||
return v.strip()
|
||||
```
|
||||
|
||||
#### Processing Jobs Collection
|
||||
|
||||
Tracks processing status and lifecycle:
|
||||
|
||||
```json
|
||||
{
|
||||
"_id": "ObjectId",
|
||||
"file_id": "reference_to_files_collection",
|
||||
"status": "completed",
|
||||
// pending, processing, completed, failed
|
||||
"task_id": "celery_task_uuid",
|
||||
"created_at": "2024-01-15T10:29:00Z",
|
||||
"started_at": "2024-01-15T10:29:30Z",
|
||||
"completed_at": "2024-01-15T10:30:00Z",
|
||||
"error_message": null
|
||||
}
|
||||
```python
|
||||
class ProcessingJob(BaseModel):
|
||||
"""
|
||||
Model for processing jobs stored in the 'processing_jobs' collection.
|
||||
|
||||
Tracks the lifecycle and status of document processing tasks.
|
||||
"""
|
||||
|
||||
id: Optional[PyObjectId] = Field(default=None, alias="_id")
|
||||
file_id: PyObjectId = Field(..., description="Reference to file document")
|
||||
status: ProcessingStatus = Field(default=ProcessingStatus.PENDING, description="Current processing status")
|
||||
task_id: Optional[str] = Field(default=None, description="Celery task UUID")
|
||||
created_at: Optional[datetime] = Field(default=None, description="Timestamp when job was created")
|
||||
started_at: Optional[datetime] = Field(default=None, description="Timestamp when processing started")
|
||||
completed_at: Optional[datetime] = Field(default=None, description="Timestamp when processing completed")
|
||||
error_message: Optional[str] = Field(default=None, description="Error message if processing failed")
|
||||
|
||||
@field_validator('error_message')
|
||||
@classmethod
|
||||
def validate_error_message(cls, v: Optional[str]) -> Optional[str]:
|
||||
"""Clean up error message."""
|
||||
if v is not None:
|
||||
return v.strip() if v.strip() else None
|
||||
return v
|
||||
```
|
||||
|
||||
### Data Storage Strategy
|
||||
|
||||
- **Choice**: Three separate collections for files, content, and processing status
|
||||
- **Rationale**: Normalization prevents content duplication when multiple files have identical content
|
||||
- **Benefits**:
|
||||
- Content deduplication via SHA256 hash
|
||||
- Better query performance for metadata vs content searches
|
||||
- Clear separation of concerns between file metadata, content, and processing lifecycle
|
||||
- Multiple files can reference the same content (e.g., identical copies in different locations)
|
||||
|
||||
### Content Storage Location
|
||||
|
||||
- **Choice**: Store extracted content in separate `document_contents` collection
|
||||
- **Rationale**: Content normalization and deduplication
|
||||
- **Benefits**:
|
||||
- Single content storage per unique file hash
|
||||
- Multiple file entries can reference same content
|
||||
- Efficient storage for duplicate files
|
||||
|
||||
### Supported File Types (Initial Implementation)
|
||||
|
||||
- **Text Files** (`.txt`): Direct content reading
|
||||
@@ -306,7 +317,7 @@ Tracks processing status and lifecycle:
|
||||
|
||||
#### Watchdog Implementation
|
||||
|
||||
- **Choice**: Dedicated observer thread (Option A)
|
||||
- **Choice**: Dedicated observer thread
|
||||
- **Rationale**: Standard approach, clean separation of concerns
|
||||
- **Implementation**: Watchdog observer runs in separate thread from FastAPI
|
||||
|
||||
@@ -327,17 +338,94 @@ Tracks processing status and lifecycle:
|
||||
|
||||
#### Content Storage Location
|
||||
|
||||
- **Choice**: Store extracted content in `files` collection
|
||||
- **Rationale**: Content is intrinsic property of the file
|
||||
- **Benefits**: Single query to get file + content, simpler data model
|
||||
- **Choice**: Store files in the file system, using the SHA256 hash as filename
|
||||
- **Rationale**: MongoDB is not meant for large files, better performance. Files remain in the file system for easy
|
||||
access.
|
||||
|
||||
### Implementation Order
|
||||
#### Repository and Services Implementation
|
||||
|
||||
- **Choice**: Synchronous implementation using pymongo
|
||||
- **Rationale**: Full compatibility with Celery workers and simplified workflow
|
||||
- **Implementation**: All repositories and services operate synchronously for seamless integration
|
||||
|
||||
### Implementation Status
|
||||
|
||||
1. ✅ Pydantic models for MongoDB collections
|
||||
2. ✅ Repository layer for data access (files + processing_jobs)
|
||||
3. ✅ Celery tasks for document processing
|
||||
4. ✅ Watchdog file monitoring implementation
|
||||
5. ✅ FastAPI integration and startup coordination
|
||||
2. ✅ Repository layer for data access (files + processing_jobs + users + documents) - synchronous
|
||||
3. ✅ Service layer for business logic (auth, user, document, job) - synchronous
|
||||
4. ✅ Celery tasks for document processing
|
||||
5. ✅ Watchdog file monitoring implementation
|
||||
6. ✅ FastAPI integration and startup coordination
|
||||
|
||||
## Job Management Layer
|
||||
|
||||
### Repository Pattern Implementation
|
||||
|
||||
The job management system follows the repository pattern for clean separation between data access and business logic.
|
||||
|
||||
#### JobRepository
|
||||
|
||||
Handles direct MongoDB operations for processing jobs using synchronous pymongo:
|
||||
|
||||
**CRUD Operations:**
|
||||
- `create_job()` - Create new processing job with automatic `created_at` timestamp
|
||||
- `get_job_by_id()` - Retrieve job by ObjectId
|
||||
- `update_job_status()` - Update job status with automatic timestamp management
|
||||
- `delete_job()` - Remove job from database
|
||||
- `get_jobs_by_file_id()` - Get all jobs for specific file
|
||||
- `get_jobs_by_status()` - Get jobs filtered by processing status
|
||||
|
||||
**Automatic Timestamp Management:**
|
||||
- `created_at`: Set automatically during job creation
|
||||
- `started_at`: Set automatically when status changes to PROCESSING
|
||||
- `completed_at`: Set automatically when status changes to COMPLETED or FAILED
|
||||
|
||||
#### JobService
|
||||
|
||||
Provides synchronous business logic layer with strict status transition validation:
|
||||
|
||||
**Status Transition Methods:**
|
||||
- `mark_job_as_started()` - PENDING → PROCESSING
|
||||
- `mark_job_as_completed()` - PROCESSING → COMPLETED
|
||||
- `mark_job_as_failed()` - PROCESSING → FAILED
|
||||
|
||||
**Validation Rules:**
|
||||
- Strict status transitions (invalid transitions raise exceptions)
|
||||
- Job existence verification before any operation
|
||||
- Automatic timestamp management through repository layer
|
||||
|
||||
#### Custom Exceptions
|
||||
|
||||
**InvalidStatusTransitionError**: Raised for invalid status transitions
|
||||
**JobRepositoryError**: Raised for MongoDB operation failures
|
||||
|
||||
#### Valid Status Transitions
|
||||
|
||||
```
|
||||
PENDING → PROCESSING (via mark_job_as_started)
|
||||
PROCESSING → COMPLETED (via mark_job_as_completed)
|
||||
PROCESSING → FAILED (via mark_job_as_failed)
|
||||
```
|
||||
|
||||
All other transitions are forbidden and will raise `InvalidStatusTransitionError`.
|
||||
|
||||
### File Structure
|
||||
|
||||
```
|
||||
src/file-processor/app/
|
||||
├── database/repositories/
|
||||
│ ├── job_repository.py # JobRepository class (synchronous)
|
||||
│ ├── user_repository.py # UserRepository class (synchronous)
|
||||
│ ├── document_repository.py # DocumentRepository class (synchronous)
|
||||
│ └── file_repository.py # FileRepository class (synchronous)
|
||||
├── services/
|
||||
│ ├── job_service.py # JobService class (synchronous)
|
||||
│ ├── auth_service.py # AuthService class (synchronous)
|
||||
│ ├── user_service.py # UserService class (synchronous)
|
||||
│ └── document_service.py # DocumentService class (synchronous)
|
||||
└── exceptions/
|
||||
└── job_exceptions.py # Custom exceptions
|
||||
```
|
||||
|
||||
### Processing Pipeline Features
|
||||
|
||||
@@ -346,87 +434,7 @@ Tracks processing status and lifecycle:
|
||||
- **Status Tracking**: Real-time processing status via `processing_jobs` collection
|
||||
- **Extensible Metadata**: Flexible metadata storage per file type
|
||||
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
|
||||
|
||||
## Document Service Architecture
|
||||
|
||||
### Service Overview
|
||||
|
||||
The document service provides orchestrated access to file documents and their content through a single interface that coordinates between `FileDocument` and `DocumentContent` repositories.
|
||||
|
||||
### Service Design
|
||||
|
||||
- **Architecture Pattern**: Service orchestration with separate repositories
|
||||
- **Transaction Support**: MongoDB ACID transactions for data consistency
|
||||
- **Content Deduplication**: Multiple files can reference the same content via SHA256 hash
|
||||
- **Error Handling**: MongoDB standard exceptions with transaction rollback
|
||||
|
||||
### Document Service (`document_service.py`)
|
||||
|
||||
Orchestrates operations between file and content repositories while maintaining data consistency.
|
||||
|
||||
#### Core Functionality
|
||||
|
||||
##### `create_document(file_path: str, file_bytes: bytes, encoding: str)`
|
||||
|
||||
Creates a new document with automatic attribute calculation and content deduplication.
|
||||
|
||||
**Automatic Calculations:**
|
||||
- `file_hash`: SHA256 hash of file bytes
|
||||
- `file_type`: Detection based on file extension
|
||||
- `mime_type`: Detection via `python-magic` library
|
||||
- `file_size`: Length of provided bytes
|
||||
- `detected_at`: Current timestamp
|
||||
- `metadata`: Empty dictionary (reserved for future extension)
|
||||
|
||||
**Deduplication Logic:**
|
||||
1. Calculate SHA256 hash of file content
|
||||
2. Check if `DocumentContent` with this hash already exists
|
||||
3. If EXISTS: Create only `FileDocument` referencing existing content
|
||||
4. If NOT EXISTS: Create both `FileDocument` and `DocumentContent` in transaction
|
||||
|
||||
**Transaction Flow:**
|
||||
```
|
||||
BEGIN TRANSACTION
|
||||
IF content_exists(file_hash):
|
||||
CREATE FileDocument with content reference
|
||||
ELSE:
|
||||
CREATE DocumentContent
|
||||
CREATE FileDocument with content reference
|
||||
COMMIT TRANSACTION
|
||||
```
|
||||
|
||||
#### Available Methods
|
||||
|
||||
- `create_document(file_path, file_bytes, encoding)`: Create with deduplication
|
||||
- `get_document_by_id(document_id)`: Retrieve by document ID
|
||||
- `get_document_by_hash(file_hash)`: Retrieve by file hash
|
||||
- `get_document_by_filepath(filepath)`: Retrieve by file path
|
||||
- `list_documents(skip, limit)`: Paginated document listing
|
||||
- `count_documents()`: Total document count
|
||||
- `update_document(document_id, update_data)`: Update document metadata
|
||||
- `delete_document(document_id)`: Remove document and orphaned content
|
||||
|
||||
### Repository Dependencies
|
||||
|
||||
The document service coordinates two existing repositories:
|
||||
|
||||
#### File Repository (`file_repository.py`)
|
||||
- `create_document()`, `find_document_by_id()`, `find_document_by_hash()`
|
||||
- `find_document_by_filepath()`, `find_document_by_name()`
|
||||
- `list_documents()`, `count_documents()`
|
||||
- `update_document()`, `delete_document()`
|
||||
|
||||
#### Document Content Repository (`document_content_repository.py`)
|
||||
- `create_document_content()`, `find_document_content_by_id()`
|
||||
- `find_document_content_by_file_hash()`, `content_exists()`
|
||||
- `update_document_content()`, `delete_document_content()`
|
||||
- `list_document_contents()`, `count_document_contents()`
|
||||
|
||||
### Dependencies
|
||||
|
||||
- `python-magic`: MIME type detection
|
||||
- `hashlib`: SHA256 hashing (standard library)
|
||||
- `pymongo`: MongoDB transactions support
|
||||
- **Synchronous Operations**: All database operations use pymongo for Celery compatibility
|
||||
|
||||
## Key Implementation Notes
|
||||
|
||||
@@ -449,6 +457,7 @@ The document service coordinates two existing repositories:
|
||||
- **Package Manager**: pip (standard)
|
||||
- **External Dependencies**: Listed in each service's requirements.txt
|
||||
- **Standard Library First**: Prefer standard library when possible
|
||||
- **Database Driver**: pymongo for synchronous MongoDB operations
|
||||
|
||||
### Testing Strategy
|
||||
|
||||
@@ -473,6 +482,7 @@ The document service coordinates two existing repositories:
|
||||
12. **Content in Files Collection**: Extracted content stored with file metadata
|
||||
13. **Direct Task Dispatch**: File watcher directly creates Celery tasks
|
||||
14. **SHA256 Duplicate Detection**: Prevents reprocessing identical files
|
||||
15. **Synchronous Implementation**: All repositories and services use pymongo for Celery compatibility
|
||||
|
||||
### Development Process Requirements
|
||||
|
||||
@@ -483,21 +493,15 @@ The document service coordinates two existing repositories:
|
||||
|
||||
### Next Implementation Steps
|
||||
|
||||
1. ✅ Create docker-compose.yml with all services => Done
|
||||
2. ✅ Define user management and authentication architecture => Done
|
||||
3. ✅ Implement user models and authentication services =>
|
||||
1. models/user.py => Done
|
||||
2. models/auth.py => Done
|
||||
3. database/repositories/user_repository.py => Done
|
||||
4. ✅ Add automatic admin user creation if it does not exists => Done
|
||||
5. **IN PROGRESS**: Implement file processing pipeline =>
|
||||
1. Create Pydantic models for files and processing_jobs collections
|
||||
2. Implement repository layer for file and processing job data access
|
||||
3. Create Celery tasks for document processing (.txt, .pdf, .docx)
|
||||
4. Implement Watchdog file monitoring with dedicated observer
|
||||
5. Integrate file watcher with FastAPI startup
|
||||
6. Create protected API routes for user management
|
||||
7. Build React monitoring interface with authentication
|
||||
1. **TODO**: Complete file processing pipeline =>
|
||||
1. ✅ Create Pydantic models for files and processing_jobs collections
|
||||
2. ✅ Implement repository layer for file and processing job data access (synchronous)
|
||||
3. ✅ Implement service layer for business logic (synchronous)
|
||||
4. ✅ Create Celery tasks for document processing (.txt, .pdf, .docx)
|
||||
5. ✅ Implement Watchdog file monitoring with dedicated observer
|
||||
6. ✅ Integrate file watcher with FastAPI startup
|
||||
2. Create protected API routes for user management
|
||||
3. Build React monitoring interface with authentication
|
||||
|
||||
## Annexes
|
||||
|
||||
@@ -586,4 +590,4 @@ docker-compose up --scale worker=3
|
||||
- **file-processor**: Hot-reload enabled via `--reload` flag
|
||||
- Code changes in `src/file-processor/app/` automatically restart FastAPI
|
||||
- **worker**: No hot-reload (manual restart required for stability)
|
||||
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
|
||||
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
|
||||
Reference in New Issue
Block a user