Working on API

This commit is contained in:
2025-09-25 22:58:31 +02:00
parent 48f5b009ae
commit 1f7ef200e7
16 changed files with 618 additions and 63 deletions

View File

@@ -13,7 +13,7 @@ architecture with Redis for task queuing and MongoDB for data persistence.
- **Backend API**: FastAPI (Python 3.12)
- **Task Processing**: Celery with Redis broker
- **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber
- **Database**: MongoDB
- **Database**: MongoDB (pymongo)
- **Frontend**: React
- **Containerization**: Docker & Docker Compose
- **File Monitoring**: Python watchdog library
@@ -109,16 +109,18 @@ MyDocManager/
│ │ │ │ └── types.py # PyObjectId and other useful types
│ │ │ ├── database/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── connection.py # MongoDB connection
│ │ │ │ ├── connection.py # MongoDB connection (pymongo)
│ │ │ │ └── repositories/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── user_repository.py # User CRUD operations
│ │ │ │ ── document_repository.py # User CRUD operations
│ │ │ │ ├── user_repository.py # User CRUD operations (synchronous)
│ │ │ │ ── document_repository.py # Document CRUD operations (synchronous)
│ │ │ │ └── job_repository.py # Job CRUD operations (synchronous)
│ │ │ ├── services/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── auth_service.py # JWT & password logic
│ │ │ │ ├── user_service.py # User business logic
│ │ │ │ ├── document_service.py # Document business logic
│ │ │ │ ├── auth_service.py # JWT & password logic (synchronous)
│ │ │ │ ├── user_service.py # User business logic (synchronous)
│ │ │ │ ├── document_service.py # Document business logic (synchronous)
│ │ │ │ ├── job_service.py # Job processing logic (synchronous)
│ │ │ │ └── init_service.py # Admin creation at startup
│ │ │ ├── api/
│ │ │ │ ├── __init__.py
@@ -334,13 +336,20 @@ class ProcessingJob(BaseModel):
- **Rationale**: MongoDB is not meant for large files, better performance. Files remain in the file system for easy
access.
### Implementation Order
#### Repository and Services Implementation
- **Choice**: Synchronous implementation using pymongo
- **Rationale**: Full compatibility with Celery workers and simplified workflow
- **Implementation**: All repositories and services operate synchronously for seamless integration
### Implementation Status
1. ✅ Pydantic models for MongoDB collections
2. UNDER PROGRESS : Repository layer for data access (files + processing_jobs)
3. TODO : Celery tasks for document processing
4. TODO : Watchdog file monitoring implementation
5. TODO : FastAPI integration and startup coordination
2. Repository layer for data access (files + processing_jobs + users + documents) - synchronous
3. ✅ Service layer for business logic (auth, user, document, job) - synchronous
4. ✅ Celery tasks for document processing
5. ✅ Watchdog file monitoring implementation
6. ✅ FastAPI integration and startup coordination
## Job Management Layer
@@ -350,7 +359,7 @@ The job management system follows the repository pattern for clean separation be
#### JobRepository
Handles direct MongoDB operations for processing jobs:
Handles direct MongoDB operations for processing jobs using synchronous pymongo:
**CRUD Operations:**
- `create_job()` - Create new processing job with automatic `created_at` timestamp
@@ -367,7 +376,7 @@ Handles direct MongoDB operations for processing jobs:
#### JobService
Provides business logic layer with strict status transition validation:
Provides synchronous business logic layer with strict status transition validation:
**Status Transition Methods:**
- `mark_job_as_started()` - PENDING → PROCESSING
@@ -381,7 +390,6 @@ Provides business logic layer with strict status transition validation:
#### Custom Exceptions
**JobNotFoundError**: Raised when job ID doesn't exist
**InvalidStatusTransitionError**: Raised for invalid status transitions
**JobRepositoryError**: Raised for MongoDB operation failures
@@ -400,11 +408,17 @@ All other transitions are forbidden and will raise `InvalidStatusTransitionError
```
src/file-processor/app/
├── database/repositories/
── job_repository.py # JobRepository class
── job_repository.py # JobRepository class (synchronous)
│ ├── user_repository.py # UserRepository class (synchronous)
│ ├── document_repository.py # DocumentRepository class (synchronous)
│ └── file_repository.py # FileRepository class (synchronous)
├── services/
── job_service.py # JobService class
── job_service.py # JobService class (synchronous)
│ ├── auth_service.py # AuthService class (synchronous)
│ ├── user_service.py # UserService class (synchronous)
│ └── document_service.py # DocumentService class (synchronous)
└── exceptions/
└── job_exceptions.py # Custom exceptions
└── job_exceptions.py # Custom exceptions
```
### Processing Pipeline Features
@@ -414,6 +428,7 @@ src/file-processor/app/
- **Status Tracking**: Real-time processing status via `processing_jobs` collection
- **Extensible Metadata**: Flexible metadata storage per file type
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
- **Synchronous Operations**: All database operations use pymongo for Celery compatibility
## Key Implementation Notes
@@ -436,6 +451,7 @@ src/file-processor/app/
- **Package Manager**: pip (standard)
- **External Dependencies**: Listed in each service's requirements.txt
- **Standard Library First**: Prefer standard library when possible
- **Database Driver**: pymongo for synchronous MongoDB operations
### Testing Strategy
@@ -460,6 +476,7 @@ src/file-processor/app/
12. **Content in Files Collection**: Extracted content stored with file metadata
13. **Direct Task Dispatch**: File watcher directly creates Celery tasks
14. **SHA256 Duplicate Detection**: Prevents reprocessing identical files
15. **Synchronous Implementation**: All repositories and services use pymongo for Celery compatibility
### Development Process Requirements
@@ -470,12 +487,13 @@ src/file-processor/app/
### Next Implementation Steps
1. **IN PROGRESS**: Implement file processing pipeline =>
1. Create Pydantic models for files and processing_jobs collections
2. Implement repository layer for file and processing job data access
3. Create Celery tasks for document processing (.txt, .pdf, .docx)
4. Implement Watchdog file monitoring with dedicated observer
5. Integrate file watcher with FastAPI startup
1. **TODO**: Complete file processing pipeline =>
1. Create Pydantic models for files and processing_jobs collections
2. Implement repository layer for file and processing job data access (synchronous)
3. ✅ Implement service layer for business logic (synchronous)
4. ✅ Create Celery tasks for document processing (.txt, .pdf, .docx)
5. ✅ Implement Watchdog file monitoring with dedicated observer
6. ✅ Integrate file watcher with FastAPI startup
2. Create protected API routes for user management
3. Build React monitoring interface with authentication
@@ -566,4 +584,4 @@ docker-compose up --scale worker=3
- **file-processor**: Hot-reload enabled via `--reload` flag
- Code changes in `src/file-processor/app/` automatically restart FastAPI
- **worker**: No hot-reload (manual restart required for stability)
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`