Files
MyDocManager/Readme.md
2025-09-26 23:58:50 +02:00

658 lines
25 KiB
Markdown

# MyDocManager
## Overview
MyDocManager is a real-time document processing application that automatically detects files in a monitored directory,
processes them asynchronously, and stores the results in a database. The application uses a modern microservices
architecture with Redis for task queuing and MongoDB for data persistence.
## Architecture
### Technology Stack
- **Backend API**: FastAPI (Python 3.12)
- **Task Processing**: Celery with Redis broker
- **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber
- **Database**: MongoDB (pymongo)
- **Frontend**: React
- **Containerization**: Docker & Docker Compose
- **File Monitoring**: Python watchdog library
### Services Architecture
┌─────────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Frontend │ │ file- │ │ Redis │ │ Worker │ │ MongoDB │
│ (React) │◄──►│ processor │───►│ (Broker) │◄──►│ (Celery) │───►│ (Results) │
│ │ │ (FastAPI + │ │ │ │ │ │ │
│ │ │ watchdog) │ │ │ │ │ │ │
└─────────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
### Docker Services
1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch
2. **worker**: Celery workers for document processing (OCR, text extraction)
3. **redis**: Message broker for Celery tasks
4. **mongodb**: Final database for processing results
5. **frontend**: React interface for monitoring and file access
## Data Flow
1. **File Detection**: Watchdog monitors target directory in real-time
2. **Task Creation**: FastAPI creates Celery task for each detected file
3. **Task Processing**: Worker processes document (OCR, text extraction)
4. **Result Storage**: Processed data stored in MongoDB
5. **Monitoring**: React frontend displays processing status and results
## Document Processing Capabilities
### Supported File Types
- **PDF**: Direct text extraction + OCR for scanned documents
- **Word Documents**: .docx text extraction
- **Images**: OCR text recognition (JPG, PNG, etc.)
### Processing Libraries
- **EasyOCR**: Modern OCR engine (80+ languages, deep learning-based)
- **PyMuPDF**: PDF text extraction and manipulation
- **python-docx**: Word document processing
- **pdfplumber**: Advanced PDF text extraction
## Development Environment
### Container-Based Development
The application is designed for container-based development with hot-reload capabilities:
- Source code mounted as volumes for real-time updates
- All services orchestrated via Docker Compose
- Development and production parity
### Key Features
- **Real-time Processing**: Immediate file detection and processing
- **Horizontal Scaling**: Multiple workers can be added easily
- **Fault Tolerance**: Celery provides automatic retry mechanisms
- **Monitoring**: Built-in task status tracking
- **Hot Reload**: Development changes reflected instantly in containers
### Docker Services
1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch
2. **worker**: Celery workers for document processing (OCR, text extraction)
3. **redis**: Message broker for Celery tasks
4. **mongodb**: Final database for processing results
5. **frontend**: React interface for monitoring and file access
## Project Structure
```
MyDocManager/
├── docker-compose.yml
├── src/
│ ├── file-processor/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ ├── app/
│ │ │ ├── main.py
│ │ │ ├── file_watcher.py # FileWatcher class with observer thread
│ │ │ ├── celery_app.py # Celery Configuration
│ │ │ ├── config/
│ │ │ │ ├── __init__.py
│ │ │ │ └── settings.py # JWT, MongoDB config
│ │ │ ├── models/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── user.py # User Pydantic models
│ │ │ │ ├── auth.py # Auth Pydantic models
│ │ │ │ ├── document.py # Document Pydantic models
│ │ │ │ ├── job.py # Job Processing Pydantic models
│ │ │ │ └── types.py # PyObjectId and other useful types
│ │ │ ├── database/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── connection.py # MongoDB connection (pymongo)
│ │ │ │ └── repositories/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── user_repository.py # User CRUD operations (synchronous)
│ │ │ │ ├── document_repository.py # Document CRUD operations (synchronous)
│ │ │ │ └── job_repository.py # Job CRUD operations (synchronous)
│ │ │ ├── services/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── auth_service.py # JWT & password logic (synchronous)
│ │ │ │ ├── user_service.py # User business logic (synchronous)
│ │ │ │ ├── document_service.py # Document business logic (synchronous)
│ │ │ │ ├── job_service.py # Job processing logic (synchronous)
│ │ │ │ └── init_service.py # Admin creation at startup
│ │ │ ├── api/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── dependencies.py # Auth dependencies
│ │ │ │ └── routes/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── auth.py # Authentication routes
│ │ │ │ └── users.py # User management routes
│ │ │ └── utils/
│ │ │ ├── __init__.py
│ │ │ ├── security.py # Password utilities
│ │ │ └── document_matching.py # Fuzzy matching Algorithms
│ ├── worker/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ └── tasks/
│ └── frontend/
│ ├── Dockerfile
│ ├── package.json
│ ├── index.html
│ └── src/
│ ├── assets/
│ ├── App.css
│ ├── App.jsx
│ ├── main.css
│ └── main.jsx
├── tests/
│ ├── file-processor/
│ │ ├── test_auth/
│ │ ├── test_users/
│ │ └── test_services/
│ └── worker/
├── volumes/
│ └── watched_files/
└── README.md
```
## Authentication & User Management
### Security Features
- **JWT Authentication**: Stateless authentication with 24-hour token expiration
- **Password Security**: bcrypt hashing with automatic salting
- **Role-Based Access**: Admin and User roles with granular permissions
- **Protected Routes**: All user management APIs require valid authentication
- **Auto Admin Creation**: Default admin user created on first startup
### User Roles
- **Admin**: Full access to user management (create, read, update, delete users)
- **User**: Limited access (view own profile, access document processing features)
### Authentication Flow
1. **Login**: User provides credentials → Server validates → Returns JWT token
2. **API Access**: Client includes JWT in Authorization header
3. **Token Validation**: Server verifies token signature and expiration
4. **Role Check**: Server validates user permissions for requested resource
### User Management APIs
```
POST /auth/login # Generate JWT token
GET /users # List all users (admin only)
POST /users # Create new user (admin only)
PUT /users/{user_id} # Update user (admin only)
DELETE /users/{user_id} # Delete user (admin only)
GET /users/me # Get current user profile (authenticated users)
```
### Useful Service URLs
- **FastAPI API**: http://localhost:8000
- **FastAPI Docs**: http://localhost:8000/docs
- **Health Check**: http://localhost:8000/health
- **Redis**: localhost:6379
- **MongoDB**: localhost:27017
### Testing Commands
```bash
# Test FastAPI health
curl http://localhost:8000/health
# Test Celery task dispatch
curl -X POST http://localhost:8000/test-task \
-H "Content-Type: application/json" \
-d '{"message": "Hello from test!"}'
# Monitor Celery tasks
docker-compose logs -f worker
```
## Default Admin User
On first startup, the application automatically creates a default admin user:
- **Username**: `admin`
- **Password**: `admin`
- **Role**: `admin`
- **Email**: `admin@mydocmanager.local`
**⚠️ Important**: Change the default admin password immediately after first login in production environments.
## File Processing Architecture
### Document Processing Flow
1. **File Detection**: Watchdog monitors `/volumes/watched_files/` directory in real-time
2. **Task Creation**: File watcher creates Celery task for each detected file
3. **Document Processing**: Celery worker processes the document and extracts content
4. **Database Storage**: Processed data stored in MongoDB collections
### MongoDB Collections Design
#### Files Collection
Stores file metadata and extracted content using Pydantic models:
```python
class FileDocument(BaseModel):
"""
Model for file documents stored in the 'files' collection.
Represents a file detected in the watched directory with its
metadata and extracted content.
"""
id: Optional[PyObjectId] = Field(default=None, alias="_id")
filename: str = Field(..., description="Original filename")
filepath: str = Field(..., description="Full path to the file")
file_type: FileType = Field(..., description="Type of the file")
extraction_method: Optional[ExtractionMethod] = Field(default=None, description="Method used to extract content")
metadata: Dict[str, Any] = Field(default_factory=dict, description="File-specific metadata")
detected_at: Optional[datetime] = Field(default=None, description="Timestamp when file was detected")
file_hash: Optional[str] = Field(default=None, description="SHA256 hash of file content")
encoding: str = Field(default="utf-8", description="Character encoding for text files")
file_size: int = Field(..., ge=0, description="File size in bytes")
mime_type: str = Field(..., description="MIME type detected")
@field_validator('filepath')
@classmethod
def validate_filepath(cls, v: str) -> str:
"""Validate filepath format."""
if not v.strip():
raise ValueError("Filepath cannot be empty")
return v.strip()
@field_validator('filename')
@classmethod
def validate_filename(cls, v: str) -> str:
"""Validate filename format."""
if not v.strip():
raise ValueError("Filename cannot be empty")
return v.strip()
```
#### Processing Jobs Collection
Tracks processing status and lifecycle:
```python
class ProcessingJob(BaseModel):
"""
Model for processing jobs stored in the 'processing_jobs' collection.
Tracks the lifecycle and status of document processing tasks.
"""
id: Optional[PyObjectId] = Field(default=None, alias="_id")
file_id: PyObjectId = Field(..., description="Reference to file document")
status: ProcessingStatus = Field(default=ProcessingStatus.PENDING, description="Current processing status")
task_id: Optional[str] = Field(default=None, description="Celery task UUID")
created_at: Optional[datetime] = Field(default=None, description="Timestamp when job was created")
started_at: Optional[datetime] = Field(default=None, description="Timestamp when processing started")
completed_at: Optional[datetime] = Field(default=None, description="Timestamp when processing completed")
error_message: Optional[str] = Field(default=None, description="Error message if processing failed")
@field_validator('error_message')
@classmethod
def validate_error_message(cls, v: Optional[str]) -> Optional[str]:
"""Clean up error message."""
if v is not None:
return v.strip() if v.strip() else None
return v
```
### Supported File Types (Initial Implementation)
- **Text Files** (`.txt`): Direct content reading
- **PDF Documents** (`.pdf`): Text extraction via PyMuPDF/pdfplumber
- **Word Documents** (`.docx`): Content extraction via python-docx
### File Processing Architecture Decisions
#### Watchdog Implementation
- **Choice**: Dedicated observer thread
- **Rationale**: Standard approach, clean separation of concerns
- **Implementation**: Watchdog observer runs in separate thread from FastAPI
#### Task Dispatch Strategy
- **Choice**: Direct Celery task creation from file watcher
- **Rationale**: Minimal latency, straightforward flow
- **Implementation**: File detected → Immediate Celery task dispatch
#### Data Storage Strategy
- **Choice**: Separate collections for files and processing status
- **Rationale**: Clean separation of file data vs processing lifecycle
- **Benefits**:
- Better query performance
- Clear data model boundaries
- Easy processing status tracking
#### Content Storage Location
- **Choice**: Store files in the file system, using the SHA256 hash as filename
- **Rationale**: MongoDB is not meant for large files, better performance. Files remain in the file system for easy
access.
#### Repository and Services Implementation
- **Choice**: Synchronous implementation using pymongo
- **Rationale**: Full compatibility with Celery workers and simplified workflow
- **Implementation**: All repositories and services operate synchronously for seamless integration
## Job Management Layer
### Repository Pattern Implementation
The job management system follows the repository pattern for clean separation between data access and business logic.
#### JobRepository
Handles direct MongoDB operations for processing jobs using synchronous pymongo:
**CRUD Operations:**
- `create_job()` - Create new processing job with automatic `created_at` timestamp
- `get_job_by_id()` - Retrieve job by ObjectId
- `update_job_status()` - Update job status with automatic timestamp management
- `delete_job()` - Remove job from database
- `get_jobs_by_file_id()` - Get all jobs for specific file
- `get_jobs_by_status()` - Get jobs filtered by processing status
**Automatic Timestamp Management:**
- `created_at`: Set automatically during job creation
- `started_at`: Set automatically when status changes to PROCESSING
- `completed_at`: Set automatically when status changes to COMPLETED or FAILED
#### JobService
Provides synchronous business logic layer with strict status transition validation:
**Status Transition Methods:**
- `mark_job_as_started()` - PENDING → PROCESSING
- `mark_job_as_completed()` - PROCESSING → COMPLETED
- `mark_job_as_failed()` - PROCESSING → FAILED
**Validation Rules:**
- Strict status transitions (invalid transitions raise exceptions)
- Job existence verification before any operation
- Automatic timestamp management through repository layer
#### Custom Exceptions
**InvalidStatusTransitionError**: Raised for invalid status transitions
**JobRepositoryError**: Raised for MongoDB operation failures
#### Valid Status Transitions
```
PENDING → PROCESSING (via mark_job_as_started)
PROCESSING → COMPLETED (via mark_job_as_completed)
PROCESSING → FAILED (via mark_job_as_failed)
```
All other transitions are forbidden and will raise `InvalidStatusTransitionError`.
### File Structure
```
src/file-processor/app/
├── database/repositories/
│ ├── job_repository.py # JobRepository class (synchronous)
│ ├── user_repository.py # UserRepository class (synchronous)
│ ├── document_repository.py # DocumentRepository class (synchronous)
│ └── file_repository.py # FileRepository class (synchronous)
├── services/
│ ├── job_service.py # JobService class (synchronous)
│ ├── auth_service.py # AuthService class (synchronous)
│ ├── user_service.py # UserService class (synchronous)
│ └── document_service.py # DocumentService class (synchronous)
└── exceptions/
└── job_exceptions.py # Custom exceptions
```
### Processing Pipeline Features
- **Duplicate Detection**: SHA256 hashing prevents reprocessing same files
- **Error Handling**: Failed processing tracked with error messages
- **Status Tracking**: Real-time processing status via `processing_jobs` collection
- **Extensible Metadata**: Flexible metadata storage per file type
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
- **Synchronous Operations**: All database operations use pymongo for Celery compatibility
## Key Implementation Notes
### Python Standards
- **Style**: PEP 8 compliance
- **Documentation**: Google/NumPy docstring format
- **Naming**: snake_case for variables and functions
- **Testing**: pytest with test_i_can_xxx / test_i_cannot_xxx patterns
### Security Best Practices
- **Password Storage**: Never store plain text passwords, always use bcrypt hashing
- **JWT Secrets**: Use strong, randomly generated secret keys in production
- **Token Expiration**: 24-hour expiration with secure signature validation
- **Role Validation**: Server-side role checking for all protected endpoints
### Dependencies Management
- **Package Manager**: pip (standard)
- **External Dependencies**: Listed in each service's requirements.txt
- **Standard Library First**: Prefer standard library when possible
- **Database Driver**: pymongo for synchronous MongoDB operations
### Testing Strategy
- All code must be testable
- Unit tests for each authentication and user management function
- Integration tests for complete authentication flow
- Tests validated before implementation
### Critical Architecture Decisions Made
1. **JWT Authentication**: Simple token-based auth with 24-hour expiration
2. **Role-Based Access**: Admin/User roles for granular permissions
3. **bcrypt Password Hashing**: Industry-standard password security
4. **MongoDB User Storage**: Centralized user management in main database
5. **Auto Admin Creation**: Automatic setup for first-time deployment
6. **Single FastAPI Service**: Handles both API and file watching with authentication
7. **Celery with Redis**: Chosen over other async patterns for scalability
8. **EasyOCR Preferred**: Selected over Tesseract for modern OCR needs
9. **Container Development**: Hot-reload setup required for development workflow
10. **Dedicated Watchdog Observer**: Thread-based file monitoring for reliability
11. **Separate MongoDB Collections**: Files and processing jobs stored separately
12. **Content in Files Collection**: Extracted content stored with file metadata
13. **Direct Task Dispatch**: File watcher directly creates Celery tasks
14. **SHA256 Duplicate Detection**: Prevents reprocessing identical files
15. **Synchronous Implementation**: All repositories and services use pymongo for Celery compatibility
### Development Process Requirements
1. **Collaborative Validation**: All options must be explained before coding
2. **Test-First Approach**: Test cases defined and validated before implementation
3. **Incremental Development**: Start simple, extend functionality progressively
4. **Error Handling**: Clear problem explanation required before proposing fixes
### Next Implementation Steps
1. Build React Login Page
2. Build React Registration Page
3. Build React Default Dashboard
4. Build React User Management Pages
#### Validated Folders and files
```
src/frontend/src/
├── components/
│ ├── auth/
│ │ ├── LoginForm.jsx # Composant formulaire de login => Done
│ │ └── AuthLayout.jsx # Layout pour les pages d'auth => Done
│ └── common/
│ ├── Header.jsx # Header commun => TODO
│ ├── Layout.jsx # Header commun => TODO
│ └── ProtectedRoutes.jsx # Done
├── contexts/
│ └── AuthContext.jsx # Done
├── pages/
│ ├── LoginPage.jsx # Page complète de login => Done
│ └── DashboardPage.jsx # Page tableau de bord (exemple) => TODO
├── services/
│ └── authService.js # Service API pour auth => Done
├── hooks/
│ └── useAuth.js # Hook React pour gestion auth => TODO
├── utils/
│ └── api.js # Configuration axios/fetch => Done
├── App.jsx # Needs to be updated => TODO
```
#### Choices already made
* Pour la gestion des requêtes API et de l'état d'authentification, je propose
* axios (plus de fonctionnalités) :
* Installation d'axios pour les requêtes HTTP
* Intercepteurs pour gestion automatique du token
* Gestion d'erreurs centralisée
* Pour la gestion de l'état d'authentification et la navigation : Option A + C en même temps
* Option A - Context React + React Router :
* React Context pour l'état global d'auth (user, token, isAuthenticated)
* React Router pour la navigation entre pages
* Routes protégées automatiques
* Option C - Context + localStorage pour persistance :
* Token sauvegardé en localStorage pour rester connecté
* Context qui se recharge au démarrage de l'app
* CSS : Utilisation de daisyUI
#### Package.json
```
{
"name": "frontend",
"private": true,
"version": "0.0.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "vite build",
"lint": "eslint .",
"preview": "vite preview"
},
"dependencies": {
"@tailwindcss/vite": "^4.1.13",
"axios": "^1.12.2",
"react": "^19.1.1",
"react-dom": "^19.1.1",
"react-router-dom": "^7.9.3"
},
"devDependencies": {
"@eslint/js": "^9.33.0",
"@types/react": "^19.1.10",
"@types/react-dom": "^19.1.7",
"@vitejs/plugin-react": "^5.0.0",
"autoprefixer": "^10.4.21",
"daisyui": "^5.1.23",
"eslint": "^9.33.0",
"eslint-plugin-react-hooks": "^5.2.0",
"eslint-plugin-react-refresh": "^0.4.20",
"globals": "^16.3.0",
"postcss": "^8.5.6",
"tailwindcss": "^4.1.13",
"vite": "^7.1.2"
}
}
```
## Annexes
### Docker Commands Reference
#### Initial Setup & Build
```bash
# Build and start all services (first time)
docker-compose up --build
# Build and start in background
docker-compose up --build -d
# Build specific service
docker-compose build file-processor
docker-compose build worker
```
#### Development Workflow
```bash
# Start all services
docker-compose up
# Start in background (detached mode)
docker-compose up -d
# Stop all services
docker-compose down
# Stop and remove volumes (⚠️ deletes MongoDB data)
docker-compose down -v
# Restart specific service
docker-compose restart file-processor
docker-compose restart worker
docker-compose restart redis
docker-compose restart mongodb
```
#### Monitoring & Debugging
```bash
# View logs of all services
docker-compose logs
# View logs of specific service
docker-compose logs file-processor
docker-compose logs worker
docker-compose logs redis
docker-compose logs mongodb
# Follow logs in real-time
docker-compose logs -f
docker-compose logs -f worker
# View running containers
docker-compose ps
# Execute command in running container
docker-compose exec file-processor bash
docker-compose exec worker bash
docker-compose exec mongodb mongosh
```
#### Service Management
```bash
# Start only specific services
docker-compose up redis mongodb file-processor
# Stop specific service
docker-compose stop worker
docker-compose stop file-processor
# Remove stopped containers
docker-compose rm
# Scale workers (multiple instances)
docker-compose up --scale worker=3
```
### Hot-Reload Configuration
- **file-processor**: Hot-reload enabled via `--reload` flag
- Code changes in `src/file-processor/app/` automatically restart FastAPI
- **worker**: No hot-reload (manual restart required for stability)
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`