MyDocManager/Readme.md

# MyDocManager

## Overview

MyDocManager is a real-time document processing application that automatically detects files in a monitored directory,
processes them asynchronously, and stores the results in a database. The application uses a modern microservices
architecture with Redis for task queuing and MongoDB for data persistence.

## Architecture

### Technology Stack

- **Backend API**: FastAPI (Python 3.12)
- **Task Processing**: Celery with Redis broker
- **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber
- **Database**: MongoDB (pymongo)
- **Frontend**: React
- **Containerization**: Docker & Docker Compose
- **File Monitoring**: Python watchdog library

### Services Architecture

    ┌─────────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
    │   Frontend      │    │ file-       │    │    Redis    │    │   Worker    │    │  MongoDB    │
    │   (React)       │◄──►│ processor   │───►│  (Broker)   │◄──►│  (Celery)   │───►│ (Results)   │
    │                 │    │ (FastAPI +  │    │             │    │             │    │             │
    │                 │    │ watchdog)   │    │             │    │             │    │             │
    └─────────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

### Docker Services

1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch
2. **worker**: Celery workers for document processing (OCR, text extraction)
3. **redis**: Message broker for Celery tasks
4. **mongodb**: Final database for processing results
5. **frontend**: React interface for monitoring and file access

## Data Flow

1. **File Detection**: Watchdog monitors target directory in real-time
2. **Task Creation**: FastAPI creates Celery task for each detected file
3. **Task Processing**: Worker processes document (OCR, text extraction)
4. **Result Storage**: Processed data stored in MongoDB
5. **Monitoring**: React frontend displays processing status and results

## Document Processing Capabilities

### Supported File Types

- **PDF**: Direct text extraction + OCR for scanned documents
- **Word Documents**: .docx text extraction
- **Images**: OCR text recognition (JPG, PNG, etc.)

### Processing Libraries

- **EasyOCR**: Modern OCR engine (80+ languages, deep learning-based)
- **PyMuPDF**: PDF text extraction and manipulation
- **python-docx**: Word document processing
- **pdfplumber**: Advanced PDF text extraction

## Development Environment

### Container-Based Development

The application is designed for container-based development with hot-reload capabilities:

- Source code mounted as volumes for real-time updates
- All services orchestrated via Docker Compose
- Development and production parity

### Key Features

- **Real-time Processing**: Immediate file detection and processing
- **Horizontal Scaling**: Multiple workers can be added easily
- **Fault Tolerance**: Celery provides automatic retry mechanisms
- **Monitoring**: Built-in task status tracking
- **Hot Reload**: Development changes reflected instantly in containers

### Docker Services

1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch
2. **worker**: Celery workers for document processing (OCR, text extraction)
3. **redis**: Message broker for Celery tasks
4. **mongodb**: Final database for processing results
5. **frontend**: React interface for monitoring and file access

## Project Structure

```
MyDocManager/
├── docker-compose.yml
├── src/
│   ├── file-processor/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── app/
│   │   │   ├── main.py
│   │   │   ├── file_watcher.py             # FileWatcher class with observer thread
│   │   │   ├── celery_app.py               # Celery Configuration
│   │   │   ├── config/
│   │   │   │   ├── __init__.py
│   │   │   │   └── settings.py              # JWT, MongoDB config
│   │   │   ├── models/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── user.py                  # User Pydantic models
│   │   │   │   ├── auth.py                  # Auth Pydantic models
│   │   │   │   ├── document.py              # Document Pydantic models
│   │   │   │   ├── job.py                   # Job Processing Pydantic models
│   │   │   │   └── types.py                 # PyObjectId and other useful types
│   │   │   ├── database/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── connection.py            # MongoDB connection (pymongo)
│   │   │   │   └── repositories/
│   │   │   │       ├── __init__.py
│   │   │   │       ├── user_repository.py      # User CRUD operations (synchronous)
│   │   │   │       ├── document_repository.py  # Document CRUD operations (synchronous)
│   │   │   │       └── job_repository.py       # Job CRUD operations (synchronous)
│   │   │   ├── services/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── auth_service.py          # JWT & password logic (synchronous)
│   │   │   │   ├── user_service.py          # User business logic (synchronous)
│   │   │   │   ├── document_service.py      # Document business logic (synchronous)
│   │   │   │   ├── job_service.py           # Job processing logic (synchronous)
│   │   │   │   └── init_service.py          # Admin creation at startup
│   │   │   ├── api/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── dependencies.py          # Auth dependencies
│   │   │   │   └── routes/
│   │   │   │       ├── __init__.py
│   │   │   │       ├── auth.py              # Authentication routes
│   │   │   │       └── users.py             # User management routes
│   │   │   └── utils/
│   │   │       ├── __init__.py
│   │   │       ├── security.py             # Password utilities
│   │   │       └── document_matching.py    # Fuzzy matching Algorithms
│   ├── worker/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   └── tasks/
│   └── frontend/
│       ├── Dockerfile
│       ├── package.json
│       ├── index.html
│       └── src/
│           ├── assets/
│           ├── App.css
│           ├── App.jsx
│           ├── main.css
│           └── main.jsx
├── tests/
│   ├── file-processor/
│   │   ├── test_auth/
│   │   ├── test_users/
│   │   └── test_services/
│   └── worker/
├── volumes/
│   └── watched_files/
└── README.md
```

## Authentication & User Management

### Security Features

- **JWT Authentication**: Stateless authentication with 24-hour token expiration
- **Password Security**: bcrypt hashing with automatic salting
- **Role-Based Access**: Admin and User roles with granular permissions
- **Protected Routes**: All user management APIs require valid authentication
- **Auto Admin Creation**: Default admin user created on first startup

### User Roles

- **Admin**: Full access to user management (create, read, update, delete users)
- **User**: Limited access (view own profile, access document processing features)

### Authentication Flow

1. **Login**: User provides credentials → Server validates → Returns JWT token
2. **API Access**: Client includes JWT in Authorization header
3. **Token Validation**: Server verifies token signature and expiration
4. **Role Check**: Server validates user permissions for requested resource

### User Management APIs

```
POST /auth/login              # Generate JWT token
GET  /users                   # List all users (admin only)
POST /users                   # Create new user (admin only)
PUT  /users/{user_id}         # Update user (admin only)
DELETE /users/{user_id}       # Delete user (admin only)
GET  /users/me                # Get current user profile (authenticated users)
```

### Useful Service URLs

- **FastAPI API**: http://localhost:8000
- **FastAPI Docs**: http://localhost:8000/docs
- **Health Check**: http://localhost:8000/health
- **Redis**: localhost:6379
- **MongoDB**: localhost:27017

### Testing Commands

```bash
# Test FastAPI health
curl http://localhost:8000/health

# Test Celery task dispatch
curl -X POST http://localhost:8000/test-task \
  -H "Content-Type: application/json" \
  -d '{"message": "Hello from test!"}'

# Monitor Celery tasks
docker-compose logs -f worker
```

## Default Admin User

On first startup, the application automatically creates a default admin user:

- **Username**: `admin`
- **Password**: `admin`
- **Role**: `admin`
- **Email**: `admin@mydocmanager.local`
  **⚠️ Important**: Change the default admin password immediately after first login in production environments.

## File Processing Architecture

### Document Processing Flow

1. **File Detection**: Watchdog monitors `/volumes/watched_files/` directory in real-time
2. **Task Creation**: File watcher creates Celery task for each detected file
3. **Document Processing**: Celery worker processes the document and extracts content
4. **Database Storage**: Processed data stored in MongoDB collections

### MongoDB Collections Design

#### Files Collection

Stores file metadata and extracted content using Pydantic models:

```python
class FileDocument(BaseModel):
  """
  Model for file documents stored in the 'files' collection.

  Represents a file detected in the watched directory with its
  metadata and extracted content.
  """

  id: Optional[PyObjectId] = Field(default=None, alias="_id")
  filename: str = Field(..., description="Original filename")
  filepath: str = Field(..., description="Full path to the file")
  file_type: FileType = Field(..., description="Type of the file")
  extraction_method: Optional[ExtractionMethod] = Field(default=None, description="Method used to extract content")
  metadata: Dict[str, Any] = Field(default_factory=dict, description="File-specific metadata")
  detected_at: Optional[datetime] = Field(default=None, description="Timestamp when file was detected")
  file_hash: Optional[str] = Field(default=None, description="SHA256 hash of file content")
  encoding: str = Field(default="utf-8", description="Character encoding for text files")
  file_size: int = Field(..., ge=0, description="File size in bytes")
  mime_type: str = Field(..., description="MIME type detected")

  @field_validator('filepath')
  @classmethod
  def validate_filepath(cls, v: str) -> str:
    """Validate filepath format."""
    if not v.strip():
      raise ValueError("Filepath cannot be empty")
    return v.strip()

  @field_validator('filename')
  @classmethod
  def validate_filename(cls, v: str) -> str:
    """Validate filename format."""
    if not v.strip():
      raise ValueError("Filename cannot be empty")
    return v.strip()
```

#### Processing Jobs Collection

Tracks processing status and lifecycle:

```python
class ProcessingJob(BaseModel):
  """
  Model for processing jobs stored in the 'processing_jobs' collection.

  Tracks the lifecycle and status of document processing tasks.
  """

  id: Optional[PyObjectId] = Field(default=None, alias="_id")
  file_id: PyObjectId = Field(..., description="Reference to file document")
  status: ProcessingStatus = Field(default=ProcessingStatus.PENDING, description="Current processing status")
  task_id: Optional[str] = Field(default=None, description="Celery task UUID")
  created_at: Optional[datetime] = Field(default=None, description="Timestamp when job was created")
  started_at: Optional[datetime] = Field(default=None, description="Timestamp when processing started")
  completed_at: Optional[datetime] = Field(default=None, description="Timestamp when processing completed")
  error_message: Optional[str] = Field(default=None, description="Error message if processing failed")

  @field_validator('error_message')
  @classmethod
  def validate_error_message(cls, v: Optional[str]) -> Optional[str]:
    """Clean up error message."""
    if v is not None:
      return v.strip() if v.strip() else None
    return v
```

### Supported File Types (Initial Implementation)

- **Text Files** (`.txt`): Direct content reading
- **PDF Documents** (`.pdf`): Text extraction via PyMuPDF/pdfplumber
- **Word Documents** (`.docx`): Content extraction via python-docx

### File Processing Architecture Decisions

#### Watchdog Implementation

- **Choice**: Dedicated observer thread
- **Rationale**: Standard approach, clean separation of concerns
- **Implementation**: Watchdog observer runs in separate thread from FastAPI

#### Task Dispatch Strategy

- **Choice**: Direct Celery task creation from file watcher
- **Rationale**: Minimal latency, straightforward flow
- **Implementation**: File detected → Immediate Celery task dispatch

#### Data Storage Strategy

- **Choice**: Separate collections for files and processing status
- **Rationale**: Clean separation of file data vs processing lifecycle
- **Benefits**:
    - Better query performance
    - Clear data model boundaries
    - Easy processing status tracking

#### Content Storage Location

- **Choice**: Store files in the file system, using the SHA256 hash as filename
- **Rationale**: MongoDB is not meant for large files, better performance. Files remain in the file system for easy
  access.

#### Repository and Services Implementation

- **Choice**: Synchronous implementation using pymongo
- **Rationale**: Full compatibility with Celery workers and simplified workflow
- **Implementation**: All repositories and services operate synchronously for seamless integration


## Job Management Layer

### Repository Pattern Implementation

The job management system follows the repository pattern for clean separation between data access and business logic.

#### JobRepository

Handles direct MongoDB operations for processing jobs using synchronous pymongo:

**CRUD Operations:**
- `create_job()` - Create new processing job with automatic `created_at` timestamp
- `get_job_by_id()` - Retrieve job by ObjectId
- `update_job_status()` - Update job status with automatic timestamp management
- `delete_job()` - Remove job from database
- `get_jobs_by_file_id()` - Get all jobs for specific file
- `get_jobs_by_status()` - Get jobs filtered by processing status

**Automatic Timestamp Management:**
- `created_at`: Set automatically during job creation
- `started_at`: Set automatically when status changes to PROCESSING
- `completed_at`: Set automatically when status changes to COMPLETED or FAILED

#### JobService

Provides synchronous business logic layer with strict status transition validation:

**Status Transition Methods:**
- `mark_job_as_started()` - PENDING → PROCESSING
- `mark_job_as_completed()` - PROCESSING → COMPLETED
- `mark_job_as_failed()` - PROCESSING → FAILED

**Validation Rules:**
- Strict status transitions (invalid transitions raise exceptions)
- Job existence verification before any operation
- Automatic timestamp management through repository layer

#### Custom Exceptions

**InvalidStatusTransitionError**: Raised for invalid status transitions
**JobRepositoryError**: Raised for MongoDB operation failures

#### Valid Status Transitions

```
PENDING → PROCESSING    (via mark_job_as_started)
PROCESSING → COMPLETED  (via mark_job_as_completed)
PROCESSING → FAILED     (via mark_job_as_failed)
```

All other transitions are forbidden and will raise `InvalidStatusTransitionError`.

### File Structure

```
src/file-processor/app/
├── database/repositories/
│   ├── job_repository.py           # JobRepository class (synchronous)
│   ├── user_repository.py          # UserRepository class (synchronous)
│   ├── document_repository.py      # DocumentRepository class (synchronous)
│   └── file_repository.py          # FileRepository class (synchronous)
├── services/
│   ├── job_service.py              # JobService class (synchronous)
│   ├── auth_service.py             # AuthService class (synchronous)
│   ├── user_service.py             # UserService class (synchronous)
│   └── document_service.py         # DocumentService class (synchronous)
└── exceptions/
    └── job_exceptions.py           # Custom exceptions
```

### Processing Pipeline Features

- **Duplicate Detection**: SHA256 hashing prevents reprocessing same files
- **Error Handling**: Failed processing tracked with error messages
- **Status Tracking**: Real-time processing status via `processing_jobs` collection
- **Extensible Metadata**: Flexible metadata storage per file type
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
- **Synchronous Operations**: All database operations use pymongo for Celery compatibility

## Key Implementation Notes

### Python Standards

- **Style**: PEP 8 compliance
- **Documentation**: Google/NumPy docstring format
- **Naming**: snake_case for variables and functions
- **Testing**: pytest with test_i_can_xxx / test_i_cannot_xxx patterns

### Security Best Practices

- **Password Storage**: Never store plain text passwords, always use bcrypt hashing
- **JWT Secrets**: Use strong, randomly generated secret keys in production
- **Token Expiration**: 24-hour expiration with secure signature validation
- **Role Validation**: Server-side role checking for all protected endpoints

### Dependencies Management

- **Package Manager**: pip (standard)
- **External Dependencies**: Listed in each service's requirements.txt
- **Standard Library First**: Prefer standard library when possible
- **Database Driver**: pymongo for synchronous MongoDB operations

### Testing Strategy

- All code must be testable
- Unit tests for each authentication and user management function
- Integration tests for complete authentication flow
- Tests validated before implementation

### Critical Architecture Decisions Made

1. **JWT Authentication**: Simple token-based auth with 24-hour expiration
2. **Role-Based Access**: Admin/User roles for granular permissions
3. **bcrypt Password Hashing**: Industry-standard password security
4. **MongoDB User Storage**: Centralized user management in main database
5. **Auto Admin Creation**: Automatic setup for first-time deployment
6. **Single FastAPI Service**: Handles both API and file watching with authentication
7. **Celery with Redis**: Chosen over other async patterns for scalability
8. **EasyOCR Preferred**: Selected over Tesseract for modern OCR needs
9. **Container Development**: Hot-reload setup required for development workflow
10. **Dedicated Watchdog Observer**: Thread-based file monitoring for reliability
11. **Separate MongoDB Collections**: Files and processing jobs stored separately
12. **Content in Files Collection**: Extracted content stored with file metadata
13. **Direct Task Dispatch**: File watcher directly creates Celery tasks
14. **SHA256 Duplicate Detection**: Prevents reprocessing identical files
15. **Synchronous Implementation**: All repositories and services use pymongo for Celery compatibility

### Development Process Requirements

1. **Collaborative Validation**: All options must be explained before coding
2. **Test-First Approach**: Test cases defined and validated before implementation
3. **Incremental Development**: Start simple, extend functionality progressively
4. **Error Handling**: Clear problem explanation required before proposing fixes

### Next Implementation Steps

1. Build React Login Page
2. Build React Registration Page
3. Build React Default Dashboard
4. Build React User Management Pages

#### Validated Folders and files
```
src/frontend/src/
├── components/
│   ├── auth/
│   │   ├── LoginForm.jsx          # Composant formulaire de login => Done
│   │   └── AuthLayout.jsx         # Layout pour les pages d'auth => Done
│   └── common/
│       ├── Header.jsx             # Header commun => TODO
│       ├── Layout.jsx             # Header commun => TODO
│       └── ProtectedRoutes.jsx    #  Done
├── contexts/
│   └── AuthContext.jsx            # Done
├── pages/
│   ├── LoginPage.jsx              # Page complète de login => Done
│   └── DashboardPage.jsx          # Page tableau de bord (exemple) => TODO
├── services/
│   └── authService.js             # Service API pour auth => Done
├── hooks/
│   └── useAuth.js                 # Hook React pour gestion auth => TODO
├── utils/
│   └── api.js                     # Configuration axios/fetch => Done
├── App.jsx                        # Needs to be updated => TODO
```
#### Choices already made
* Pour la gestion des requêtes API et de l'état d'authentification, je propose
    * axios (plus de fonctionnalités) :
    * Installation d'axios pour les requêtes HTTP
    * Intercepteurs pour gestion automatique du token
    * Gestion d'erreurs centralisée
* Pour la gestion de l'état d'authentification et la navigation : Option A + C en même temps
  * Option A - Context React + React Router :
    * React Context pour l'état global d'auth (user, token, isAuthenticated)
    * React Router pour la navigation entre pages
    * Routes protégées automatiques
  * Option C - Context + localStorage pour persistance :
    * Token sauvegardé en localStorage pour rester connecté
    * Context qui se recharge au démarrage de l'app
* CSS : Utilisation de daisyUI

#### Package.json
```
{
  "name": "frontend",
  "private": true,
  "version": "0.0.0",
  "type": "module",
  "scripts": {
    "dev": "vite",
    "build": "vite build",
    "lint": "eslint .",
    "preview": "vite preview"
  },
  "dependencies": {
    "@tailwindcss/vite": "^4.1.13",
    "axios": "^1.12.2",
    "react": "^19.1.1",
    "react-dom": "^19.1.1",
    "react-router-dom": "^7.9.3"
  },
  "devDependencies": {
    "@eslint/js": "^9.33.0",
    "@types/react": "^19.1.10",
    "@types/react-dom": "^19.1.7",
    "@vitejs/plugin-react": "^5.0.0",
    "autoprefixer": "^10.4.21",
    "daisyui": "^5.1.23",
    "eslint": "^9.33.0",
    "eslint-plugin-react-hooks": "^5.2.0",
    "eslint-plugin-react-refresh": "^0.4.20",
    "globals": "^16.3.0",
    "postcss": "^8.5.6",
    "tailwindcss": "^4.1.13",
    "vite": "^7.1.2"
  }
}
```

## Annexes

### Docker Commands Reference

#### Initial Setup & Build

```bash
# Build and start all services (first time)
docker-compose up --build

# Build and start in background
docker-compose up --build -d

# Build specific service
docker-compose build file-processor
docker-compose build worker
```

#### Development Workflow

```bash
# Start all services
docker-compose up

# Start in background (detached mode)
docker-compose up -d

# Stop all services
docker-compose down

# Stop and remove volumes (⚠️ deletes MongoDB data)
docker-compose down -v

# Restart specific service
docker-compose restart file-processor
docker-compose restart worker
docker-compose restart redis
docker-compose restart mongodb
```

#### Monitoring & Debugging

```bash
# View logs of all services
docker-compose logs

# View logs of specific service
docker-compose logs file-processor
docker-compose logs worker
docker-compose logs redis
docker-compose logs mongodb

# Follow logs in real-time
docker-compose logs -f
docker-compose logs -f worker

# View running containers
docker-compose ps

# Execute command in running container
docker-compose exec file-processor bash
docker-compose exec worker bash
docker-compose exec mongodb mongosh
```

#### Service Management

```bash
# Start only specific services
docker-compose up redis mongodb file-processor

# Stop specific service
docker-compose stop worker
docker-compose stop file-processor

# Remove stopped containers
docker-compose rm

# Scale workers (multiple instances)
docker-compose up --scale worker=3
```

### Hot-Reload Configuration

- **file-processor**: Hot-reload enabled via `--reload` flag
    - Code changes in `src/file-processor/app/` automatically restart FastAPI
- **worker**: No hot-reload (manual restart required for stability)
    - Code changes in `src/worker/tasks/` require: `docker-compose restart worker`