485 lines
17 KiB
Markdown
485 lines
17 KiB
Markdown
# MyDocManager
|
|
|
|
## Overview
|
|
|
|
MyDocManager is a real-time document processing application that automatically detects files in a monitored directory,
|
|
processes them asynchronously, and stores the results in a database. The application uses a modern microservices
|
|
architecture with Redis for task queuing and MongoDB for data persistence.
|
|
|
|
## Architecture
|
|
|
|
### Technology Stack
|
|
|
|
- **Backend API**: FastAPI (Python 3.12)
|
|
- **Task Processing**: Celery with Redis broker
|
|
- **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber
|
|
- **Database**: MongoDB
|
|
- **Frontend**: React
|
|
- **Containerization**: Docker & Docker Compose
|
|
- **File Monitoring**: Python watchdog library
|
|
|
|
### Services Architecture
|
|
|
|
┌─────────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Frontend │ │ file- │ │ Redis │ │ Worker │ │ MongoDB │
|
|
│ (React) │◄──►│ processor │───►│ (Broker) │◄──►│ (Celery) │───►│ (Results) │
|
|
│ │ │ (FastAPI + │ │ │ │ │ │ │
|
|
│ │ │ watchdog) │ │ │ │ │ │ │
|
|
└─────────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
|
|
|
|
### Docker Services
|
|
|
|
1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch
|
|
2. **worker**: Celery workers for document processing (OCR, text extraction)
|
|
3. **redis**: Message broker for Celery tasks
|
|
4. **mongodb**: Final database for processing results
|
|
5. **frontend**: React interface for monitoring and file access
|
|
|
|
## Data Flow
|
|
|
|
1. **File Detection**: Watchdog monitors target directory in real-time
|
|
2. **Task Creation**: FastAPI creates Celery task for each detected file
|
|
3. **Task Processing**: Worker processes document (OCR, text extraction)
|
|
4. **Result Storage**: Processed data stored in MongoDB
|
|
5. **Monitoring**: React frontend displays processing status and results
|
|
|
|
## Document Processing Capabilities
|
|
|
|
### Supported File Types
|
|
|
|
- **PDF**: Direct text extraction + OCR for scanned documents
|
|
- **Word Documents**: .docx text extraction
|
|
- **Images**: OCR text recognition (JPG, PNG, etc.)
|
|
|
|
### Processing Libraries
|
|
|
|
- **EasyOCR**: Modern OCR engine (80+ languages, deep learning-based)
|
|
- **PyMuPDF**: PDF text extraction and manipulation
|
|
- **python-docx**: Word document processing
|
|
- **pdfplumber**: Advanced PDF text extraction
|
|
|
|
## Development Environment
|
|
|
|
### Container-Based Development
|
|
|
|
The application is designed for container-based development with hot-reload capabilities:
|
|
|
|
- Source code mounted as volumes for real-time updates
|
|
- All services orchestrated via Docker Compose
|
|
- Development and production parity
|
|
|
|
### Key Features
|
|
|
|
- **Real-time Processing**: Immediate file detection and processing
|
|
- **Horizontal Scaling**: Multiple workers can be added easily
|
|
- **Fault Tolerance**: Celery provides automatic retry mechanisms
|
|
- **Monitoring**: Built-in task status tracking
|
|
- **Hot Reload**: Development changes reflected instantly in containers
|
|
|
|
### Docker Services
|
|
|
|
1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch
|
|
2. **worker**: Celery workers for document processing (OCR, text extraction)
|
|
3. **redis**: Message broker for Celery tasks
|
|
4. **mongodb**: Final database for processing results
|
|
5. **frontend**: React interface for monitoring and file access
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
MyDocManager/
|
|
├── docker-compose.yml
|
|
├── src/
|
|
│ ├── file-processor/
|
|
│ │ ├── Dockerfile
|
|
│ │ ├── requirements.txt
|
|
│ │ ├── app/
|
|
│ │ │ ├── main.py
|
|
│ │ │ ├── file_watcher.py
|
|
│ │ │ ├── celery_app.py
|
|
│ │ │ ├── config/
|
|
│ │ │ │ ├── __init__.py
|
|
│ │ │ │ └── settings.py # JWT, MongoDB config
|
|
│ │ │ ├── models/
|
|
│ │ │ │ ├── __init__.py
|
|
│ │ │ │ ├── user.py # User Pydantic models
|
|
│ │ │ │ └── auth.py # Auth Pydantic models
|
|
│ │ │ ├── database/
|
|
│ │ │ │ ├── __init__.py
|
|
│ │ │ │ ├── connection.py # MongoDB connection
|
|
│ │ │ │ └── repositories/
|
|
│ │ │ │ ├── __init__.py
|
|
│ │ │ │ └── user_repository.py # User CRUD operations
|
|
│ │ │ ├── services/
|
|
│ │ │ │ ├── __init__.py
|
|
│ │ │ │ ├── auth_service.py # JWT & password logic
|
|
│ │ │ │ ├── user_service.py # User business logic
|
|
│ │ │ │ └── init_service.py # Admin creation at startup
|
|
│ │ │ ├── api/
|
|
│ │ │ │ ├── __init__.py
|
|
│ │ │ │ ├── dependencies.py # Auth dependencies
|
|
│ │ │ │ └── routes/
|
|
│ │ │ │ ├── __init__.py
|
|
│ │ │ │ ├── auth.py # Authentication routes
|
|
│ │ │ │ └── users.py # User management routes
|
|
│ │ │ └── utils/
|
|
│ │ │ ├── __init__.py
|
|
│ │ │ ├── security.py # Password utilities
|
|
│ │ │ └── exceptions.py # Custom exceptions
|
|
│ ├── worker/
|
|
│ │ ├── Dockerfile
|
|
│ │ ├── requirements.txt
|
|
│ │ └── tasks/
|
|
│ └── frontend/
|
|
│ ├── Dockerfile
|
|
│ ├── package.json
|
|
│ └── src/
|
|
├── tests/
|
|
│ ├── file-processor/
|
|
│ │ ├── test_auth/
|
|
│ │ ├── test_users/
|
|
│ │ └── test_services/
|
|
│ └── worker/
|
|
├── volumes/
|
|
│ └── watched_files/
|
|
└── README.md
|
|
```
|
|
|
|
## Authentication & User Management
|
|
|
|
### Security Features
|
|
|
|
- **JWT Authentication**: Stateless authentication with 24-hour token expiration
|
|
- **Password Security**: bcrypt hashing with automatic salting
|
|
- **Role-Based Access**: Admin and User roles with granular permissions
|
|
- **Protected Routes**: All user management APIs require valid authentication
|
|
- **Auto Admin Creation**: Default admin user created on first startup
|
|
|
|
### User Roles
|
|
|
|
- **Admin**: Full access to user management (create, read, update, delete users)
|
|
- **User**: Limited access (view own profile, access document processing features)
|
|
|
|
### Authentication Flow
|
|
|
|
1. **Login**: User provides credentials → Server validates → Returns JWT token
|
|
2. **API Access**: Client includes JWT in Authorization header
|
|
3. **Token Validation**: Server verifies token signature and expiration
|
|
4. **Role Check**: Server validates user permissions for requested resource
|
|
|
|
### User Management APIs
|
|
|
|
```
|
|
POST /auth/login # Generate JWT token
|
|
GET /users # List all users (admin only)
|
|
POST /users # Create new user (admin only)
|
|
PUT /users/{user_id} # Update user (admin only)
|
|
DELETE /users/{user_id} # Delete user (admin only)
|
|
GET /users/me # Get current user profile (authenticated users)
|
|
```
|
|
|
|
### Useful Service URLs
|
|
|
|
- **FastAPI API**: http://localhost:8000
|
|
- **FastAPI Docs**: http://localhost:8000/docs
|
|
- **Health Check**: http://localhost:8000/health
|
|
- **Redis**: localhost:6379
|
|
- **MongoDB**: localhost:27017
|
|
|
|
### Testing Commands
|
|
|
|
```bash
|
|
# Test FastAPI health
|
|
curl http://localhost:8000/health
|
|
|
|
# Test Celery task dispatch
|
|
curl -X POST http://localhost:8000/test-task \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"message": "Hello from test!"}'
|
|
|
|
# Monitor Celery tasks
|
|
docker-compose logs -f worker
|
|
```
|
|
|
|
## Default Admin User
|
|
|
|
On first startup, the application automatically creates a default admin user:
|
|
|
|
- **Username**: `admin`
|
|
- **Password**: `admin`
|
|
- **Role**: `admin`
|
|
- **Email**: `admin@mydocmanager.local`
|
|
**⚠️ Important**: Change the default admin password immediately after first login in production environments.
|
|
|
|
## File Processing Architecture
|
|
|
|
### Document Processing Flow
|
|
|
|
1. **File Detection**: Watchdog monitors `/volumes/watched_files/` directory in real-time
|
|
2. **Task Creation**: File watcher creates Celery task for each detected file
|
|
3. **Document Processing**: Celery worker processes the document and extracts content
|
|
4. **Database Storage**: Processed data stored in MongoDB collections
|
|
|
|
### MongoDB Collections Design
|
|
|
|
#### Files Collection
|
|
|
|
Stores file metadata and extracted content:
|
|
|
|
```json
|
|
{
|
|
"_id": "ObjectId",
|
|
"filename": "document.pdf",
|
|
"filepath": "/watched_files/document.pdf",
|
|
"file_type": "pdf",
|
|
"mime_type": "application/pdf",
|
|
"file_size": 2048576,
|
|
"content": "extracted text content...",
|
|
"encoding": "utf-8",
|
|
"extraction_method": "direct_text",
|
|
// direct_text, ocr, hybrid
|
|
"metadata": {
|
|
"page_count": 15,
|
|
// for PDFs
|
|
"word_count": 250,
|
|
// for text files
|
|
"image_dimensions": {
|
|
// for images
|
|
"width": 1920,
|
|
"height": 1080
|
|
}
|
|
},
|
|
"detected_at": "2024-01-15T10:29:00Z",
|
|
"file_hash": "sha256_hash_value"
|
|
}
|
|
```
|
|
|
|
#### Processing Jobs Collection
|
|
|
|
Tracks processing status and lifecycle:
|
|
|
|
```json
|
|
{
|
|
"_id": "ObjectId",
|
|
"file_id": "reference_to_files_collection",
|
|
"status": "completed",
|
|
// pending, processing, completed, failed
|
|
"task_id": "celery_task_uuid",
|
|
"created_at": "2024-01-15T10:29:00Z",
|
|
"started_at": "2024-01-15T10:29:30Z",
|
|
"completed_at": "2024-01-15T10:30:00Z",
|
|
"error_message": null
|
|
}
|
|
```
|
|
|
|
### Supported File Types (Initial Implementation)
|
|
|
|
- **Text Files** (`.txt`): Direct content reading
|
|
- **PDF Documents** (`.pdf`): Text extraction via PyMuPDF/pdfplumber
|
|
- **Word Documents** (`.docx`): Content extraction via python-docx
|
|
|
|
### File Processing Architecture Decisions
|
|
|
|
#### Watchdog Implementation
|
|
|
|
- **Choice**: Dedicated observer thread (Option A)
|
|
- **Rationale**: Standard approach, clean separation of concerns
|
|
- **Implementation**: Watchdog observer runs in separate thread from FastAPI
|
|
|
|
#### Task Dispatch Strategy
|
|
|
|
- **Choice**: Direct Celery task creation from file watcher
|
|
- **Rationale**: Minimal latency, straightforward flow
|
|
- **Implementation**: File detected → Immediate Celery task dispatch
|
|
|
|
#### Data Storage Strategy
|
|
|
|
- **Choice**: Separate collections for files and processing status
|
|
- **Rationale**: Clean separation of file data vs processing lifecycle
|
|
- **Benefits**:
|
|
- Better query performance
|
|
- Clear data model boundaries
|
|
- Easy processing status tracking
|
|
|
|
#### Content Storage Location
|
|
|
|
- **Choice**: Store extracted content in `files` collection
|
|
- **Rationale**: Content is intrinsic property of the file
|
|
- **Benefits**: Single query to get file + content, simpler data model
|
|
|
|
### Implementation Order
|
|
|
|
1. ✅ Pydantic models for MongoDB collections
|
|
2. ✅ Repository layer for data access (files + processing_jobs)
|
|
3. ✅ Celery tasks for document processing
|
|
4. ✅ Watchdog file monitoring implementation
|
|
5. ✅ FastAPI integration and startup coordination
|
|
|
|
### Processing Pipeline Features
|
|
|
|
- **Duplicate Detection**: SHA256 hashing prevents reprocessing same files
|
|
- **Error Handling**: Failed processing tracked with error messages
|
|
- **Status Tracking**: Real-time processing status via `processing_jobs` collection
|
|
- **Extensible Metadata**: Flexible metadata storage per file type
|
|
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
|
|
|
|
## Key Implementation Notes
|
|
|
|
### Python Standards
|
|
|
|
- **Style**: PEP 8 compliance
|
|
- **Documentation**: Google/NumPy docstring format
|
|
- **Naming**: snake_case for variables and functions
|
|
- **Testing**: pytest with test_i_can_xxx / test_i_cannot_xxx patterns
|
|
|
|
### Security Best Practices
|
|
|
|
- **Password Storage**: Never store plain text passwords, always use bcrypt hashing
|
|
- **JWT Secrets**: Use strong, randomly generated secret keys in production
|
|
- **Token Expiration**: 24-hour expiration with secure signature validation
|
|
- **Role Validation**: Server-side role checking for all protected endpoints
|
|
|
|
### Dependencies Management
|
|
|
|
- **Package Manager**: pip (standard)
|
|
- **External Dependencies**: Listed in each service's requirements.txt
|
|
- **Standard Library First**: Prefer standard library when possible
|
|
|
|
### Testing Strategy
|
|
|
|
- All code must be testable
|
|
- Unit tests for each authentication and user management function
|
|
- Integration tests for complete authentication flow
|
|
- Tests validated before implementation
|
|
|
|
### Critical Architecture Decisions Made
|
|
|
|
1. **JWT Authentication**: Simple token-based auth with 24-hour expiration
|
|
2. **Role-Based Access**: Admin/User roles for granular permissions
|
|
3. **bcrypt Password Hashing**: Industry-standard password security
|
|
4. **MongoDB User Storage**: Centralized user management in main database
|
|
5. **Auto Admin Creation**: Automatic setup for first-time deployment
|
|
6. **Single FastAPI Service**: Handles both API and file watching with authentication
|
|
7. **Celery with Redis**: Chosen over other async patterns for scalability
|
|
8. **EasyOCR Preferred**: Selected over Tesseract for modern OCR needs
|
|
9. **Container Development**: Hot-reload setup required for development workflow
|
|
10. **Dedicated Watchdog Observer**: Thread-based file monitoring for reliability
|
|
11. **Separate MongoDB Collections**: Files and processing jobs stored separately
|
|
12. **Content in Files Collection**: Extracted content stored with file metadata
|
|
13. **Direct Task Dispatch**: File watcher directly creates Celery tasks
|
|
14. **SHA256 Duplicate Detection**: Prevents reprocessing identical files
|
|
|
|
### Development Process Requirements
|
|
|
|
1. **Collaborative Validation**: All options must be explained before coding
|
|
2. **Test-First Approach**: Test cases defined and validated before implementation
|
|
3. **Incremental Development**: Start simple, extend functionality progressively
|
|
4. **Error Handling**: Clear problem explanation required before proposing fixes
|
|
|
|
### Next Implementation Steps
|
|
|
|
1. ✅ Create docker-compose.yml with all services => Done
|
|
2. ✅ Define user management and authentication architecture => Done
|
|
3. ✅ Implement user models and authentication services =>
|
|
1. models/user.py => Done
|
|
2. models/auth.py => Done
|
|
3. database/repositories/user_repository.py => Done
|
|
4. ✅ Add automatic admin user creation if it does not exists => Done
|
|
5. **IN PROGRESS**: Implement file processing pipeline =>
|
|
1. Create Pydantic models for files and processing_jobs collections
|
|
2. Implement repository layer for file and processing job data access
|
|
3. Create Celery tasks for document processing (.txt, .pdf, .docx)
|
|
4. Implement Watchdog file monitoring with dedicated observer
|
|
5. Integrate file watcher with FastAPI startup
|
|
6. Create protected API routes for user management
|
|
7. Build React monitoring interface with authentication
|
|
|
|
## Annexes
|
|
|
|
### Docker Commands Reference
|
|
|
|
#### Initial Setup & Build
|
|
|
|
```bash
|
|
# Build and start all services (first time)
|
|
docker-compose up --build
|
|
|
|
# Build and start in background
|
|
docker-compose up --build -d
|
|
|
|
# Build specific service
|
|
docker-compose build file-processor
|
|
docker-compose build worker
|
|
```
|
|
|
|
#### Development Workflow
|
|
|
|
```bash
|
|
# Start all services
|
|
docker-compose up
|
|
|
|
# Start in background (detached mode)
|
|
docker-compose up -d
|
|
|
|
# Stop all services
|
|
docker-compose down
|
|
|
|
# Stop and remove volumes (⚠️ deletes MongoDB data)
|
|
docker-compose down -v
|
|
|
|
# Restart specific service
|
|
docker-compose restart file-processor
|
|
docker-compose restart worker
|
|
docker-compose restart redis
|
|
docker-compose restart mongodb
|
|
```
|
|
|
|
#### Monitoring & Debugging
|
|
|
|
```bash
|
|
# View logs of all services
|
|
docker-compose logs
|
|
|
|
# View logs of specific service
|
|
docker-compose logs file-processor
|
|
docker-compose logs worker
|
|
docker-compose logs redis
|
|
docker-compose logs mongodb
|
|
|
|
# Follow logs in real-time
|
|
docker-compose logs -f
|
|
docker-compose logs -f worker
|
|
|
|
# View running containers
|
|
docker-compose ps
|
|
|
|
# Execute command in running container
|
|
docker-compose exec file-processor bash
|
|
docker-compose exec worker bash
|
|
docker-compose exec mongodb mongosh
|
|
```
|
|
|
|
#### Service Management
|
|
|
|
```bash
|
|
# Start only specific services
|
|
docker-compose up redis mongodb file-processor
|
|
|
|
# Stop specific service
|
|
docker-compose stop worker
|
|
docker-compose stop file-processor
|
|
|
|
# Remove stopped containers
|
|
docker-compose rm
|
|
|
|
# Scale workers (multiple instances)
|
|
docker-compose up --scale worker=3
|
|
```
|
|
|
|
### Hot-Reload Configuration
|
|
|
|
- **file-processor**: Hot-reload enabled via `--reload` flag
|
|
- Code changes in `src/file-processor/app/` automatically restart FastAPI
|
|
- **worker**: No hot-reload (manual restart required for stability)
|
|
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
|