# MyDocManager ## Overview MyDocManager is a real-time document processing application that automatically detects files in a monitored directory, processes them asynchronously, and stores the results in a database. The application uses a modern microservices architecture with Redis for task queuing and MongoDB for data persistence. ## Architecture ### Technology Stack - **Backend API**: FastAPI (Python 3.12) - **Task Processing**: Celery with Redis broker - **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber - **Database**: MongoDB - **Frontend**: React - **Containerization**: Docker & Docker Compose - **File Monitoring**: Python watchdog library ### Services Architecture ┌─────────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Frontend │ │ file- │ │ Redis │ │ Worker │ │ MongoDB │ │ (React) │◄──►│ processor │───►│ (Broker) │◄──►│ (Celery) │───►│ (Results) │ │ │ │ (FastAPI + │ │ │ │ │ │ │ │ │ │ watchdog) │ │ │ │ │ │ │ └─────────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ### Docker Services 1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch 2. **worker**: Celery workers for document processing (OCR, text extraction) 3. **redis**: Message broker for Celery tasks 4. **mongodb**: Final database for processing results 5. **frontend**: React interface for monitoring and file access ## Data Flow 1. **File Detection**: Watchdog monitors target directory in real-time 2. **Task Creation**: FastAPI creates Celery task for each detected file 3. **Task Processing**: Worker processes document (OCR, text extraction) 4. **Result Storage**: Processed data stored in MongoDB 5. **Monitoring**: React frontend displays processing status and results ## Document Processing Capabilities ### Supported File Types - **PDF**: Direct text extraction + OCR for scanned documents - **Word Documents**: .docx text extraction - **Images**: OCR text recognition (JPG, PNG, etc.) ### Processing Libraries - **EasyOCR**: Modern OCR engine (80+ languages, deep learning-based) - **PyMuPDF**: PDF text extraction and manipulation - **python-docx**: Word document processing - **pdfplumber**: Advanced PDF text extraction ## Development Environment ### Container-Based Development The application is designed for container-based development with hot-reload capabilities: - Source code mounted as volumes for real-time updates - All services orchestrated via Docker Compose - Development and production parity ### Key Features - **Real-time Processing**: Immediate file detection and processing - **Horizontal Scaling**: Multiple workers can be added easily - **Fault Tolerance**: Celery provides automatic retry mechanisms - **Monitoring**: Built-in task status tracking - **Hot Reload**: Development changes reflected instantly in containers ### Docker Services 1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch 2. **worker**: Celery workers for document processing (OCR, text extraction) 3. **redis**: Message broker for Celery tasks 4. **mongodb**: Final database for processing results 5. **frontend**: React interface for monitoring and file access ## Project Structure (To be implemented) MyDocManager/ ├── docker-compose.yml ├── src/ │ ├── file-processor/ │ │ ├── Dockerfile │ │ ├── requirements.txt │ │ ├── app/ │ │ │ ├── main.py │ │ │ ├── file_watcher.py │ │ │ ├── celery_app.py │ │ │ └── api/ │ ├── worker/ │ │ ├── Dockerfile │ │ ├── requirements.txt │ │ └── tasks/ │ └── frontend/ │ ├── Dockerfile │ ├── package.json │ └── src/ ├── tests/ │ ├── file-processor/ │ └── worker/ ├── volumes/ │ └── watched_files/ └── README.md ## Docker Commands Reference ### Initial Setup & Build ```bash # Build and start all services (first time) docker-compose up --build # Build and start in background docker-compose up --build -d # Build specific service docker-compose build file-processor docker-compose build worker ``` ### Development Workflow ```bash # Start all services docker-compose up # Start in background (detached mode) docker-compose up -d # Stop all services docker-compose down # Stop and remove volumes (⚠️ deletes MongoDB data) docker-compose down -v # Restart specific service docker-compose restart file-processor docker-compose restart worker docker-compose restart redis docker-compose restart mongodb ``` ### Monitoring & Debugging ```bash # View logs of all services docker-compose logs # View logs of specific service docker-compose logs file-processor docker-compose logs worker docker-compose logs redis docker-compose logs mongodb # Follow logs in real-time docker-compose logs -f docker-compose logs -f worker # View running containers docker-compose ps # Execute command in running container docker-compose exec file-processor bash docker-compose exec worker bash docker-compose exec mongodb mongosh ``` ### Service Management ```bash # Start only specific services docker-compose up redis mongodb file-processor # Stop specific service docker-compose stop worker docker-compose stop file-processor # Remove stopped containers docker-compose rm # Scale workers (multiple instances) docker-compose up --scale worker=3 ``` ### Hot-Reload Configuration - **file-processor**: Hot-reload enabled via `--reload` flag - Code changes in `src/file-processor/app/` automatically restart FastAPI - **worker**: No hot-reload (manual restart required for stability) - Code changes in `src/worker/tasks/` require: `docker-compose restart worker` ### Useful Service URLs - **FastAPI API**: http://localhost:8000 - **FastAPI Docs**: http://localhost:8000/docs - **Health Check**: http://localhost:8000/health - **Redis**: localhost:6379 - **MongoDB**: localhost:27017 ### Testing Commands ```bash # Test FastAPI health curl http://localhost:8000/health # Test Celery task dispatch curl -X POST http://localhost:8000/test-task \ -H "Content-Type: application/json" \ -d '{"message": "Hello from test!"}' # Monitor Celery tasks docker-compose logs -f worker ``` ## Key Implementation Notes ### Python Standards - **Style**: PEP 8 compliance - **Documentation**: Google/NumPy docstring format - **Naming**: snake_case for variables and functions - **Testing**: pytest with test_i_can_xxx / test_i_cannot_xxx patterns ### Dependencies Management - **Package Manager**: pip (standard) - **External Dependencies**: Listed in each service's requirements.txt - **Standard Library First**: Prefer standard library when possible ### Testing Strategy - All code must be testable - Unit tests for each processing function - Integration tests for file processing workflow - Tests validated before implementation ### Critical Architecture Decisions Made 1. **Option Selected**: Single FastAPI service handles both API and file watching 2. **Celery with Redis**: Chosen over other async patterns for scalability 3. **EasyOCR Preferred**: Selected over Tesseract for modern OCR needs 4. **Container Development**: Hot-reload setup required for development workflow ### Development Process Requirements 1. **Collaborative Validation**: All options must be explained before coding 2. **Test-First Approach**: Test cases defined and validated before implementation 3. **Incremental Development**: Start simple, extend functionality progressively 4. **Error Handling**: Clear problem explanation required before proposing fixes ### Next Implementation Steps 1. Create docker-compose.yml with all services 2. Implement basic FastAPI service structure 3. Add watchdog file monitoring 4. Create Celery task structure 5. Implement document processing tasks 6. Build React monitoring interface """