8.7 KiB
MyDocManager
Overview
MyDocManager is a real-time document processing application that automatically detects files in a monitored directory, processes them asynchronously, and stores the results in a database. The application uses a modern microservices architecture with Redis for task queuing and MongoDB for data persistence.
Architecture
Technology Stack
- Backend API: FastAPI (Python 3.12)
- Task Processing: Celery with Redis broker
- Document Processing: EasyOCR, PyMuPDF, python-docx, pdfplumber
- Database: MongoDB
- Frontend: React
- Containerization: Docker & Docker Compose
- File Monitoring: Python watchdog library
Services Architecture
┌─────────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Frontend │ │ file- │ │ Redis │ │ Worker │ │ MongoDB │
│ (React) │◄──►│ processor │───►│ (Broker) │◄──►│ (Celery) │───►│ (Results) │
│ │ │ (FastAPI + │ │ │ │ │ │ │
│ │ │ watchdog) │ │ │ │ │ │ │
└─────────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Docker Services
- file-processor: FastAPI + real-time file monitoring + Celery task dispatch
- worker: Celery workers for document processing (OCR, text extraction)
- redis: Message broker for Celery tasks
- mongodb: Final database for processing results
- frontend: React interface for monitoring and file access
Data Flow
- File Detection: Watchdog monitors target directory in real-time
- Task Creation: FastAPI creates Celery task for each detected file
- Task Processing: Worker processes document (OCR, text extraction)
- Result Storage: Processed data stored in MongoDB
- Monitoring: React frontend displays processing status and results
Document Processing Capabilities
Supported File Types
- PDF: Direct text extraction + OCR for scanned documents
- Word Documents: .docx text extraction
- Images: OCR text recognition (JPG, PNG, etc.)
Processing Libraries
- EasyOCR: Modern OCR engine (80+ languages, deep learning-based)
- PyMuPDF: PDF text extraction and manipulation
- python-docx: Word document processing
- pdfplumber: Advanced PDF text extraction
Development Environment
Container-Based Development
The application is designed for container-based development with hot-reload capabilities:
- Source code mounted as volumes for real-time updates
- All services orchestrated via Docker Compose
- Development and production parity
Key Features
- Real-time Processing: Immediate file detection and processing
- Horizontal Scaling: Multiple workers can be added easily
- Fault Tolerance: Celery provides automatic retry mechanisms
- Monitoring: Built-in task status tracking
- Hot Reload: Development changes reflected instantly in containers
Docker Services
- file-processor: FastAPI + real-time file monitoring + Celery task dispatch
- worker: Celery workers for document processing (OCR, text extraction)
- redis: Message broker for Celery tasks
- mongodb: Final database for processing results
- frontend: React interface for monitoring and file access
Project Structure (To be implemented)
MyDocManager/ ├── docker-compose.yml ├── src/ │ ├── file-processor/ │ │ ├── Dockerfile │ │ ├── requirements.txt │ │ ├── app/ │ │ │ ├── main.py │ │ │ ├── file_watcher.py │ │ │ ├── celery_app.py │ │ │ └── api/ │ ├── worker/ │ │ ├── Dockerfile │ │ ├── requirements.txt │ │ └── tasks/ │ └── frontend/ │ ├── Dockerfile │ ├── package.json │ └── src/ ├── tests/ │ ├── file-processor/ │ └── worker/ ├── volumes/ │ └── watched_files/ └── README.md
Docker Commands Reference
Initial Setup & Build
# Build and start all services (first time)
docker-compose up --build
# Build and start in background
docker-compose up --build -d
# Build specific service
docker-compose build file-processor
docker-compose build worker
Development Workflow
# Start all services
docker-compose up
# Start in background (detached mode)
docker-compose up -d
# Stop all services
docker-compose down
# Stop and remove volumes (⚠️ deletes MongoDB data)
docker-compose down -v
# Restart specific service
docker-compose restart file-processor
docker-compose restart worker
docker-compose restart redis
docker-compose restart mongodb
Monitoring & Debugging
# View logs of all services
docker-compose logs
# View logs of specific service
docker-compose logs file-processor
docker-compose logs worker
docker-compose logs redis
docker-compose logs mongodb
# Follow logs in real-time
docker-compose logs -f
docker-compose logs -f worker
# View running containers
docker-compose ps
# Execute command in running container
docker-compose exec file-processor bash
docker-compose exec worker bash
docker-compose exec mongodb mongosh
Service Management
# Start only specific services
docker-compose up redis mongodb file-processor
# Stop specific service
docker-compose stop worker
docker-compose stop file-processor
# Remove stopped containers
docker-compose rm
# Scale workers (multiple instances)
docker-compose up --scale worker=3
Hot-Reload Configuration
- file-processor: Hot-reload enabled via
--reloadflag- Code changes in
src/file-processor/app/automatically restart FastAPI
- Code changes in
- worker: No hot-reload (manual restart required for stability)
- Code changes in
src/worker/tasks/require:docker-compose restart worker
- Code changes in
Useful Service URLs
- FastAPI API: http://localhost:8000
- FastAPI Docs: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
- Redis: localhost:6379
- MongoDB: localhost:27017
Testing Commands
# Test FastAPI health
curl http://localhost:8000/health
# Test Celery task dispatch
curl -X POST http://localhost:8000/test-task \
-H "Content-Type: application/json" \
-d '{"message": "Hello from test!"}'
# Monitor Celery tasks
docker-compose logs -f worker
Key Implementation Notes
Python Standards
- Style: PEP 8 compliance
- Documentation: Google/NumPy docstring format
- Naming: snake_case for variables and functions
- Testing: pytest with test_i_can_xxx / test_i_cannot_xxx patterns
Dependencies Management
- Package Manager: pip (standard)
- External Dependencies: Listed in each service's requirements.txt
- Standard Library First: Prefer standard library when possible
Testing Strategy
- All code must be testable
- Unit tests for each processing function
- Integration tests for file processing workflow
- Tests validated before implementation
Critical Architecture Decisions Made
- Option Selected: Single FastAPI service handles both API and file watching
- Celery with Redis: Chosen over other async patterns for scalability
- EasyOCR Preferred: Selected over Tesseract for modern OCR needs
- Container Development: Hot-reload setup required for development workflow
Development Process Requirements
- Collaborative Validation: All options must be explained before coding
- Test-First Approach: Test cases defined and validated before implementation
- Incremental Development: Start simple, extend functionality progressively
- Error Handling: Clear problem explanation required before proposing fixes
Next Implementation Steps
- Create docker-compose.yml with all services
- Implement basic FastAPI service structure
- Add watchdog file monitoring
- Create Celery task structure
- Implement document processing tasks
- Build React monitoring interface
"""