Fisrt commit. Docker compose is working
This commit is contained in:
255
Readme.md
Normal file
255
Readme.md
Normal file
@@ -0,0 +1,255 @@
|
||||
# MyDocManager
|
||||
|
||||
## Overview
|
||||
|
||||
MyDocManager is a real-time document processing application that automatically detects files in a monitored directory, processes them asynchronously, and stores the results in a database. The application uses a modern microservices architecture with Redis for task queuing and MongoDB for data persistence.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Technology Stack
|
||||
- **Backend API**: FastAPI (Python 3.12)
|
||||
- **Task Processing**: Celery with Redis broker
|
||||
- **Document Processing**: EasyOCR, PyMuPDF, python-docx, pdfplumber
|
||||
- **Database**: MongoDB
|
||||
- **Frontend**: React
|
||||
- **Containerization**: Docker & Docker Compose
|
||||
- **File Monitoring**: Python watchdog library
|
||||
|
||||
### Services Architecture
|
||||
┌─────────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Frontend │ │ file- │ │ Redis │ │ Worker │ │ MongoDB │
|
||||
│ (React) │◄──►│ processor │───►│ (Broker) │◄──►│ (Celery) │───►│ (Results) │
|
||||
│ │ │ (FastAPI + │ │ │ │ │ │ │
|
||||
│ │ │ watchdog) │ │ │ │ │ │ │
|
||||
└─────────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
|
||||
|
||||
### Docker Services
|
||||
1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch
|
||||
2. **worker**: Celery workers for document processing (OCR, text extraction)
|
||||
3. **redis**: Message broker for Celery tasks
|
||||
4. **mongodb**: Final database for processing results
|
||||
5. **frontend**: React interface for monitoring and file access
|
||||
|
||||
## Data Flow
|
||||
|
||||
1. **File Detection**: Watchdog monitors target directory in real-time
|
||||
2. **Task Creation**: FastAPI creates Celery task for each detected file
|
||||
3. **Task Processing**: Worker processes document (OCR, text extraction)
|
||||
4. **Result Storage**: Processed data stored in MongoDB
|
||||
5. **Monitoring**: React frontend displays processing status and results
|
||||
|
||||
## Document Processing Capabilities
|
||||
|
||||
### Supported File Types
|
||||
- **PDF**: Direct text extraction + OCR for scanned documents
|
||||
- **Word Documents**: .docx text extraction
|
||||
- **Images**: OCR text recognition (JPG, PNG, etc.)
|
||||
|
||||
### Processing Libraries
|
||||
- **EasyOCR**: Modern OCR engine (80+ languages, deep learning-based)
|
||||
- **PyMuPDF**: PDF text extraction and manipulation
|
||||
- **python-docx**: Word document processing
|
||||
- **pdfplumber**: Advanced PDF text extraction
|
||||
|
||||
## Development Environment
|
||||
|
||||
### Container-Based Development
|
||||
The application is designed for container-based development with hot-reload capabilities:
|
||||
- Source code mounted as volumes for real-time updates
|
||||
- All services orchestrated via Docker Compose
|
||||
- Development and production parity
|
||||
|
||||
### Key Features
|
||||
- **Real-time Processing**: Immediate file detection and processing
|
||||
- **Horizontal Scaling**: Multiple workers can be added easily
|
||||
- **Fault Tolerance**: Celery provides automatic retry mechanisms
|
||||
- **Monitoring**: Built-in task status tracking
|
||||
- **Hot Reload**: Development changes reflected instantly in containers
|
||||
|
||||
### Docker Services
|
||||
1. **file-processor**: FastAPI + real-time file monitoring + Celery task dispatch
|
||||
2. **worker**: Celery workers for document processing (OCR, text extraction)
|
||||
3. **redis**: Message broker for Celery tasks
|
||||
4. **mongodb**: Final database for processing results
|
||||
5. **frontend**: React interface for monitoring and file access
|
||||
|
||||
## Project Structure (To be implemented)
|
||||
|
||||
MyDocManager/
|
||||
├── docker-compose.yml
|
||||
├── src/
|
||||
│ ├── file-processor/
|
||||
│ │ ├── Dockerfile
|
||||
│ │ ├── requirements.txt
|
||||
│ │ ├── app/
|
||||
│ │ │ ├── main.py
|
||||
│ │ │ ├── file_watcher.py
|
||||
│ │ │ ├── celery_app.py
|
||||
│ │ │ └── api/
|
||||
│ ├── worker/
|
||||
│ │ ├── Dockerfile
|
||||
│ │ ├── requirements.txt
|
||||
│ │ └── tasks/
|
||||
│ └── frontend/
|
||||
│ ├── Dockerfile
|
||||
│ ├── package.json
|
||||
│ └── src/
|
||||
├── tests/
|
||||
│ ├── file-processor/
|
||||
│ └── worker/
|
||||
├── volumes/
|
||||
│ └── watched_files/
|
||||
└── README.md
|
||||
|
||||
|
||||
## Docker Commands Reference
|
||||
|
||||
### Initial Setup & Build
|
||||
|
||||
```bash
|
||||
# Build and start all services (first time)
|
||||
docker-compose up --build
|
||||
|
||||
# Build and start in background
|
||||
docker-compose up --build -d
|
||||
|
||||
# Build specific service
|
||||
docker-compose build file-processor
|
||||
docker-compose build worker
|
||||
```
|
||||
|
||||
### Development Workflow
|
||||
|
||||
```bash
|
||||
# Start all services
|
||||
docker-compose up
|
||||
|
||||
# Start in background (detached mode)
|
||||
docker-compose up -d
|
||||
|
||||
# Stop all services
|
||||
docker-compose down
|
||||
|
||||
# Stop and remove volumes (⚠️ deletes MongoDB data)
|
||||
docker-compose down -v
|
||||
|
||||
# Restart specific service
|
||||
docker-compose restart file-processor
|
||||
docker-compose restart worker
|
||||
docker-compose restart redis
|
||||
docker-compose restart mongodb
|
||||
```
|
||||
|
||||
### Monitoring & Debugging
|
||||
|
||||
```bash
|
||||
# View logs of all services
|
||||
docker-compose logs
|
||||
|
||||
# View logs of specific service
|
||||
docker-compose logs file-processor
|
||||
docker-compose logs worker
|
||||
docker-compose logs redis
|
||||
docker-compose logs mongodb
|
||||
|
||||
# Follow logs in real-time
|
||||
docker-compose logs -f
|
||||
docker-compose logs -f worker
|
||||
|
||||
# View running containers
|
||||
docker-compose ps
|
||||
|
||||
# Execute command in running container
|
||||
docker-compose exec file-processor bash
|
||||
docker-compose exec worker bash
|
||||
docker-compose exec mongodb mongosh
|
||||
```
|
||||
|
||||
### Service Management
|
||||
|
||||
```bash
|
||||
# Start only specific services
|
||||
docker-compose up redis mongodb file-processor
|
||||
|
||||
# Stop specific service
|
||||
docker-compose stop worker
|
||||
docker-compose stop file-processor
|
||||
|
||||
# Remove stopped containers
|
||||
docker-compose rm
|
||||
|
||||
# Scale workers (multiple instances)
|
||||
docker-compose up --scale worker=3
|
||||
```
|
||||
|
||||
### Hot-Reload Configuration
|
||||
|
||||
- **file-processor**: Hot-reload enabled via `--reload` flag
|
||||
- Code changes in `src/file-processor/app/` automatically restart FastAPI
|
||||
- **worker**: No hot-reload (manual restart required for stability)
|
||||
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
|
||||
|
||||
### Useful Service URLs
|
||||
|
||||
- **FastAPI API**: http://localhost:8000
|
||||
- **FastAPI Docs**: http://localhost:8000/docs
|
||||
- **Health Check**: http://localhost:8000/health
|
||||
- **Redis**: localhost:6379
|
||||
- **MongoDB**: localhost:27017
|
||||
|
||||
### Testing Commands
|
||||
|
||||
```bash
|
||||
# Test FastAPI health
|
||||
curl http://localhost:8000/health
|
||||
|
||||
# Test Celery task dispatch
|
||||
curl -X POST http://localhost:8000/test-task \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"message": "Hello from test!"}'
|
||||
|
||||
# Monitor Celery tasks
|
||||
docker-compose logs -f worker
|
||||
```
|
||||
|
||||
|
||||
## Key Implementation Notes
|
||||
|
||||
### Python Standards
|
||||
- **Style**: PEP 8 compliance
|
||||
- **Documentation**: Google/NumPy docstring format
|
||||
- **Naming**: snake_case for variables and functions
|
||||
- **Testing**: pytest with test_i_can_xxx / test_i_cannot_xxx patterns
|
||||
|
||||
### Dependencies Management
|
||||
- **Package Manager**: pip (standard)
|
||||
- **External Dependencies**: Listed in each service's requirements.txt
|
||||
- **Standard Library First**: Prefer standard library when possible
|
||||
|
||||
### Testing Strategy
|
||||
- All code must be testable
|
||||
- Unit tests for each processing function
|
||||
- Integration tests for file processing workflow
|
||||
- Tests validated before implementation
|
||||
|
||||
### Critical Architecture Decisions Made
|
||||
1. **Option Selected**: Single FastAPI service handles both API and file watching
|
||||
2. **Celery with Redis**: Chosen over other async patterns for scalability
|
||||
3. **EasyOCR Preferred**: Selected over Tesseract for modern OCR needs
|
||||
4. **Container Development**: Hot-reload setup required for development workflow
|
||||
|
||||
### Development Process Requirements
|
||||
1. **Collaborative Validation**: All options must be explained before coding
|
||||
2. **Test-First Approach**: Test cases defined and validated before implementation
|
||||
3. **Incremental Development**: Start simple, extend functionality progressively
|
||||
4. **Error Handling**: Clear problem explanation required before proposing fixes
|
||||
|
||||
### Next Implementation Steps
|
||||
1. Create docker-compose.yml with all services
|
||||
2. Implement basic FastAPI service structure
|
||||
3. Add watchdog file monitoring
|
||||
4. Create Celery task structure
|
||||
5. Implement document processing tasks
|
||||
6. Build React monitoring interface
|
||||
|
||||
"""
|
||||
Reference in New Issue
Block a user