Working on document repository
This commit is contained in:
315
Readme.md
315
Readme.md
@@ -178,93 +178,6 @@ DELETE /users/{user_id} # Delete user (admin only)
|
||||
GET /users/me # Get current user profile (authenticated users)
|
||||
```
|
||||
|
||||
## Docker Commands Reference
|
||||
|
||||
### Initial Setup & Build
|
||||
|
||||
```bash
|
||||
# Build and start all services (first time)
|
||||
docker-compose up --build
|
||||
|
||||
# Build and start in background
|
||||
docker-compose up --build -d
|
||||
|
||||
# Build specific service
|
||||
docker-compose build file-processor
|
||||
docker-compose build worker
|
||||
```
|
||||
|
||||
### Development Workflow
|
||||
|
||||
```bash
|
||||
# Start all services
|
||||
docker-compose up
|
||||
|
||||
# Start in background (detached mode)
|
||||
docker-compose up -d
|
||||
|
||||
# Stop all services
|
||||
docker-compose down
|
||||
|
||||
# Stop and remove volumes (⚠️ deletes MongoDB data)
|
||||
docker-compose down -v
|
||||
|
||||
# Restart specific service
|
||||
docker-compose restart file-processor
|
||||
docker-compose restart worker
|
||||
docker-compose restart redis
|
||||
docker-compose restart mongodb
|
||||
```
|
||||
|
||||
### Monitoring & Debugging
|
||||
|
||||
```bash
|
||||
# View logs of all services
|
||||
docker-compose logs
|
||||
|
||||
# View logs of specific service
|
||||
docker-compose logs file-processor
|
||||
docker-compose logs worker
|
||||
docker-compose logs redis
|
||||
docker-compose logs mongodb
|
||||
|
||||
# Follow logs in real-time
|
||||
docker-compose logs -f
|
||||
docker-compose logs -f worker
|
||||
|
||||
# View running containers
|
||||
docker-compose ps
|
||||
|
||||
# Execute command in running container
|
||||
docker-compose exec file-processor bash
|
||||
docker-compose exec worker bash
|
||||
docker-compose exec mongodb mongosh
|
||||
```
|
||||
|
||||
### Service Management
|
||||
|
||||
```bash
|
||||
# Start only specific services
|
||||
docker-compose up redis mongodb file-processor
|
||||
|
||||
# Stop specific service
|
||||
docker-compose stop worker
|
||||
docker-compose stop file-processor
|
||||
|
||||
# Remove stopped containers
|
||||
docker-compose rm
|
||||
|
||||
# Scale workers (multiple instances)
|
||||
docker-compose up --scale worker=3
|
||||
```
|
||||
|
||||
### Hot-Reload Configuration
|
||||
|
||||
- **file-processor**: Hot-reload enabled via `--reload` flag
|
||||
- Code changes in `src/file-processor/app/` automatically restart FastAPI
|
||||
- **worker**: No hot-reload (manual restart required for stability)
|
||||
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
|
||||
|
||||
### Useful Service URLs
|
||||
|
||||
- **FastAPI API**: http://localhost:8000
|
||||
@@ -298,6 +211,118 @@ On first startup, the application automatically creates a default admin user:
|
||||
- **Email**: `admin@mydocmanager.local`
|
||||
**⚠️ Important**: Change the default admin password immediately after first login in production environments.
|
||||
|
||||
## File Processing Architecture
|
||||
|
||||
### Document Processing Flow
|
||||
|
||||
1. **File Detection**: Watchdog monitors `/volumes/watched_files/` directory in real-time
|
||||
2. **Task Creation**: File watcher creates Celery task for each detected file
|
||||
3. **Document Processing**: Celery worker processes the document and extracts content
|
||||
4. **Database Storage**: Processed data stored in MongoDB collections
|
||||
|
||||
### MongoDB Collections Design
|
||||
|
||||
#### Files Collection
|
||||
|
||||
Stores file metadata and extracted content:
|
||||
|
||||
```json
|
||||
{
|
||||
"_id": "ObjectId",
|
||||
"filename": "document.pdf",
|
||||
"filepath": "/watched_files/document.pdf",
|
||||
"file_type": "pdf",
|
||||
"mime_type": "application/pdf",
|
||||
"file_size": 2048576,
|
||||
"content": "extracted text content...",
|
||||
"encoding": "utf-8",
|
||||
"extraction_method": "direct_text",
|
||||
// direct_text, ocr, hybrid
|
||||
"metadata": {
|
||||
"page_count": 15,
|
||||
// for PDFs
|
||||
"word_count": 250,
|
||||
// for text files
|
||||
"image_dimensions": {
|
||||
// for images
|
||||
"width": 1920,
|
||||
"height": 1080
|
||||
}
|
||||
},
|
||||
"detected_at": "2024-01-15T10:29:00Z",
|
||||
"file_hash": "sha256_hash_value"
|
||||
}
|
||||
```
|
||||
|
||||
#### Processing Jobs Collection
|
||||
|
||||
Tracks processing status and lifecycle:
|
||||
|
||||
```json
|
||||
{
|
||||
"_id": "ObjectId",
|
||||
"file_id": "reference_to_files_collection",
|
||||
"status": "completed",
|
||||
// pending, processing, completed, failed
|
||||
"task_id": "celery_task_uuid",
|
||||
"created_at": "2024-01-15T10:29:00Z",
|
||||
"started_at": "2024-01-15T10:29:30Z",
|
||||
"completed_at": "2024-01-15T10:30:00Z",
|
||||
"error_message": null
|
||||
}
|
||||
```
|
||||
|
||||
### Supported File Types (Initial Implementation)
|
||||
|
||||
- **Text Files** (`.txt`): Direct content reading
|
||||
- **PDF Documents** (`.pdf`): Text extraction via PyMuPDF/pdfplumber
|
||||
- **Word Documents** (`.docx`): Content extraction via python-docx
|
||||
|
||||
### File Processing Architecture Decisions
|
||||
|
||||
#### Watchdog Implementation
|
||||
|
||||
- **Choice**: Dedicated observer thread (Option A)
|
||||
- **Rationale**: Standard approach, clean separation of concerns
|
||||
- **Implementation**: Watchdog observer runs in separate thread from FastAPI
|
||||
|
||||
#### Task Dispatch Strategy
|
||||
|
||||
- **Choice**: Direct Celery task creation from file watcher
|
||||
- **Rationale**: Minimal latency, straightforward flow
|
||||
- **Implementation**: File detected → Immediate Celery task dispatch
|
||||
|
||||
#### Data Storage Strategy
|
||||
|
||||
- **Choice**: Separate collections for files and processing status
|
||||
- **Rationale**: Clean separation of file data vs processing lifecycle
|
||||
- **Benefits**:
|
||||
- Better query performance
|
||||
- Clear data model boundaries
|
||||
- Easy processing status tracking
|
||||
|
||||
#### Content Storage Location
|
||||
|
||||
- **Choice**: Store extracted content in `files` collection
|
||||
- **Rationale**: Content is intrinsic property of the file
|
||||
- **Benefits**: Single query to get file + content, simpler data model
|
||||
|
||||
### Implementation Order
|
||||
|
||||
1. ✅ Pydantic models for MongoDB collections
|
||||
2. ✅ Repository layer for data access (files + processing_jobs)
|
||||
3. ✅ Celery tasks for document processing
|
||||
4. ✅ Watchdog file monitoring implementation
|
||||
5. ✅ FastAPI integration and startup coordination
|
||||
|
||||
### Processing Pipeline Features
|
||||
|
||||
- **Duplicate Detection**: SHA256 hashing prevents reprocessing same files
|
||||
- **Error Handling**: Failed processing tracked with error messages
|
||||
- **Status Tracking**: Real-time processing status via `processing_jobs` collection
|
||||
- **Extensible Metadata**: Flexible metadata storage per file type
|
||||
- **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches
|
||||
|
||||
## Key Implementation Notes
|
||||
|
||||
### Python Standards
|
||||
@@ -338,6 +363,11 @@ On first startup, the application automatically creates a default admin user:
|
||||
7. **Celery with Redis**: Chosen over other async patterns for scalability
|
||||
8. **EasyOCR Preferred**: Selected over Tesseract for modern OCR needs
|
||||
9. **Container Development**: Hot-reload setup required for development workflow
|
||||
10. **Dedicated Watchdog Observer**: Thread-based file monitoring for reliability
|
||||
11. **Separate MongoDB Collections**: Files and processing jobs stored separately
|
||||
12. **Content in Files Collection**: Extracted content stored with file metadata
|
||||
13. **Direct Task Dispatch**: File watcher directly creates Celery tasks
|
||||
14. **SHA256 Duplicate Detection**: Prevents reprocessing identical files
|
||||
|
||||
### Development Process Requirements
|
||||
|
||||
@@ -351,13 +381,104 @@ On first startup, the application automatically creates a default admin user:
|
||||
1. ✅ Create docker-compose.yml with all services => Done
|
||||
2. ✅ Define user management and authentication architecture => Done
|
||||
3. ✅ Implement user models and authentication services =>
|
||||
1. models/user.py => Done
|
||||
2. models/auth.py => Done
|
||||
3. database/repositories/user_repository.py => Done
|
||||
4. Add automatic admin user creation if it does not exists
|
||||
5. Create protected API routes for user management
|
||||
6. Implement basic FastAPI service structure
|
||||
7. Add watchdog file monitoring
|
||||
8. Create Celery task structure
|
||||
9. Implement document processing tasks
|
||||
10. Build React monitoring interface with authentication
|
||||
1. models/user.py => Done
|
||||
2. models/auth.py => Done
|
||||
3. database/repositories/user_repository.py => Done
|
||||
4. ✅ Add automatic admin user creation if it does not exists => Done
|
||||
5. **IN PROGRESS**: Implement file processing pipeline =>
|
||||
1. Create Pydantic models for files and processing_jobs collections
|
||||
2. Implement repository layer for file and processing job data access
|
||||
3. Create Celery tasks for document processing (.txt, .pdf, .docx)
|
||||
4. Implement Watchdog file monitoring with dedicated observer
|
||||
5. Integrate file watcher with FastAPI startup
|
||||
6. Create protected API routes for user management
|
||||
7. Build React monitoring interface with authentication
|
||||
|
||||
## Annexes
|
||||
|
||||
### Docker Commands Reference
|
||||
|
||||
#### Initial Setup & Build
|
||||
|
||||
```bash
|
||||
# Build and start all services (first time)
|
||||
docker-compose up --build
|
||||
|
||||
# Build and start in background
|
||||
docker-compose up --build -d
|
||||
|
||||
# Build specific service
|
||||
docker-compose build file-processor
|
||||
docker-compose build worker
|
||||
```
|
||||
|
||||
#### Development Workflow
|
||||
|
||||
```bash
|
||||
# Start all services
|
||||
docker-compose up
|
||||
|
||||
# Start in background (detached mode)
|
||||
docker-compose up -d
|
||||
|
||||
# Stop all services
|
||||
docker-compose down
|
||||
|
||||
# Stop and remove volumes (⚠️ deletes MongoDB data)
|
||||
docker-compose down -v
|
||||
|
||||
# Restart specific service
|
||||
docker-compose restart file-processor
|
||||
docker-compose restart worker
|
||||
docker-compose restart redis
|
||||
docker-compose restart mongodb
|
||||
```
|
||||
|
||||
#### Monitoring & Debugging
|
||||
|
||||
```bash
|
||||
# View logs of all services
|
||||
docker-compose logs
|
||||
|
||||
# View logs of specific service
|
||||
docker-compose logs file-processor
|
||||
docker-compose logs worker
|
||||
docker-compose logs redis
|
||||
docker-compose logs mongodb
|
||||
|
||||
# Follow logs in real-time
|
||||
docker-compose logs -f
|
||||
docker-compose logs -f worker
|
||||
|
||||
# View running containers
|
||||
docker-compose ps
|
||||
|
||||
# Execute command in running container
|
||||
docker-compose exec file-processor bash
|
||||
docker-compose exec worker bash
|
||||
docker-compose exec mongodb mongosh
|
||||
```
|
||||
|
||||
#### Service Management
|
||||
|
||||
```bash
|
||||
# Start only specific services
|
||||
docker-compose up redis mongodb file-processor
|
||||
|
||||
# Stop specific service
|
||||
docker-compose stop worker
|
||||
docker-compose stop file-processor
|
||||
|
||||
# Remove stopped containers
|
||||
docker-compose rm
|
||||
|
||||
# Scale workers (multiple instances)
|
||||
docker-compose up --scale worker=3
|
||||
```
|
||||
|
||||
### Hot-Reload Configuration
|
||||
|
||||
- **file-processor**: Hot-reload enabled via `--reload` flag
|
||||
- Code changes in `src/file-processor/app/` automatically restart FastAPI
|
||||
- **worker**: No hot-reload (manual restart required for stability)
|
||||
- Code changes in `src/worker/tasks/` require: `docker-compose restart worker`
|
||||
|
||||
Reference in New Issue
Block a user