Adding document service

Fixed unit tests
2025-09-19 22:59:41 +02:00 · 2025-09-19 21:06:09 +02:00
15 changed files with 1927 additions and 88 deletions
--- a/Readme.md
+++ b/Readme.md
@@ -232,19 +232,11 @@ Stores file metadata and extracted content:
  "filename": "document.pdf",
  "filepath": "/watched_files/document.pdf",
  "file_type": "pdf",
-  "mime_type": "application/pdf",
-  "file_size": 2048576,
-  "content": "extracted text content...",
-  "encoding": "utf-8",
-  "extraction_method": "direct_text",
-  // direct_text, ocr, hybrid
+  "extraction_method": "direct_text", // direct_text, ocr, hybrid
  "metadata": {
-    "page_count": 15,
-    // for PDFs
-    "word_count": 250,
-    // for text files  
-    "image_dimensions": {
-      // for images
+    "page_count": 15,        // for PDFs
+    "word_count": 250,       // for text files  
+    "image_dimensions": {    // for images
      "width": 1920,
      "height": 1080
    }
@@ -253,6 +245,19 @@ Stores file metadata and extracted content:
  "file_hash": "sha256_hash_value"
 }
 ```
+#### Document Contents Collection
+
+Stores actual file content and technical metadata:
+```json
+{
+  "_id": "ObjectId",
+  "file_hash": "sha256_hash_value",
+  "content": "extracted text content...",
+  "encoding": "utf-8",
+  "file_size": 2048576,
+  "mime_type": "application/pdf"
+}
+```

 #### Processing Jobs Collection

@@ -272,6 +277,25 @@ Tracks processing status and lifecycle:
 }
 ```

+### Data Storage Strategy
+
+- **Choice**: Three separate collections for files, content, and processing status
+- **Rationale**: Normalization prevents content duplication when multiple files have identical content
+- **Benefits**:
+    - Content deduplication via SHA256 hash
+    - Better query performance for metadata vs content searches
+    - Clear separation of concerns between file metadata, content, and processing lifecycle
+    - Multiple files can reference the same content (e.g., identical copies in different locations)
+
+### Content Storage Location
+
+- **Choice**: Store extracted content in separate `document_contents` collection
+- **Rationale**: Content normalization and deduplication
+- **Benefits**: 
+    - Single content storage per unique file hash
+    - Multiple file entries can reference same content
+    - Efficient storage for duplicate files
+
 ### Supported File Types (Initial Implementation)

 - **Text Files** (`.txt`): Direct content reading
@@ -323,6 +347,87 @@ Tracks processing status and lifecycle:
 - **Extensible Metadata**: Flexible metadata storage per file type
 - **Multiple Extraction Methods**: Support for direct text, OCR, and hybrid approaches

+## Document Service Architecture
+
+### Service Overview
+
+The document service provides orchestrated access to file documents and their content through a single interface that coordinates between `FileDocument` and `DocumentContent` repositories.
+
+### Service Design
+
+- **Architecture Pattern**: Service orchestration with separate repositories
+- **Transaction Support**: MongoDB ACID transactions for data consistency
+- **Content Deduplication**: Multiple files can reference the same content via SHA256 hash
+- **Error Handling**: MongoDB standard exceptions with transaction rollback
+
+### Document Service (`document_service.py`)
+
+Orchestrates operations between file and content repositories while maintaining data consistency.
+
+#### Core Functionality
+
+##### `create_document(file_path: str, file_bytes: bytes, encoding: str)`
+
+Creates a new document with automatic attribute calculation and content deduplication.
+
+**Automatic Calculations:**
+- `file_hash`: SHA256 hash of file bytes
+- `file_type`: Detection based on file extension 
+- `mime_type`: Detection via `python-magic` library
+- `file_size`: Length of provided bytes
+- `detected_at`: Current timestamp
+- `metadata`: Empty dictionary (reserved for future extension)
+
+**Deduplication Logic:**
+1. Calculate SHA256 hash of file content
+2. Check if `DocumentContent` with this hash already exists
+3. If EXISTS: Create only `FileDocument` referencing existing content
+4. If NOT EXISTS: Create both `FileDocument` and `DocumentContent` in transaction
+
+**Transaction Flow:**
+```
+BEGIN TRANSACTION
+  IF content_exists(file_hash):
+    CREATE FileDocument with content reference
+  ELSE:
+    CREATE DocumentContent
+    CREATE FileDocument with content reference
+COMMIT TRANSACTION
+```
+
+#### Available Methods
+
+- `create_document(file_path, file_bytes, encoding)`: Create with deduplication
+- `get_document_by_id(document_id)`: Retrieve by document ID
+- `get_document_by_hash(file_hash)`: Retrieve by file hash
+- `get_document_by_filepath(filepath)`: Retrieve by file path
+- `list_documents(skip, limit)`: Paginated document listing
+- `count_documents()`: Total document count
+- `update_document(document_id, update_data)`: Update document metadata
+- `delete_document(document_id)`: Remove document and orphaned content
+
+### Repository Dependencies
+
+The document service coordinates two existing repositories:
+
+#### File Repository (`file_repository.py`)
+- `create_document()`, `find_document_by_id()`, `find_document_by_hash()`
+- `find_document_by_filepath()`, `find_document_by_name()`
+- `list_documents()`, `count_documents()`
+- `update_document()`, `delete_document()`
+
+#### Document Content Repository (`document_content_repository.py`)
+- `create_document_content()`, `find_document_content_by_id()`
+- `find_document_content_by_file_hash()`, `content_exists()`
+- `update_document_content()`, `delete_document_content()`
+- `list_document_contents()`, `count_document_contents()`
+
+### Dependencies
+
+- `python-magic`: MIME type detection
+- `hashlib`: SHA256 hashing (standard library)
+- `pymongo`: MongoDB transactions support
+
 ## Key Implementation Notes

 ### Python Standards
--- a/requirements.txt
+++ b/requirements.txt
@@ -32,6 +32,7 @@ pytest-asyncio==1.2.0
 pytest-mock==3.15.1
 python-dateutil==2.9.0.post0
 python-dotenv==1.1.1
+python-magic==0.4.27
 pytz==2025.2
 PyYAML==6.0.2
 sentinels==1.1.1
--- a/src/file-processor/app/database/repositories/document_content_repository.py
+++ b/src/file-processor/app/database/repositories/document_content_repository.py
@@ -0,0 +1,214 @@
+from typing import List, Optional
+from datetime import datetime
+from motor.motor_asyncio import AsyncIOMotorDatabase, AsyncIOMotorCollection
+from pymongo.errors import DuplicateKeyError, PyMongoError
+from bson import ObjectId
+
+from app.models.document import DocumentContent
+
+
+class DocumentContentRepository:
+  """
+  Repository class for document content CRUD operations in MongoDB.
+
+  This class handles all database operations related to document content,
+  following the repository pattern with dependency injection and async/await.
+  """
+  
+  def __init__(self, database: AsyncIOMotorDatabase):
+    """
+    Initialize repository with database dependency.
+
+    Args:
+        database (AsyncIOMotorDatabase): MongoDB database instance
+    """
+    self.db = database
+    self.collection: AsyncIOMotorCollection = database.document_contents
+    self._ensure_indexes()
+  
+  async def initialize(self):
+    """
+    Initialize repository by ensuring required indexes exist.
+
+    Should be called after repository instantiation to setup database indexes.
+    """
+    await self._ensure_indexes()
+  
+  async def _ensure_indexes(self):
+    """
+    Ensure required database indexes exist.
+
+    Creates unique index on file_hash field to prevent duplicates.
+    """
+    try:
+      await self.collection.create_index("file_hash", unique=True)
+    except PyMongoError:
+      # Index might already exist, ignore error
+      pass
+  
+  async def create_document_content(self, document_content: DocumentContent) -> DocumentContent:
+    """
+    Create a new document content in the database.
+
+    Args:
+        document_content (DocumentContent): Document content data
+
+    Returns:
+        DocumentContent: Created document content with database ID
+
+    Raises:
+        DuplicateKeyError: If file_hash already exists
+        ValueError: If document content creation fails due to validation
+    """
+    document_dict = document_content.model_dump(by_alias=True, exclude_unset=True)
+    
+    # Remove _id if it's None to let MongoDB generate it
+    if document_dict.get("_id") is None:
+      document_dict.pop("_id", None)
+    
+    try:
+      result = await self.collection.insert_one(document_dict)
+      document_dict["_id"] = result.inserted_id
+      return DocumentContent(**document_dict)
+    except DuplicateKeyError as e:
+      raise DuplicateKeyError(f"Document content with file_hash '{document_content.file_hash}' already exists: {e}")
+    except PyMongoError as e:
+      raise ValueError(f"Failed to create document content: {e}")
+  
+  async def find_document_content_by_id(self, document_id: str) -> Optional[DocumentContent]:
+    """
+    Find document content by ID.
+
+    Args:
+        document_id (str): Document content ID to search for
+
+    Returns:
+        DocumentContent or None: Document content if found, None otherwise
+    """
+    try:
+      if not ObjectId.is_valid(document_id):
+        return None
+      
+      document_doc = await self.collection.find_one({"_id": ObjectId(document_id)})
+      if document_doc:
+        return DocumentContent(**document_doc)
+      return None
+    except PyMongoError:
+      return None
+  
+  async def find_document_content_by_file_hash(self, file_hash: str) -> Optional[DocumentContent]:
+    """
+    Find document content by file hash.
+
+    Args:
+        file_hash (str): File hash to search for
+
+    Returns:
+        DocumentContent or None: Document content if found, None otherwise
+    """
+    try:
+      document_doc = await self.collection.find_one({"file_hash": file_hash})
+      if document_doc:
+        return DocumentContent(**document_doc)
+      return None
+    except PyMongoError:
+      return None
+  
+  async def content_exists(self, file_hash: str) -> bool:
+    """
+    Check if document content exists by file hash.
+
+    Args:
+        file_hash (str): File hash to check
+
+    Returns:
+        bool: True if document content exists, False otherwise
+    """
+    try:
+      count = await self.collection.count_documents({"file_hash": file_hash})
+      return count > 0
+    except PyMongoError:
+      return False
+  
+  async def update_document_content(self, document_id: str, update_data: dict) -> Optional[DocumentContent]:
+    """
+    Update document content information.
+
+    Args:
+        document_id (str): Document content ID to update
+        update_data (dict): Updated document content data
+
+    Returns:
+        DocumentContent or None: Updated document content if found, None otherwise
+    """
+    try:
+      if not ObjectId.is_valid(document_id):
+        return None
+      
+      # Remove None values and _id from update data
+      clean_update_data = {k: v for k, v in update_data.items() if v is not None and k != "_id"}
+      
+      if not clean_update_data:
+        return await self.find_document_content_by_id(document_id)
+      
+      result = await self.collection.find_one_and_update(
+        {"_id": ObjectId(document_id)},
+        {"$set": clean_update_data},
+        return_document=True
+      )
+      
+      if result:
+        return DocumentContent(**result)
+      return None
+    
+    except PyMongoError:
+      return None
+  
+  async def delete_document_content(self, document_id: str) -> bool:
+    """
+    Delete document content from database.
+
+    Args:
+        document_id (str): Document content ID to delete
+
+    Returns:
+        bool: True if document content was deleted, False otherwise
+    """
+    try:
+      if not ObjectId.is_valid(document_id):
+        return False
+      
+      result = await self.collection.delete_one({"_id": ObjectId(document_id)})
+      return result.deleted_count > 0
+    except PyMongoError:
+      return False
+  
+  async def list_document_contents(self, skip: int = 0, limit: int = 100) -> List[DocumentContent]:
+    """
+    List document contents with pagination.
+
+    Args:
+        skip (int): Number of document contents to skip (default: 0)
+        limit (int): Maximum number of document contents to return (default: 100)
+
+    Returns:
+        List[DocumentContent]: List of document contents
+    """
+    try:
+      cursor = self.collection.find({}).skip(skip).limit(limit).sort("_id", -1)
+      document_docs = await cursor.to_list(length=limit)
+      return [DocumentContent(**document_doc) for document_doc in document_docs]
+    except PyMongoError:
+      return []
+  
+  async def count_document_contents(self) -> int:
+    """
+    Count total number of document contents.
+
+    Returns:
+        int: Total number of document contents in database
+    """
+    try:
+      return await self.collection.count_documents({})
+    except PyMongoError:
+      return 0
--- a/src/file-processor/app/database/repositories/document_repository.py
+++ b/src/file-processor/app/database/repositories/document_repository.py
@@ -8,10 +8,22 @@ in MongoDB with proper error handling and type safety.
 from typing import Optional, List
 from bson import ObjectId
 from pymongo.errors import DuplicateKeyError, PyMongoError
-from difflib import SequenceMatcher
-from motor.motor_asyncio import AsyncIOMotorCollection
+from motor.motor_asyncio import AsyncIOMotorCollection, AsyncIOMotorDatabase
 from app.models.document import FileDocument
-from app.database.connection import get_database
+from app.utils.document_matching import fuzzy_matching, subsequence_matching
+
+
+class MatchMethodBase:
+  pass
+
+
+class SubsequenceMatching(MatchMethodBase):
+  pass
+
+
+class FuzzyMatching(MatchMethodBase):
+  def __init__(self, threshold: float = 0.6):
+    self.threshold = threshold


 class FileDocumentRepository:
@@ -22,12 +34,20 @@ class FileDocumentRepository:
  with proper error handling and data validation.
  """
  
-  def __init__(self):
+  def __init__(self, database: AsyncIOMotorDatabase):
    """Initialize file repository with database connection."""
-    self.db = get_database()
+    self.db = database
    self.collection: AsyncIOMotorCollection = self.db.files
    self._ensure_indexes()
  
+  async def initialize(self):
+    """
+    Initialize repository by ensuring required indexes exist.
+
+    Should be called after repository instantiation to setup database indexes.
+    """
+    await self._ensure_indexes()
+  
  async def _ensure_indexes(self):
    """
    Ensure required database indexes exist.
@@ -64,7 +84,7 @@ class FileDocumentRepository:
      return file_data
    
    except DuplicateKeyError as e:
-      raise DuplicateKeyError(f"File with same hash already exists: {e}")
+      raise DuplicateKeyError(f"File with same file path already exists: {e}")
    except PyMongoError as e:
      raise ValueError(f"Failed to create file document: {e}")
  
@@ -128,13 +148,13 @@ class FileDocumentRepository:
    except PyMongoError:
      return None
  
-  async def find_document_by_name(self, filename: str, similarity_threshold: float = 0.6) -> List[FileDocument]:
+  async def find_document_by_name(self, filename: str, matching_method: MatchMethodBase = None) -> List[FileDocument]:
    """
    Find file documents by filename using fuzzy matching.
    
    Args:
        filename (str): Filename to search for
-        similarity_threshold (float): Minimum similarity ratio (0.0 to 1.0)
+        matching_method (MatchMethodBase): Minimum similarity ratio (0.0 to 1.0)
        
    Returns:
        List[FileDocument]: List of matching files sorted by similarity score
@@ -143,21 +163,12 @@ class FileDocumentRepository:
      # Get all files from database
      cursor = self.collection.find({})
      all_files = await cursor.to_list(length=None)
+      all_documents = [FileDocument(**file_doc) for file_doc in all_files]
      
-      matches = []
-      for file_doc in all_files:
-        file_obj = FileDocument(**file_doc)
-        # Calculate similarity between search term and filename
-        similarity = SequenceMatcher(None, filename.lower(), file_obj.filename.lower()).ratio()
-        
-        if similarity >= similarity_threshold:
-          matches.append((file_obj, similarity))
+      if isinstance(matching_method, FuzzyMatching):
+        return fuzzy_matching(filename, all_documents, matching_method.threshold)
      
-      # Sort by similarity score (highest first)
-      matches.sort(key=lambda x: x[1], reverse=True)
-      
-      # Return only the FileDocument objects
-      return [match[0] for match in matches]
+      return subsequence_matching(filename, all_documents)
    
    except PyMongoError:
      return []
--- a/src/file-processor/app/database/repositories/user_repository.py
+++ b/src/file-processor/app/database/repositories/user_repository.py
@@ -34,6 +34,14 @@ class UserRepository:
    self.collection: AsyncIOMotorCollection = database.users
    self._ensure_indexes()
  
+  async def initialize(self):
+    """
+    Initialize repository by ensuring required indexes exist.
+
+    Should be called after repository instantiation to setup database indexes.
+    """
+    await self._ensure_indexes()
+  
  async def _ensure_indexes(self):
    """
    Ensure required database indexes exist.
--- a/src/file-processor/app/models/document.py
+++ b/src/file-processor/app/models/document.py
@@ -86,7 +86,7 @@ class DocumentContent(BaseModel):
  """Model for document content."""
  
  id: Optional[PyObjectId] = Field(default=None, alias="_id")
-  file_hash: Optional[str] = Field(..., description="SHA256 hash of file content")
+  file_hash: Optional[str] = Field(default=None, description="SHA256 hash of file content")
  content: str = Field(..., description="File content")
  encoding: str = Field(default="utf-8", description="Character encoding for text files")
  file_size: int = Field(..., ge=0, description="File size in bytes")
--- a/src/file-processor/app/models/job.py
+++ b/src/file-processor/app/models/job.py
--- a/src/file-processor/app/services/document_service.py
+++ b/src/file-processor/app/services/document_service.py
@@ -0,0 +1,380 @@
+"""
+Document service for orchestrated file and content management.
+
+This service coordinates between FileDocument and DocumentContent repositories
+while maintaining data consistency through MongoDB transactions.
+"""
+
+import hashlib
+import magic
+from datetime import datetime
+from pathlib import Path
+from typing import List, Optional, Dict, Any, Tuple
+
+from motor.motor_asyncio import AsyncIOMotorClientSession
+from pymongo.errors import PyMongoError
+
+from app.database.connection import get_database
+from app.database.repositories.document_repository import FileDocumentRepository
+from app.database.repositories.document_content_repository import DocumentContentRepository
+from app.models.document import (
+  FileDocument,
+  DocumentContent,
+  FileType,
+  ProcessingStatus
+)
+from app.models.types import PyObjectId
+
+
+class DocumentService:
+  """
+  Service for orchestrated document and content management.
+
+  Provides high-level operations that coordinate between file documents
+  and their content while ensuring data consistency through transactions.
+  """
+  
+  def __init__(self):
+    """Initialize the document service with repository dependencies."""
+    self.db = get_database()
+    self.file_repository = FileDocumentRepository(self.db)
+    self.content_repository = DocumentContentRepository(self.db)
+  
+  def _calculate_file_hash(self, file_bytes: bytes) -> str:
+    """
+    Calculate SHA256 hash of file content.
+
+    Args:
+        file_bytes: Raw file content as bytes
+
+    Returns:
+        Hexadecimal SHA256 hash string
+    """
+    return hashlib.sha256(file_bytes).hexdigest()
+  
+  def _detect_file_type(self, file_path: str) -> FileType:
+    """
+    Detect file type from file extension.
+
+    Args:
+        file_path: Path to the file
+
+    Returns:
+        FileType enum value
+
+    Raises:
+        ValueError: If file type is not supported
+    """
+    extension = Path(file_path).suffix.lower().lstrip('.')
+    
+    try:
+      return FileType(extension)
+    except ValueError:
+      raise ValueError(f"Unsupported file type: {extension}")
+  
+  def _detect_mime_type(self, file_bytes: bytes) -> str:
+    """
+    Detect MIME type from file content.
+
+    Args:
+        file_bytes: Raw file content as bytes
+
+    Returns:
+        MIME type string
+    """
+    return magic.from_buffer(file_bytes, mime=True)
+  
+  async def create_document(
+      self,
+      file_path: str,
+      file_bytes: bytes,
+      encoding: str = "utf-8"
+  ) -> FileDocument:
+    """
+    Create a new document with automatic deduplication.
+
+    This method handles the creation of both FileDocument and DocumentContent
+    with proper deduplication based on file hash. If content with the same
+    hash already exists, only a new FileDocument is created.
+
+    Args:
+        file_path: Full path to the file
+        file_bytes: Raw file content as bytes
+        encoding: Character encoding for text content
+
+    Returns:
+        Created FileDocument instance
+
+    Raises:
+        ValueError: If file type is not supported
+        PyMongoError: If database operation fails
+    """
+    # Calculate automatic attributes
+    file_hash = self._calculate_file_hash(file_bytes)
+    file_type = self._detect_file_type(file_path)
+    mime_type = self._detect_mime_type(file_bytes)
+    file_size = len(file_bytes)
+    filename = Path(file_path).name
+    detected_at = datetime.utcnow()
+    
+    # Start MongoDB transaction
+    async with await self.db.client.start_session() as session:
+      async with session.start_transaction():
+        try:
+          # Check if content already exists
+          existing_content = await self.content_repository.find_document_content_by_file_hash(
+            file_hash, session=session
+          )
+          
+          # Create DocumentContent if it doesn't exist
+          if not existing_content:
+            content_data = DocumentContent(
+              file_hash=file_hash,
+              content="",  # Will be populated by processing workers
+              encoding=encoding,
+              file_size=file_size,
+              mime_type=mime_type
+            )
+            await self.content_repository.create_document_content(
+              content_data, session=session
+            )
+          
+          # Create FileDocument
+          file_data = FileDocument(
+            filename=filename,
+            filepath=file_path,
+            file_type=file_type,
+            extraction_method=None,  # Will be set by processing workers
+            metadata={},  # Empty for now
+            detected_at=detected_at,
+            file_hash=file_hash
+          )
+          
+          created_file = await self.file_repository.create_document(
+            file_data, session=session
+          )
+          
+          return created_file
+        
+        except Exception as e:
+          # Transaction will automatically rollback
+          raise PyMongoError(f"Failed to create document: {str(e)}")
+  
+  async def get_document_by_id(self, document_id: PyObjectId) -> Optional[FileDocument]:
+    """
+    Retrieve a document by its ID.
+
+    Args:
+        document_id: Document ObjectId
+
+    Returns:
+        FileDocument if found, None otherwise
+    """
+    return await self.file_repository.find_document_by_id(document_id)
+  
+  async def get_document_by_hash(self, file_hash: str) -> Optional[FileDocument]:
+    """
+    Retrieve a document by its file hash.
+
+    Args:
+        file_hash: SHA256 hash of file content
+
+    Returns:
+        FileDocument if found, None otherwise
+    """
+    return await self.file_repository.find_document_by_hash(file_hash)
+  
+  async def get_document_by_filepath(self, filepath: str) -> Optional[FileDocument]:
+    """
+    Retrieve a document by its file path.
+
+    Args:
+        filepath: Full path to the file
+
+    Returns:
+        FileDocument if found, None otherwise
+    """
+    return await self.file_repository.find_document_by_filepath(filepath)
+  
+  async def get_document_with_content(
+      self,
+      document_id: PyObjectId
+  ) -> Optional[Tuple[FileDocument, DocumentContent]]:
+    """
+    Retrieve a document with its associated content.
+
+    Args:
+        document_id: Document ObjectId
+
+    Returns:
+        Tuple of (FileDocument, DocumentContent) if found, None otherwise
+    """
+    document = await self.get_document_by_id(document_id)
+    if not document:
+      return None
+    
+    content = await self.content_repository.find_document_content_by_file_hash(
+      document.file_hash
+    )
+    if not content:
+      return None
+    
+    return (document, content)
+  
+  async def list_documents(
+      self,
+      skip: int = 0,
+      limit: int = 100
+  ) -> List[FileDocument]:
+    """
+    List documents with pagination.
+
+    Args:
+        skip: Number of documents to skip
+        limit: Maximum number of documents to return
+
+    Returns:
+        List of FileDocument instances
+    """
+    return await self.file_repository.list_documents(skip=skip, limit=limit)
+  
+  async def count_documents(self) -> int:
+    """
+    Get total number of documents.
+
+    Returns:
+        Total document count
+    """
+    return await self.file_repository.count_documents()
+  
+  async def update_document(
+      self,
+      document_id: PyObjectId,
+      update_data: Dict[str, Any]
+  ) -> Optional[FileDocument]:
+    """
+    Update document metadata.
+
+    Args:
+        document_id: Document ObjectId
+        update_data: Dictionary with fields to update
+
+    Returns:
+        Updated FileDocument if found, None otherwise
+    """
+    return await self.file_repository.update_document(document_id, update_data)
+  
+  async def delete_document(self, document_id: PyObjectId) -> bool:
+    """
+    Delete a document and its orphaned content.
+
+    This method removes the FileDocument and checks if the associated
+    DocumentContent is orphaned (no other files reference it). If orphaned,
+    the content is also deleted.
+
+    Args:
+        document_id: Document ObjectId
+
+    Returns:
+        True if document was deleted, False otherwise
+
+    Raises:
+        PyMongoError: If database operation fails
+    """
+    # Start MongoDB transaction
+    async with await self.db.client.start_session() as session:
+      async with session.start_transaction():
+        try:
+          # Get document to find its hash
+          document = await self.file_repository.find_document_by_id(
+            document_id, session=session
+          )
+          if not document:
+            return False
+          
+          # Delete the document
+          deleted = await self.file_repository.delete_document(
+            document_id, session=session
+          )
+          if not deleted:
+            return False
+          
+          # Check if content is orphaned
+          remaining_files = await self.file_repository.find_document_by_hash(
+            document.file_hash, session=session
+          )
+          
+          # If no other files reference this content, delete it
+          if not remaining_files:
+            content = await self.content_repository.find_document_content_by_file_hash(
+              document.file_hash, session=session
+            )
+            if content:
+              await self.content_repository.delete_document_content(
+                content.id, session=session
+              )
+          
+          return True
+        
+        except Exception as e:
+          # Transaction will automatically rollback
+          raise PyMongoError(f"Failed to delete document: {str(e)}")
+  
+  async def content_exists(self, file_hash: str) -> bool:
+    """
+    Check if content with given hash exists.
+
+    Args:
+        file_hash: SHA256 hash of file content
+
+    Returns:
+        True if content exists, False otherwise
+    """
+    return await self.content_repository.content_exists(file_hash)
+  
+  async def get_content_by_hash(self, file_hash: str) -> Optional[DocumentContent]:
+    """
+    Retrieve content by file hash.
+
+    Args:
+        file_hash: SHA256 hash of file content
+
+    Returns:
+        DocumentContent if found, None otherwise
+    """
+    return await self.content_repository.find_document_content_by_file_hash(file_hash)
+  
+  async def update_document_content(
+      self,
+      file_hash: str,
+      content: str,
+      encoding: str = "utf-8"
+  ) -> Optional[DocumentContent]:
+    """
+    Update the extracted content for a document.
+
+    This method is typically called by processing workers to store
+    the extracted text content.
+
+    Args:
+        file_hash: SHA256 hash of file content
+        content: Extracted text content
+        encoding: Character encoding
+
+    Returns:
+        Updated DocumentContent if found, None otherwise
+    """
+    existing_content = await self.content_repository.find_document_content_by_file_hash(
+      file_hash
+    )
+    if not existing_content:
+      return None
+    
+    update_data = {
+        "content": content,
+        "encoding": encoding
+    }
+    
+    return await self.content_repository.update_document_content(
+      existing_content.id, update_data
+    )
--- a/src/file-processor/app/utils/document_matching.py
+++ b/src/file-processor/app/utils/document_matching.py
@@ -0,0 +1,60 @@
+from difflib import SequenceMatcher
+
+from app.models.document import FileDocument
+
+
+def _is_subsequence(query: str, target: str) -> tuple[bool, float]:
+  """
+  Check if query is a subsequence of target (case-insensitive).
+  Returns (match, score).
+  Score is higher when the query letters are closer together in the target.
+  """
+  query = query.lower()
+  target = target.lower()
+  
+  positions = []
+  idx = 0
+  
+  for char in query:
+    idx = target.find(char, idx)
+    if idx == -1:
+      return False, 0.0
+    positions.append(idx)
+    idx += 1
+  
+  # Smallest window containing all matched chars
+  window_size = positions[-1] - positions[0] + 1
+  
+  # Score: ratio of query length vs window size (compactness)
+  score = len(query) / window_size
+  
+  return True, score
+
+def fuzzy_matching(filename: str, documents: list[FileDocument], similarity_threshold: float = 0.7):
+  matches = []
+  for file_doc in documents:
+    # Calculate similarity between search term and filename
+    similarity = SequenceMatcher(None, filename.lower(), file_doc.filename.lower()).ratio()
+    
+    if similarity >= similarity_threshold:
+      matches.append((file_doc, similarity))
+  
+  # Sort by similarity score (highest first)
+  matches.sort(key=lambda x: x[1], reverse=True)
+  
+  # Return only the FileDocument objects
+  return [match[0] for match in matches]
+  
+
+def subsequence_matching(query: str, documents: list[FileDocument]):
+  matches = []
+  for file_doc in documents:
+    matched, score = _is_subsequence(query, file_doc.filename)
+    if matched:
+      matches.append((file_doc, score))
+  
+  # Sort by score (highest first)
+  matches.sort(key=lambda x: x[1], reverse=True)
+  
+  # Return only the FileDocument objects
+  return [match[0] for match in matches]
--- a/src/file-processor/requirements.txt
+++ b/src/file-processor/requirements.txt
@@ -8,3 +8,4 @@ pymongo==4.15.0
 pydantic==2.11.9
 redis==6.4.0
 uvicorn==0.35.0
+python-magic==0.4.27
--- a/tests/test_document_content_repository.py
+++ b/tests/test_document_content_repository.py
@@ -0,0 +1,311 @@
+"""
+Test suite for DocumentContentRepository with async/await support.
+
+This module contains comprehensive tests for all DocumentContentRepository methods
+using mongomock-motor for in-memory MongoDB testing.
+"""
+
+import pytest
+import hashlib
+from datetime import datetime
+
+import pytest_asyncio
+from bson import ObjectId
+from pymongo.errors import DuplicateKeyError
+from mongomock_motor import AsyncMongoMockClient
+
+from app.database.repositories.document_content_repository import DocumentContentRepository
+from app.models.document import DocumentContent
+
+
+@pytest_asyncio.fixture
+async def in_memory_repository():
+  """Create an in-memory DocumentContentRepository for testing."""
+  client = AsyncMongoMockClient()
+  db = client.test_database
+  repo = DocumentContentRepository(db)
+  await repo.initialize()
+  return repo
+
+
+@pytest.fixture
+def sample_document_content():
+  """Sample DocumentContent data for testing."""
+  content = "This is sample document content for testing purposes."
+  file_hash = hashlib.sha256(content.encode()).hexdigest()
+  
+  return DocumentContent(
+    file_hash=file_hash,
+    content=content,
+    encoding="utf-8",
+    file_size=len(content.encode()),
+    mime_type="text/plain"
+  )
+
+
+@pytest.fixture
+def another_document_content():
+  """Another sample DocumentContent data for testing."""
+  content = "This is another sample document with different content."
+  file_hash = hashlib.sha256(content.encode()).hexdigest()
+  
+  return DocumentContent(
+    file_hash=file_hash,
+    content=content,
+    encoding="utf-8",
+    file_size=len(content.encode()),
+    mime_type="text/plain"
+  )
+
+
+class TestDocumentContentRepositoryCreation:
+  """Tests for document content creation functionality."""
+  
+  @pytest.mark.asyncio
+  async def test_i_can_create_document_content(self, in_memory_repository, sample_document_content):
+    """Test successful document content creation."""
+    # Act
+    created_content = await in_memory_repository.create_document_content(sample_document_content)
+    
+    # Assert
+    assert created_content is not None
+    assert created_content.file_hash == sample_document_content.file_hash
+    assert created_content.content == sample_document_content.content
+    assert created_content.encoding == sample_document_content.encoding
+    assert created_content.file_size == sample_document_content.file_size
+    assert created_content.mime_type == sample_document_content.mime_type
+    assert created_content.id is not None
+  
+  @pytest.mark.asyncio
+  async def test_i_cannot_create_document_content_with_duplicate_file_hash(self, in_memory_repository,
+                                                                           sample_document_content):
+    """Test that creating document content with duplicate file_hash raises DuplicateKeyError."""
+    # Arrange
+    await in_memory_repository.create_document_content(sample_document_content)
+    
+    # Act & Assert
+    with pytest.raises(DuplicateKeyError) as exc_info:
+      await in_memory_repository.create_document_content(sample_document_content)
+    
+    assert "already exists" in str(exc_info.value)
+
+
+class TestDocumentContentRepositoryFinding:
+  """Tests for document content finding functionality."""
+  
+  @pytest.mark.asyncio
+  async def test_i_can_find_document_content_by_id(self, in_memory_repository, sample_document_content):
+    """Test finding document content by valid ID."""
+    # Arrange
+    created_content = await in_memory_repository.create_document_content(sample_document_content)
+    
+    # Act
+    found_content = await in_memory_repository.find_document_content_by_id(str(created_content.id))
+    
+    # Assert
+    assert found_content is not None
+    assert found_content.id == created_content.id
+    assert found_content.file_hash == created_content.file_hash
+    assert found_content.content == created_content.content
+  
+  @pytest.mark.asyncio
+  async def test_i_cannot_find_document_content_by_invalid_id(self, in_memory_repository):
+    """Test that invalid ObjectId returns None."""
+    # Act
+    found_content = await in_memory_repository.find_document_content_by_id("invalid_id")
+    
+    # Assert
+    assert found_content is None
+  
+  @pytest.mark.asyncio
+  async def test_i_cannot_find_document_content_by_nonexistent_id(self, in_memory_repository):
+    """Test that nonexistent but valid ObjectId returns None."""
+    # Arrange
+    nonexistent_id = str(ObjectId())
+    
+    # Act
+    found_content = await in_memory_repository.find_document_content_by_id(nonexistent_id)
+    
+    # Assert
+    assert found_content is None
+  
+  @pytest.mark.asyncio
+  async def test_i_can_find_document_content_by_file_hash(self, in_memory_repository, sample_document_content):
+    """Test finding document content by file hash."""
+    # Arrange
+    created_content = await in_memory_repository.create_document_content(sample_document_content)
+    
+    # Act
+    found_content = await in_memory_repository.find_document_content_by_file_hash(sample_document_content.file_hash)
+    
+    # Assert
+    assert found_content is not None
+    assert found_content.file_hash == created_content.file_hash
+    assert found_content.id == created_content.id
+  
+  @pytest.mark.asyncio
+  async def test_i_cannot_find_document_content_by_nonexistent_file_hash(self, in_memory_repository):
+    """Test that nonexistent file hash returns None."""
+    # Act
+    found_content = await in_memory_repository.find_document_content_by_file_hash("nonexistent_hash")
+    
+    # Assert
+    assert found_content is None
+
+
+class TestDocumentContentRepositoryUpdate:
+  """Tests for document content update functionality."""
+  
+  @pytest.mark.asyncio
+  async def test_i_can_update_document_content(self, in_memory_repository, sample_document_content):
+    """Test successful document content update."""
+    # Arrange
+    created_content = await in_memory_repository.create_document_content(sample_document_content)
+    update_data = {
+        "content": "Updated content for testing",
+        "encoding": "utf-16",
+        "mime_type": "text/html"
+    }
+    
+    # Act
+    updated_content = await in_memory_repository.update_document_content(str(created_content.id), update_data)
+    
+    # Assert
+    assert updated_content is not None
+    assert updated_content.content == update_data["content"]
+    assert updated_content.encoding == update_data["encoding"]
+    assert updated_content.mime_type == update_data["mime_type"]
+    assert updated_content.id == created_content.id
+    assert updated_content.file_hash == created_content.file_hash  # Should remain unchanged
+  
+  @pytest.mark.asyncio
+  async def test_i_cannot_update_document_content_with_invalid_id(self, in_memory_repository):
+    """Test that updating with invalid ID returns None."""
+    # Act
+    result = await in_memory_repository.update_document_content("invalid_id", {"content": "test"})
+    
+    # Assert
+    assert result is None
+  
+  @pytest.mark.asyncio
+  async def test_i_can_update_document_content_with_partial_data(self, in_memory_repository, sample_document_content):
+    """Test updating document content with partial data."""
+    # Arrange
+    created_content = await in_memory_repository.create_document_content(sample_document_content)
+    partial_update = {"encoding": "iso-8859-1"}
+    
+    # Act
+    updated_content = await in_memory_repository.update_document_content(str(created_content.id), partial_update)
+    
+    # Assert
+    assert updated_content is not None
+    assert updated_content.encoding == "iso-8859-1"
+    assert updated_content.content == created_content.content  # Should remain unchanged
+    assert updated_content.mime_type == created_content.mime_type  # Should remain unchanged
+  
+  @pytest.mark.asyncio
+  async def test_i_can_update_document_content_with_empty_data(self, in_memory_repository, sample_document_content):
+    """Test updating document content with empty data returns current content."""
+    # Arrange
+    created_content = await in_memory_repository.create_document_content(sample_document_content)
+    empty_update = {}
+    
+    # Act
+    result = await in_memory_repository.update_document_content(str(created_content.id), empty_update)
+    
+    # Assert
+    assert result is not None
+    assert result.content == created_content.content
+    assert result.encoding == created_content.encoding
+    assert result.mime_type == created_content.mime_type
+
+
+class TestDocumentContentRepositoryDeletion:
+  """Tests for document content deletion functionality."""
+  
+  @pytest.mark.asyncio
+  async def test_i_can_delete_document_content(self, in_memory_repository, sample_document_content):
+    """Test successful document content deletion."""
+    # Arrange
+    created_content = await in_memory_repository.create_document_content(sample_document_content)
+    
+    # Act
+    deletion_result = await in_memory_repository.delete_document_content(str(created_content.id))
+    
+    # Assert
+    assert deletion_result is True
+    
+    # Verify content is actually deleted
+    found_content = await in_memory_repository.find_document_content_by_id(str(created_content.id))
+    assert found_content is None
+  
+  @pytest.mark.asyncio
+  async def test_i_cannot_delete_document_content_with_invalid_id(self, in_memory_repository):
+    """Test that deleting with invalid ID returns False."""
+    # Act
+    result = await in_memory_repository.delete_document_content("invalid_id")
+    
+    # Assert
+    assert result is False
+  
+  @pytest.mark.asyncio
+  async def test_i_cannot_delete_nonexistent_document_content(self, in_memory_repository):
+    """Test that deleting nonexistent document content returns False."""
+    # Arrange
+    nonexistent_id = str(ObjectId())
+    
+    # Act
+    result = await in_memory_repository.delete_document_content(nonexistent_id)
+    
+    # Assert
+    assert result is False
+
+
+class TestDocumentContentRepositoryUtilities:
+  """Tests for utility methods."""
+  
+  @pytest.mark.asyncio
+  async def test_i_can_check_content_exists(self, in_memory_repository, sample_document_content):
+    """Test checking if document content exists by file hash."""
+    # Arrange
+    await in_memory_repository.create_document_content(sample_document_content)
+    
+    # Act
+    exists = await in_memory_repository.content_exists(sample_document_content.file_hash)
+    not_exists = await in_memory_repository.content_exists("nonexistent_hash")
+    
+    # Assert
+    assert exists is True
+    assert not_exists is False
+  
+  @pytest.mark.asyncio
+  async def test_i_can_list_document_contents(self, in_memory_repository, sample_document_content,
+                                              another_document_content):
+    """Test listing document contents with pagination."""
+    # Arrange
+    await in_memory_repository.create_document_content(sample_document_content)
+    await in_memory_repository.create_document_content(another_document_content)
+    
+    # Act
+    all_contents = await in_memory_repository.list_document_contents()
+    limited_contents = await in_memory_repository.list_document_contents(skip=0, limit=1)
+    
+    # Assert
+    assert len(all_contents) == 2
+    assert len(limited_contents) == 1
+    assert all(isinstance(content, DocumentContent) for content in all_contents)
+  
+  @pytest.mark.asyncio
+  async def test_i_can_count_document_contents(self, in_memory_repository, sample_document_content,
+                                               another_document_content):
+    """Test counting document contents."""
+    # Arrange
+    initial_count = await in_memory_repository.count_document_contents()
+    await in_memory_repository.create_document_content(sample_document_content)
+    await in_memory_repository.create_document_content(another_document_content)
+    
+    # Act
+    final_count = await in_memory_repository.count_document_contents()
+    
+    # Assert
+    assert final_count == initial_count + 2
--- a/tests/test_document_repository.py
+++ b/tests/test_document_repository.py
@@ -23,9 +23,10 @@ async def in_memory_repository():
  """Create an in-memory FileDocumentRepository for testing."""
  client = AsyncMongoMockClient()
  db = client.test_database
-  repo = FileDocumentRepository()
-  repo.db = db
-  repo.collection = db.files
+  repo = FileDocumentRepository(db)
+  # repo.db = db
+  # repo.collection = db.files
+  await repo.initialize()
  return repo


@@ -86,11 +87,15 @@ class TestFileDocumentRepositoryInitialization:
  async def test_i_can_initialize_repository(self):
    """Test repository initialization."""
    # Arrange
-    repo = FileDocumentRepository()
+    client = AsyncMongoMockClient()
+    db = client.test_database
+    repo = FileDocumentRepository(db)
+    await repo.initialize()
    
    # Act & Assert (should not raise any exception)
    assert repo.db is not None
    assert repo.collection is not None
+    # TODO : check that the indexes are create


 class TestFileDocumentRepositoryCreation:
@@ -276,48 +281,6 @@ class TestFileDocumentRepositoryFuzzySearch:
    assert "document1.pdf" in filenames
    assert "similar_document.pdf" in filenames
  
-  @pytest.mark.asyncio
-  async def test_i_can_find_documents_with_custom_threshold(self, in_memory_repository, multiple_sample_documents):
-    """Test finding documents with custom similarity threshold."""
-    # Arrange
-    for doc in multiple_sample_documents:
-      await in_memory_repository.create_document(doc)
-    
-    # Act - Very high threshold should only match exact or very similar names
-    found_docs = await in_memory_repository.find_document_by_name("document1.pdf", similarity_threshold=0.9)
-    
-    # Assert
-    assert len(found_docs) == 1
-    assert found_docs[0].filename == "document1.pdf"
-  
-  @pytest.mark.asyncio
-  async def test_i_can_find_documents_sorted_by_similarity(self, in_memory_repository, multiple_sample_documents):
-    """Test that documents are sorted by similarity score (highest first)."""
-    # Arrange
-    for doc in multiple_sample_documents:
-      await in_memory_repository.create_document(doc)
-    
-    # Act
-    found_docs = await in_memory_repository.find_document_by_name("document1", similarity_threshold=0.3)
-    
-    # Assert
-    assert len(found_docs) >= 1
-    # First result should be the most similar (document1.pdf)
-    assert found_docs[0].filename == "document1.pdf"
-  
-  @pytest.mark.asyncio
-  async def test_i_cannot_find_documents_below_threshold(self, in_memory_repository, multiple_sample_documents):
-    """Test that no documents are returned when similarity is below threshold."""
-    # Arrange
-    for doc in multiple_sample_documents:
-      await in_memory_repository.create_document(doc)
-    
-    # Act
-    found_docs = await in_memory_repository.find_document_by_name("xyz", similarity_threshold=0.6)
-    
-    # Assert
-    assert len(found_docs) == 0
-  
  @pytest.mark.asyncio
  async def test_i_cannot_find_documents_by_name_with_pymongo_error(self, in_memory_repository, mocker):
    """Test handling of PyMongo errors during name search."""
@@ -377,11 +340,13 @@ class TestFileDocumentRepositoryListing:
    # Create documents with different timestamps
    doc1 = sample_file_document.model_copy()
    doc1.filename = "oldest.pdf"
+    doc1.filepath = f"/path/to/{doc1.filename}"
    doc1.file_hash = "hash1" + "0" * 58
    doc1.detected_at = datetime.now() - timedelta(hours=2)
    
    doc2 = sample_file_document.model_copy()
    doc2.filename = "newest.pdf"
+    doc2.filepath = f"/path/to/{doc2.filename}"
    doc2.file_hash = "hash2" + "0" * 58
    doc2.detected_at = datetime.now()
    
@@ -433,7 +398,6 @@ class TestFileDocumentRepositoryUpdate:
    
    # Assert
    assert updated_doc is not None
-    assert updated_doc.tags == sample_update_data["tags"]
    assert updated_doc.file_type == sample_update_data["file_type"]
    assert updated_doc.id == created_doc.id
    assert updated_doc.filename == created_doc.filename  # Unchanged fields remain
@@ -443,30 +407,30 @@ class TestFileDocumentRepositoryUpdate:
    """Test updating document with partial data."""
    # Arrange
    created_doc = await in_memory_repository.create_document(sample_file_document)
-    partial_update = {"tags": ["new_tag"]}
+    partial_update = {"file_type": FileType("txt")}
    
    # Act
    updated_doc = await in_memory_repository.update_document(str(created_doc.id), partial_update)
    
    # Assert
    assert updated_doc is not None
-    assert updated_doc.tags == ["new_tag"]
+    assert updated_doc.file_type == FileType("txt")
    assert updated_doc.filename == created_doc.filename  # Should remain unchanged
-    assert updated_doc.file_type == created_doc.file_type  # Should remain unchanged
+    assert updated_doc.filepath == created_doc.filepath  # Should remain unchanged
  
  @pytest.mark.asyncio
  async def test_i_can_update_document_filtering_none_values(self, in_memory_repository, sample_file_document):
    """Test that None values are filtered out from update data."""
    # Arrange
    created_doc = await in_memory_repository.create_document(sample_file_document)
-    update_with_none = {"tags": ["new_tag"], "file_type": None}
+    update_with_none = {"metadata": {"tags": ["updated", "document"]}, "file_type": None}
    
    # Act
    updated_doc = await in_memory_repository.update_document(str(created_doc.id), update_with_none)
    
    # Assert
    assert updated_doc is not None
-    assert updated_doc.tags == ["new_tag"]
+    assert updated_doc.metadata == {"tags": ["updated", "document"]}
    assert updated_doc.file_type == created_doc.file_type  # Should remain unchanged (None filtered out)
  
  @pytest.mark.asyncio
@@ -483,7 +447,7 @@ class TestFileDocumentRepositoryUpdate:
    assert result is not None
    assert result.filename == created_doc.filename
    assert result.file_hash == created_doc.file_hash
-    assert result.tags == created_doc.tags
+    assert result.metadata == created_doc.metadata
  
  @pytest.mark.asyncio
  async def test_i_cannot_update_document_with_invalid_id(self, in_memory_repository, sample_update_data):
--- a/tests/test_document_service.py
+++ b/tests/test_document_service.py
@@ -0,0 +1,697 @@
+"""
+Unit tests for DocumentService using in-memory MongoDB.
+
+Tests the orchestration logic with real MongoDB operations
+using mongomock for better integration testing.
+"""
+
+import pytest
+import pytest_asyncio
+from unittest.mock import Mock, patch
+from datetime import datetime
+from bson import ObjectId
+from pathlib import Path
+
+from mongomock_motor import AsyncMongoMockClient
+
+from app.services.document_service import DocumentService
+from app.database.repositories.document_repository import FileDocumentRepository
+from app.database.repositories.document_content_repository import DocumentContentRepository
+from app.models.document import FileDocument, DocumentContent, FileType, ExtractionMethod
+from app.models.types import PyObjectId
+
+
+@pytest_asyncio.fixture
+async def in_memory_file_repository():
+    """Create an in-memory FileDocumentRepository for testing."""
+    client = AsyncMongoMockClient()
+    db = client.test_database
+    repo = FileDocumentRepository(db)
+    await repo.initialize()
+    return repo
+
+
+@pytest_asyncio.fixture
+async def in_memory_content_repository():
+    """Create an in-memory DocumentContentRepository for testing."""
+    client = AsyncMongoMockClient()
+    db = client.test_database
+    repo = DocumentContentRepository(db)
+    await repo.initialize()
+    return repo
+
+
+@pytest_asyncio.fixture
+async def in_memory_database():
+    """Create an in-memory database for testing."""
+    client = AsyncMongoMockClient()
+    return client.test_database
+
+
+@pytest_asyncio.fixture
+async def document_service(in_memory_file_repository, in_memory_content_repository, in_memory_database):
+    """Create DocumentService with in-memory repositories."""
+    with patch('app.services.document_service.get_database', return_value=in_memory_database):
+        service = DocumentService()
+        service.file_repository = in_memory_file_repository
+        service.content_repository = in_memory_content_repository
+        return service
+
+
+@pytest.fixture
+def sample_file_bytes():
+    """Sample file content as bytes."""
+    return b"This is a test PDF content"
+
+
+@pytest.fixture
+def sample_text_bytes():
+    """Sample text file content as bytes."""
+    return b"This is a test text file content"
+
+
+@pytest.fixture
+def sample_file_hash():
+    """Expected SHA256 hash for sample file bytes."""
+    import hashlib
+    return hashlib.sha256(b"This is a test PDF content").hexdigest()
+
+
+@pytest.fixture
+def sample_file_document():
+    """Sample FileDocument for testing."""
+    return FileDocument(
+        id=ObjectId(),
+        filename="test.pdf",
+        filepath="/test/test.pdf",
+        file_type=FileType.PDF,
+        extraction_method=None,
+        metadata={},
+        detected_at=datetime(2024, 1, 15, 10, 30, 0),
+        file_hash="test_hash"
+    )
+
+
+class TestCreateDocument:
+    """Tests for create_document method."""
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @patch('app.services.document_service.datetime')
+    @pytest.mark.asyncio
+    async def test_i_can_create_document_with_new_content(
+        self,
+        mock_datetime,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test creating document when content doesn't exist yet."""
+        # Setup mocks
+        fixed_time = datetime(2024, 1, 15, 10, 30, 0)
+        mock_datetime.utcnow.return_value = fixed_time
+        mock_magic.return_value = "application/pdf"
+        
+        # Execute
+        result = await document_service.create_document(
+            "/test/test.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Verify document creation
+        assert result is not None
+        assert result.filename == "test.pdf"
+        assert result.filepath == "/test/test.pdf"
+        assert result.file_type == FileType.PDF
+        assert result.detected_at == fixed_time
+        assert result.file_hash == document_service._calculate_file_hash(sample_file_bytes)
+        
+        # Verify content was created
+        content = await document_service.content_repository.find_document_content_by_file_hash(
+            result.file_hash
+        )
+        assert content is not None
+        assert content.file_hash == result.file_hash
+        assert content.file_size == len(sample_file_bytes)
+        assert content.mime_type == "application/pdf"
+        assert content.encoding == "utf-8"
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @patch('app.services.document_service.datetime')
+    @pytest.mark.asyncio
+    async def test_i_can_create_document_with_existing_content(
+        self,
+        mock_datetime,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test creating document when content already exists (deduplication)."""
+        # Setup mocks
+        fixed_time = datetime(2024, 1, 15, 10, 30, 0)
+        mock_datetime.utcnow.return_value = fixed_time
+        mock_magic.return_value = "application/pdf"
+        
+        # Create first document
+        first_doc = await document_service.create_document(
+            "/test/first.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Create second document with same content
+        second_doc = await document_service.create_document(
+            "/test/second.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Verify both documents exist but share same hash
+        assert first_doc.file_hash == second_doc.file_hash
+        assert first_doc.filename != second_doc.filename
+        assert first_doc.filepath != second_doc.filepath
+        
+        # Verify only one content document exists
+        all_content = await document_service.content_repository.list_document_content()
+        content_for_hash = [c for c in all_content if c.file_hash == first_doc.file_hash]
+        assert len(content_for_hash) == 1
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_create_document_with_different_encodings(
+        self,
+        mock_magic,
+        document_service,
+        sample_text_bytes
+    ):
+        """Test creating documents with different text encodings."""
+        # Setup
+        mock_magic.return_value = "text/plain"
+        
+        # Test with different encodings
+        encodings = ["utf-8", "latin-1", "ascii"]
+        
+        for i, encoding in enumerate(encodings):
+            result = await document_service.create_document(
+                f"/test/test{i}.txt",
+                sample_text_bytes,
+                encoding
+            )
+            
+            # Verify document was created
+            assert result is not None
+            assert result.file_type == FileType.TXT
+            
+            # Verify content has correct encoding
+            content = await document_service.content_repository.find_document_content_by_file_hash(
+                result.file_hash
+            )
+            assert content.encoding == encoding
+    
+    @pytest.mark.asyncio
+    async def test_i_cannot_create_document_with_unsupported_file_type(
+        self,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test that unsupported file types raise ValueError."""
+        with pytest.raises(ValueError, match="Unsupported file type"):
+            await document_service.create_document(
+                "/test/test.xyz",  # Unsupported extension
+                sample_file_bytes,
+                "utf-8"
+            )
+    
+    @pytest.mark.asyncio
+    async def test_i_cannot_create_document_with_empty_file_path(
+        self,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test that empty file path raises ValueError."""
+        with pytest.raises(ValueError):
+            await document_service.create_document(
+                "",  # Empty path
+                sample_file_bytes,
+                "utf-8"
+            )
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_create_document_with_empty_bytes(
+        self,
+        mock_magic,
+        document_service
+    ):
+        """Test behavior with empty file bytes."""
+        # Setup
+        mock_magic.return_value = "text/plain"
+        
+        # Execute with empty bytes
+        result = await document_service.create_document(
+            "/test/empty.txt",
+            b"",  # Empty bytes
+            "utf-8"
+        )
+        
+        # Should still work but with zero file size
+        assert result is not None
+        content = await document_service.content_repository.find_document_content_by_file_hash(
+            result.file_hash
+        )
+        assert content.file_size == 0
+
+
+class TestGetMethods:
+    """Tests for document retrieval methods."""
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_get_document_by_id(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test retrieving document by ID."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Create a document first
+        created_doc = await document_service.create_document(
+            "/test/test.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Execute
+        result = await document_service.get_document_by_id(created_doc.id)
+        
+        # Verify
+        assert result is not None
+        assert result.id == created_doc.id
+        assert result.filename == created_doc.filename
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_get_document_by_hash(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test retrieving document by file hash."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Create a document first
+        created_doc = await document_service.create_document(
+            "/test/test.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Execute
+        result = await document_service.get_document_by_hash(created_doc.file_hash)
+        
+        # Verify
+        assert result is not None
+        assert result.file_hash == created_doc.file_hash
+        assert result.filename == created_doc.filename
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_get_document_by_filepath(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test retrieving document by file path."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        test_path = "/test/unique_test.pdf"
+        
+        # Create a document first
+        created_doc = await document_service.create_document(
+            test_path,
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Execute
+        result = await document_service.get_document_by_filepath(test_path)
+        
+        # Verify
+        assert result is not None
+        assert result.filepath == test_path
+        assert result.id == created_doc.id
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_get_document_with_content(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test retrieving document with associated content."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Create a document first
+        created_doc = await document_service.create_document(
+            "/test/test.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Execute
+        result = await document_service.get_document_with_content(created_doc.id)
+        
+        # Verify
+        assert result is not None
+        document, content = result
+        assert document.id == created_doc.id
+        assert content is not None
+        assert content.file_hash == created_doc.file_hash
+    
+    @pytest.mark.asyncio
+    async def test_i_cannot_get_nonexistent_document_by_id(
+        self,
+        document_service
+    ):
+        """Test that nonexistent document returns None."""
+        # Execute with random ObjectId
+        result = await document_service.get_document_by_id(ObjectId())
+        
+        # Verify
+        assert result is None
+    
+    @pytest.mark.asyncio
+    async def test_i_cannot_get_nonexistent_document_by_hash(
+        self,
+        document_service
+    ):
+        """Test that nonexistent document hash returns None."""
+        # Execute
+        result = await document_service.get_document_by_hash("nonexistent_hash")
+        
+        # Verify
+        assert result is None
+
+
+class TestPaginationAndCounting:
+    """Tests for document listing and counting."""
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_list_documents_with_pagination(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test document listing with pagination parameters."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Create multiple documents
+        for i in range(5):
+            await document_service.create_document(
+                f"/test/test{i}.pdf",
+                sample_file_bytes + bytes(str(i), 'utf-8'),  # Make each file unique
+                "utf-8"
+            )
+        
+        # Execute with pagination
+        result = await document_service.list_documents(skip=1, limit=2)
+        
+        # Verify
+        assert len(result) == 2
+        
+        # Test counting
+        total_count = await document_service.count_documents()
+        assert total_count == 5
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_count_documents(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test document counting."""
+        # Setup
+        mock_magic.return_value = "text/plain"
+        
+        # Initially should be 0
+        initial_count = await document_service.count_documents()
+        assert initial_count == 0
+        
+        # Create some documents
+        for i in range(3):
+            await document_service.create_document(
+                f"/test/test{i}.txt",
+                sample_file_bytes + bytes(str(i), 'utf-8'),
+                "utf-8"
+            )
+        
+        # Execute
+        final_count = await document_service.count_documents()
+        
+        # Verify
+        assert final_count == 3
+
+
+class TestUpdateAndDelete:
+    """Tests for document update and deletion operations."""
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_update_document_metadata(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test updating document metadata."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Create a document first
+        created_doc = await document_service.create_document(
+            "/test/test.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Execute update
+        update_data = {"metadata": {"page_count": 5}}
+        result = await document_service.update_document(created_doc.id, update_data)
+        
+        # Verify
+        assert result is not None
+        assert result.metadata.get("page_count") == 5
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_delete_document_and_orphaned_content(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test deleting document with orphaned content cleanup."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Create a document
+        created_doc = await document_service.create_document(
+            "/test/test.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Verify content exists
+        content_before = await document_service.content_repository.find_document_content_by_file_hash(
+            created_doc.file_hash
+        )
+        assert content_before is not None
+        
+        # Execute deletion
+        result = await document_service.delete_document(created_doc.id)
+        
+        # Verify document and content are deleted
+        assert result is True
+        
+        deleted_doc = await document_service.get_document_by_id(created_doc.id)
+        assert deleted_doc is None
+        
+        content_after = await document_service.content_repository.find_document_content_by_file_hash(
+            created_doc.file_hash
+        )
+        assert content_after is None
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_delete_document_without_affecting_shared_content(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test deleting document without removing shared content."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Create two documents with same content
+        doc1 = await document_service.create_document(
+            "/test/test1.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        doc2 = await document_service.create_document(
+            "/test/test2.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # They should share the same hash
+        assert doc1.file_hash == doc2.file_hash
+        
+        # Delete first document
+        result = await document_service.delete_document(doc1.id)
+        assert result is True
+        
+        # Verify first document is deleted but content still exists
+        deleted_doc = await document_service.get_document_by_id(doc1.id)
+        assert deleted_doc is None
+        
+        remaining_doc = await document_service.get_document_by_id(doc2.id)
+        assert remaining_doc is not None
+        
+        content = await document_service.content_repository.find_document_content_by_file_hash(
+            doc2.file_hash
+        )
+        assert content is not None
+
+
+class TestUtilityMethods:
+    """Tests for utility methods."""
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_check_content_exists(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test checking if content exists by hash."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Initially content doesn't exist
+        test_hash = "nonexistent_hash"
+        exists_before = await document_service.content_exists(test_hash)
+        assert exists_before is False
+        
+        # Create a document
+        created_doc = await document_service.create_document(
+            "/test/test.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Now content should exist
+        exists_after = await document_service.content_exists(created_doc.file_hash)
+        assert exists_after is True
+    
+    @patch('app.services.document_service.magic.from_buffer')
+    @pytest.mark.asyncio
+    async def test_i_can_update_document_content(
+        self,
+        mock_magic,
+        document_service,
+        sample_file_bytes
+    ):
+        """Test updating extracted document content."""
+        # Setup
+        mock_magic.return_value = "application/pdf"
+        
+        # Create a document first
+        created_doc = await document_service.create_document(
+            "/test/test.pdf",
+            sample_file_bytes,
+            "utf-8"
+        )
+        
+        # Update content
+        new_content = "Updated extracted content"
+        result = await document_service.update_document_content(
+            created_doc.file_hash,
+            new_content
+        )
+        
+        # Verify update
+        assert result is not None
+        assert result.content == new_content
+        
+        # Verify persistence
+        updated_content = await document_service.content_repository.find_document_content_by_file_hash(
+            created_doc.file_hash
+        )
+        assert updated_content.content == new_content
+
+
+class TestHashCalculation:
+    """Tests for file hash calculation utility."""
+    
+    def test_i_can_calculate_consistent_file_hash(self, document_service):
+        """Test that file hash calculation is consistent."""
+        test_bytes = b"Test content for hashing"
+        
+        # Calculate hash multiple times
+        hash1 = document_service._calculate_file_hash(test_bytes)
+        hash2 = document_service._calculate_file_hash(test_bytes)
+        
+        # Should be identical
+        assert hash1 == hash2
+        assert len(hash1) == 64  # SHA256 produces 64-character hex string
+    
+    def test_i_get_different_hashes_for_different_content(self, document_service):
+        """Test that different content produces different hashes."""
+        content1 = b"First content"
+        content2 = b"Second content"
+        
+        hash1 = document_service._calculate_file_hash(content1)
+        hash2 = document_service._calculate_file_hash(content2)
+        
+        assert hash1 != hash2
+
+
+class TestFileTypeDetection:
+    """Tests for file type detection."""
+    
+    def test_i_can_detect_pdf_file_type(self, document_service):
+        """Test PDF file type detection."""
+        file_type = document_service._detect_file_type("/path/to/document.pdf")
+        assert file_type == FileType.PDF
+    
+    def test_i_can_detect_txt_file_type(self, document_service):
+        """Test text file type detection."""
+        file_type = document_service._detect_file_type("/path/to/document.txt")
+        assert file_type == FileType.TXT
+    
+    def test_i_can_detect_docx_file_type(self, document_service):
+        """Test DOCX file type detection."""
+        file_type = document_service._detect_file_type("/path/to/document.docx")
+        assert file_type == FileType.DOCX
+    
+    def test_i_cannot_detect_unsupported_file_type(self, document_service):
+        """Test unsupported file type raises ValueError."""
+        with pytest.raises(ValueError, match="Unsupported file type"):
+            document_service._detect_file_type("/path/to/document.xyz")
--- a/tests/test_user_repository.py
+++ b/tests/test_user_repository.py
@@ -23,7 +23,7 @@ async def in_memory_repository():
  client = AsyncMongoMockClient()
  db = client.test_database
  repo = UserRepository(db)
-  #await repo.initialize()
+  await repo.initialize()
  return repo


--- a/tests/test_utils_document_matching.py
+++ b/tests/test_utils_document_matching.py
@@ -0,0 +1,87 @@
+import os
+from datetime import datetime
+
+import pytest
+from app.models.document import FileDocument, FileType
+from app.utils.document_matching import fuzzy_matching, subsequence_matching
+
+
+def get_doc(filename: str = None):
+  """Sample FileDocument data for testing."""
+  return FileDocument(
+    filename=f"{filename}",
+    filepath=f"/path/to/{filename}",
+    file_hash="a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
+    file_type=FileType(os.path.splitext(filename)[1].lstrip(".") or "txt"),
+    detected_at=datetime.now(),
+  )
+
+
+class TestFuzzyMatching:
+  def test_i_can_find_exact_match_with_fuzzy(self):
+    # Exact match should always pass
+    docs = [get_doc(filename="hello.txt")]
+    result = fuzzy_matching("hello.txt", docs)
+    assert len(result) == 1
+    assert result[0].filename == "hello.txt"
+  
+  def test_i_can_find_close_match_with_fuzzy(self):
+    # "helo.txt" should match "hello.txt" with high similarity
+    docs = [get_doc(filename="hello.txt")]
+    result = fuzzy_matching("helo.txt", docs, similarity_threshold=0.7)
+    assert len(result) == 1
+    assert result[0].filename == "hello.txt"
+  
+  def test_i_cannot_find_dissimilar_match_with_fuzzy(self):
+    # "world.txt" should not match "hello.txt"
+    docs = [get_doc(filename="hello.txt")]
+    result = fuzzy_matching("world.txt", docs, similarity_threshold=0.7)
+    assert len(result) == 0
+  
+  def test_i_can_sort_by_similarity_in_fuzzy(self):
+    # "helo.txt" is closer to "hello.txt" than "hllll.txt"
+    docs = [
+        get_doc(filename="hello.txt"),
+        get_doc(filename="hllll.txt"),
+    ]
+    result = fuzzy_matching("helo.txt", docs, similarity_threshold=0.5)
+    assert result[0].filename == "hello.txt"
+
+
+class TestSubsequenceMatching:
+  def test_i_can_match_subsequence_simple(self):
+    # "ifb" should match "ilFaitBeau.txt"
+    docs = [get_doc(filename="ilFaitBeau.txt")]
+    result = subsequence_matching("ifb", docs)
+    assert len(result) == 1
+    assert result[0].filename == "ilFaitBeau.txt"
+  
+  def test_i_cannot_match_wrong_order_subsequence(self):
+    # "fib" should not match "ilFaitBeau.txt" because the order is wrong
+    docs = [get_doc(filename="ilFaitBeau.txt")]
+    result = subsequence_matching("bfi", docs)
+    assert len(result) == 0
+  
+  def test_i_can_match_multiple_documents_subsequence(self):
+    # "ifb" should match both filenames, but "ilFaitBeau.txt" has a higher score
+    docs = [
+        get_doc(filename="ilFaitBeau.txt"),
+        get_doc(filename="information_base.txt"),
+    ]
+    result = subsequence_matching("ifb", docs)
+    assert len(result) == 2
+    assert result[0].filename == "ilFaitBeau.txt"
+    assert result[1].filename == "information_base.txt"
+  
+  def test_i_cannot_match_unrelated_subsequence(self):
+    # "xyz" should not match any file
+    docs = [get_doc(filename="ilFaitBeau.txt")]
+    result = subsequence_matching("xyz", docs)
+    assert len(result) == 0
+  
+  def test_i_can_handle_case_insensitivity_in_subsequence(self):
+    # Matching should be case-insensitive
+    docs = [get_doc(filename="HelloWorld.txt")]
+    result = subsequence_matching("hw", docs)
+    assert len(result) == 1
+    assert result[0].filename == "HelloWorld.txt"
Author	SHA1	Message	Date
Kodjo Sossouvi	f1b551d243	Adding document service	2025-09-19 22:59:41 +02:00
Kodjo Sossouvi	e8b306ac4a	Fixed unit tests	2025-09-19 21:06:09 +02:00