# Document Service Generic document management service with S3 storage and PDF field discovery. ## Features - **Multi-format support**: PDF, DOCX, XLSX, JPG, JPEG, PNG, GIF - **S3 storage**: Configurable S3-compatible storage (MinIO, AWS S3, etc.) - **PDF field discovery**: Extract form fields from PDF documents - **Organization-based access control**: Documents scoped to organizations - **File size limits**: Configurable per document type - **Content type detection**: Automatic detection using python-magic - **Comprehensive logging**: All operations logged for audit trail ## API Endpoints ### Upload Document ``` POST /api/documents/upload Content-Type: multipart/form-data Authorization: Bearer Form data: - file: (required) Document file - uploaded_by: (optional) User who uploaded the document Response: { "document_id": "uuid", "metadata": {...}, "download_url": "presigned-url" } ``` ### Rewrite Document ``` PUT /api/documents/{document_id} Content-Type: multipart/form-data Authorization: Bearer Form data: - file: (required) New document file - uploaded_by: (optional) User who uploaded the document Response: { "document_id": "uuid", "metadata": {...}, "download_url": "presigned-url" } ``` ### Get Document Metadata ``` GET /api/documents/{document_id} Authorization: Bearer Response: { "document_id": "uuid", "org_id": "org-id", "uploaded_by": "user", "document_type": "pdf", "filename": "document.pdf", "content_type": "application/pdf", "file_size": 12345, "s3_key": "documents/org-id/uuid/document.pdf", "created_at": "2024-01-01T00:00:00", "updated_at": "2024-01-01T00:00:00" } ``` ### Get Download URL ``` GET /api/documents/{document_id}/download-url?expires_in=3600 Authorization: Bearer Response: { "download_url": "presigned-url", "s3_key": "documents/org-id/uuid/document.pdf", "expires_in": 3600 } ``` ### Get PDF Fields ``` GET /api/documents/{document_id}/fields Authorization: Bearer Response: { "document_id": "uuid", "document_type": "pdf", "fields": [ { "field": "field_name", "label": "Field Name", "type": "string", "required": false, "options": null } ] } ``` ### Delete Document ``` DELETE /api/documents/{document_id} Authorization: Bearer Response: { "message": "Document deleted successfully" } ``` ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `S3_ENDPOINT` | S3 endpoint URL | `http://localhost:9000` | | `S3_ACCESS_KEY` | S3 access key | `minioadmin` | | `S3_SECRET_KEY` | S3 secret key | `minioadmin` | | `S3_BUCKET` | S3 bucket name | `document-bucket` | | `S3_REGION` | S3 region | `us-east-1` | | `HOST` | Service host | `0.0.0.0` | | `PORT` | Service port | `8082` | | `TEST_UPLOADER` | Default uploader for testing | `test-user` | | `LOG_LEVEL` | Logging level | `INFO` | ### File Size Limits | Document Type | Default Limit | |---------------|---------------| | PDF | 50MB | | DOCX | 25MB | | XLSX | 25MB | | JPG/JPEG | 10MB | | PNG | 10MB | | GIF | 10MB | | Other | 10MB | ## Authentication The service uses JWT tokens for authentication. The `org_id` is extracted from the token claims and used for organization-based access control. **Note**: Currently, the auth middleware includes a mock implementation for testing. In production, this should be replaced with proper Zitadel integration. ## Development ### Setup This project uses [uv2nix](https://pyproject-nix.github.io/uv2nix/) for reproducible Python dependency management with Nix. ```bash # Enter the development shell (uses uv2nix) nix develop # The development shell includes: # - Python with all dependencies from uv.lock # - uv tool for package management # - pyright for type checking # - file package (provides libmagic for content type detection) ``` ### Running the Service ```bash # Start the development server uvicorn app.main:app --reload --host 0.0.0.0 --port 8082 # Access API documentation open http://localhost:8082/docs ``` ### Adding Dependencies ```bash # Add a new dependency uv add # Add a development dependency uv add --dev # Update the lock file uv lock ``` ### Testing ```bash # Run tests pytest # Run with coverage pytest --cov=app ``` ### Linting ```bash # Run ruff ruff check app/ # Format code ruff format app/ ``` ### Building Production Package ```bash # Build the production package nix build # The package will be available at ./result ``` ## Deployment ### Using Helm ```bash # Install chart helm install document-service ./ops/chart # Upgrade chart helm upgrade document-service ./ops/chart # Uninstall helm uninstall document-service ``` ### Configuration Edit `ops/chart/values.yaml` to customize deployment settings. ## S3 Path Structure Documents are stored in S3 using the following path structure: ``` documents/{org_id}/{document_id}/{filename} ``` Example: ``` documents/org-123/abc-456-def-789/policy_document.pdf ``` ## Logging All operations are logged with the following information: - Operation type (upload, download, delete, etc.) - Document ID - Organization ID - User ID - Timestamp - Success/failure status ## Error Handling The service returns appropriate HTTP status codes: - `200` - Success - `201` - Created - `400` - Bad Request - `401` - Unauthorized - `403` - Forbidden - `404` - Not Found - `413` - Payload Too Large (file size exceeded) - `415` - Unsupported Media Type - `500` - Internal Server Error ## TODO - [ ] Implement proper Zitadel authentication - [ ] Add document listing endpoint - [ ] Add document search functionality - [ ] Add document versioning support - [ ] Add document conversion capabilities - [ ] Add comprehensive test coverage - [ ] Add API rate limiting - [ ] Add metrics and monitoring