document-service/README.md

# Document Service

Generic document management service with S3 storage and PDF field discovery.

## Features

- **Multi-format support**: PDF, DOCX, XLSX, JPG, JPEG, PNG, GIF
- **S3 storage**: Configurable S3-compatible storage (MinIO, AWS S3, etc.)
- **PDF field discovery**: Extract form fields from PDF documents
- **Organization-based access control**: Documents scoped to organizations
- **File size limits**: Configurable per document type
- **Content type detection**: Automatic detection using python-magic
- **Comprehensive logging**: All operations logged for audit trail

## API Endpoints

### Upload Document
```
POST /api/documents/upload
Content-Type: multipart/form-data
Authorization: Bearer <token>

Form data:
- file: (required) Document file
- uploaded_by: (optional) User who uploaded the document

Response:
{
  "document_id": "uuid",
  "metadata": {...},
  "download_url": "presigned-url"
}
```

### Rewrite Document
```
PUT /api/documents/{document_id}
Content-Type: multipart/form-data
Authorization: Bearer <token>

Form data:
- file: (required) New document file
- uploaded_by: (optional) User who uploaded the document

Response:
{
  "document_id": "uuid",
  "metadata": {...},
  "download_url": "presigned-url"
}
```

### Get Document Metadata
```
GET /api/documents/{document_id}
Authorization: Bearer <token>

Response:
{
  "document_id": "uuid",
  "org_id": "org-id",
  "uploaded_by": "user",
  "document_type": "pdf",
  "filename": "document.pdf",
  "content_type": "application/pdf",
  "file_size": 12345,
  "s3_key": "documents/org-id/uuid/document.pdf",
  "created_at": "2024-01-01T00:00:00",
  "updated_at": "2024-01-01T00:00:00"
}
```

### Get Download URL
```
GET /api/documents/{document_id}/download-url?expires_in=3600
Authorization: Bearer <token>

Response:
{
  "download_url": "presigned-url",
  "s3_key": "documents/org-id/uuid/document.pdf",
  "expires_in": 3600
}
```

### Get PDF Fields
```
GET /api/documents/{document_id}/fields
Authorization: Bearer <token>

Response:
{
  "document_id": "uuid",
  "document_type": "pdf",
  "fields": [
    {
      "field": "field_name",
      "label": "Field Name",
      "type": "string",
      "required": false,
      "options": null
    }
  ]
}
```

### Delete Document
```
DELETE /api/documents/{document_id}
Authorization: Bearer <token>

Response:
{
  "message": "Document deleted successfully"
}
```

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `S3_ENDPOINT` | S3 endpoint URL | `http://localhost:9000` |
| `S3_ACCESS_KEY` | S3 access key | `minioadmin` |
| `S3_SECRET_KEY` | S3 secret key | `minioadmin` |
| `S3_BUCKET` | S3 bucket name | `document-bucket` |
| `S3_REGION` | S3 region | `us-east-1` |
| `HOST` | Service host | `0.0.0.0` |
| `PORT` | Service port | `8082` |
| `TEST_UPLOADER` | Default uploader for testing | `test-user` |
| `LOG_LEVEL` | Logging level | `INFO` |

### File Size Limits

| Document Type | Default Limit |
|---------------|---------------|
| PDF | 50MB |
| DOCX | 25MB |
| XLSX | 25MB |
| JPG/JPEG | 10MB |
| PNG | 10MB |
| GIF | 10MB |
| Other | 10MB |

## Authentication

The service uses JWT tokens for authentication. The `org_id` is extracted from the token claims and used for organization-based access control.

**Note**: Currently, the auth middleware includes a mock implementation for testing. In production, this should be replaced with proper Zitadel integration.

## Development

### Setup

This project uses [uv2nix](https://pyproject-nix.github.io/uv2nix/) for reproducible Python dependency management with Nix.

```bash
# Enter the development shell (uses uv2nix)
nix develop

# The development shell includes:
# - Python with all dependencies from uv.lock
# - uv tool for package management
# - pyright for type checking
# - file package (provides libmagic for content type detection)
```

### Running the Service

```bash
# Start the development server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8082

# Access API documentation
open http://localhost:8082/docs
```

### Adding Dependencies

```bash
# Add a new dependency
uv add <package-name>

# Add a development dependency
uv add --dev <package-name>

# Update the lock file
uv lock
```

### Testing

```bash
# Run tests
pytest

# Run with coverage
pytest --cov=app
```

### Linting

```bash
# Run ruff
ruff check app/

# Format code
ruff format app/
```

### Building Production Package

```bash
# Build the production package
nix build

# The package will be available at ./result
```

## Deployment

### Using Helm

```bash
# Install chart
helm install document-service ./ops/chart

# Upgrade chart
helm upgrade document-service ./ops/chart

# Uninstall
helm uninstall document-service
```

### Configuration

Edit `ops/chart/values.yaml` to customize deployment settings.

## S3 Path Structure

Documents are stored in S3 using the following path structure:

```
documents/{org_id}/{document_id}/{filename}
```

Example:
```
documents/org-123/abc-456-def-789/policy_document.pdf
```

## Logging

All operations are logged with the following information:
- Operation type (upload, download, delete, etc.)
- Document ID
- Organization ID
- User ID
- Timestamp
- Success/failure status

## Error Handling

The service returns appropriate HTTP status codes:

- `200` - Success
- `201` - Created
- `400` - Bad Request
- `401` - Unauthorized
- `403` - Forbidden
- `404` - Not Found
- `413` - Payload Too Large (file size exceeded)
- `415` - Unsupported Media Type
- `500` - Internal Server Error

## TODO

- [ ] Implement proper Zitadel authentication
- [ ] Add document listing endpoint
- [ ] Add document search functionality
- [ ] Add document versioning support
- [ ] Add document conversion capabilities
- [ ] Add comprehensive test coverage
- [ ] Add API rate limiting
- [ ] Add metrics and monitoring