edge-gitops/SETUP.md

# Edge GitOps - Complete Setup Guide

## Quick Start

1. **Configure the setup:**
   ```bash
   ./configure.sh
   ```

2. **Bootstrap FluxCD:**
   ```bash
   make bootstrap
   ```

3. **Monitor the deployment:**
   ```bash
   make status
   ```

## Directory Structure

```
edge-gitops/
├── bootstrap.sh              # FluxCD bootstrap script
├── configure.sh             # Configuration wizard
├── Makefile                 # Convenient commands
├── README.md                # Main documentation
├── .gitignore              # Git ignore rules
├── .env                    # Environment variables (not committed)
│
├── clusters/
│   └── k3s-dgx/            # Cluster-specific configuration
│       ├── kustomization.yaml
│       │
│       ├── flux-system/    # FluxCD installation
│       │   ├── kustomization.yaml
│       │   ├── gotk-components.yaml
│       │   └── gotk-sync.yaml
│       │
│       ├── gpu-support/    # NVIDIA GPU Operator
│       │   ├── kustomization.yaml
│       │   ├── gpu-operator-namespace.yaml
│       │   ├── gpu-operator-helmrelease.yaml
│       │   └── gpu-node-labels.yaml
│       │
│       ├── kserve/         # KServe installation
│       │   ├── kustomization.yaml
│       │   ├── kserve-namespace.yaml
│       │   ├── kserve-crds.yaml
│       │   ├── kserve-controller.yaml
│       │   ├── istio-gateway.yaml
│       │   ├── gpu-serving-runtime.yaml
│       │   ├── model-storage-pvc.yaml
│       │   └── storage-config.yaml
│       │
│       └── apps/           # ML model deployments
│           ├── kustomization.yaml
│           └── huihui-granite-inference.yaml
│
├── apps/                   # Reusable application manifests
└── infrastructure/        # Base infrastructure components
```

## Component Details

### FluxCD
- **Version:** Latest stable
- **Components:** source-controller, kustomize-controller, helm-controller, notification-controller
- **Sync Interval:** 1 minute for GitRepository, 10 minutes for Kustomization
- **Repository:** Gitea (configurable)

### NVIDIA GPU Operator
- **Version:** v23.9.1
- **Driver:** 535.129.03
- **Components:**
  - NVIDIA Driver
  - Device Plugin
  - DCGM Exporter
  - MIG Manager
  - Node Feature Discovery

### KServe
- **Version:** v0.12.0
- **Components:**
  - KServe Controller
  - Custom Resource Definitions
  - GPU Serving Runtime
  - Istio Integration
  - Model Storage (50Gi PVC)

### Example Model
- **Name:** Huihui-granite-4.1-30b-abliterated
- **Source:** Hugging Face
- **Resources:**
  - CPU: 2-4 cores
  - Memory: 8-16Gi
  - GPU: 1 NVIDIA GPU

## Common Tasks

### Add a New Model

1. Create a new InferenceService:
   ```bash
   cat > clusters/k3s-dgx/apps/your-model.yaml << EOF
   apiVersion: serving.kserve.io/v1beta1
   kind: InferenceService
   metadata:
     name: your-model
     namespace: kserve
   spec:
     predictor:
       model:
         modelFormat:
           name: huggingface
         storageUri: "hf://your-org/your-model"
         resources:
           limits:
             nvidia.com/gpu: "1"
   EOF
   ```

2. Update the kustomization:
   ```bash
   echo "  - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml
   ```

3. Commit and push:
   ```bash
   git add clusters/k3s-dgx/apps/
   git commit -m "Add new model deployment"
   git push
   ```

### Update GPU Resources

Edit the resource limits in your InferenceService:
```yaml
resources:
  limits:
    cpu: "8"
    memory: "32Gi"
    nvidia.com/gpu: "2"
  requests:
    cpu: "4"
    memory: "16Gi"
    nvidia.com/gpu: "2"
```

### Monitor Model Performance

```bash
# Get model endpoint
kubectl get inferenceservice your-model -n kserve

# View logs
kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f

# Check GPU usage
kubectl exec -n kserve <pod-name> -- nvidia-smi
```

## Troubleshooting

### FluxCD Not Syncing

```bash
# Check FluxCD status
flux check

# View logs
flux logs

# Force sync
flux reconcile kustomization flux-system --with-source
```

### GPU Not Available

```bash
# Check GPU nodes
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'

# Check GPU operator
kubectl get pods -n gpu-operator

# View GPU operator logs
kubectl logs -n gpu-operator deployment/gpu-operator
```

### KServe Issues

```bash
# Check KServe pods
kubectl get pods -n kserve

# Check KServe controller
kubectl logs -n kserve deployment/kserve-controller-manager

# Describe InferenceService
kubectl describe inferenceservice your-model -n kserve
```

## Security Considerations

1. **Secrets Management:**
   - Never commit secrets to git
   - Use Kubernetes secrets for sensitive data
   - Consider using Sealed Secrets or External Secrets Operator

2. **Network Policies:**
   - Review and restrict network access
   - Use Istio for service mesh security

3. **RBAC:**
   - Review FluxCD service account permissions
   - Implement principle of least privilege

## Performance Optimization

### GPU Optimization
- Use appropriate GPU resource requests/limits
- Monitor GPU utilization with DCGM Exporter
- Consider MIG (Multi-Instance GPU) for better isolation

### Storage Optimization
- Use fast storage for model cache
- Consider using ReadWriteMany for multi-pod access
- Implement model caching strategies

### Network Optimization
- Use Istio for efficient load balancing
- Configure appropriate timeouts for large models
- Consider using gRPC for internal communication

## Scaling

### Horizontal Scaling
```yaml
# Add to InferenceService
spec:
  predictor:
    replicas: 3
```

### Vertical Scaling
```yaml
# Update resource limits
resources:
  limits:
    nvidia.com/gpu: "2"
```

## Monitoring

### Metrics Collection
- DCGM Exporter for GPU metrics
- Prometheus for cluster metrics
- KServe metrics for inference performance

### Logging
- Structured logging for all components
- Centralized logging with Loki/ELK
- Log retention policies

### Alerting
- GPU utilization alerts
- Model health alerts
- Resource exhaustion alerts

## Backup and Recovery

### GitOps Backup
- All configuration is in git
- Easy rollback with git revert
- Branch-based testing

### Data Backup
- Model storage backup
- Configuration backup
- Disaster recovery plan

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request

## Support

For issues and questions:
- Check the troubleshooting section
- Review component documentation
- Check FluxCD, KServe, and GPU Operator docs
- Open an issue in the repository