# Edge GitOps - Complete Setup Guide ## Quick Start 1. **Configure the setup:** ```bash ./configure.sh ``` 2. **Bootstrap FluxCD:** ```bash make bootstrap ``` 3. **Monitor the deployment:** ```bash make status ``` ## Directory Structure ``` edge-gitops/ ├── bootstrap.sh # FluxCD bootstrap script ├── configure.sh # Configuration wizard ├── Makefile # Convenient commands ├── README.md # Main documentation ├── .gitignore # Git ignore rules ├── .env # Environment variables (not committed) │ ├── clusters/ │ └── k3s-dgx/ # Cluster-specific configuration │ ├── kustomization.yaml │ │ │ ├── flux-system/ # FluxCD installation │ │ ├── kustomization.yaml │ │ ├── gotk-components.yaml │ │ └── gotk-sync.yaml │ │ │ ├── gpu-support/ # NVIDIA GPU Operator │ │ ├── kustomization.yaml │ │ ├── gpu-operator-namespace.yaml │ │ ├── gpu-operator-helmrelease.yaml │ │ └── gpu-node-labels.yaml │ │ │ ├── kserve/ # KServe installation │ │ ├── kustomization.yaml │ │ ├── kserve-namespace.yaml │ │ ├── kserve-crds.yaml │ │ ├── kserve-controller.yaml │ │ ├── istio-gateway.yaml │ │ ├── gpu-serving-runtime.yaml │ │ ├── model-storage-pvc.yaml │ │ └── storage-config.yaml │ │ │ └── apps/ # ML model deployments │ ├── kustomization.yaml │ └── huihui-granite-inference.yaml │ ├── apps/ # Reusable application manifests └── infrastructure/ # Base infrastructure components ``` ## Component Details ### FluxCD - **Version:** Latest stable - **Components:** source-controller, kustomize-controller, helm-controller, notification-controller - **Sync Interval:** 1 minute for GitRepository, 10 minutes for Kustomization - **Repository:** Gitea (configurable) ### NVIDIA GPU Operator - **Version:** v23.9.1 - **Driver:** 535.129.03 - **Components:** - NVIDIA Driver - Device Plugin - DCGM Exporter - MIG Manager - Node Feature Discovery ### KServe - **Version:** v0.12.0 - **Components:** - KServe Controller - Custom Resource Definitions - GPU Serving Runtime - Istio Integration - Model Storage (50Gi PVC) ### Example Model - **Name:** Huihui-granite-4.1-30b-abliterated - **Source:** Hugging Face - **Resources:** - CPU: 2-4 cores - Memory: 8-16Gi - GPU: 1 NVIDIA GPU ## Common Tasks ### Add a New Model 1. Create a new InferenceService: ```bash cat > clusters/k3s-dgx/apps/your-model.yaml << EOF apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: your-model namespace: kserve spec: predictor: model: modelFormat: name: huggingface storageUri: "hf://your-org/your-model" resources: limits: nvidia.com/gpu: "1" EOF ``` 2. Update the kustomization: ```bash echo " - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml ``` 3. Commit and push: ```bash git add clusters/k3s-dgx/apps/ git commit -m "Add new model deployment" git push ``` ### Update GPU Resources Edit the resource limits in your InferenceService: ```yaml resources: limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "2" requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "2" ``` ### Monitor Model Performance ```bash # Get model endpoint kubectl get inferenceservice your-model -n kserve # View logs kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f # Check GPU usage kubectl exec -n kserve -- nvidia-smi ``` ## Troubleshooting ### FluxCD Not Syncing ```bash # Check FluxCD status flux check # View logs flux logs # Force sync flux reconcile kustomization flux-system --with-source ``` ### GPU Not Available ```bash # Check GPU nodes kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}' # Check GPU operator kubectl get pods -n gpu-operator # View GPU operator logs kubectl logs -n gpu-operator deployment/gpu-operator ``` ### KServe Issues ```bash # Check KServe pods kubectl get pods -n kserve # Check KServe controller kubectl logs -n kserve deployment/kserve-controller-manager # Describe InferenceService kubectl describe inferenceservice your-model -n kserve ``` ## Security Considerations 1. **Secrets Management:** - Never commit secrets to git - Use Kubernetes secrets for sensitive data - Consider using Sealed Secrets or External Secrets Operator 2. **Network Policies:** - Review and restrict network access - Use Istio for service mesh security 3. **RBAC:** - Review FluxCD service account permissions - Implement principle of least privilege ## Performance Optimization ### GPU Optimization - Use appropriate GPU resource requests/limits - Monitor GPU utilization with DCGM Exporter - Consider MIG (Multi-Instance GPU) for better isolation ### Storage Optimization - Use fast storage for model cache - Consider using ReadWriteMany for multi-pod access - Implement model caching strategies ### Network Optimization - Use Istio for efficient load balancing - Configure appropriate timeouts for large models - Consider using gRPC for internal communication ## Scaling ### Horizontal Scaling ```yaml # Add to InferenceService spec: predictor: replicas: 3 ``` ### Vertical Scaling ```yaml # Update resource limits resources: limits: nvidia.com/gpu: "2" ``` ## Monitoring ### Metrics Collection - DCGM Exporter for GPU metrics - Prometheus for cluster metrics - KServe metrics for inference performance ### Logging - Structured logging for all components - Centralized logging with Loki/ELK - Log retention policies ### Alerting - GPU utilization alerts - Model health alerts - Resource exhaustion alerts ## Backup and Recovery ### GitOps Backup - All configuration is in git - Easy rollback with git revert - Branch-based testing ### Data Backup - Model storage backup - Configuration backup - Disaster recovery plan ## Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Test thoroughly 5. Submit a pull request ## Support For issues and questions: - Check the troubleshooting section - Review component documentation - Check FluxCD, KServe, and GPU Operator docs - Open an issue in the repository