6.6 KiB
6.6 KiB
Edge GitOps - Complete Setup Guide
Quick Start
-
Configure the setup:
./configure.sh -
Bootstrap FluxCD:
make bootstrap -
Monitor the deployment:
make status
Directory Structure
edge-gitops/
├── bootstrap.sh # FluxCD bootstrap script
├── configure.sh # Configuration wizard
├── Makefile # Convenient commands
├── README.md # Main documentation
├── .gitignore # Git ignore rules
├── .env # Environment variables (not committed)
│
├── clusters/
│ └── k3s-dgx/ # Cluster-specific configuration
│ ├── kustomization.yaml
│ │
│ ├── flux-system/ # FluxCD installation
│ │ ├── kustomization.yaml
│ │ ├── gotk-components.yaml
│ │ └── gotk-sync.yaml
│ │
│ ├── gpu-support/ # NVIDIA GPU Operator
│ │ ├── kustomization.yaml
│ │ ├── gpu-operator-namespace.yaml
│ │ ├── gpu-operator-helmrelease.yaml
│ │ └── gpu-node-labels.yaml
│ │
│ ├── kserve/ # KServe installation
│ │ ├── kustomization.yaml
│ │ ├── kserve-namespace.yaml
│ │ ├── kserve-crds.yaml
│ │ ├── kserve-controller.yaml
│ │ ├── istio-gateway.yaml
│ │ ├── gpu-serving-runtime.yaml
│ │ ├── model-storage-pvc.yaml
│ │ └── storage-config.yaml
│ │
│ └── apps/ # ML model deployments
│ ├── kustomization.yaml
│ └── huihui-granite-inference.yaml
│
├── apps/ # Reusable application manifests
└── infrastructure/ # Base infrastructure components
Component Details
FluxCD
- Version: Latest stable
- Components: source-controller, kustomize-controller, helm-controller, notification-controller
- Sync Interval: 1 minute for GitRepository, 10 minutes for Kustomization
- Repository: Gitea (configurable)
NVIDIA GPU Operator
- Version: v23.9.1
- Driver: 535.129.03
- Components:
- NVIDIA Driver
- Device Plugin
- DCGM Exporter
- MIG Manager
- Node Feature Discovery
KServe
- Version: v0.12.0
- Components:
- KServe Controller
- Custom Resource Definitions
- GPU Serving Runtime
- Istio Integration
- Model Storage (50Gi PVC)
Example Model
- Name: Huihui-granite-4.1-30b-abliterated
- Source: Hugging Face
- Resources:
- CPU: 2-4 cores
- Memory: 8-16Gi
- GPU: 1 NVIDIA GPU
Common Tasks
Add a New Model
-
Create a new InferenceService:
cat > clusters/k3s-dgx/apps/your-model.yaml << EOF apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: your-model namespace: kserve spec: predictor: model: modelFormat: name: huggingface storageUri: "hf://your-org/your-model" resources: limits: nvidia.com/gpu: "1" EOF -
Update the kustomization:
echo " - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml -
Commit and push:
git add clusters/k3s-dgx/apps/ git commit -m "Add new model deployment" git push
Update GPU Resources
Edit the resource limits in your InferenceService:
resources:
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "2"
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "2"
Monitor Model Performance
# Get model endpoint
kubectl get inferenceservice your-model -n kserve
# View logs
kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f
# Check GPU usage
kubectl exec -n kserve <pod-name> -- nvidia-smi
Troubleshooting
FluxCD Not Syncing
# Check FluxCD status
flux check
# View logs
flux logs
# Force sync
flux reconcile kustomization flux-system --with-source
GPU Not Available
# Check GPU nodes
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
# Check GPU operator
kubectl get pods -n gpu-operator
# View GPU operator logs
kubectl logs -n gpu-operator deployment/gpu-operator
KServe Issues
# Check KServe pods
kubectl get pods -n kserve
# Check KServe controller
kubectl logs -n kserve deployment/kserve-controller-manager
# Describe InferenceService
kubectl describe inferenceservice your-model -n kserve
Security Considerations
-
Secrets Management:
- Never commit secrets to git
- Use Kubernetes secrets for sensitive data
- Consider using Sealed Secrets or External Secrets Operator
-
Network Policies:
- Review and restrict network access
- Use Istio for service mesh security
-
RBAC:
- Review FluxCD service account permissions
- Implement principle of least privilege
Performance Optimization
GPU Optimization
- Use appropriate GPU resource requests/limits
- Monitor GPU utilization with DCGM Exporter
- Consider MIG (Multi-Instance GPU) for better isolation
Storage Optimization
- Use fast storage for model cache
- Consider using ReadWriteMany for multi-pod access
- Implement model caching strategies
Network Optimization
- Use Istio for efficient load balancing
- Configure appropriate timeouts for large models
- Consider using gRPC for internal communication
Scaling
Horizontal Scaling
# Add to InferenceService
spec:
predictor:
replicas: 3
Vertical Scaling
# Update resource limits
resources:
limits:
nvidia.com/gpu: "2"
Monitoring
Metrics Collection
- DCGM Exporter for GPU metrics
- Prometheus for cluster metrics
- KServe metrics for inference performance
Logging
- Structured logging for all components
- Centralized logging with Loki/ELK
- Log retention policies
Alerting
- GPU utilization alerts
- Model health alerts
- Resource exhaustion alerts
Backup and Recovery
GitOps Backup
- All configuration is in git
- Easy rollback with git revert
- Branch-based testing
Data Backup
- Model storage backup
- Configuration backup
- Disaster recovery plan
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
Support
For issues and questions:
- Check the troubleshooting section
- Review component documentation
- Check FluxCD, KServe, and GPU Operator docs
- Open an issue in the repository