Files
edge-gitops/SETUP.md
2026-05-05 11:15:49 -05:00

297 lines
6.6 KiB
Markdown

# Edge GitOps - Complete Setup Guide
## Quick Start
1. **Configure the setup:**
```bash
./configure.sh
```
2. **Bootstrap FluxCD:**
```bash
make bootstrap
```
3. **Monitor the deployment:**
```bash
make status
```
## Directory Structure
```
edge-gitops/
├── bootstrap.sh # FluxCD bootstrap script
├── configure.sh # Configuration wizard
├── Makefile # Convenient commands
├── README.md # Main documentation
├── .gitignore # Git ignore rules
├── .env # Environment variables (not committed)
├── clusters/
│ └── k3s-dgx/ # Cluster-specific configuration
│ ├── kustomization.yaml
│ │
│ ├── flux-system/ # FluxCD installation
│ │ ├── kustomization.yaml
│ │ ├── gotk-components.yaml
│ │ └── gotk-sync.yaml
│ │
│ ├── gpu-support/ # NVIDIA GPU Operator
│ │ ├── kustomization.yaml
│ │ ├── gpu-operator-namespace.yaml
│ │ ├── gpu-operator-helmrelease.yaml
│ │ └── gpu-node-labels.yaml
│ │
│ ├── kserve/ # KServe installation
│ │ ├── kustomization.yaml
│ │ ├── kserve-namespace.yaml
│ │ ├── kserve-crds.yaml
│ │ ├── kserve-controller.yaml
│ │ ├── istio-gateway.yaml
│ │ ├── gpu-serving-runtime.yaml
│ │ ├── model-storage-pvc.yaml
│ │ └── storage-config.yaml
│ │
│ └── apps/ # ML model deployments
│ ├── kustomization.yaml
│ └── huihui-granite-inference.yaml
├── apps/ # Reusable application manifests
└── infrastructure/ # Base infrastructure components
```
## Component Details
### FluxCD
- **Version:** Latest stable
- **Components:** source-controller, kustomize-controller, helm-controller, notification-controller
- **Sync Interval:** 1 minute for GitRepository, 10 minutes for Kustomization
- **Repository:** Gitea (configurable)
### NVIDIA GPU Operator
- **Version:** v23.9.1
- **Driver:** 535.129.03
- **Components:**
- NVIDIA Driver
- Device Plugin
- DCGM Exporter
- MIG Manager
- Node Feature Discovery
### KServe
- **Version:** v0.12.0
- **Components:**
- KServe Controller
- Custom Resource Definitions
- GPU Serving Runtime
- Istio Integration
- Model Storage (50Gi PVC)
### Example Model
- **Name:** Huihui-granite-4.1-30b-abliterated
- **Source:** Hugging Face
- **Resources:**
- CPU: 2-4 cores
- Memory: 8-16Gi
- GPU: 1 NVIDIA GPU
## Common Tasks
### Add a New Model
1. Create a new InferenceService:
```bash
cat > clusters/k3s-dgx/apps/your-model.yaml << EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: your-model
namespace: kserve
spec:
predictor:
model:
modelFormat:
name: huggingface
storageUri: "hf://your-org/your-model"
resources:
limits:
nvidia.com/gpu: "1"
EOF
```
2. Update the kustomization:
```bash
echo " - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml
```
3. Commit and push:
```bash
git add clusters/k3s-dgx/apps/
git commit -m "Add new model deployment"
git push
```
### Update GPU Resources
Edit the resource limits in your InferenceService:
```yaml
resources:
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "2"
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "2"
```
### Monitor Model Performance
```bash
# Get model endpoint
kubectl get inferenceservice your-model -n kserve
# View logs
kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f
# Check GPU usage
kubectl exec -n kserve <pod-name> -- nvidia-smi
```
## Troubleshooting
### FluxCD Not Syncing
```bash
# Check FluxCD status
flux check
# View logs
flux logs
# Force sync
flux reconcile kustomization flux-system --with-source
```
### GPU Not Available
```bash
# Check GPU nodes
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
# Check GPU operator
kubectl get pods -n gpu-operator
# View GPU operator logs
kubectl logs -n gpu-operator deployment/gpu-operator
```
### KServe Issues
```bash
# Check KServe pods
kubectl get pods -n kserve
# Check KServe controller
kubectl logs -n kserve deployment/kserve-controller-manager
# Describe InferenceService
kubectl describe inferenceservice your-model -n kserve
```
## Security Considerations
1. **Secrets Management:**
- Never commit secrets to git
- Use Kubernetes secrets for sensitive data
- Consider using Sealed Secrets or External Secrets Operator
2. **Network Policies:**
- Review and restrict network access
- Use Istio for service mesh security
3. **RBAC:**
- Review FluxCD service account permissions
- Implement principle of least privilege
## Performance Optimization
### GPU Optimization
- Use appropriate GPU resource requests/limits
- Monitor GPU utilization with DCGM Exporter
- Consider MIG (Multi-Instance GPU) for better isolation
### Storage Optimization
- Use fast storage for model cache
- Consider using ReadWriteMany for multi-pod access
- Implement model caching strategies
### Network Optimization
- Use Istio for efficient load balancing
- Configure appropriate timeouts for large models
- Consider using gRPC for internal communication
## Scaling
### Horizontal Scaling
```yaml
# Add to InferenceService
spec:
predictor:
replicas: 3
```
### Vertical Scaling
```yaml
# Update resource limits
resources:
limits:
nvidia.com/gpu: "2"
```
## Monitoring
### Metrics Collection
- DCGM Exporter for GPU metrics
- Prometheus for cluster metrics
- KServe metrics for inference performance
### Logging
- Structured logging for all components
- Centralized logging with Loki/ELK
- Log retention policies
### Alerting
- GPU utilization alerts
- Model health alerts
- Resource exhaustion alerts
## Backup and Recovery
### GitOps Backup
- All configuration is in git
- Easy rollback with git revert
- Branch-based testing
### Data Backup
- Model storage backup
- Configuration backup
- Disaster recovery plan
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request
## Support
For issues and questions:
- Check the troubleshooting section
- Review component documentation
- Check FluxCD, KServe, and GPU Operator docs
- Open an issue in the repository