297 lines
6.6 KiB
Markdown
297 lines
6.6 KiB
Markdown
# Edge GitOps - Complete Setup Guide
|
|
|
|
## Quick Start
|
|
|
|
1. **Configure the setup:**
|
|
```bash
|
|
./configure.sh
|
|
```
|
|
|
|
2. **Bootstrap FluxCD:**
|
|
```bash
|
|
make bootstrap
|
|
```
|
|
|
|
3. **Monitor the deployment:**
|
|
```bash
|
|
make status
|
|
```
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
edge-gitops/
|
|
├── bootstrap.sh # FluxCD bootstrap script
|
|
├── configure.sh # Configuration wizard
|
|
├── Makefile # Convenient commands
|
|
├── README.md # Main documentation
|
|
├── .gitignore # Git ignore rules
|
|
├── .env # Environment variables (not committed)
|
|
│
|
|
├── clusters/
|
|
│ └── k3s-dgx/ # Cluster-specific configuration
|
|
│ ├── kustomization.yaml
|
|
│ │
|
|
│ ├── flux-system/ # FluxCD installation
|
|
│ │ ├── kustomization.yaml
|
|
│ │ ├── gotk-components.yaml
|
|
│ │ └── gotk-sync.yaml
|
|
│ │
|
|
│ ├── gpu-support/ # NVIDIA GPU Operator
|
|
│ │ ├── kustomization.yaml
|
|
│ │ ├── gpu-operator-namespace.yaml
|
|
│ │ ├── gpu-operator-helmrelease.yaml
|
|
│ │ └── gpu-node-labels.yaml
|
|
│ │
|
|
│ ├── kserve/ # KServe installation
|
|
│ │ ├── kustomization.yaml
|
|
│ │ ├── kserve-namespace.yaml
|
|
│ │ ├── kserve-crds.yaml
|
|
│ │ ├── kserve-controller.yaml
|
|
│ │ ├── istio-gateway.yaml
|
|
│ │ ├── gpu-serving-runtime.yaml
|
|
│ │ ├── model-storage-pvc.yaml
|
|
│ │ └── storage-config.yaml
|
|
│ │
|
|
│ └── apps/ # ML model deployments
|
|
│ ├── kustomization.yaml
|
|
│ └── huihui-granite-inference.yaml
|
|
│
|
|
├── apps/ # Reusable application manifests
|
|
└── infrastructure/ # Base infrastructure components
|
|
```
|
|
|
|
## Component Details
|
|
|
|
### FluxCD
|
|
- **Version:** Latest stable
|
|
- **Components:** source-controller, kustomize-controller, helm-controller, notification-controller
|
|
- **Sync Interval:** 1 minute for GitRepository, 10 minutes for Kustomization
|
|
- **Repository:** Gitea (configurable)
|
|
|
|
### NVIDIA GPU Operator
|
|
- **Version:** v23.9.1
|
|
- **Driver:** 535.129.03
|
|
- **Components:**
|
|
- NVIDIA Driver
|
|
- Device Plugin
|
|
- DCGM Exporter
|
|
- MIG Manager
|
|
- Node Feature Discovery
|
|
|
|
### KServe
|
|
- **Version:** v0.12.0
|
|
- **Components:**
|
|
- KServe Controller
|
|
- Custom Resource Definitions
|
|
- GPU Serving Runtime
|
|
- Istio Integration
|
|
- Model Storage (50Gi PVC)
|
|
|
|
### Example Model
|
|
- **Name:** Huihui-granite-4.1-30b-abliterated
|
|
- **Source:** Hugging Face
|
|
- **Resources:**
|
|
- CPU: 2-4 cores
|
|
- Memory: 8-16Gi
|
|
- GPU: 1 NVIDIA GPU
|
|
|
|
## Common Tasks
|
|
|
|
### Add a New Model
|
|
|
|
1. Create a new InferenceService:
|
|
```bash
|
|
cat > clusters/k3s-dgx/apps/your-model.yaml << EOF
|
|
apiVersion: serving.kserve.io/v1beta1
|
|
kind: InferenceService
|
|
metadata:
|
|
name: your-model
|
|
namespace: kserve
|
|
spec:
|
|
predictor:
|
|
model:
|
|
modelFormat:
|
|
name: huggingface
|
|
storageUri: "hf://your-org/your-model"
|
|
resources:
|
|
limits:
|
|
nvidia.com/gpu: "1"
|
|
EOF
|
|
```
|
|
|
|
2. Update the kustomization:
|
|
```bash
|
|
echo " - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml
|
|
```
|
|
|
|
3. Commit and push:
|
|
```bash
|
|
git add clusters/k3s-dgx/apps/
|
|
git commit -m "Add new model deployment"
|
|
git push
|
|
```
|
|
|
|
### Update GPU Resources
|
|
|
|
Edit the resource limits in your InferenceService:
|
|
```yaml
|
|
resources:
|
|
limits:
|
|
cpu: "8"
|
|
memory: "32Gi"
|
|
nvidia.com/gpu: "2"
|
|
requests:
|
|
cpu: "4"
|
|
memory: "16Gi"
|
|
nvidia.com/gpu: "2"
|
|
```
|
|
|
|
### Monitor Model Performance
|
|
|
|
```bash
|
|
# Get model endpoint
|
|
kubectl get inferenceservice your-model -n kserve
|
|
|
|
# View logs
|
|
kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f
|
|
|
|
# Check GPU usage
|
|
kubectl exec -n kserve <pod-name> -- nvidia-smi
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### FluxCD Not Syncing
|
|
|
|
```bash
|
|
# Check FluxCD status
|
|
flux check
|
|
|
|
# View logs
|
|
flux logs
|
|
|
|
# Force sync
|
|
flux reconcile kustomization flux-system --with-source
|
|
```
|
|
|
|
### GPU Not Available
|
|
|
|
```bash
|
|
# Check GPU nodes
|
|
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
|
|
|
|
# Check GPU operator
|
|
kubectl get pods -n gpu-operator
|
|
|
|
# View GPU operator logs
|
|
kubectl logs -n gpu-operator deployment/gpu-operator
|
|
```
|
|
|
|
### KServe Issues
|
|
|
|
```bash
|
|
# Check KServe pods
|
|
kubectl get pods -n kserve
|
|
|
|
# Check KServe controller
|
|
kubectl logs -n kserve deployment/kserve-controller-manager
|
|
|
|
# Describe InferenceService
|
|
kubectl describe inferenceservice your-model -n kserve
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
1. **Secrets Management:**
|
|
- Never commit secrets to git
|
|
- Use Kubernetes secrets for sensitive data
|
|
- Consider using Sealed Secrets or External Secrets Operator
|
|
|
|
2. **Network Policies:**
|
|
- Review and restrict network access
|
|
- Use Istio for service mesh security
|
|
|
|
3. **RBAC:**
|
|
- Review FluxCD service account permissions
|
|
- Implement principle of least privilege
|
|
|
|
## Performance Optimization
|
|
|
|
### GPU Optimization
|
|
- Use appropriate GPU resource requests/limits
|
|
- Monitor GPU utilization with DCGM Exporter
|
|
- Consider MIG (Multi-Instance GPU) for better isolation
|
|
|
|
### Storage Optimization
|
|
- Use fast storage for model cache
|
|
- Consider using ReadWriteMany for multi-pod access
|
|
- Implement model caching strategies
|
|
|
|
### Network Optimization
|
|
- Use Istio for efficient load balancing
|
|
- Configure appropriate timeouts for large models
|
|
- Consider using gRPC for internal communication
|
|
|
|
## Scaling
|
|
|
|
### Horizontal Scaling
|
|
```yaml
|
|
# Add to InferenceService
|
|
spec:
|
|
predictor:
|
|
replicas: 3
|
|
```
|
|
|
|
### Vertical Scaling
|
|
```yaml
|
|
# Update resource limits
|
|
resources:
|
|
limits:
|
|
nvidia.com/gpu: "2"
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Metrics Collection
|
|
- DCGM Exporter for GPU metrics
|
|
- Prometheus for cluster metrics
|
|
- KServe metrics for inference performance
|
|
|
|
### Logging
|
|
- Structured logging for all components
|
|
- Centralized logging with Loki/ELK
|
|
- Log retention policies
|
|
|
|
### Alerting
|
|
- GPU utilization alerts
|
|
- Model health alerts
|
|
- Resource exhaustion alerts
|
|
|
|
## Backup and Recovery
|
|
|
|
### GitOps Backup
|
|
- All configuration is in git
|
|
- Easy rollback with git revert
|
|
- Branch-based testing
|
|
|
|
### Data Backup
|
|
- Model storage backup
|
|
- Configuration backup
|
|
- Disaster recovery plan
|
|
|
|
## Contributing
|
|
|
|
1. Fork the repository
|
|
2. Create a feature branch
|
|
3. Make your changes
|
|
4. Test thoroughly
|
|
5. Submit a pull request
|
|
|
|
## Support
|
|
|
|
For issues and questions:
|
|
- Check the troubleshooting section
|
|
- Review component documentation
|
|
- Check FluxCD, KServe, and GPU Operator docs
|
|
- Open an issue in the repository |