init commit
This commit is contained in:
297
SETUP.md
Normal file
297
SETUP.md
Normal file
@@ -0,0 +1,297 @@
|
||||
# Edge GitOps - Complete Setup Guide
|
||||
|
||||
## Quick Start
|
||||
|
||||
1. **Configure the setup:**
|
||||
```bash
|
||||
./configure.sh
|
||||
```
|
||||
|
||||
2. **Bootstrap FluxCD:**
|
||||
```bash
|
||||
make bootstrap
|
||||
```
|
||||
|
||||
3. **Monitor the deployment:**
|
||||
```bash
|
||||
make status
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
edge-gitops/
|
||||
├── bootstrap.sh # FluxCD bootstrap script
|
||||
├── configure.sh # Configuration wizard
|
||||
├── Makefile # Convenient commands
|
||||
├── README.md # Main documentation
|
||||
├── .gitignore # Git ignore rules
|
||||
├── .env # Environment variables (not committed)
|
||||
│
|
||||
├── clusters/
|
||||
│ └── k3s-dgx/ # Cluster-specific configuration
|
||||
│ ├── kustomization.yaml
|
||||
│ │
|
||||
│ ├── flux-system/ # FluxCD installation
|
||||
│ │ ├── kustomization.yaml
|
||||
│ │ ├── gotk-components.yaml
|
||||
│ │ └── gotk-sync.yaml
|
||||
│ │
|
||||
│ ├── gpu-support/ # NVIDIA GPU Operator
|
||||
│ │ ├── kustomization.yaml
|
||||
│ │ ├── gpu-operator-namespace.yaml
|
||||
│ │ ├── gpu-operator-helmrelease.yaml
|
||||
│ │ └── gpu-node-labels.yaml
|
||||
│ │
|
||||
│ ├── kserve/ # KServe installation
|
||||
│ │ ├── kustomization.yaml
|
||||
│ │ ├── kserve-namespace.yaml
|
||||
│ │ ├── kserve-crds.yaml
|
||||
│ │ ├── kserve-controller.yaml
|
||||
│ │ ├── istio-gateway.yaml
|
||||
│ │ ├── gpu-serving-runtime.yaml
|
||||
│ │ ├── model-storage-pvc.yaml
|
||||
│ │ └── storage-config.yaml
|
||||
│ │
|
||||
│ └── apps/ # ML model deployments
|
||||
│ ├── kustomization.yaml
|
||||
│ └── huihui-granite-inference.yaml
|
||||
│
|
||||
├── apps/ # Reusable application manifests
|
||||
└── infrastructure/ # Base infrastructure components
|
||||
```
|
||||
|
||||
## Component Details
|
||||
|
||||
### FluxCD
|
||||
- **Version:** Latest stable
|
||||
- **Components:** source-controller, kustomize-controller, helm-controller, notification-controller
|
||||
- **Sync Interval:** 1 minute for GitRepository, 10 minutes for Kustomization
|
||||
- **Repository:** Gitea (configurable)
|
||||
|
||||
### NVIDIA GPU Operator
|
||||
- **Version:** v23.9.1
|
||||
- **Driver:** 535.129.03
|
||||
- **Components:**
|
||||
- NVIDIA Driver
|
||||
- Device Plugin
|
||||
- DCGM Exporter
|
||||
- MIG Manager
|
||||
- Node Feature Discovery
|
||||
|
||||
### KServe
|
||||
- **Version:** v0.12.0
|
||||
- **Components:**
|
||||
- KServe Controller
|
||||
- Custom Resource Definitions
|
||||
- GPU Serving Runtime
|
||||
- Istio Integration
|
||||
- Model Storage (50Gi PVC)
|
||||
|
||||
### Example Model
|
||||
- **Name:** Huihui-granite-4.1-30b-abliterated
|
||||
- **Source:** Hugging Face
|
||||
- **Resources:**
|
||||
- CPU: 2-4 cores
|
||||
- Memory: 8-16Gi
|
||||
- GPU: 1 NVIDIA GPU
|
||||
|
||||
## Common Tasks
|
||||
|
||||
### Add a New Model
|
||||
|
||||
1. Create a new InferenceService:
|
||||
```bash
|
||||
cat > clusters/k3s-dgx/apps/your-model.yaml << EOF
|
||||
apiVersion: serving.kserve.io/v1beta1
|
||||
kind: InferenceService
|
||||
metadata:
|
||||
name: your-model
|
||||
namespace: kserve
|
||||
spec:
|
||||
predictor:
|
||||
model:
|
||||
modelFormat:
|
||||
name: huggingface
|
||||
storageUri: "hf://your-org/your-model"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "1"
|
||||
EOF
|
||||
```
|
||||
|
||||
2. Update the kustomization:
|
||||
```bash
|
||||
echo " - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml
|
||||
```
|
||||
|
||||
3. Commit and push:
|
||||
```bash
|
||||
git add clusters/k3s-dgx/apps/
|
||||
git commit -m "Add new model deployment"
|
||||
git push
|
||||
```
|
||||
|
||||
### Update GPU Resources
|
||||
|
||||
Edit the resource limits in your InferenceService:
|
||||
```yaml
|
||||
resources:
|
||||
limits:
|
||||
cpu: "8"
|
||||
memory: "32Gi"
|
||||
nvidia.com/gpu: "2"
|
||||
requests:
|
||||
cpu: "4"
|
||||
memory: "16Gi"
|
||||
nvidia.com/gpu: "2"
|
||||
```
|
||||
|
||||
### Monitor Model Performance
|
||||
|
||||
```bash
|
||||
# Get model endpoint
|
||||
kubectl get inferenceservice your-model -n kserve
|
||||
|
||||
# View logs
|
||||
kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f
|
||||
|
||||
# Check GPU usage
|
||||
kubectl exec -n kserve <pod-name> -- nvidia-smi
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### FluxCD Not Syncing
|
||||
|
||||
```bash
|
||||
# Check FluxCD status
|
||||
flux check
|
||||
|
||||
# View logs
|
||||
flux logs
|
||||
|
||||
# Force sync
|
||||
flux reconcile kustomization flux-system --with-source
|
||||
```
|
||||
|
||||
### GPU Not Available
|
||||
|
||||
```bash
|
||||
# Check GPU nodes
|
||||
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
|
||||
|
||||
# Check GPU operator
|
||||
kubectl get pods -n gpu-operator
|
||||
|
||||
# View GPU operator logs
|
||||
kubectl logs -n gpu-operator deployment/gpu-operator
|
||||
```
|
||||
|
||||
### KServe Issues
|
||||
|
||||
```bash
|
||||
# Check KServe pods
|
||||
kubectl get pods -n kserve
|
||||
|
||||
# Check KServe controller
|
||||
kubectl logs -n kserve deployment/kserve-controller-manager
|
||||
|
||||
# Describe InferenceService
|
||||
kubectl describe inferenceservice your-model -n kserve
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Secrets Management:**
|
||||
- Never commit secrets to git
|
||||
- Use Kubernetes secrets for sensitive data
|
||||
- Consider using Sealed Secrets or External Secrets Operator
|
||||
|
||||
2. **Network Policies:**
|
||||
- Review and restrict network access
|
||||
- Use Istio for service mesh security
|
||||
|
||||
3. **RBAC:**
|
||||
- Review FluxCD service account permissions
|
||||
- Implement principle of least privilege
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### GPU Optimization
|
||||
- Use appropriate GPU resource requests/limits
|
||||
- Monitor GPU utilization with DCGM Exporter
|
||||
- Consider MIG (Multi-Instance GPU) for better isolation
|
||||
|
||||
### Storage Optimization
|
||||
- Use fast storage for model cache
|
||||
- Consider using ReadWriteMany for multi-pod access
|
||||
- Implement model caching strategies
|
||||
|
||||
### Network Optimization
|
||||
- Use Istio for efficient load balancing
|
||||
- Configure appropriate timeouts for large models
|
||||
- Consider using gRPC for internal communication
|
||||
|
||||
## Scaling
|
||||
|
||||
### Horizontal Scaling
|
||||
```yaml
|
||||
# Add to InferenceService
|
||||
spec:
|
||||
predictor:
|
||||
replicas: 3
|
||||
```
|
||||
|
||||
### Vertical Scaling
|
||||
```yaml
|
||||
# Update resource limits
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "2"
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Metrics Collection
|
||||
- DCGM Exporter for GPU metrics
|
||||
- Prometheus for cluster metrics
|
||||
- KServe metrics for inference performance
|
||||
|
||||
### Logging
|
||||
- Structured logging for all components
|
||||
- Centralized logging with Loki/ELK
|
||||
- Log retention policies
|
||||
|
||||
### Alerting
|
||||
- GPU utilization alerts
|
||||
- Model health alerts
|
||||
- Resource exhaustion alerts
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### GitOps Backup
|
||||
- All configuration is in git
|
||||
- Easy rollback with git revert
|
||||
- Branch-based testing
|
||||
|
||||
### Data Backup
|
||||
- Model storage backup
|
||||
- Configuration backup
|
||||
- Disaster recovery plan
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Fork the repository
|
||||
2. Create a feature branch
|
||||
3. Make your changes
|
||||
4. Test thoroughly
|
||||
5. Submit a pull request
|
||||
|
||||
## Support
|
||||
|
||||
For issues and questions:
|
||||
- Check the troubleshooting section
|
||||
- Review component documentation
|
||||
- Check FluxCD, KServe, and GPU Operator docs
|
||||
- Open an issue in the repository
|
||||
Reference in New Issue
Block a user