init commit

2026-05-05 11:15:49 -05:00
commit 06f52750ac
24 changed files with 1158 additions and 0 deletions
--- a/SETUP.md
+++ b/SETUP.md
@@ -0,0 +1,297 @@
+# Edge GitOps - Complete Setup Guide
+
+## Quick Start
+
+1. **Configure the setup:**
+   ```bash
+   ./configure.sh
+   ```
+
+2. **Bootstrap FluxCD:**
+   ```bash
+   make bootstrap
+   ```
+
+3. **Monitor the deployment:**
+   ```bash
+   make status
+   ```
+
+## Directory Structure
+
+```
+edge-gitops/
+├── bootstrap.sh              # FluxCD bootstrap script
+├── configure.sh             # Configuration wizard
+├── Makefile                 # Convenient commands
+├── README.md                # Main documentation
+├── .gitignore              # Git ignore rules
+├── .env                    # Environment variables (not committed)
+│
+├── clusters/
+│   └── k3s-dgx/            # Cluster-specific configuration
+│       ├── kustomization.yaml
+│       │
+│       ├── flux-system/    # FluxCD installation
+│       │   ├── kustomization.yaml
+│       │   ├── gotk-components.yaml
+│       │   └── gotk-sync.yaml
+│       │
+│       ├── gpu-support/    # NVIDIA GPU Operator
+│       │   ├── kustomization.yaml
+│       │   ├── gpu-operator-namespace.yaml
+│       │   ├── gpu-operator-helmrelease.yaml
+│       │   └── gpu-node-labels.yaml
+│       │
+│       ├── kserve/         # KServe installation
+│       │   ├── kustomization.yaml
+│       │   ├── kserve-namespace.yaml
+│       │   ├── kserve-crds.yaml
+│       │   ├── kserve-controller.yaml
+│       │   ├── istio-gateway.yaml
+│       │   ├── gpu-serving-runtime.yaml
+│       │   ├── model-storage-pvc.yaml
+│       │   └── storage-config.yaml
+│       │
+│       └── apps/           # ML model deployments
+│           ├── kustomization.yaml
+│           └── huihui-granite-inference.yaml
+│
+├── apps/                   # Reusable application manifests
+└── infrastructure/        # Base infrastructure components
+```
+
+## Component Details
+
+### FluxCD
+- **Version:** Latest stable
+- **Components:** source-controller, kustomize-controller, helm-controller, notification-controller
+- **Sync Interval:** 1 minute for GitRepository, 10 minutes for Kustomization
+- **Repository:** Gitea (configurable)
+
+### NVIDIA GPU Operator
+- **Version:** v23.9.1
+- **Driver:** 535.129.03
+- **Components:**
+  - NVIDIA Driver
+  - Device Plugin
+  - DCGM Exporter
+  - MIG Manager
+  - Node Feature Discovery
+
+### KServe
+- **Version:** v0.12.0
+- **Components:**
+  - KServe Controller
+  - Custom Resource Definitions
+  - GPU Serving Runtime
+  - Istio Integration
+  - Model Storage (50Gi PVC)
+
+### Example Model
+- **Name:** Huihui-granite-4.1-30b-abliterated
+- **Source:** Hugging Face
+- **Resources:**
+  - CPU: 2-4 cores
+  - Memory: 8-16Gi
+  - GPU: 1 NVIDIA GPU
+
+## Common Tasks
+
+### Add a New Model
+
+1. Create a new InferenceService:
+   ```bash
+   cat > clusters/k3s-dgx/apps/your-model.yaml << EOF
+   apiVersion: serving.kserve.io/v1beta1
+   kind: InferenceService
+   metadata:
+     name: your-model
+     namespace: kserve
+   spec:
+     predictor:
+       model:
+         modelFormat:
+           name: huggingface
+         storageUri: "hf://your-org/your-model"
+         resources:
+           limits:
+             nvidia.com/gpu: "1"
+   EOF
+   ```
+
+2. Update the kustomization:
+   ```bash
+   echo "  - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml
+   ```
+
+3. Commit and push:
+   ```bash
+   git add clusters/k3s-dgx/apps/
+   git commit -m "Add new model deployment"
+   git push
+   ```
+
+### Update GPU Resources
+
+Edit the resource limits in your InferenceService:
+```yaml
+resources:
+  limits:
+    cpu: "8"
+    memory: "32Gi"
+    nvidia.com/gpu: "2"
+  requests:
+    cpu: "4"
+    memory: "16Gi"
+    nvidia.com/gpu: "2"
+```
+
+### Monitor Model Performance
+
+```bash
+# Get model endpoint
+kubectl get inferenceservice your-model -n kserve
+
+# View logs
+kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f
+
+# Check GPU usage
+kubectl exec -n kserve <pod-name> -- nvidia-smi
+```
+
+## Troubleshooting
+
+### FluxCD Not Syncing
+
+```bash
+# Check FluxCD status
+flux check
+
+# View logs
+flux logs
+
+# Force sync
+flux reconcile kustomization flux-system --with-source
+```
+
+### GPU Not Available
+
+```bash
+# Check GPU nodes
+kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
+
+# Check GPU operator
+kubectl get pods -n gpu-operator
+
+# View GPU operator logs
+kubectl logs -n gpu-operator deployment/gpu-operator
+```
+
+### KServe Issues
+
+```bash
+# Check KServe pods
+kubectl get pods -n kserve
+
+# Check KServe controller
+kubectl logs -n kserve deployment/kserve-controller-manager
+
+# Describe InferenceService
+kubectl describe inferenceservice your-model -n kserve
+```
+
+## Security Considerations
+
+1. **Secrets Management:**
+   - Never commit secrets to git
+   - Use Kubernetes secrets for sensitive data
+   - Consider using Sealed Secrets or External Secrets Operator
+
+2. **Network Policies:**
+   - Review and restrict network access
+   - Use Istio for service mesh security
+
+3. **RBAC:**
+   - Review FluxCD service account permissions
+   - Implement principle of least privilege
+
+## Performance Optimization
+
+### GPU Optimization
+- Use appropriate GPU resource requests/limits
+- Monitor GPU utilization with DCGM Exporter
+- Consider MIG (Multi-Instance GPU) for better isolation
+
+### Storage Optimization
+- Use fast storage for model cache
+- Consider using ReadWriteMany for multi-pod access
+- Implement model caching strategies
+
+### Network Optimization
+- Use Istio for efficient load balancing
+- Configure appropriate timeouts for large models
+- Consider using gRPC for internal communication
+
+## Scaling
+
+### Horizontal Scaling
+```yaml
+# Add to InferenceService
+spec:
+  predictor:
+    replicas: 3
+```
+
+### Vertical Scaling
+```yaml
+# Update resource limits
+resources:
+  limits:
+    nvidia.com/gpu: "2"
+```
+
+## Monitoring
+
+### Metrics Collection
+- DCGM Exporter for GPU metrics
+- Prometheus for cluster metrics
+- KServe metrics for inference performance
+
+### Logging
+- Structured logging for all components
+- Centralized logging with Loki/ELK
+- Log retention policies
+
+### Alerting
+- GPU utilization alerts
+- Model health alerts
+- Resource exhaustion alerts
+
+## Backup and Recovery
+
+### GitOps Backup
+- All configuration is in git
+- Easy rollback with git revert
+- Branch-based testing
+
+### Data Backup
+- Model storage backup
+- Configuration backup
+- Disaster recovery plan
+
+## Contributing
+
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Test thoroughly
+5. Submit a pull request
+
+## Support
+
+For issues and questions:
+- Check the troubleshooting section
+- Review component documentation
+- Check FluxCD, KServe, and GPU Operator docs
+- Open an issue in the repository