Files
edge-gitops/README.md
2026-05-05 11:15:49 -05:00

165 lines
3.6 KiB
Markdown

# Edge GitOps - KServe on k3s with GPU
GitOps setup for deploying ML models using KServe on a k3s cluster with GPU support (DGX Spark).
## Prerequisites
- k3s cluster with GPU support
- kubectl configured to access the cluster
- Gitea instance for GitOps repository
- FluxCD CLI installed
## Architecture
```
edge-gitops/
├── clusters/
│ └── k3s-dgx/
│ ├── flux-system/ # FluxCD installation
│ ├── gpu-support/ # NVIDIA GPU Operator
│ ├── kserve/ # KServe installation
│ └── apps/ # ML model deployments
├── apps/ # Reusable app manifests
└── infrastructure/ # Base infrastructure
```
## Setup Instructions
### 1. Bootstrap FluxCD
```bash
flux bootstrap git \
--url=ssh://git@gitea.example.com/edge-gitops/edge-gitops.git \
--branch=main \
--path=clusters/k3s-dgx \
--components=source-controller,kustomize-controller,helm-controller,notification-controller
```
### 2. Configure Gitea SSH Key
Generate SSH key for FluxCD:
```bash
ssh-keygen -t ed25519 -N "" -f flux-gitea-key
```
Add the public key to your Gitea repository as a deploy key.
### 3. Update Repository Configuration
Edit `clusters/k3s-dgx/flux-system/gotk-sync.yaml` to match your Gitea URL:
```yaml
url: ssh://git@your-gitea-instance.com/edge-gitops/edge-gitops.git
```
### 4. Deploy the Stack
Commit and push the changes:
```bash
git add .
git commit -m "Initial GitOps setup for KServe on k3s"
git push origin main
```
FluxCD will automatically sync the changes to your cluster.
## Components
### GPU Support
- NVIDIA GPU Operator (v23.9.1)
- NVIDIA Device Plugin
- DCGM Exporter for monitoring
- GPU Node Feature Discovery
### KServe
- KServe Core (v0.12.0)
- GPU-enabled Serving Runtime
- Istio Gateway for networking
- Model Storage (PVC)
### Example Model
- Huihui-granite-4.1-30b-abliterated (Hugging Face)
- GPU-accelerated inference
- REST API endpoint
## Usage
### Deploy a New Model
1. Create a new InferenceService in `clusters/k3s-dgx/apps/`:
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: your-model
namespace: kserve
spec:
predictor:
model:
modelFormat:
name: huggingface
storageUri: "hf://your-org/your-model"
resources:
limits:
nvidia.com/gpu: "1"
```
2. Commit and push changes
### Test the Model
```bash
# Get the service URL
kubectl get inferenceservice huihui-granite -n kserve
# Test inference
curl -X POST http://your-service-url/v1/models/huihui-granite:predict \
-H "Content-Type: application/json" \
-d '{"inputs": [{"name": "text", "shape": [1], "datatype": "BYTES", "data": ["Hello world"]}]}'
```
## Monitoring
Check FluxCD status:
```bash
flux get all --all-namespaces
```
Check GPU status:
```bash
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
```
Check KServe services:
```bash
kubectl get inferenceservices -n kserve
```
## Troubleshooting
### GPU Not Available
```bash
kubectl describe node | grep -A 5 nvidia.com/gpu
```
### KServe Pods Not Starting
```bash
kubectl logs -n kserve deployment/kserve-controller-manager
kubectl get pods -n kserve
```
### FluxCD Sync Issues
```bash
flux reconcile kustomization flux-system --with-source
flux logs
```
## Customization
### GPU Resources
Edit `clusters/k3s-dgx/apps/huihui-granite-inference.yaml` to adjust GPU allocation.
### Storage
Modify `clusters/k3s-dgx/kserve/model-storage-pvc.yaml` for different storage requirements.
### Networking
Update `clusters/k3s-dgx/kserve/istio-gateway.yaml` for custom ingress configuration.