init commit

2026-05-05 11:15:49 -05:00
commit 06f52750ac
24 changed files with 1158 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,165 @@
+# Edge GitOps - KServe on k3s with GPU
+
+GitOps setup for deploying ML models using KServe on a k3s cluster with GPU support (DGX Spark).
+
+## Prerequisites
+
+- k3s cluster with GPU support
+- kubectl configured to access the cluster
+- Gitea instance for GitOps repository
+- FluxCD CLI installed
+
+## Architecture
+
+```
+edge-gitops/
+├── clusters/
+│   └── k3s-dgx/
+│       ├── flux-system/          # FluxCD installation
+│       ├── gpu-support/          # NVIDIA GPU Operator
+│       ├── kserve/               # KServe installation
+│       └── apps/                 # ML model deployments
+├── apps/                        # Reusable app manifests
+└── infrastructure/              # Base infrastructure
+```
+
+## Setup Instructions
+
+### 1. Bootstrap FluxCD
+
+```bash
+flux bootstrap git \
+  --url=ssh://git@gitea.example.com/edge-gitops/edge-gitops.git \
+  --branch=main \
+  --path=clusters/k3s-dgx \
+  --components=source-controller,kustomize-controller,helm-controller,notification-controller
+```
+
+### 2. Configure Gitea SSH Key
+
+Generate SSH key for FluxCD:
+```bash
+ssh-keygen -t ed25519 -N "" -f flux-gitea-key
+```
+
+Add the public key to your Gitea repository as a deploy key.
+
+### 3. Update Repository Configuration
+
+Edit `clusters/k3s-dgx/flux-system/gotk-sync.yaml` to match your Gitea URL:
+```yaml
+url: ssh://git@your-gitea-instance.com/edge-gitops/edge-gitops.git
+```
+
+### 4. Deploy the Stack
+
+Commit and push the changes:
+```bash
+git add .
+git commit -m "Initial GitOps setup for KServe on k3s"
+git push origin main
+```
+
+FluxCD will automatically sync the changes to your cluster.
+
+## Components
+
+### GPU Support
+- NVIDIA GPU Operator (v23.9.1)
+- NVIDIA Device Plugin
+- DCGM Exporter for monitoring
+- GPU Node Feature Discovery
+
+### KServe
+- KServe Core (v0.12.0)
+- GPU-enabled Serving Runtime
+- Istio Gateway for networking
+- Model Storage (PVC)
+
+### Example Model
+- Huihui-granite-4.1-30b-abliterated (Hugging Face)
+- GPU-accelerated inference
+- REST API endpoint
+
+## Usage
+
+### Deploy a New Model
+
+1. Create a new InferenceService in `clusters/k3s-dgx/apps/`:
+```yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: your-model
+  namespace: kserve
+spec:
+  predictor:
+    model:
+      modelFormat:
+        name: huggingface
+      storageUri: "hf://your-org/your-model"
+      resources:
+        limits:
+          nvidia.com/gpu: "1"
+```
+
+2. Commit and push changes
+
+### Test the Model
+
+```bash
+# Get the service URL
+kubectl get inferenceservice huihui-granite -n kserve
+
+# Test inference
+curl -X POST http://your-service-url/v1/models/huihui-granite:predict \
+  -H "Content-Type: application/json" \
+  -d '{"inputs": [{"name": "text", "shape": [1], "datatype": "BYTES", "data": ["Hello world"]}]}'
+```
+
+## Monitoring
+
+Check FluxCD status:
+```bash
+flux get all --all-namespaces
+```
+
+Check GPU status:
+```bash
+kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
+```
+
+Check KServe services:
+```bash
+kubectl get inferenceservices -n kserve
+```
+
+## Troubleshooting
+
+### GPU Not Available
+```bash
+kubectl describe node | grep -A 5 nvidia.com/gpu
+```
+
+### KServe Pods Not Starting
+```bash
+kubectl logs -n kserve deployment/kserve-controller-manager
+kubectl get pods -n kserve
+```
+
+### FluxCD Sync Issues
+```bash
+flux reconcile kustomization flux-system --with-source
+flux logs
+```
+
+## Customization
+
+### GPU Resources
+Edit `clusters/k3s-dgx/apps/huihui-granite-inference.yaml` to adjust GPU allocation.
+
+### Storage
+Modify `clusters/k3s-dgx/kserve/model-storage-pvc.yaml` for different storage requirements.
+
+### Networking
+Update `clusters/k3s-dgx/kserve/istio-gateway.yaml` for custom ingress configuration.