software-engineering/edge-gitops

Fork 0

Go to file

HaimKortovich c655bca1bf use vlmm optomized for spark

2026-05-11 13:07:56 -05:00

clusters/k3s-dgx

use vlmm optomized for spark

2026-05-11 13:07:56 -05:00

.env

update url for gitea

2026-05-05 11:19:54 -05:00

.envrc

fix typo

2026-05-06 11:59:15 -05:00

.gitignore

cleanup

2026-05-05 11:31:59 -05:00

bootstrap.sh

init commit

2026-05-05 11:15:49 -05:00

configure.sh

init commit

2026-05-05 11:15:49 -05:00

Makefile

init commit

2026-05-05 11:15:49 -05:00

README.md

init commit

2026-05-05 11:15:49 -05:00

SETUP.md

init commit

2026-05-05 11:15:49 -05:00

shell.nix

start slow

2026-05-05 11:38:28 -05:00

README.md

Edge GitOps - KServe on k3s with GPU

GitOps setup for deploying ML models using KServe on a k3s cluster with GPU support (DGX Spark).

Prerequisites

k3s cluster with GPU support
kubectl configured to access the cluster
Gitea instance for GitOps repository
FluxCD CLI installed

Architecture

edge-gitops/
├── clusters/
│   └── k3s-dgx/
│       ├── flux-system/          # FluxCD installation
│       ├── gpu-support/          # NVIDIA GPU Operator
│       ├── kserve/               # KServe installation
│       └── apps/                 # ML model deployments
├── apps/                        # Reusable app manifests
└── infrastructure/              # Base infrastructure

Setup Instructions

1. Bootstrap FluxCD

flux bootstrap git \
  --url=ssh://git@gitea.example.com/edge-gitops/edge-gitops.git \
  --branch=main \
  --path=clusters/k3s-dgx \
  --components=source-controller,kustomize-controller,helm-controller,notification-controller

2. Configure Gitea SSH Key

Generate SSH key for FluxCD:

ssh-keygen -t ed25519 -N "" -f flux-gitea-key

Add the public key to your Gitea repository as a deploy key.

3. Update Repository Configuration

Edit clusters/k3s-dgx/flux-system/gotk-sync.yaml to match your Gitea URL:

url: ssh://git@your-gitea-instance.com/edge-gitops/edge-gitops.git

4. Deploy the Stack

Commit and push the changes:

git add .
git commit -m "Initial GitOps setup for KServe on k3s"
git push origin main

FluxCD will automatically sync the changes to your cluster.

Components

GPU Support

NVIDIA GPU Operator (v23.9.1)
NVIDIA Device Plugin
DCGM Exporter for monitoring
GPU Node Feature Discovery

KServe

KServe Core (v0.12.0)
GPU-enabled Serving Runtime
Istio Gateway for networking
Model Storage (PVC)

Example Model

Huihui-granite-4.1-30b-abliterated (Hugging Face)
GPU-accelerated inference
REST API endpoint

Usage

Deploy a New Model

Create a new InferenceService in clusters/k3s-dgx/apps/:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: your-model
  namespace: kserve
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      storageUri: "hf://your-org/your-model"
      resources:
        limits:
          nvidia.com/gpu: "1"

Commit and push changes

Test the Model

# Get the service URL
kubectl get inferenceservice huihui-granite -n kserve

# Test inference
curl -X POST http://your-service-url/v1/models/huihui-granite:predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": [{"name": "text", "shape": [1], "datatype": "BYTES", "data": ["Hello world"]}]}'

Monitoring

Check FluxCD status:

flux get all --all-namespaces

Check GPU status:

kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'

Check KServe services:

kubectl get inferenceservices -n kserve

Troubleshooting

GPU Not Available

kubectl describe node | grep -A 5 nvidia.com/gpu

KServe Pods Not Starting

kubectl logs -n kserve deployment/kserve-controller-manager
kubectl get pods -n kserve

FluxCD Sync Issues

flux reconcile kustomization flux-system --with-source
flux logs

Customization

GPU Resources

Edit clusters/k3s-dgx/apps/huihui-granite-inference.yaml to adjust GPU allocation.

Storage

Modify clusters/k3s-dgx/kserve/model-storage-pvc.yaml for different storage requirements.

Networking

Update clusters/k3s-dgx/kserve/istio-gateway.yaml for custom ingress configuration.