init commit
This commit is contained in:
165
README.md
Normal file
165
README.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Edge GitOps - KServe on k3s with GPU
|
||||
|
||||
GitOps setup for deploying ML models using KServe on a k3s cluster with GPU support (DGX Spark).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- k3s cluster with GPU support
|
||||
- kubectl configured to access the cluster
|
||||
- Gitea instance for GitOps repository
|
||||
- FluxCD CLI installed
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
edge-gitops/
|
||||
├── clusters/
|
||||
│ └── k3s-dgx/
|
||||
│ ├── flux-system/ # FluxCD installation
|
||||
│ ├── gpu-support/ # NVIDIA GPU Operator
|
||||
│ ├── kserve/ # KServe installation
|
||||
│ └── apps/ # ML model deployments
|
||||
├── apps/ # Reusable app manifests
|
||||
└── infrastructure/ # Base infrastructure
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
### 1. Bootstrap FluxCD
|
||||
|
||||
```bash
|
||||
flux bootstrap git \
|
||||
--url=ssh://git@gitea.example.com/edge-gitops/edge-gitops.git \
|
||||
--branch=main \
|
||||
--path=clusters/k3s-dgx \
|
||||
--components=source-controller,kustomize-controller,helm-controller,notification-controller
|
||||
```
|
||||
|
||||
### 2. Configure Gitea SSH Key
|
||||
|
||||
Generate SSH key for FluxCD:
|
||||
```bash
|
||||
ssh-keygen -t ed25519 -N "" -f flux-gitea-key
|
||||
```
|
||||
|
||||
Add the public key to your Gitea repository as a deploy key.
|
||||
|
||||
### 3. Update Repository Configuration
|
||||
|
||||
Edit `clusters/k3s-dgx/flux-system/gotk-sync.yaml` to match your Gitea URL:
|
||||
```yaml
|
||||
url: ssh://git@your-gitea-instance.com/edge-gitops/edge-gitops.git
|
||||
```
|
||||
|
||||
### 4. Deploy the Stack
|
||||
|
||||
Commit and push the changes:
|
||||
```bash
|
||||
git add .
|
||||
git commit -m "Initial GitOps setup for KServe on k3s"
|
||||
git push origin main
|
||||
```
|
||||
|
||||
FluxCD will automatically sync the changes to your cluster.
|
||||
|
||||
## Components
|
||||
|
||||
### GPU Support
|
||||
- NVIDIA GPU Operator (v23.9.1)
|
||||
- NVIDIA Device Plugin
|
||||
- DCGM Exporter for monitoring
|
||||
- GPU Node Feature Discovery
|
||||
|
||||
### KServe
|
||||
- KServe Core (v0.12.0)
|
||||
- GPU-enabled Serving Runtime
|
||||
- Istio Gateway for networking
|
||||
- Model Storage (PVC)
|
||||
|
||||
### Example Model
|
||||
- Huihui-granite-4.1-30b-abliterated (Hugging Face)
|
||||
- GPU-accelerated inference
|
||||
- REST API endpoint
|
||||
|
||||
## Usage
|
||||
|
||||
### Deploy a New Model
|
||||
|
||||
1. Create a new InferenceService in `clusters/k3s-dgx/apps/`:
|
||||
```yaml
|
||||
apiVersion: serving.kserve.io/v1beta1
|
||||
kind: InferenceService
|
||||
metadata:
|
||||
name: your-model
|
||||
namespace: kserve
|
||||
spec:
|
||||
predictor:
|
||||
model:
|
||||
modelFormat:
|
||||
name: huggingface
|
||||
storageUri: "hf://your-org/your-model"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "1"
|
||||
```
|
||||
|
||||
2. Commit and push changes
|
||||
|
||||
### Test the Model
|
||||
|
||||
```bash
|
||||
# Get the service URL
|
||||
kubectl get inferenceservice huihui-granite -n kserve
|
||||
|
||||
# Test inference
|
||||
curl -X POST http://your-service-url/v1/models/huihui-granite:predict \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"inputs": [{"name": "text", "shape": [1], "datatype": "BYTES", "data": ["Hello world"]}]}'
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
Check FluxCD status:
|
||||
```bash
|
||||
flux get all --all-namespaces
|
||||
```
|
||||
|
||||
Check GPU status:
|
||||
```bash
|
||||
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
|
||||
```
|
||||
|
||||
Check KServe services:
|
||||
```bash
|
||||
kubectl get inferenceservices -n kserve
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### GPU Not Available
|
||||
```bash
|
||||
kubectl describe node | grep -A 5 nvidia.com/gpu
|
||||
```
|
||||
|
||||
### KServe Pods Not Starting
|
||||
```bash
|
||||
kubectl logs -n kserve deployment/kserve-controller-manager
|
||||
kubectl get pods -n kserve
|
||||
```
|
||||
|
||||
### FluxCD Sync Issues
|
||||
```bash
|
||||
flux reconcile kustomization flux-system --with-source
|
||||
flux logs
|
||||
```
|
||||
|
||||
## Customization
|
||||
|
||||
### GPU Resources
|
||||
Edit `clusters/k3s-dgx/apps/huihui-granite-inference.yaml` to adjust GPU allocation.
|
||||
|
||||
### Storage
|
||||
Modify `clusters/k3s-dgx/kserve/model-storage-pvc.yaml` for different storage requirements.
|
||||
|
||||
### Networking
|
||||
Update `clusters/k3s-dgx/kserve/istio-gateway.yaml` for custom ingress configuration.
|
||||
Reference in New Issue
Block a user