software-engineering/edge-gitops

Fork 0

Files

HaimKortovich 06f52750ac init commit

2026-05-05 11:15:49 -05:00

6.6 KiB

Raw Blame History

Edge GitOps - Complete Setup Guide

Quick Start

Configure the setup:
```
./configure.sh
```
Bootstrap FluxCD:
```
make bootstrap
```
Monitor the deployment:
```
make status
```

Directory Structure

edge-gitops/
├── bootstrap.sh              # FluxCD bootstrap script
├── configure.sh             # Configuration wizard
├── Makefile                 # Convenient commands
├── README.md                # Main documentation
├── .gitignore              # Git ignore rules
├── .env                    # Environment variables (not committed)
│
├── clusters/
│   └── k3s-dgx/            # Cluster-specific configuration
│       ├── kustomization.yaml
│       │
│       ├── flux-system/    # FluxCD installation
│       │   ├── kustomization.yaml
│       │   ├── gotk-components.yaml
│       │   └── gotk-sync.yaml
│       │
│       ├── gpu-support/    # NVIDIA GPU Operator
│       │   ├── kustomization.yaml
│       │   ├── gpu-operator-namespace.yaml
│       │   ├── gpu-operator-helmrelease.yaml
│       │   └── gpu-node-labels.yaml
│       │
│       ├── kserve/         # KServe installation
│       │   ├── kustomization.yaml
│       │   ├── kserve-namespace.yaml
│       │   ├── kserve-crds.yaml
│       │   ├── kserve-controller.yaml
│       │   ├── istio-gateway.yaml
│       │   ├── gpu-serving-runtime.yaml
│       │   ├── model-storage-pvc.yaml
│       │   └── storage-config.yaml
│       │
│       └── apps/           # ML model deployments
│           ├── kustomization.yaml
│           └── huihui-granite-inference.yaml
│
├── apps/                   # Reusable application manifests
└── infrastructure/        # Base infrastructure components

Component Details

FluxCD

Version: Latest stable
Components: source-controller, kustomize-controller, helm-controller, notification-controller
Sync Interval: 1 minute for GitRepository, 10 minutes for Kustomization
Repository: Gitea (configurable)

NVIDIA GPU Operator

Version: v23.9.1
Driver: 535.129.03
Components:
- NVIDIA Driver
- Device Plugin
- DCGM Exporter
- MIG Manager
- Node Feature Discovery

KServe

Version: v0.12.0
Components:
- KServe Controller
- Custom Resource Definitions
- GPU Serving Runtime
- Istio Integration
- Model Storage (50Gi PVC)

Example Model

Name: Huihui-granite-4.1-30b-abliterated
Source: Hugging Face
Resources:
- CPU: 2-4 cores
- Memory: 8-16Gi
- GPU: 1 NVIDIA GPU

Common Tasks

Add a New Model

Create a new InferenceService:

cat > clusters/k3s-dgx/apps/your-model.yaml << EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: your-model
  namespace: kserve
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      storageUri: "hf://your-org/your-model"
      resources:
        limits:
          nvidia.com/gpu: "1"
EOF

Update the kustomization:

echo "  - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml

Commit and push:

git add clusters/k3s-dgx/apps/
git commit -m "Add new model deployment"
git push

Update GPU Resources

Edit the resource limits in your InferenceService:

resources:
  limits:
    cpu: "8"
    memory: "32Gi"
    nvidia.com/gpu: "2"
  requests:
    cpu: "4"
    memory: "16Gi"
    nvidia.com/gpu: "2"

Monitor Model Performance

# Get model endpoint
kubectl get inferenceservice your-model -n kserve

# View logs
kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f

# Check GPU usage
kubectl exec -n kserve <pod-name> -- nvidia-smi

Troubleshooting

FluxCD Not Syncing

# Check FluxCD status
flux check

# View logs
flux logs

# Force sync
flux reconcile kustomization flux-system --with-source

GPU Not Available

# Check GPU nodes
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'

# Check GPU operator
kubectl get pods -n gpu-operator

# View GPU operator logs
kubectl logs -n gpu-operator deployment/gpu-operator

KServe Issues

# Check KServe pods
kubectl get pods -n kserve

# Check KServe controller
kubectl logs -n kserve deployment/kserve-controller-manager

# Describe InferenceService
kubectl describe inferenceservice your-model -n kserve

Security Considerations

Secrets Management:
- Never commit secrets to git
- Use Kubernetes secrets for sensitive data
- Consider using Sealed Secrets or External Secrets Operator
Network Policies:
- Review and restrict network access
- Use Istio for service mesh security
RBAC:
- Review FluxCD service account permissions
- Implement principle of least privilege

Performance Optimization

GPU Optimization

Use appropriate GPU resource requests/limits
Monitor GPU utilization with DCGM Exporter
Consider MIG (Multi-Instance GPU) for better isolation

Storage Optimization

Use fast storage for model cache
Consider using ReadWriteMany for multi-pod access
Implement model caching strategies

Network Optimization

Use Istio for efficient load balancing
Configure appropriate timeouts for large models
Consider using gRPC for internal communication

Scaling

Horizontal Scaling

# Add to InferenceService
spec:
  predictor:
    replicas: 3

Vertical Scaling

# Update resource limits
resources:
  limits:
    nvidia.com/gpu: "2"

Monitoring

Metrics Collection

DCGM Exporter for GPU metrics
Prometheus for cluster metrics
KServe metrics for inference performance

Logging

Structured logging for all components
Centralized logging with Loki/ELK
Log retention policies

Alerting

GPU utilization alerts
Model health alerts
Resource exhaustion alerts

Backup and Recovery

GitOps Backup

All configuration is in git
Easy rollback with git revert
Branch-based testing

Data Backup

Model storage backup
Configuration backup
Disaster recovery plan

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

Support

For issues and questions:

Check the troubleshooting section
Review component documentation
Check FluxCD, KServe, and GPU Operator docs
Open an issue in the repository

6.6 KiB Raw Blame History