Files
edge-gitops/SETUP.md
2026-05-05 11:15:49 -05:00

6.6 KiB

Edge GitOps - Complete Setup Guide

Quick Start

  1. Configure the setup:

    ./configure.sh
    
  2. Bootstrap FluxCD:

    make bootstrap
    
  3. Monitor the deployment:

    make status
    

Directory Structure

edge-gitops/
├── bootstrap.sh              # FluxCD bootstrap script
├── configure.sh             # Configuration wizard
├── Makefile                 # Convenient commands
├── README.md                # Main documentation
├── .gitignore              # Git ignore rules
├── .env                    # Environment variables (not committed)
│
├── clusters/
│   └── k3s-dgx/            # Cluster-specific configuration
│       ├── kustomization.yaml
│       │
│       ├── flux-system/    # FluxCD installation
│       │   ├── kustomization.yaml
│       │   ├── gotk-components.yaml
│       │   └── gotk-sync.yaml
│       │
│       ├── gpu-support/    # NVIDIA GPU Operator
│       │   ├── kustomization.yaml
│       │   ├── gpu-operator-namespace.yaml
│       │   ├── gpu-operator-helmrelease.yaml
│       │   └── gpu-node-labels.yaml
│       │
│       ├── kserve/         # KServe installation
│       │   ├── kustomization.yaml
│       │   ├── kserve-namespace.yaml
│       │   ├── kserve-crds.yaml
│       │   ├── kserve-controller.yaml
│       │   ├── istio-gateway.yaml
│       │   ├── gpu-serving-runtime.yaml
│       │   ├── model-storage-pvc.yaml
│       │   └── storage-config.yaml
│       │
│       └── apps/           # ML model deployments
│           ├── kustomization.yaml
│           └── huihui-granite-inference.yaml
│
├── apps/                   # Reusable application manifests
└── infrastructure/        # Base infrastructure components

Component Details

FluxCD

  • Version: Latest stable
  • Components: source-controller, kustomize-controller, helm-controller, notification-controller
  • Sync Interval: 1 minute for GitRepository, 10 minutes for Kustomization
  • Repository: Gitea (configurable)

NVIDIA GPU Operator

  • Version: v23.9.1
  • Driver: 535.129.03
  • Components:
    • NVIDIA Driver
    • Device Plugin
    • DCGM Exporter
    • MIG Manager
    • Node Feature Discovery

KServe

  • Version: v0.12.0
  • Components:
    • KServe Controller
    • Custom Resource Definitions
    • GPU Serving Runtime
    • Istio Integration
    • Model Storage (50Gi PVC)

Example Model

  • Name: Huihui-granite-4.1-30b-abliterated
  • Source: Hugging Face
  • Resources:
    • CPU: 2-4 cores
    • Memory: 8-16Gi
    • GPU: 1 NVIDIA GPU

Common Tasks

Add a New Model

  1. Create a new InferenceService:

    cat > clusters/k3s-dgx/apps/your-model.yaml << EOF
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: your-model
      namespace: kserve
    spec:
      predictor:
        model:
          modelFormat:
            name: huggingface
          storageUri: "hf://your-org/your-model"
          resources:
            limits:
              nvidia.com/gpu: "1"
    EOF
    
  2. Update the kustomization:

    echo "  - your-model.yaml" >> clusters/k3s-dgx/apps/kustomization.yaml
    
  3. Commit and push:

    git add clusters/k3s-dgx/apps/
    git commit -m "Add new model deployment"
    git push
    

Update GPU Resources

Edit the resource limits in your InferenceService:

resources:
  limits:
    cpu: "8"
    memory: "32Gi"
    nvidia.com/gpu: "2"
  requests:
    cpu: "4"
    memory: "16Gi"
    nvidia.com/gpu: "2"

Monitor Model Performance

# Get model endpoint
kubectl get inferenceservice your-model -n kserve

# View logs
kubectl logs -n kserve -l serving.kserve.io/inferenceservice=your-model -f

# Check GPU usage
kubectl exec -n kserve <pod-name> -- nvidia-smi

Troubleshooting

FluxCD Not Syncing

# Check FluxCD status
flux check

# View logs
flux logs

# Force sync
flux reconcile kustomization flux-system --with-source

GPU Not Available

# Check GPU nodes
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'

# Check GPU operator
kubectl get pods -n gpu-operator

# View GPU operator logs
kubectl logs -n gpu-operator deployment/gpu-operator

KServe Issues

# Check KServe pods
kubectl get pods -n kserve

# Check KServe controller
kubectl logs -n kserve deployment/kserve-controller-manager

# Describe InferenceService
kubectl describe inferenceservice your-model -n kserve

Security Considerations

  1. Secrets Management:

    • Never commit secrets to git
    • Use Kubernetes secrets for sensitive data
    • Consider using Sealed Secrets or External Secrets Operator
  2. Network Policies:

    • Review and restrict network access
    • Use Istio for service mesh security
  3. RBAC:

    • Review FluxCD service account permissions
    • Implement principle of least privilege

Performance Optimization

GPU Optimization

  • Use appropriate GPU resource requests/limits
  • Monitor GPU utilization with DCGM Exporter
  • Consider MIG (Multi-Instance GPU) for better isolation

Storage Optimization

  • Use fast storage for model cache
  • Consider using ReadWriteMany for multi-pod access
  • Implement model caching strategies

Network Optimization

  • Use Istio for efficient load balancing
  • Configure appropriate timeouts for large models
  • Consider using gRPC for internal communication

Scaling

Horizontal Scaling

# Add to InferenceService
spec:
  predictor:
    replicas: 3

Vertical Scaling

# Update resource limits
resources:
  limits:
    nvidia.com/gpu: "2"

Monitoring

Metrics Collection

  • DCGM Exporter for GPU metrics
  • Prometheus for cluster metrics
  • KServe metrics for inference performance

Logging

  • Structured logging for all components
  • Centralized logging with Loki/ELK
  • Log retention policies

Alerting

  • GPU utilization alerts
  • Model health alerts
  • Resource exhaustion alerts

Backup and Recovery

GitOps Backup

  • All configuration is in git
  • Easy rollback with git revert
  • Branch-based testing

Data Backup

  • Model storage backup
  • Configuration backup
  • Disaster recovery plan

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

Support

For issues and questions:

  • Check the troubleshooting section
  • Review component documentation
  • Check FluxCD, KServe, and GPU Operator docs
  • Open an issue in the repository