
projecthami/hamiHami Scheduler & NVIDIA Device Plugin 镜像整合了 Hami 调度器与 NVIDIA 设备插件,专为 Kubernetes 集群设计,用于实现 GPU 资源的高效管理与调度。其中:
主要用途:为需要 GPU 加速的应用(如 AI 训练、机器学习推理、高性能计算)提供稳定的资源供给与高效调度能力,适用于 Kubernetes 集群环境。
nvidia.com/gpu)。nvidia-container-runtime。bashdocker pull [镜像仓库地址]/hami-scheduler-nvidia-device-plugin:latest # 替换为实际镜像地址
适用于快速验证功能,生产环境建议通过 Kubernetes 部署。
bash# 启动 NVIDIA Device Plugin(需挂载主机设备与运行时) docker run -d \ --name nvidia-device-plugin \ --restart=always \ --cap-add=SYS_ADMIN \ --device=/dev/nvidiactl:/dev/nvidiactl \ --device=/dev/nvidia-uvm:/dev/nvidia-uvm \ --device=/dev/nvidia0:/dev/nvidia0 \ # 根据实际 GPU 数量添加(如 nvidia1, nvidia2...) -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ -v /usr/local/nvidia/lib64:/usr/local/nvidia/lib64 \ -e NVIDIA_VISIBLE_DEVICES=all \ # 暴露所有 GPU,或指定卡号(如 "0,1") -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ # 启用计算与工具能力 [镜像仓库地址]/hami-scheduler-nvidia-device-plugin:latest nvidia-device-plugin # 启动 Hami Scheduler(需连接 Kubernetes API Server) docker run -d \ --name hami-scheduler \ --restart=always \ -v /etc/kubernetes/admin.conf:/etc/kubernetes/admin.conf \ # 挂载 K8s 配置 -e KUBECONFIG=/etc/kubernetes/admin.conf \ -e HAMI_LOG_LEVEL=info \ # 日志级别:debug/info/warn/error -e HAMI_SCHEDULING_POLICY=gpu-aware \ # 调度策略:gpu-aware(默认)/balanced [镜像仓库地址]/hami-scheduler-nvidia-device-plugin:latest hami-scheduler
生产环境中,建议通过 DaemonSet(设备插件)与 Deployment(调度器)部署:
1. NVIDIA Device Plugin DaemonSet 配置(nvidia-device-plugin-daemonset.yaml)
yamlapiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin template: metadata: labels: name: nvidia-device-plugin spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: nvidia-device-plugin image: [镜像仓库地址]/hami-scheduler-nvidia-device-plugin:latest command: ["nvidia-device-plugin"] resources: limits: cpu: 50m memory: 100Mi requests: cpu: 50m memory: 100Mi securityContext: privileged: true volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins - name: nvidia-lib mountPath: /usr/local/nvidia/lib64 - name: nvidia-bin mountPath: /usr/local/nvidia/bin env: - name: NVIDIA_VISIBLE_DEVICES value: "all" - name: NVIDIA_DRIVER_CAPABILITIES value: "compute,utility" volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins - name: nvidia-lib hostPath: path: /usr/local/nvidia/lib64 - name: nvidia-bin hostPath: path: /usr/local/nvidia/bin
2. Hami Scheduler Deployment 配置(hami-scheduler-deployment.yaml)
yamlapiVersion: apps/v1 kind: Deployment metadata: name: hami-scheduler namespace: kube-system spec: replicas: 1 selector: matchLabels: name: hami-scheduler template: metadata: labels: name: hami-scheduler spec: serviceAccountName: hami-scheduler # 需提前创建具备调度权限的 ServiceAccount containers: - name: hami-scheduler image: [镜像仓库地址]/hami-scheduler-nvidia-device-plugin:latest command: ["hami-scheduler"] resources: limits: cpu: 100m memory: 200Mi requests: cpu: 100m memory: 200Mi env: - name: HAMI_LOG_LEVEL value: "info" - name: HAMI_SCHEDULING_POLICY value: "gpu-aware" - name: HAMI_GPU_MEM_THRESHOLD value: "80" # 显存使用率阈值(%),超过则触发重调度
| 参数名 | 说明 | 默认值 |
|---|---|---|
NVIDIA_VISIBLE_DEVICES | 暴露的 GPU 设备,可选值:all(所有设备)、卡号列表(如 "0,1")、void(无设备) | all |
NVIDIA_DRIVER_CAPABILITIES | 启用的驱动能力,可选值:compute(计算)、utility(工具)、video(视频) | compute,utility |
NVIDIA_MIG_MONITOR_DEVICES | MIG 设备监控路径(仅 MIG 模式下使用) | /dev/nvidia-mig-monitor |
| 参数名 | 说明 | 默认值 |
|---|---|---|
HAMI_LOG_LEVEL | 日志级别:debug/info/warn/error | info |
HAMI_SCHEDULING_POLICY | 调度策略:gpu-aware(GPU 优先)、balanced(资源均衡)、priority(优先级调度) | gpu-aware |
HAMI_GPU_MEM_THRESHOLD | 触发重调度的显存使用率阈值(%),范围 0-100 | 80 |
HAMI_NODE_AFFINITY_WEIGHT | 节点亲和性权重(影响调度优先级),数值越高优先级越高 | 50 |
nvidia-container-runtime(替换默认 runc)。privileged: true),以访问主机 GPU 设备;Hami Scheduler 需具备 Kubernetes API 的调度相关权限(如 nodes、pods 的读写权限)。NVIDIA_MIG_MONITOR_DEVICES 配置监控路径。kubectl logs <hami-scheduler-pod> -n kube-system 查看 Hami 调度器日志,通过 kubectl logs <nvidia-device-plugin-pod> -n kube-system 查看设备插件日志。| 镜像版本 | 支持 Kubernetes 版本 | 支持 NVIDIA 驱动版本 | 支持 CUDA 版本 |
|---|---|---|---|
| latest | 1.16+ | ≥ 418.81.07 | ≥ 10.1 |
manifest unknown 错误
TLS 证书验证失败
DNS 解析超时
410 错误:版本过低
402 错误:流量耗尽
身份认证失败错误
429 限流错误
凭证保存错误
来自真实用户的反馈,见证轩辕镜像的优质服务