openeuler/kserve-controllerThe official PyTorch docker image.
Maintained by: openEuler CloudNative SIG.
Where to get help: openEuler CloudNative SIG, openEuler.
Current KServe docker images are built on the openEuler. This repository is free to use and exempted from per-user rate limits.
KServe provides a Kubernetes Custom Resource Definition for serving predictive and generative machine learning (ML) models. It aims to solve production model serving use cases by providing high abstraction interfaces for Tensorflow, XGBoost, ScikitLearn, PyTorch, Huggingface Transformer/LLM models using standardized data plane protocols.
It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU Autoscaling, Scale to Zero, and Canary Rollouts to your ML deployments. It enables a simple, pluggable, and complete story for Production ML Serving including prediction, pre-processing, post-processing and explainability. KServe is being used across various organizations.
For more details, visit the KServe website.
The tag of each KServe docker image is consist of the complete software stack version. The details are as follows
| Tag | Currently | Architectures |
|---|---|---|
| 0.15.2-oe2403lts | KServe controller 0.15.2 on openEuler 24.03-LTS | amd64 |
KServe Quickstart Environments are for experimentation use only. For production installation, see our Administrator's Guide.
Before you can get started with a KServe Quickstart deployment you must install kind and the Kubernetes CLI.
You can use kind (Kubernetes in Docker) to run a local Kubernetes cluster with Docker container nodes.
The Kubernetes CLI (kubectl), allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.
The Helm package manager for Kubernetes helps you define, install and upgrade software built for Kubernetes.
After having kind installed, create a kind cluster with:
shellkind create cluster
Then run:
shellkubectl config get-contexts
It should list out a list of contexts you have, one of them should be kind-kind. Then run:
shellkubectl config use-context kind-kind
to use this context.
You can then get started with a local deployment of KServe by using KServe Quick installation script on Kind:
shellcurl -s "[***]" | bash
In this example, We demonstrate how to deploy Llama3 model for text generation task from Hugging Face by deploying the InferenceService with Hugging Face Serving runtime.
KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
The Llama3 model requires huggingface hub token to download the model. You can set the token using HF_TOKEN environment variable.
Create a secret with the Hugging Face token.
yamlapiVersion: v1 kind: Secret metadata: name: hf-secret type: Opaque stringData: HF_TOKEN: <token>
Then create the inference service.
yamlkubectl apply -f - <<EOF apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: huggingface-llama3 spec: predictor: model: modelFormat: name: huggingface args: - --model_name=llama3 - --model_id=meta-llama/meta-llama-3-8b-instruct env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: HF_TOKEN optional: false resources: limits: cpu: "6" memory: 24Gi nvidia.com/gpu: "1" requests: cpu: "6" memory: 24Gi nvidia.com/gpu: "1" EOF
Check InferenceService status.
shellkubectl get inferenceservices huggingface-llama3
Expected output:
shellNAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE huggingface-llama3 [***] True 100 huggingface-llama3-predictor-default-47q2g 7d23h
The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT.
shellMODEL_NAME=llama3 SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
KServe Hugging Face vLLM runtime supports the OpenAI /v1/completions and /v1/chat/completions endpoints for inference.
Sample OpenAI Completions request:
shellcurl -v [***]{INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \ -H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \ -d '{"model": "llama3", "prompt": "Write a poem about colors", "stream":false, "max_tokens": 30}'
Expected output:
shell{ "id": "cmpl-625a9240f25e463487a9b6c53cbed080", "choices": [ { "finish_reason": "length", "index": 0, "logprobs": null, "text": " and how they make you feel\nColors, oh colors, so vibrant and bright\nA world of emotions, a kaleidoscope in sight\nRed" } ], "created": ***, "model": "llama3", "system_fingerprint": null, "object": "text_completion", "usage": { "completion_tokens": 30, "prompt_tokens": 6, "total_tokens": 36 } }
manifest unknown 错误
TLS 证书验证失败
DNS 解析超时
410 错误:版本过低
402 错误:流量耗尽
身份认证失败错误
429 限流错误
凭证保存错误
来自真实用户的反馈,见证轩辕镜像的优质服务