
opea/llm-textgen该微服务专为语言模型推理(LLM)设计,处理包含查询字符串和相关重排文档的输入,基于查询和文档构建提示,再使用大型语言模型执行推理,并将推理结果作为输出返回。
使用此微服务的前提是用户必须已运行LLM文本生成服务(如TGI、vLLM),需将LLM服务的端点设置为环境变量。微服务通过该端点创建LLM对象,与LLM服务通信以执行语言模型操作。
总体而言,该微服务提供了一种简化的方式,将大型语言模型推理集成到应用中,用户只需启动TGI/vLLM服务并配置必要的环境变量,即可无缝处理查询和文档,生成智能、上下文感知的响应。
| 模型 | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | vLLM-IPEX-XPU | OVMS | Optimum-Habana | SGLANG-CPU |
|---|---|---|---|---|---|---|---|
| [Intel/neural-chat-7b-v3-3] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - |
| [meta-llama/Llama-2-7b-chat-hf] | ✓ | ✓ | ✓ | - | ✓ | ✓ | ✓ |
| [meta-llama/Llama-2-70b-chat-hf] | ✓ | - | ✓ | - | - | ✓ | ✓ |
| [meta-llama/Meta-Llama-3-8B-Instruct] | ✓ | ✓ | ✓ | - | ✓ | ✓ | ✓ |
| [meta-llama/Meta-Llama-3-70B-Instruct] | ✓ | - | ✓ | - | - | ✓ | ✓ |
| [Phi-3] | ✗ | Limit 4K | Limit 4K | ✓ | Limit 4K | ✓ | - |
| [Phi-4] | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | - |
| [deepseek-ai/DeepSeek-R1-Distill-Llama-8B] | ✓ | - | ✓ | ✓ | - | ✓ | - |
| [deepseek-ai/DeepSeek-R1-Distill-Llama-70B] | ✓ | - | ✓ | ✓ | - | ✓ | - |
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B] | ✓ | - | ✓ | ✓ | - | ✓ | - |
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B] | ✓ | - | ✓ | ✓ | - | ✓ | - |
| [mistralai/Mistral-Small-24B-Instruct-2501] | ✓ | - | ✓ | - | - | ✓ | - |
| [mistralai/Mistral-Large-Instruct-2411] | ✗ | - | ✓ | - | - | ✓ | - |
| [meta-llama/Llama-4-Scout-17B-16E-Instruct] | - | - | - | - | - | - | ✓ |
| [meta-llama/Llama-4-Maverick-17B-128E-Instruct] | - | - | - | - | - | - | ✓ |
| [Qwen3-8B/14B/32B] | - | - | - | ✓ | - | - | - |
注意: vLLM-IPEX-XPU支持的模型详情可参见 supported-models。
| 模型 | 最小Gaudi卡数量 |
|---|---|
| Intel/neural-chat-7b-v3-3 | 1 |
| meta-llama/Llama-2-7b-chat-hf | 1 |
| meta-llama/Llama-2-70b-chat-hf | 2 |
| meta-llama/Meta-Llama-3-8B-Instruct | 1 |
| meta-llama/Meta-Llama-3-70B-Instruct | 2 |
| Phi-3 | - |
| Phi-4 | - |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | 1 |
| deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 8 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | 2 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 4 |
| mistralai/Mistral-Small-24B-Instruct-2501 | 1 |
| mistralai/Mistral-Large-Instruct-2411 | 4 |
注意: 详细的硬件要求将很快提供。
本微服务支持以下后端LLM服务集成,本文档将介绍TGI/vLLM/Ollama的使用,其他后端请参考对应文档:
bashgit clone [***] export OPEA_GENAICOMPS_ROOT=$(pwd)/GenAIComps
HF_TOKEN和LLM_MODEL为环境变量。对于vLLM,请先参考 vLLM构建指南 构建Docker镜像。
TGI和Ollama无需构建。
bashcd ${OPEA_GENAICOMPS_ROOT} docker build \ --build-arg https_proxy=$https_proxy \ --build-arg http_proxy=$http_proxy \ -t opea/llm-textgen:latest \ -f comps/llms/src/text-generation/Dockerfile .
可通过CLI或Docker Compose启动服务。compose_text-generation.yaml文件将自动启动端点和微服务容器。
启动服务前,需先设置以下环境变量:
bashexport LLM_ENDPOINT_PORT=8008 export TEXTGEN_PORT=9000 export host_ip=${host_ip} export HF_TOKEN=${HF_TOKEN} export LLM_ENDPOINT="[***]{host_ip}:${LLM_ENDPOINT_PORT}" export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
启动后端LLM服务(TGI, vLLM, Ollama)。
启动TextGen微服务:
bashexport LLM_COMPONENT_NAME="OpeaTextGenService" docker run \ --name="llm-textgen-server" \ -p $TEXTGEN_PORT:9000 \ --ipc=host \ -e http_proxy=$http_proxy \ -e https_proxy=$https_proxy \ -e no_proxy=${no_proxy} \ -e LLM_ENDPOINT=$LLM_ENDPOINT \ -e HF_TOKEN=$HF_TOKEN \ -e LLM_MODEL_ID=$LLM_MODEL_ID \ -e LLM_COMPONENT_NAME=$LLM_COMPONENT_NAME \ opea/llm-textgen:latest
设置service_name以匹配后端服务:
bashexport service_name="textgen-service-tgi" # export service_name="textgen-service-tgi-gaudi" # export service_name="textgen-service-vllm" # export service_name="textgen-service-vllm-gaudi" # export service_name="textgen-service-ollama" cd ../../deployment/docker_compose/ docker compose -f compose_text-generation.yaml up ${service_name} -d
bashcurl [***]{host_ip}:${TEXTGEN_PORT}/v1/health_check\ -X GET \ -H 'Content-Type: application/json'
可根据实际需求设置模型参数,如max_tokens、stream。stream参数决定API返回数据格式:stream=false返回文本字符串,stream=true返回文本流。
bash# 流式模式 curl [***]{host_ip}:${TEXTGEN_PORT}/v1/chat/completions \ -X POST \ -d '{"model": "${LLM_MODEL_ID}", "messages": "What is Deep Learning?", "max_tokens":17}' \ -H 'Content-Type: application/json' curl [***]{host_ip}:${TEXTGEN_PORT}/v1/chat/completions \ -X POST \ -d '{"model": "${LLM_MODEL_ID}", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \ -H 'Content-Type: application/json' # 非流式模式 curl [***]{host_ip}:${TEXTGEN_PORT}/v1/chat/completions \ -X POST \ -d '{"model": "${LLM_MODEL_ID}", "messages": "What is Deep Learning?", "max_tokens":17, "stream":false}' \ -H 'Content-Type: application/json'






manifest unknown 错误
TLS 证书验证失败
DNS 解析超时
410 错误:版本过低
402 错误:流量耗尽
身份认证失败错误
429 限流错误
凭证保存错误
来自真实用户的反馈,见证轩辕镜像的优质服务