
mitakad/vllm!!! Note: for vLLM:v0.15.0, if you experience degraded performance, check the vllm serve logs if using ...TRITON_MLA attention backend..., this could degrade performance with larger inputs. To disble MLA, set VLLM_MLA_DISABLE=1, example: VLLM_MLA_DISABLE=1 vllm serve ...
Built on Jetson Orin AGX (SM 87)
vLLM docs: [***]
Distributed LLM inferencing with vLLM, as described in the Multi-node case here [***]
How to use: [***]
Tested with 2x Jetson AGX Orin dev kits.
Also try EAGLE-3 speculative decoding for faster inference (as explained here) :
vllm serve RedHatAI/Qwen3-8B-quantized.w4a16 \ --port 8000 \ --gpu-memory-utilization 0.6 \ --kv-cache-memory-bytes 5G \ --max-model-len 2048 \ --speculative-config '{ "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 3, "method": "eagle3" }'




manifest unknown 错误
TLS 证书验证失败
DNS 解析超时
410 错误:版本过低
402 错误:流量耗尽
身份认证失败错误
429 限流错误
凭证保存错误
来自真实用户的反馈,见证轩辕镜像的优质服务