
mekayelanik/vllm-cpu
Easy, fast, and cheap LLM serving for everyone
Docker Images for CPU-Only Inference
Documentation | Docker Hub | GitHub | Supported Models
Your support encourages me to keep creating/supporting my open-source projects. If you found value in this project, you can buy me a coffee to keep me up all the sleepless nights.
Pre-built Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets.
Features: OpenAI-Compatible API, CPU Optimizations (AVX512, VNNI, BF16, AMX), Multi-Architecture, Health Checks, PUID/PGID.
Requirements: 4+ GB RAM (16+ recommended), 4+ CPU cores, Docker 20.10+, 2+ GB shm.
Choose the appropriate variant based on your CPU's instruction set support:
| Variant | Tag | Optimizations | Target CPUs | Architectures |
|---|---|---|---|---|
| noavx512 | noavx512-<version> | Baseline | All CPUs | x86_64, ARM64 |
| avx512 | avx512-<version> | AVX512 | Intel Skylake-X+ | x86_64 |
| avx512vnni | avx512vnni-<version> | AVX512 + VNNI | Intel Cascade Lake+ | x86_64 |
| avx512bf16 | avx512bf16-<version> | AVX512 + VNNI + BF16 | Intel Cooper Lake+ | x86_64 |
| amxbf16 | amxbf16-<version> | AVX512 + VNNI + BF16 + AMX | Intel Sapphire Rapids+ | x86_64 |
Each variant has two tag formats:
<variant>-<version> (e.g., avx512bf16-0.12.0)<variant>-latest (e.g., avx512bf16-latest)bash# Check available instruction sets lscpu | grep -E "avx512|vnni|amx" # Or use flags grep -o 'avx512[a-z_]*\|amx[a-z_]*' /proc/cpuinfo | sort -u
bash# Pull the image docker pull mekayelanik/vllm-cpu:noavx512-latest # Run with a small model docker run -d \ --name vllm-cpu \ --restart unless-stopped \ --cap-add SYS_NICE \ --security-opt seccomp=unconfined \ --shm-size 4g \ -p 8000:8000 \ -v vllm-data:/data \ -e VLLM_MODEL=Qwen/Qwen3-0.6B \ -e VLLM_SERVER_HOST=0.0.0.0 \ -e VLLM_SERVER_PORT=8000 \ mekayelanik/vllm-cpu:noavx512-latest \ --max-model-len 8192 # Check logs docker logs -f vllm-cpu # Test the API curl http://localhost:8000/v1/models
| Endpoint | Description |
|---|---|
/health | Health check endpoint |
/v1/models | List available models |
/v1/completions | Text completions API |
/v1/chat/completions | Chat completions API |
/v1/embeddings | Embeddings API |
This is the simplest and most portable configuration, suitable for most deployments.
yaml# docker-compose.yml - Standard Bridge Network # Save this file and run: docker compose up -d services: vllm-cpu: image: mekayelanik/vllm-cpu:noavx512-latest container_name: vllm-cpu hostname: vllm-cpu domainname: local restart: unless-stopped # Required capabilities for CPU optimization cap_add: - SYS_NICE security_opt: - seccomp=unconfined # Shared memory for model loading shm_size: 4g # Port mapping ports: - "8000:8000" # Persistent storage volumes: - vllm-data:/data - vllm-cache:/data/cache # Optional: Mount local HuggingFace cache # - ${HOME}/.cache/huggingface:/data/models # Environment configuration environment: # User/Group mapping (optional) - PUID=1000 - PGID=1000 - TZ=UTC # Model configuration - VLLM_MODEL=Qwen/Qwen3-0.6B # - HF_TOKEN=your_huggingface_token # For gated models # Server configuration - VLLM_SERVER_HOST=0.0.0.0 - VLLM_SERVER_PORT=8000 # - VLLM_API_KEY=your_api_key # Optional API key # Logging - VLLM_LOGGING_LEVEL=INFO - VLLM_CONFIGURE_LOGGING=1 # Memory optimization - MALLOC_TRIM_THRESHOLD_=*** # CPU-specific optimization (adjust based on your system) # KV cache size in GB - increase for more concurrent requests # Default is 4GB. Formula: (RAM - Model Size - 8GB) / 2 - VLLM_CPU_KVCACHE_SPACE=8 # vLLM arguments (passed to entrypoint) command: - "--max-model-len" - "8192" - "--dtype" - "auto" # Health check is built into the Docker image (30s interval, 120s start period) # Resource limits (optional) deploy: resources: limits: memory: 16G reservations: memory: 4G # Named volumes for persistence volumes: vllm-data: driver: local vllm-cache: driver: local
Deploy:
bash# Start the service docker compose up -d # View logs docker compose logs -f vllm-cpu # Check status docker compose ps # Stop the service docker compose down
MACVLAN gives the container its own IP address on your local network, making it appear as a separate physical device. This is useful for:
Prerequisites:
eth0, ens18, enp0s3)yaml# docker-compose-macvlan.yml - MACVLAN Network Configuration # Save this file and run: docker compose -f docker-compose-macvlan.yml up -d services: vllm-cpu: image: mekayelanik/vllm-cpu:avx512bf16-latest container_name: vllm-cpu hostname: vllm-cpu domainname: local restart: unless-stopped # Required capabilities cap_add: - SYS_NICE security_opt: - seccomp=unconfined # Shared memory shm_size: 4g # MAC address (generate unique one for your network) mac_address: "02:42:c0:a8:01:64" # MACVLAN network with static IP networks: vllm-macvlan: ipv4_address: 192.168.1.100 # Persistent storage volumes: - vllm-data:/data - vllm-cache:/data/cache # Mount local models directory - /path/to/models:/data/models:ro # Environment configuration environment: # User/Group mapping - PUID=1000 - PGID=1000 - TZ=UTC # Model configuration - VLLM_MODEL=Qwen/Qwen3-0.6B - HF_HOME=/data/models # - HF_TOKEN=your_huggingface_token # Server configuration (bind to all interfaces) - VLLM_SERVER_HOST=0.0.0.0 - VLLM_SERVER_PORT=8000 # Logging - VLLM_LOGGING_LEVEL=INFO - VLLM_CONFIGURE_LOGGING=1 # Performance tuning - MALLOC_TRIM_THRESHOLD_=*** - OMP_NUM_THREADS=8 - MKL_NUM_THREADS=8 # CPU-specific optimization # KV cache size in GB - adjust based on your RAM - VLLM_CPU_KVCACHE_SPACE=12 # vLLM arguments command: - "--max-model-len" - "***" - "--dtype" - "bfloat16" - "--cpu-offload-gb" - "0" # Health check is built into the Docker image (30s interval, 120s start period) # Resource limits deploy: resources: limits: memory: 32G reservations: memory: 8G # MACVLAN Network Definition networks: vllm-macvlan: driver: macvlan driver_opts: # Change to your network interface name parent: eth0 ipam: driver: default config: # Adjust to match your network configuration - subnet: 192.168.1.0/24 gateway: 192.168.1.1 # IP range for containers (optional, limits assignable IPs) ip_range: 192.168.1.100/30 # Named volumes volumes: vllm-data: driver: local vllm-cache: driver: local
Deploy with MACVLAN:
bash# Find your network interface: ip link show # Create and start docker compose -f docker-compose-macvlan.yml up -d # Access via dedicated IP curl [***]
Note: With MACVLAN, the host cannot directly reach the container. Create a macvlan shim interface on the host if needed.
This configuration is optimized for high-load, high-concurrency production environments with maximum throughput. Designed for enterprise deployments on high-end Intel Xeon or AMD EPYC servers.
Target Hardware:
Key Optimization: VLLM_CPU_KVCACHE_SPACE
The most important setting for CPU inference is KV cache size. The default 4GB is too small for production. This config sets it to 40GB for high concurrency.
Use Cases:
yaml# docker-compose-high-performance.yml - High Load Production Configuration # Optimized for maximum throughput and concurrency # Run: docker compose -f docker-compose-high-performance.yml up -d services: vllm-cpu-prod: image: mekayelanik/vllm-cpu:amxbf16-latest container_name: vllm-cpu-prod hostname: vllm-prod domainname: local restart: always # Use host network for maximum network performance network_mode: host # Extended capabilities for performance optimization cap_add: - SYS_NICE - IPC_LOCK security_opt: - seccomp=unconfined # Large shared memory for concurrent request handling shm_size: 16g # CPU pinning for NUMA optimization (adjust to your topology) # cpuset: "0-31" # Persistent storage with performance optimizations volumes: # Use fast NVMe storage for models - /mnt/nvme/vllm-data:/data - /mnt/nvme/vllm-cache:/data/cache - /mnt/nvme/models:/data/models:ro # tmpfs for temporary files - type: tmpfs target: /tmp tmpfs: size: 4G mode: 1777 # Production environment configuration environment: # Run as root for maximum performance (or set specific user) # - PUID=0 # - PGID=0 - TZ=UTC # Model configuration - use a powerful model - VLLM_MODEL=Qwen/Qwen3-8B - HF_HOME=/data/models - HF_HUB_OFFLINE=1 # - HF_TOKEN=your_token # Server configuration - VLLM_SERVER_HOST=0.0.0.0 - VLLM_SERVER_PORT=8000 - VLLM_API_KEY=${VLLM_API_KEY:-} # ============================================================ # CPU-SPECIFIC VLLM OPTIMIZATION (CRITICAL FOR PERFORMANCE) # ============================================================ # KV Cache size in GB - CRITICAL: Default is only 4GB! # Set this based on your available RAM and model size # Formula: (Total RAM - Model Size - 8GB headroom) / 2 # Example: 128GB RAM, 16GB model = (128-16-8)/2 = 52GB - VLLM_CPU_KVCACHE_SPACE=40 # OpenMP thread binding for CPU affinity # "auto" = automatic binding, or specify cores like "0-31" # Improves cache locality and reduces context switching - VLLM_CPU_OMP_THREADS_BIND=0-31 # Reserve CPU cores for framework overhead (scheduler, HTTP server) # These cores won't be used for inference, reducing contention - VLLM_CPU_NUM_OF_RESERVED_CPU=2 # ============================================================ # Extended timeouts for high load - VLLM_HTTP_TIMEOUT_KEEP_ALIVE=30 - VLLM_ENGINE_ITERATION_TIMEOUT_S=300 - VLLM_RPC_TIMEOUT=30000 # Logging - reduce verbosity for performance - VLLM_CONFIGURE_LOGGING=1 - VLLM_LOGGING_LEVEL=WARNING - VLLM_LOG_STATS_INTERVAL=60 # Memory optimization - aggressive settings - MALLOC_TRIM_THRESHOLD_=65536 - MALLOC_MMAP_THRESHOLD_=*** - MALLOC_MMAP_MAX_=65536 # Threading - set to (physical cores - reserved cores) # If you have 32 cores and reserve 2, set this to 30 - OMP_NUM_THREADS=30 - MKL_NUM_THREADS=30 - OMP_PROC_BIND=close - OMP_PLACES=cores # NUMA optimization (GNU OpenMP) - GOMP_CPU_AFFINITY=0-29 # Use tcmalloc for better memory performance - LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 # Disable usage stats for privacy/performance - VLLM_NO_USAGE_STATS=1 - VLLM_DO_NOT_TRACK=1 # High-performance vLLM arguments command: # Context length - "--max-model-len" - "32768" # Data type for Intel AMX - "--dtype" - "bfloat16" # Scheduling for high concurrency - "--max-num-seqs" - "256" - "--max-num-batched-tokens" - "32768" # Disable request logging for performance - "--disable-log-requests" # Enable chunked prefill for better latency - "--enable-chunked-prefill" - "--max-chunked-prefill-len" - "8192" # Speculative decoding (if supported by model) # - "--speculative-model" # - "path/to/draft/model" # - "--num-speculative-tokens" # - "5" # Health check is built into the Docker image (30s interval, 120s start period) # No resource limits - use all available resources # deploy: # resources: # limits: # memory: 120G # reservations: # memory: 64G # Logging configuration logging: driver: json-file options: max-size: "100m" max-file: "5" compress: "true" # Ulimits for high concurrency ulimits: memlock: soft: -1 hard: -1 nofile: soft: *** hard: *** nproc: soft: 65535 hard: 65535
Deploy:
bash# Verify CPU supports AMX: lscpu | grep -i amx # Set kernel params: sudo sysctl -w vm.swappiness=1 docker compose -f docker-compose-high-performance.yml up -d docker stats vllm-cpu-prod
| Variable | Default | Description |
|---|---|---|
PUID | (unset) | User ID for running vLLM |
PGID | (unset) | Group ID for running vLLM |
TZ | UTC | Timezone |
DATA_DIR | /data | Base data directory |
| Variable | Default | Description |
|---|---|---|
VLLM_MODEL | (required) | HuggingFace model ID or local path |
VLLM_TOKENIZER | (empty) | Custom tokenizer (if different from model) |
HF_TOKEN | (empty) | HuggingFace access token for gated models |
HF_HOME | /data/models | HuggingFace cache directory |
HF_HUB_OFFLINE | (empty) | Set to 1 for offline mode |
| Variable | Default | Description |
|---|---|---|
VLLM_SERVER_HOST | 0.0.0.0 | Server bind address |
VLLM_SERVER_PORT | 8000 | Server port |
VLLM_API_KEY | (empty) | API key for authentication |
VLLM_HTTP_TIMEOUT_KEEP_ALIVE | 5 | HTTP keep-alive timeout (seconds) |
| Variable | Default | Description |
|---|---|---|
VLLM_CONFIGURE_LOGGING | 1 | Enable logging configuration |
VLLM_LOGGING_LEVEL | INFO | Log level (DEBUG, INFO, WARNING, ERROR) |
| Variable | Default | Description |
|---|---|---|
MALLOC_TRIM_THRESHOLD_ | *** | Memory trim threshold |
OMP_NUM_THREADS | (auto) | OpenMP thread count |
MKL_NUM_THREADS | (auto) | MKL thread count |
These environment variables are specific to vLLM CPU inference and can significantly impact performance. See vLLM CPU FAQ for details.
| Variable | Default | Description |
|---|---|---|
VLLM_CPU_KVCACHE_SPACE | 4 | KV cache size in GB. Default 4GB is often too small! Set based on: (RAM - Model Size - 8GB) / 2 |
VLLM_CPU_OMP_THREADS_BIND | (unset) | OpenMP thread binding. Use auto or specific cores like 0-31 for better cache locality |
VLLM_CPU_NUM_OF_RESERVED_CPU | 0 | Reserve CPU cores for framework overhead (HTTP server, scheduler). Recommended: 2 for high-load servers |
KV Cache Sizing Guide:
| System RAM | Model Size | Recommended VLLM_CPU_KVCACHE_SPACE |
|---|---|---|
| 32 GB | 7B (~14GB) | 5 |
| 64 GB | 7B (~14GB) | 20 |
| 64 GB | 14B (~28GB) | 14 |
| 128 GB | 7B (~14GB) | 50 |
| 128 GB | 70B (~140GB) | Not enough RAM |
| 256 GB | 70B (~140GB) | 50 |
| Container Path | Purpose |
|---|---|
/data | Base data directory |
/data/models | HuggingFace cache |
/data/cache/vllm | vLLM cache |
| Argument | Description |
|---|---|
--max-model-len | Maximum context length |
--dtype | Data type (auto, float16, bfloat16, float32) |
--disable-log-requests | Don't log individual requests |
--trust-remote-code | Allow remote code execution |
| Issue | Solution |
|---|---|
| Container won't start | Check logs: docker logs vllm-cpu |
| Out of memory | Reduce --max-model-len or increase shm_size |
| Model download fails | Set HF_TOKEN for gated models |
| Slow inference | Verify correct CPU variant for your hardware |
bash# Docker Compose docker compose pull && docker compose up -d # Docker CLI docker pull mekayelanik/vllm-cpu:noavx512-latest docker stop vllm-cpu && docker rm vllm-cpu # Recreate with new image # Clean old images docker image prune -f
Your support encourages me to keep creating/supporting my open-source projects. If you found value in this project, you can buy me a coffee to keep me up all the sleepless nights.
Back to Top





manifest unknown 错误
TLS 证书验证失败
DNS 解析超时
410 错误:版本过低
402 错误:流量耗尽
身份认证失败错误
429 限流错误
凭证保存错误
来自真实用户的反馈,见证轩辕镜像的优质服务