Docker container for llama-cpp-python - a python binding for llama.cpp.
!Docker Image Version (latest by date)
!Docker Image Size (latest by date)
!Docker Pulls
!GitHub Workflow Status
GitHub - 3x3cut0r/llama-cpp-python
DockerHub - 3x3cut0r/llama-cpp-python
IMPORTANT: you need to add SYS_RESOURCE capability to enable MLOCK support
shell# for docker run: docker run -d --cap-add SYS_RESOURCE ... # for docker compose: version: '3.9' llama-cpp-python: image: 3x3cut0r/llama-cpp-python:latest container_name: llama-cpp-python cap_add: - SYS_RESOURCE
Example 1 - run model from huggingface:
This is the recommended way to use this container !!!
shelldocker run -d \ --name llama-cpp-python \ --cap-add SYS_RESOURCE \ -e MODEL_DOWNLOAD="True" \ -e MODEL_REPO="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" \ -e MODEL="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \ -e MODEL_ALIAS="mistral-7b-instruct" \ -e CHAT_FORMAT="mistral" \ -p 8000:8000/tcp \ 3x3cut0r/llama-cpp-python:latest
Example 2 - run own model from local file:
shelldocker run -d \ --name llama-cpp-python \ --cap-add SYS_RESOURCE \ -e MODEL_DOWNLOAD="False" \ -e MODEL_REPO="local" \ -e MODEL="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \ -e MODEL_ALIAS="mistral-7b-instruct" \ -e CHAT_FORMAT="mistral" \ -v /path/to/your/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf:/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \ -p 8000:8000/tcp \ 3x3cut0r/llama-cpp-python:latest
Example 3 - run with arguments (most environment variables will be ignored):
arguments will be executed like this:
/venv/bin/python3 -B -m llama_cpp.server --host 0.0.0.0 <your arguments>
shelldocker run -d \ --name llama-cpp-python \ --cap-add SYS_RESOURCE \ -e MODEL_DOWNLOAD="False" \ -v /path/to/your/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf:/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \ -p 8000:8000/tcp \ 3x3cut0r/llama-cpp-python:latest \ --model /model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \ --n_ctx 1024 \ ...
Example 4 - show help:
shelldocker run --rm \ --name llama-cpp-python \ 3x3cut0r/llama-cpp-python:latest \ --help
yamlversion: '3.9' services: llama-cpp-python: image: 3x3cut0r/llama-cpp-python:latest container_name: llama-cpp-python cap_add: - SYS_RESOURCE environment: MODEL_DOWNLOAD: "True" MODEL_REPO: "TheBloke/Mistral-7B-Instruct-v0.2-GGUF" MODEL: "mistral-7b-instruct-v0.2.Q4_K_M.gguf" MODEL_ALIAS: "mistral-7b-instruct" CHAT_FORMAT: "mistral" ports: - 8000:8000/tcp
TZ - Specifies the server timezone - default: UTCMODEL_DOWNLOAD - If True, downloads MODEL file from Huggingface MODEL_REPO- default: trueMODEL_REPO - The huggingface repo name. Set to local if MODEL was mounted locally - default: TheBloke/Llama-2-7B-Chat-GGUFMODEL - MANDATORY: The model filename - default: llama-2-7b-chat.Q4_K_M.ggufMODEL_ALIAS - The alias of the model to use for generating completions - default: llama-2-7b-chatSEED - Random seed. -1 for random - default: 4294967295N_CTX - The context size - default: 2048N_BATCH - The batch size to use per eval - default: 512N_GPU_LAYERS - The number of layers to put on the GPU. The rest will be on the CPU - default: 0MAIN_GPU - Main GPU to use - default: 0TENSOR_SPLIT - Split layers across multiple GPUs in proportion.ROPE_FREQ_BASE - RoPE base frequency - default: 0.0ROPE_FREQ_SCALE - RoPE frequency scaling factor - default: 0.0MUL_MAT_Q - if true, use experimental mul_mat_q kernels - default: TrueLOGITS_ALL - Whether to return logits - default: TrueVOCAB_ONLY - Whether to only return the vocabulary - default: FalseUSE_MMAP - Use mmap - default: TrueUSE_MLOCK - Use mlock - default: TrueEMBEDDING - Whether to use embeddings - default: TrueN_THREADS - The number of threads to use - default: 4LAST_N_TOKENS_SIZE - Last n tokens to keep for repeat penalty calculation - default: 64LORA_BASE - Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.LORA_PATH - Path to a LoRA file to apply to the model.NUMA - Enable NUMA support - default: FalseCHAT_FORMAT - Chat format to use - default: llama-2CACHE - Use a cache to reduce processing times for evaluated prompts - default: FalseCACHE_TYPE - The type of cache to use. Only used if cache is True - default: ramCACHE_SIZE - The size of the cache in bytes. Only used if cache is True - default: 2147483648VERBOSE - Whether to print debug information - default: TrueHOST - Listen address - default: 0.0.0.0PORT - Listen port - default: 8000INTERRUPT_REQUESTS - Whether to interrupt requests when a new request is received - default: TrueHF_TOKEN - Huggingface Token for private repos - default: None/model - model directory -> map your llama (*.gguf) models here8000/tcp - API Portvisit abetlen's documentation or
see [***] for more information
/v1/engines/copilot-codex/completions - POST - Create Completionjson{ "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n", "stop": ["\n", "###"] }
/v1/completions - POST - Create Completionjson{ "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n", "stop": ["\n", "###"] }
/v1/embeddings - POST - Create Embeddingjson{ "input": "The food was delicious and the waiter..." }
/v1/chat/completions - POST - Create Chat Completionjson{ "messages": [ { "content": "You are a helpful assistant.", "role": "system" }, { "content": "What is the capital of France?", "role": "user" } ] }
/v1/models - GET - Get Models
response:json{ "object": "list", "data": [ { "id": "llama-2-7b-chat", "object": "model", "owned_by": "me", "permissions": [] } ] }
![License: GPL v3]([***] - This project is licensed under the GNU General Public License - see the gpl-3.0 for details.
来自真实用户的反馈,见证轩辕镜像的优质服务
免费版仅支持 Docker Hub 加速,不承诺可用性和速度;专业版支持更多镜像源,保证可用性和稳定速度,提供优先客服响应。
免费版仅支持 docker.io;专业版支持 docker.io、gcr.io、ghcr.io、registry.k8s.io、nvcr.io、quay.io、mcr.microsoft.com、docker.elastic.co 等。
当返回 402 Payment Required 错误时,表示流量已耗尽,需要充值流量包以恢复服务。
通常由 Docker 版本过低导致,需要升级到 20.x 或更高版本以支持 V2 协议。
先检查 Docker 版本,版本过低则升级;版本正常则验证镜像信息是否正确。
使用 docker tag 命令为镜像打上新标签,去掉域名前缀,使镜像名称更简洁。
探索更多轩辕镜像的使用方法,找到最适合您系统的配置方式
通过 Docker 登录认证访问私有仓库
在 Linux 系统配置镜像加速服务
在 Docker Desktop 配置镜像加速
Docker Compose 项目配置加速
Kubernetes 集群配置 Containerd
在宝塔面板一键配置镜像加速
Synology 群晖 NAS 配置加速
飞牛 fnOS 系统配置镜像加速
极空间 NAS 系统配置加速服务
爱快 iKuai 路由系统配置加速
绿联 NAS 系统配置镜像加速
QNAP 威联通 NAS 配置加速
Podman 容器引擎配置加速
HPC 科学计算容器配置加速
ghcr、Quay、nvcr 等镜像仓库
无需登录使用专属域名加速