本站支持搜索的镜像仓库:Docker Hub、gcr.io、ghcr.io、quay.io、k8s.gcr.io、registry.gcr.io、elastic.co、mcr.microsoft.com

Docker container for llama-cpp-python - a python binding for llama.cpp.
!Docker Image Version (latest by date)
!Docker Image Size (latest by date)
!Docker Pulls
!GitHub Workflow Status
GitHub - 3x3cut0r/llama-cpp-python
DockerHub - 3x3cut0r/llama-cpp-python
IMPORTANT: you need to add SYS_RESOURCE capability to enable MLOCK support
# for docker run: docker run -d --cap-add SYS_RESOURCE ... # for docker compose: version: '3.9' llama-cpp-python: image: 3x3cut0r/llama-cpp-python:latest container_name: llama-cpp-python cap_add: - SYS_RESOURCE
Example 1 - run model from huggingface:
This is the recommended way to use this container !!!
docker run -d \ --name llama-cpp-python \ --cap-add SYS_RESOURCE \ -e MODEL_DOWNLOAD="True" \ -e MODEL_REPO="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" \ -e MODEL="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \ -e MODEL_ALIAS="mistral-7b-instruct" \ -e CHAT_FORMAT="mistral" \ -p 8000:8000/tcp \ 3x3cut0r/llama-cpp-python:latest
Example 2 - run own model from local file:
docker run -d \ --name llama-cpp-python \ --cap-add SYS_RESOURCE \ -e MODEL_DOWNLOAD="False" \ -e MODEL_REPO="local" \ -e MODEL="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \ -e MODEL_ALIAS="mistral-7b-instruct" \ -e CHAT_FORMAT="mistral" \ -v /path/to/your/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf:/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \ -p 8000:8000/tcp \ 3x3cut0r/llama-cpp-python:latest
Example 3 - run with arguments (most environment variables will be ignored):
arguments will be executed like this:
/venv/bin/python3 -B -m llama_cpp.server --host 0.0.0.0 <your arguments>
docker run -d \ --name llama-cpp-python \ --cap-add SYS_RESOURCE \ -e MODEL_DOWNLOAD="False" \ -v /path/to/your/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf:/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \ -p 8000:8000/tcp \ 3x3cut0r/llama-cpp-python:latest \ --model /model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \ --n_ctx 1024 \ ...
Example 4 - show help:
docker run --rm \ --name llama-cpp-python \ 3x3cut0r/llama-cpp-python:latest \ --help
version: '3.9' services: llama-cpp-python: image: 3x3cut0r/llama-cpp-python:latest container_name: llama-cpp-python cap_add: - SYS_RESOURCE environment: MODEL_DOWNLOAD: "True" MODEL_REPO: "TheBloke/Mistral-7B-Instruct-v0.2-GGUF" MODEL: "mistral-7b-instruct-v0.2.Q4_K_M.gguf" MODEL_ALIAS: "mistral-7b-instruct" CHAT_FORMAT: "mistral" ports: - 8000:8000/tcp
TZ - Specifies the server timezone - default: UTCMODEL_DOWNLOAD - If True, downloads MODEL file from Huggingface MODEL_REPO- default: trueMODEL_REPO - The huggingface repo name. Set to local if MODEL was mounted locally - default: TheBloke/Llama-2-7B-Chat-GGUFMODEL - MANDATORY: The model filename - default: llama-2-7b-chat.Q4_K_M.ggufMODEL_ALIAS - The alias of the model to use for generating completions - default: llama-2-7b-chatSEED - Random seed. -1 for random - default: 4294967295N_CTX - The context size - default: 2048N_BATCH - The batch size to use per eval - default: 512N_GPU_LAYERS - The number of layers to put on the GPU. The rest will be on the CPU - default: 0MAIN_GPU - Main GPU to use - default: 0TENSOR_SPLIT - Split layers across multiple GPUs in proportion.ROPE_FREQ_BASE - RoPE base frequency - default: 0.0ROPE_FREQ_SCALE - RoPE frequency scaling factor - default: 0.0MUL_MAT_Q - if true, use experimental mul_mat_q kernels - default: TrueLOGITS_ALL - Whether to return logits - default: TrueVOCAB_ONLY - Whether to only return the vocabulary - default: FalseUSE_MMAP - Use mmap - default: TrueUSE_MLOCK - Use mlock - default: TrueEMBEDDING - Whether to use embeddings - default: TrueN_THREADS - The number of threads to use - default: 4LAST_N_TOKENS_SIZE - Last n tokens to keep for repeat penalty calculation - default: 64LORA_BASE - Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.LORA_PATH - Path to a LoRA file to apply to the model.NUMA - Enable NUMA support - default: FalseCHAT_FORMAT - Chat format to use - default: llama-2CACHE - Use a cache to reduce processing times for evaluated prompts - default: FalseCACHE_TYPE - The type of cache to use. Only used if cache is True - default: ramCACHE_SIZE - The size of the cache in bytes. Only used if cache is True - default: 2147483648VERBOSE - Whether to print debug information - default: TrueHOST - Listen address - default: 0.0.0.0PORT - Listen port - default: 8000INTERRUPT_REQUESTS - Whether to interrupt requests when a new request is received - default: TrueHF_TOKEN - Huggingface Token for private repos - default: None/model - model directory -> map your llama (*.gguf) models here8000/tcp - API Portvisit abetlen's documentation or
see [***] for more information
/v1/engines/copilot-codex/completions - POST - Create Completion{ "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n", "stop": ["\n", "###"] }
/v1/completions - POST - Create Completion{ "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n", "stop": ["\n", "###"] }
/v1/embeddings - POST - Create Embedding{ "input": "The food was delicious and the waiter..." }
/v1/chat/completions - POST - Create Chat Completion{ "messages": [ { "content": "You are a helpful assistant.", "role": "system" }, { "content": "What is the capital of France?", "role": "user" } ] }
/v1/models - GET - Get Models
response:{ "object": "list", "data": [ { "id": "llama-2-7b-chat", "object": "model", "owned_by": "me", "permissions": [] } ] }
![License: GPL v3]([***] - This project is licensed under the GNU General Public License - see the gpl-3.0 for details.
免费版仅支持 Docker Hub 加速,不承诺可用性和速度;专业版支持更多镜像源,保证可用性和稳定速度,提供优先客服响应。
免费版仅支持 docker.io;专业版支持 docker.io、gcr.io、ghcr.io、registry.k8s.io、nvcr.io、quay.io、mcr.microsoft.com、docker.elastic.co 等。
当返回 402 Payment Required 错误时,表示流量已耗尽,需要充值流量包以恢复服务。
通常由 Docker 版本过低导致,需要升级到 20.x 或更高版本以支持 V2 协议。
先检查 Docker 版本,版本过低则升级;版本正常则验证镜像信息是否正确。
使用 docker tag 命令为镜像打上新标签,去掉域名前缀,使镜像名称更简洁。
探索更多轩辕镜像的使用方法,找到最适合您系统的配置方式
通过 Docker 登录方式配置轩辕镜像加速服务,包含7个详细步骤
在 Linux 系统上配置轩辕镜像源,支持主流发行版
在 Docker Desktop 中配置轩辕镜像加速,适用于桌面系统
在 Docker Compose 中使用轩辕镜像加速,支持容器编排
在 k8s 中配置 containerd 使用轩辕镜像加速
在宝塔面板中配置轩辕镜像加速,提升服务器管理效率
在 Synology 群晖NAS系统中配置轩辕镜像加速
在飞牛fnOS系统中配置轩辕镜像加速
在极空间NAS中配置轩辕镜像加速
在爱快ikuai系统中配置轩辕镜像加速
在绿联NAS系统中配置轩辕镜像加速
在威联通NAS系统中配置轩辕镜像加速
在 Podman 中配置轩辕镜像加速,支持多系统
配置轩辕镜像加速9大主流镜像仓库,包含详细配置步骤
无需登录即可使用轩辕镜像加速服务,更加便捷高效
需要其他帮助?请查看我们的 常见问题 或 官方QQ群: 13763429