本站支持搜索的镜像仓库：Docker Hub、gcr.io、ghcr.io、quay.io、k8s.gcr.io、registry.gcr.io、elastic.co、mcr.microsoft.com

3x3cut0r/llama-cpp-python

Docker container for llama-cpp-python - a python binding for llama.cpp

4 收藏0 次下载active3x3cut0r

🚀轩辕镜像专业版更稳定💎一键安装 Docker 配置镜像源

镜像简介版本下载

🚀轩辕镜像专业版更稳定💎一键安装 Docker 配置镜像源

llama-cpp-python

Docker container for llama-cpp-python - a python binding for llama.cpp.

!Docker Image Version (latest by date)
!Docker Image Size (latest by date)
!Docker Pulls
!GitHub Workflow Status

GitHub - 3x3cut0r/llama-cpp-python
DockerHub - 3x3cut0r/llama-cpp-python

Index

Usage
1.1 docker run
1.2 docker-compose.yml
Environment Variables
Volumes
Ports
API Endpoints
Find Me
License

1 Usage

IMPORTANT: you need to add SYS_RESOURCE capability to enable MLOCK support

# for docker run:
docker run -d --cap-add SYS_RESOURCE ...

# for docker compose:
version: '3.9'
  llama-cpp-python:
    image: 3x3cut0r/llama-cpp-python:latest
    container_name: llama-cpp-python
    cap_add:
      - SYS_RESOURCE

1.1 docker run

Example 1 - run model from huggingface:
This is the recommended way to use this container !!!

docker run -d \
    --name llama-cpp-python \
    --cap-add SYS_RESOURCE \
    -e MODEL_DOWNLOAD="True" \
    -e MODEL_REPO="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" \
    -e MODEL="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \
    -e MODEL_ALIAS="mistral-7b-instruct" \
    -e CHAT_FORMAT="mistral" \
    -p 8000:8000/tcp \
    3x3cut0r/llama-cpp-python:latest

Example 2 - run own model from local file:

docker run -d \
    --name llama-cpp-python \
    --cap-add SYS_RESOURCE \
    -e MODEL_DOWNLOAD="False" \
    -e MODEL_REPO="local" \
    -e MODEL="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \
    -e MODEL_ALIAS="mistral-7b-instruct" \
    -e CHAT_FORMAT="mistral" \
    -v /path/to/your/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf:/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
    -p 8000:8000/tcp \
    3x3cut0r/llama-cpp-python:latest

Example 3 - run with arguments (most environment variables will be ignored):
arguments will be executed like this:
/venv/bin/python3 -B -m llama_cpp.server --host 0.0.0.0 <your arguments>

docker run -d \
    --name llama-cpp-python \
    --cap-add SYS_RESOURCE \
    -e MODEL_DOWNLOAD="False" \
    -v /path/to/your/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf:/model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
    -p 8000:8000/tcp \
    3x3cut0r/llama-cpp-python:latest \
    --model /model/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
    --n_ctx 1024 \
    ...

Example 4 - show help:

docker run --rm \
    --name llama-cpp-python \
    3x3cut0r/llama-cpp-python:latest \
    --help

1.2 docker-compose.yml

version: '3.9'

services:
  llama-cpp-python:
    image: 3x3cut0r/llama-cpp-python:latest
    container_name: llama-cpp-python
    cap_add:
      - SYS_RESOURCE
    environment:
        MODEL_DOWNLOAD: "True"
        MODEL_REPO: "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
        MODEL: "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
        MODEL_ALIAS: "mistral-7b-instruct"
        CHAT_FORMAT: "mistral"
    ports:
      - 8000:8000/tcp

2 Environment Variables

TZ - Specifies the server timezone - default: UTC
MODEL_DOWNLOAD - If True, downloads MODEL file from Huggingface MODEL_REPO- default: true
MODEL_REPO - The huggingface repo name. Set to local if MODEL was mounted locally - default: TheBloke/Llama-2-7B-Chat-GGUF
MODEL - MANDATORY: The model filename - default: llama-2-7b-chat.Q4_K_M.gguf
MODEL_ALIAS - The alias of the model to use for generating completions - default: llama-2-7b-chat
SEED - Random seed. -1 for random - default: 4294967295
N_CTX - The context size - default: 2048
N_BATCH - The batch size to use per eval - default: 512
N_GPU_LAYERS - The number of layers to put on the GPU. The rest will be on the CPU - default: 0
MAIN_GPU - Main GPU to use - default: 0
TENSOR_SPLIT - Split layers across multiple GPUs in proportion.
ROPE_FREQ_BASE - RoPE base frequency - default: 0.0
ROPE_FREQ_SCALE - RoPE frequency scaling factor - default: 0.0
MUL_MAT_Q - if true, use experimental mul_mat_q kernels - default: True
LOGITS_ALL - Whether to return logits - default: True
VOCAB_ONLY - Whether to only return the vocabulary - default: False
USE_MMAP - Use mmap - default: True
USE_MLOCK - Use mlock - default: True
EMBEDDING - Whether to use embeddings - default: True
N_THREADS - The number of threads to use - default: 4
LAST_N_TOKENS_SIZE - Last n tokens to keep for repeat penalty calculation - default: 64
LORA_BASE - Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
LORA_PATH - Path to a LoRA file to apply to the model.
NUMA - Enable NUMA support - default: False
CHAT_FORMAT - Chat format to use - default: llama-2
CACHE - Use a cache to reduce processing times for evaluated prompts - default: False
CACHE_TYPE - The type of cache to use. Only used if cache is True - default: ram
CACHE_SIZE - The size of the cache in bytes. Only used if cache is True - default: 2147483648
VERBOSE - Whether to print debug information - default: True
HOST - Listen address - default: 0.0.0.0
PORT - Listen port - default: 8000
INTERRUPT_REQUESTS - Whether to interrupt requests when a new request is received - default: True
HF_TOKEN - Huggingface Token for private repos - default: None

3 Volumes

/model - model directory -> map your llama (*.gguf) models here

4 Ports

8000/tcp - API Port

5 API Endpoints

visit abetlen's documentation or
see [***] for more information

/v1/engines/copilot-codex/completions - POST - Create Completion

{
  "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n",
  "stop": ["\n", "###"]
}

/v1/completions - POST - Create Completion

{
  "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n",
  "stop": ["\n", "###"]
}

/v1/embeddings - POST - Create Embedding

{
  "input": "The food was delicious and the waiter..."
}

/v1/chat/completions - POST - Create Chat Completion

{
  "messages": [
    {
      "content": "You are a helpful assistant.",
      "role": "system"
    },
    {
      "content": "What is the capital of France?",
      "role": "user"
    }
  ]
}

/v1/models - GET - Get Models response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-2-7b-chat",
      "object": "model",
      "owned_by": "me",
      "permissions": []
    }
  ]
}