专属
文档
插件
助手
邀请
顶部
快速返回页面顶部
收起
收起工具栏
轩辕镜像 官方专业版
轩辕镜像
专业版
轩辕镜像 官方专业版
轩辕镜像
专业版
首页个人中心搜索镜像

交易
充值流量我的订单

文档

工具

功能
提交工单页面收录

帮助
轩辕镜像免费版

其他
关于我们网站地图
热门搜索:
gemma4

ai/gemma4

Docker AI 官方镜像

Gemma 4是Google DeepMind开发的多模态开源AI模型系列,支持文本、图像、音频处理,优化了推理、编码能力和长上下文处理,提供E2B、E4B、26B A4B、31B四种尺寸,适用于从移动设备到服务器的多样化部署场景。

35 次收藏下载次数: 0状态:社区镜像维护者:Docker AI 官方镜像仓库类型:镜像最近更新:2 个月前
让 AI 帮你使用轩辕镜像? · 展开查看说明

如果你使用 DeepSeek、元宝 AI、Copilot、ChatGPT 等 AI 工具,协助配置轩辕镜像、编写 docker pull 命令、修改 Docker Compose 镜像地址、配置镜像加速、排查镜像拉取失败、分析报错日志等问题,请先让 AI 阅读并遵守轩辕镜像的规则文档。

只需在 AI 对话中先发送下面这句话即可:

请先完整阅读并严格遵守以下文档中的全部规则与要求:

https://xuanyuan.cloud/agents.md

在未充分阅读并理解该文档前,不要生成任何命令、配置、修改建议、故障排查方案或技术回答。后续所有输出都必须严格以该文档中的规范为最高优先级执行。

查看 agents.md 用法指南与完整示范。国内用户首推 元宝 AI、DeepSeek 的深度思考模式,不推荐豆包 AI;Cursor 等编辑器可在对话 @ 该链接,或加入 User Rules。 若 AI 无法访问外链,可 打开说明文档 复制全文粘贴。文档会随站点更新,复制内容可能过期,建议定期检查。

轩辕镜像,不浪费每一次拉取。点击查看
DockerHub 官方简介
轩辕镜像中文简介
标签列表
镜像标签列表与下载命令
轩辕镜像,不浪费每一次拉取。点击查看

Hugging Face | https://github.com/google-gemma | Launch Blog | Documentation
License: Apache 2.0 | Authors: Google DeepMind

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.

Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.

Gemma 4 introduces key capability and architectural advancements:

  • Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.

  • Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).

  • Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.

  • Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.

  • Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.

  • Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.

  • Native System Prompt Support – Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

Models Overview

Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).

Dense Models

PropertyE2BE4B31B Dense
Total Parameters2.3B effective (5.1B with embeddings)4.5B effective (8B with embeddings)30.7B
Layers354260
Sliding Window512 tokens512 tokens1024 tokens
Context Length128K tokens128K tokens256K tokens
Vocabulary Size262K262K262K
Supported ModalitiesText, Image, AudioText, Image, AudioText, Image
Vision Encoder Parameters~150M~150M~550M
Audio Encoder Parameters~300M~300MNo Audio

The "E" in E2B and E4B stands for "effective" parameters. The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.

Mixture-of-Experts (MoE) Model

Property26B A4B MoE
Total Parameters25.2B
Active Parameters3.8B
Layers30
Sliding Window1024 tokens
Context Length256K tokens
Vocabulary Size262K
Expert Count8 active / 128 total and 1 shared
Supported ModalitiesText, Image
Vision Encoder Parameters~550M

The "A" in 26B A4B stands for "active parameters" in contrast to the total number of parameters the model contains. By only activating a 4B subset of parameters during inference, the Mixture-of-Experts model runs much faster than its 26B total might suggest. This makes it an excellent choice for fast inference compared to the dense 31B model since it runs almost as fast as a 4B-parameter model.

Benchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models.

Gemma 4 31BGemma 4 26B A4BGemma 4 E4BGemma 4 E2BGemma 3 27B (no think)
MMLU Pro85.2%82.6%69.4%60.0%67.6%
AIME 2026 no tools89.2%88.3%42.5%37.5%20.8%
LiveCodeBench v680.0%77.1%52.0%44.0%29.1%
Codeforces ELO21501718940633110
GPQA Diamond84.3%82.3%58.6%43.4%42.4%
Tau2 (average over 3)76.9%68.2%42.2%24.5%16.2%
HLE no tools19.5%8.7%---
HLE with search26.5%17.2%---
BigBench Extra Hard74.4%64.8%33.1%21.9%19.3%
MMMLU88.4%86.3%76.6%67.4%70.7%
Vision
MMMU Pro76.9%73.8%52.6%44.2%49.7%
OmniDocBench 1.5 (average edit distance, lower is better)0.1310.1490.1810.2900.365
MATH-Vision85.6%82.4%59.5%52.4%46.0%
MedXPertQA MM61.3%58.1%28.7%23.5%-
Audio
CoVoST--35.5433.47-
FLEURS (lower is better)--0.080.09-
Long Context
MRCR v2 8 needle 128k (average)66.4%44.1%25.4%19.1%13.5%

Core Capabilities

Gemma 4 models handle a broad range of tasks across text, vision, and audio. Key capabilities include:

  • Thinking – Built-in reasoning mode that lets the model think step-by-step before answering.
  • Long Context – Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B).
  • Image Understanding – Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.
  • Video Understanding – Analyze video by processing sequences of frames.
  • Interleaved Multimodal Input – Freely mix text and images in any order within a single prompt.
  • Function Calling – Native support for structured tool use, enabling agentic workflows.
  • Coding – Code generation, completion, and correction.
  • Multilingual – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
  • Audio (E2B and E4B only) – Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.

Best Practices

For the best performance, use these configurations and best practices:

1. Sampling Parameters

Use the following standardized sampling configuration across all use cases:

  • temperature=1.0
  • top_p=0.95
  • top_k=64

2. Thinking Mode Configuration

Compared to Gemma 3, the models use standard system, assistant, and user roles. To properly manage the thinking process, use the following control tokens:

  • Trigger Thinking: Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token.
  • Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
    <|channel>thoughtn[Internal reasoning]<channel|>
  • Disabled Thinking Behavior: For all models except for the E2B and E4B variants, if thinking is disabled, the model will still generate the tags but with an empty thought block:
    <|channel>thoughtn<channel|>[Final answer]

[!Note]
Note that many libraries like Transformers and llama.cpp handle the complexities of the chat template for you.

3. Multi-Turn Conversations

  • No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

4. Modality order

  • For optimal performance with multimodal inputs, place image and/or audio content before the text in your prompt.

5. Variable Image Resolution

Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.

  • The supported token budgets are: 70, 140, 280, 560, and 1120.
    • Use lower budgets for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail.
    • Use higher budgets for tasks like OCR, document parsing, or reading small text.

6. Audio

Use the following prompt structures for audio processing:

  • Audio Speech Recognition (ASR)
text
Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:  
* Only output the transcription, with no newlines.  
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.  
  • Automatic Speech Translation (AST)
text
Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.  
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.  

7. Audio and Video Length

All models support image inputs and can process videos as frames whereas the E2B and E4B models also support audio inputs. Audio supports a maximum length of 30 seconds. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.

Model Data

Data used for model training and how the data was processed.

Training Dataset

Our pre-training dataset is a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which includes web documents, code, images, audio, with a cutoff date of January 2025. Here are the key components:

  • Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
  • Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
  • Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.
  • Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.

The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.

Data Preprocessing

Here are the key data cleaning and filtering methods applied to the training data:

  • CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
  • Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
  • Additional methods: Filtering based on content quality and safety in line with our policies.

Ethics and Safety

As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, Gemma 4 undergoes the same rigorous safety evaluations as our proprietary Gemini models.

Evaluation Approach

Gemma 4 models were developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align with Google’s AI principles, as well as safety policies, which aim to prevent our generative AI models from generating harmful content, including:

  • Content related to child sexual abuse material and exploitation
  • Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)
  • Sexually explicit content
  • Hate speech (e.g., dehumanizing members of protected groups)
  • Harassment (e.g., encouraging violence against people)

Evaluation Results

For all areas of safety testing, we saw major improvements in all categories of content safety relative to previous Gemma models. Overall, Gemma 4 models significantly outperform Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance.

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Multimodal models (capable of processing vision, language, and/or audio) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators ***ed as part of model training and development.

  • Content Creation and Communication
    • Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
    • Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
    • Text Summarization: Generate concise summaries of a text corpus, research papers, or reports.
    • Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications.
    • Audio Processing and Interaction: The smaller models (E2B and E4B) can analyze and interpret audio inputs, enabling voice-driven interactions and transcriptions.
  • Research and Education
    • Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
    • Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
    • Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

Limitations

  • Training Data
    • The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
    • The scope of the training dataset determines the subject areas the model can handle effectively.
  • Context and Task Complexity
    • Models perform well on tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
    • A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
  • Language Ambiguity and Nuance
    • Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.
  • Factual Accuracy
    • Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
  • Common Sense
    • Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.

**Ethical *ations and Risks

The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully ***ed the following:

  • Bias and Fairness
    • VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. Gemma 4 models underwent careful scrutiny, input data pre-processing, and post-training evaluations as reported in this card to help mitigate the risk of these biases.
  • Misinformation and Misuse
    • VLMs can be misused to generate text that is false, misleading, or harmful.
    • Guidelines are provided for responsible use with the model, see the Responsible Generative AI Toolkit.
  • Transparency and Accountability
    • This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
    • A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.

Risks identified and mitigations:

  • Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
  • Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided.
  • Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
  • Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.

Benefits

At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models.

更多相关 Docker 镜像与资源

以下是 ai/gemma4 相关的常用 Docker 镜像,适用于 不同场景 等不同场景:

  • ai/gpt-oss Docker 镜像说明(Docker 官方 GPT-OSS 系列模型镜像,适合开源大模型推理与实验部署)
  • ai/gemma3 Docker 镜像说明(Docker 官方 Gemma 3 模型推理镜像,适合轻量开源语言模型服务)
  • dengcao/vllm-openai Docker 镜像说明(社区维护的 vLLM OpenAI 兼容推理镜像,适合自建大模型 HTTP API)
  • vllm/vllm-openai Docker 镜像说明

镜像拉取方式

您可以使用以下命令拉取该镜像。请将 <标签> 替换为具体的标签版本。如需查看所有可用标签版本,请访问 标签列表页面。

轩辕镜像加速拉取命令点我查看更多 gemma4 镜像标签

docker model pull docker.xuanyuan.run/ai/gemma4:<标签>

使用方法:

  • 登录认证方式
  • 免认证方式

DockerHub 原生拉取命令

docker model pull ai/gemma4:<标签>

更多 gemma4 镜像推荐

ai/qwen3 logo

ai/qwen3

Docker AI 官方镜像
Qwen3是通义千问系列最新推出的大语言模型,作为新一代人工智能技术成果,它专为顶尖级代码编写、数学计算、逻辑推理及语言处理任务打造,凭借先进的算法架构与深度优化的模型性能,能够高效应对多领域复杂场景下的专业需求,为用户提供精准、智能的解决方案,展现出在技术研发与应用落地中的卓越潜力。
152 次收藏50万+ 次下载
7 个月前更新
ai/gemma3 logo

ai/gemma3

Docker AI 官方镜像
Google最新推出的AI模型Gemma,虽体型小巧却性能强劲,尤其在聊天交互与内容生成领域表现突出;其轻量化设计不仅便于高效部署,还能在保证响应速度的同时,持续输出高质量内容,为用户带来便捷且智能的使用体验,是一款兼顾小巧体型与强大功能的新一代AI模型。
60 次收藏50万+ 次下载
8 个月前更新
ai/gpt-oss logo

ai/gpt-oss

Docker AI 官方镜像
OpenAI的开源权重模型系列,设计用于强大的推理、代理任务和多用途开发场景,提供多种量化版本以适应不同资源需求。
43 次收藏10万+ 次下载
7 个月前更新
ai/gemma3n logo

ai/gemma3n

Docker AI 官方镜像
为低资源设备提供高效多模态AI,支持文本、图像、音频和视频处理。
10 次收藏5万+ 次下载
11 个月前更新
ai/smollm2 logo

ai/smollm2

Docker AI 官方镜像
为速度、边缘设备和本地开发构建的小型语言模型
33 次收藏10万+ 次下载
9 个月前更新
ai/llama3.2 logo

ai/llama3.2

Docker AI 官方镜像
LLaMA 3的此次稳定更新显著提升了模型性能,在编码开发、日常聊天及各类问答任务中均展现出可靠表现,能够高效处理多样化的用户需求,无论是辅助编写代码、进行流畅对话还是解答专业问题,都具备稳定且出色的任务执行能力,为用户提供了更为实用的人工智能工具支持。
27 次收藏10万+ 次下载
1 年前更新

查看更多 gemma4 相关镜像

轩辕镜像配置手册

按平台快速找到配置文档

Docker

登录仓库拉取

登录认证 · 私有仓库

专属域名拉取

免登录 · 高速拉取

Linux

Docker 镜像配置

Windows / Mac

Docker Desktop 配置

MacOS OrbStack

OrbStack 容器

Docker Compose

Compose 项目配置

NAS

群晖

Synology 配置

飞牛

fnOS 镜像配置

绿联

绿联 NAS

威联通

QNAP 配置

极空间

极空间 NAS

企业仓库

其他仓库

ghcr · Quay · nvcr

Harbor 镜像源

Proxy Repository 对接

Portainer 镜像源

Registries 配置

Nexus 镜像源

Docker Proxy 缓存

开发工具

Dev Containers

VS Code 开发容器

Podman

Podman 配置指南

Singularity / Apptainer

HPC 科学计算容器

Kubernetes

K8s Containerd

Kubernetes · Containerd

K3s

轻量级集群

面板 / 网络

爱快路由

iKuai 镜像加速

宝塔面板

一键配置镜像源

AI

用 AI 使用轩辕镜像

agents.md · AI 对话 · 提示词

一键安装

一键安装 Docker

Linux Docker 一键安装

需要其他帮助?请查看我们的 常见问题 Docker 镜像访问常见问题解答 或 提交工单

镜像拉取常见问题

功能

免费版与专业版区别

功能对比 · 版本选择

支持的镜像仓库

Docker Hub · GCR · GHCR

新手拉取配置

登录 · 专属域名 · 配置

docker search 限制

专属域名 · Hub 搜索

不支持 push

仅支持 pull · 不支持

拉取速度原因

带宽 · 缓存 · 冷热镜像

错误码

402 与流量用尽

402 · 流量包 · 充值

401 认证失败

401 · docker login

manifest unknown

标签错误 · 镜像不存在

410 Gone 排查

410 · Docker 升级

429 限流

免费版 · 请求频率

其他报错

DNS 超时

DNS 解析 · 网络超时

TLS 证书失败

no matching manifest(架构)

账号

失败是否计费

manifest · blob · 计费

申请开发票(企业 / 个人)

企业 · 个人 · 工单

修改登录密码

网站 · 仓库 · 重置

注销账户

工单 · 数据 · 注销

原理

mirrors 不生效

daemon.json · 重启

去掉域名前缀

docker tag · 重命名

指定架构拉取

ARM64 · AMD64 · 多架构

latest 与「最新」

digest · 版本号 · 标签

查看全部问题→

用户好评

来自真实用户的反馈,见证轩辕镜像的优质服务

用户头像

oldzhang

运维工程师

Linux服务器

5

"Docker访问体验非常流畅,大镜像也能快速完成下载。"

轩辕镜像
Docker AI 官方镜像
...
ai/gemma4
教程轩辕镜像功能与使用教程
价格查看流量套餐与价格
热门查看热门 Docker 镜像推荐
博客Docker 镜像公告与技术博客
官方公众号:源码跳动|官方技术交流群:831623681
官方公众号:源码跳动|官方技术交流群:|问题咨询请:提交工单
商务合作:点击复制邮箱
©2024-2026 源码跳动
商务合作:点击复制邮箱Copyright © 2024-2026 杭州源码跳动科技有限公司. All rights reserved.