
如果你使用 DeepSeek、元宝 AI、Copilot、ChatGPT 等 AI 工具,协助配置轩辕镜像、编写 docker pull 命令、修改 Docker Compose 镜像地址、配置镜像加速、排查镜像拉取失败、分析报错日志等问题,请先让 AI 阅读并遵守轩辕镜像的规则文档。
只需在 AI 对话中先发送下面这句话即可:
请先完整阅读并严格遵守以下文档中的全部规则与要求:
https://xuanyuan.cloud/agents.md
在未充分阅读并理解该文档前,不要生成任何命令、配置、修改建议、故障排查方案或技术回答。后续所有输出都必须严格以该文档中的规范为最高优先级执行。查看 agents.md 用法指南与完整示范。国内用户首推 元宝 AI、DeepSeek 的深度思考模式,不推荐豆包 AI;Cursor 等编辑器可在对话 @ 该链接,或加入 User Rules。 若 AI 无法访问外链,可 打开说明文档 复制全文粘贴。文档会随站点更新,复制内容可能过期,建议定期检查。
Real-time dictation, zero-shot voice cloning, and cinematic video dub***g — all on your desktop.
Open-source, no API keys, fully local. 646 languages.
https://github.com/debpalash/OmniVoice-Studio/stargazers https://github.com/debpalash/OmniVoice-Studio/releases/latest
https://github.com/debpalash/OmniVoice-Studio/issues
Quickstart · Features · Why OmniVoice Studio? · TTS Engines · ASR Engines · Contributing · *** · 简体中文
https://github.com/debpalash/OmniVoice-Studio/releases/latest https://github.com/debpalash/OmniVoice-Studio/releases/latest https://github.com/debpalash/OmniVoice-Studio/releases/latest https://github.com/debpalash/OmniVoice-Studio/releases/latest
macOS: first launch needs a one-time approval — right-click → Open (or System Settings → Privacy & Security → "Open Anyway" on macOS 15). No Terminal needed. Why?
[!WARNING] OmniVoice Studio is in active beta. Things may break between releases. For the latest features and fixes, clone the repo and run from source rather than using pre-built installers. Bug reports and PRs are very welcome — https://github.com/debpalash/OmniVoice-Studio/issues or join ***.
🎙️ Voice Cloning3-second clip → mirror any voice. |
🎨 Voice DesignGender, age, accent, pitch, speed, |
🎬 Video Dub***g*** URL or file → transcribe → |
⌨️ Dictation Widget
|
🔊 Vocal IsolationDemucs-powered. Splits speech |
👥 Speaker DiarizationPyannote + ***X. |
📦 Batch QueueDrop 50 videos, walk away. |
🤖 MCP ServerUse OmniVoice from Claude, |
🛡️ AI WatermarkAudioSeal (Meta). Invisible, |
🔐 100% LocalNo keys, no cloud, no accounts. |
⚡ GPU Auto-DetectCUDA · MPS · ROCm · CPU. |
🧩 ExtensibleSubclass |
Per-OS install guides — pick yours and follow it end-to-end:
Stuck? Run the built-in self-check first — Settings → About → "Run
self-check" in the app, or uv run python backend/main.py --diagnose from
a checkout (--deep also test-loads the active engine). Then see
docs/install/troubleshooting.md for the
top 10 install errors. The in-app error UI deeplinks to those entries when
something breaks at runtime, and Settings → About → "Save diagnostic
bundle" packages scrubbed logs + the self-check report for bug reports.
For Hugging Face token setup, see docs/setup/huggingface-token.md. For diarization-specific gating, see docs/features/diarization.md.
| |
| |
| |
| |
ElevenLabs charges $5–$330/mo and processes your audio on their servers. OmniVoice Studio runs on your hardware, with no usage limits.
| ElevenLabs | OmniVoice Studio | |
|---|---|---|
| Pricing | $5–$330/mo, per-character billing | Free & open-source (AGPL-3.0) · Commercial license for proprietary use |
| Voice Cloning | ✅ 3s clip | ✅ 3s clip, zero-shot |
| Voice Design | ✅ Gender, age | ✅ Gender, age, accent, pitch, style, dialect |
| Languages | 32 | 646 |
| Video Dub*g** | ✅ Cloud-only | ✅ Fully local |
| Data Privacy | Audio sent to cloud | Nothing leaves your machine |
| API Keys | Required | Not needed |
| GPU Support | N/A (cloud) | CUDA · Apple Silicon · ROCm · CPU |
| Desktop App | ❌ | ✅ macOS · Windows · Linux |
| Customizable | ❌ Closed | ✅ Fork it, extend it, ship it |
OmniVoice Studio gives you professional-grade AI tools without the subscription or the cloud.
| Minimum | Recommended | |
|---|---|---|
| OS | Windows 10, macOS 12+, Ubuntu 20.04+ | Any modern 64-bit OS |
| RAM | 8 GB | 16 GB+ |
| VRAM (GPU) | 4 GB (auto-offloads TTS to CPU) | 8 GB+ (NVIDIA RTX 3060+) |
| Disk | 10 GB free (models + cache) | 20 GB+ SSD |
| Python | 3.10+ (managed by uv) | 3.11–3.12 |
| GPU | Optional — CPU works | NVIDIA CUDA · Apple Silicon MPS · AMD ROCm |
[!TIP] On GPUs with ≤8 GB VRAM, OmniVoice automatically offloads TTS to CPU during transcription — no config needed. A dedicated GPU is not required; the entire pipeline runs on CPU (just slower).
OmniVoice ships a multi-engine TTS backend. The default engine (OmniVoice) is always available; additional engines are opt-in and auto-detected. Switch engines in Settings → TTS Engine or via the OMNIVOICE_TTS_BACKEND env var.
| Engine | Languages | Clone | Instruct | Linux | macOS ARM | Windows | License |
|---|---|---|---|---|---|---|---|
| OmniVoice (default) | 600+ | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Built-in |
| CosyVoice 3 | 9 + 18 dialects | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Apache-2.0 |
| MLX-Audio (Kokoro, Qwen3-TTS, CSM, Dia, …) | Multi | Varies | Varies | ❌ | ✅ Native | ❌ | Varies |
| VoxCPM2 | 30 | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Apache-2.0 |
| MOSS-TTS-Nano | 20 | ✅ | ❌ | ✅ CUDA/CPU | ✅ CPU | ✅ CUDA/CPU | Apache-2.0 |
| KittenTTS | English | ❌ | ❌ | ✅ CPU | ✅ CPU | ✅ CPU | MIT |
CUDA = GPU-accelerated · MPS = Apple Silicon Metal · CPU = runs everywhere, slower for large models · KittenTTS and MOSS-TTS-Nano run realtime on CPU · MLX-Audio is Apple Silicon only.
OmniVoice ships a multi-engine ASR (speech-to-text) backend that powers dictation, video dub***g, and subtitle generation — all fully local. ***X is the cross-platform default; the rest are opt-in and auto-detected. Switch in Settings → ASR Engine or via the OMNIVOICE_ASR_BACKEND env var.
| Engine | OMNIVOICE_ASR_BACKEND | Languages | Best for |
|---|---|---|---|
| ***X (default) | whisperx | ~100 | Dub***g & subtitles — word-level timing via wav2vec2 forced alignment |
| Faster-*** | faster-whisper | ~100 | Fast transcription on Linux / macOS / Windows (CTranslate2) |
| **MLX ***** | mlx-whisper | ~100 | Native Apple Silicon speed (Apple MLX / Metal) |
| **PyTorch ***** | pytorch-whisper | ~100 | CUDA / CPU fallback via 🤗 Transformers |
| Parakeet TDT | nemo-parakeet | English + 25 EU | SOTA English accuracy, auto language detection (NVIDIA NeMo, GPU only) |
| Moonshine | moonshine | English | Edge / low-latency, ONNX |
| FunASR | funasr | 50+ | All-in-one multilingual — built-in VAD + inline speaker diarization (SenseVoice) |
***-family engines cover ~100 languages; FunASR / SenseVoice adds an all-in-one multilingual path with built-in voice-activity detection and inline speaker diarization. Every engine runs on-device — no API keys, no cloud.
┌─────────────────────────────────────────────────┐ │ Frontend (React) │ │ DubTab · VoicePreview · BatchQueue · Gallery │ ├─────────────────────────────────────────────────┤ │ Backend (FastAPI) │ │ 97 API endpoints · SSE streaming · SQLite │ ├──────────┬──────────┬──────────┬────────────────┤ │ WhisperX │ Demucs │OmniVoice │ Pyannote │ │ ASR │ Source │ TTS │ Diarization │ │ │ Sep. │ │ │ └──────────┴──────────┴──────────┴────────────────┘ CUDA / MPS / ROCm / CPU (auto-detected)
| Category | Features |
|---|---|
| Dub*g** | Full pipeline (transcribe→translate→synthesize→mux), scene-aware splitting, lip-sync scoring, streaming TTS |
| Voice | Zero-shot cloning, voice design, A/B comparison, voice preview widget, gallery with favorites/tags |
| Audio | Demucs vocal isolation, per-segment gain, selective track export, stem/SRT/VTT/MP3 export |
| Multi-Lang | Multi-language batch picker, batch dub***g queue with sequential GPU execution |
| Diarization | Pyannote ML diarization, auto speaker clone extraction, per-speaker voice assignment |
| Infra | Docker deployment, CUDA/MPS/ROCm auto-detect, cuDNN 8 compat, VRAM-aware model offloading |
| AI Provenance | AudioSeal invisible watermarking (SynthID-like), video logo overlay, watermark detection API |
| UX | Undo/redo, keyboard shortcuts, drag-and-drop, session persistence, glassmorphism design system |
| Real-time Events | WebSocket event bus — instant sidebar refresh on data mutations, exponential backoff reconnect |
| State Management | Zustand store migration — uiSlice, pillSlice, dubSlice, generateSlice, prefsSlice, glossarySlice |
| Desktop | Cross-platform Tauri installers (macOS DMG, Windows MSI, Linux deb/AppImage), auto-update infrastructure |
| Windows Hardening | Cross-platform log paths, Triton workaround, HF symlink bypass, 300s health check timeout |
| Dictation | Global system-wide hotkey (⌘+⇧+Space), frameless floating widget, streaming ASR via WebSocket, auto-paste |
| Batch Pipeline | Full batch TTS: extract → transcribe → translate → generate → mix → export, with live progress tracking |
| Channel | What happens there |
|---|---|
#showcase | Members share their dubs, clones, and voice designs |
#help | Setup issues, GPU troubleshooting, model questions |
#feature-requests | Vote on what gets built next |
#dev | Architecture discussions, PR reviews, engine integrations |
#announcements | Release notes, breaking changes, early access |
**→ Join the ***** — we respond to setup questions within hours, not days.
We welcome contributions of all kinds — bug fixes, new TTS engine ***s, UI improvements, docs, and translations.
TTSBackend in backend/services/tts_backend.py and add it to the _REGISTRY dictionary at the bottom. Six engines are built in: OmniVoice, CosyVoice, MLX-Audio (14+ sub-engines), VoxCPM2, MOSS-TTS-Nano, and KittenTTS. See the TTS Engines section for details.
OmniVoice Studio is free and open-source software under the https://www.gnu.org/licenses/agpl-3.0.html.
Free for any use — including commercial and internal business use. Run it, sell the audio you produce with it, dub your own or clients' videos, roll it out across your team — all free, no license needed. As a network copyleft license, AGPL adds one obligation: if you modify OmniVoice Studio and offer that modified version to others over a network, you must make the complete corresponding source of your modified version available to them under the same AGPL-3.0 terms.
A commercial license is available for organizations that want to embed OmniVoice Studio in a closed-source or proprietary product or service without the AGPL-3.0 copyleft obligations. Pricing tiers coming soon. Inquiries: *******.
The bundled omnivoice/ TTS model by Han Zhu remains Apache-2.0 upstream. See LICENSE for the full, ***ding terms.
OmniVoice Studio is built on the shoulders of exceptional open-source work:
| Project | Role |
|---|---|
| https://github.com/k2-fsa/OmniVoice | Zero-shot diffusion TTS engine — the core voice synthesis model |
| https://github.com/m-bain/***X | Word-level speech recognition and alignment |
| https://github.com/***research/demucs | Music source separation for vocal isolation |
| https://github.com/pyannote/pyannote-audio | Speaker diarization — who said what |
| https://github.com/OpenNMT/CTranslate2 | Optimized Transformer inference on CPU and GPU |
| https://github.com/***research/audioseal | Invisible neural audio watermarking for AI provenance |
| https://tauri.app | Native desktop app framework |
If you read this far, you're our kind of person.
https://github.com/debpalash/OmniVoice-Studio so others can find it too.
**💬 Join the ***** to share what you build.
</picture>
您可以使用以下命令拉取该镜像。请将 <标签> 替换为具体的标签版本。如需查看所有可用标签版本,请访问 标签列表页面。
来自真实用户的反馈,见证轩辕镜像的优质服务