Model Catalog and LiteLLM Aliases¶
Last updated: 2026-05-31
Active Aliases¶
| LiteLLM alias | Backend | Port | Purpose |
|---|---|---|---|
local/qwen-coder |
llama.cpp | 8010 |
coding, engineering automation, technical reasoning |
local/qwen-vision-fast |
llama.cpp | 8011 |
image understanding, screenshots, diagrams, OCR-style inspection |
local/embed-engineering |
llama.cpp embeddings | 8012 |
document embeddings for RAG |
Clients should use these LiteLLM aliases instead of calling raw backend URLs.
Models on Disk¶
Coder model:
/srv/localai/models/qwen3-coder-next-q4_k_m/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf
/srv/localai/models/qwen3-coder-next-q4_k_m/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00002-of-00004.gguf
/srv/localai/models/qwen3-coder-next-q4_k_m/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00003-of-00004.gguf
/srv/localai/models/qwen3-coder-next-q4_k_m/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00004-of-00004.gguf
Embedding model:
/srv/localai/models/qwen3-embedding-0.6b-q8_0/Qwen3-Embedding-0.6B-Q8_0.gguf
Vision model:
/srv/localai/models/qwen3-vl-8b-instruct-q4_k_m/Qwen3VL-8B-Instruct-Q4_K_M.gguf
/srv/localai/models/qwen3-vl-8b-instruct-q4_k_m/mmproj-Qwen3VL-8B-Instruct-F16.gguf
The previous Qwen3.5-9B-Q4_K_M vision model remains on disk as a rollback option, but Qwen3-VL-8B-Instruct-Q4_K_M is the active default because it followed RapidDraft's final-answer prompt contract more reliably.
llama.cpp Runtime Settings¶
local/qwen-coder¶
Service:
localai-qwen-coder.service
Key runtime settings:
--alias local/qwen-coder
--host 127.0.0.1
--port 8010
--ctx-size 32768
--parallel 1
--gpu-layers auto
--flash-attn auto
--jinja
--metrics
--no-ui
local/qwen-coder uses a larger single-slot context for local agent apps such as
Hermes Agent, whose default tool prompt can exceed 20k model tokens before the
user message.
local/qwen-vision-fast¶
Service:
localai-qwen-vision.service
Key runtime settings:
--alias local/qwen-vision-fast
--model /srv/localai/models/qwen3-vl-8b-instruct-q4_k_m/Qwen3VL-8B-Instruct-Q4_K_M.gguf
--mmproj /srv/localai/models/qwen3-vl-8b-instruct-q4_k_m/mmproj-Qwen3VL-8B-Instruct-F16.gguf
--media-path /srv/localai/documents
--host 127.0.0.1
--port 8011
--ctx-size 8192
--parallel 1
--gpu-layers auto
--flash-attn auto
--jinja
--metrics
--no-ui
local/embed-engineering¶
Service:
localai-embed.service
Key runtime settings:
--alias local/embed-engineering
--host 127.0.0.1
--port 8012
--ctx-size 8192
--parallel 2
--gpu-layers auto
--embedding
--metrics
--no-ui
The embedding model returns 1024-dimensional embeddings. The RAG database schema uses vector(1024).
LiteLLM Mapping¶
LiteLLM maps aliases to loopback backend APIs:
| LiteLLM alias | LiteLLM backend model string | Backend API base |
|---|---|---|
local/qwen-coder |
openai/local/qwen-coder |
http://127.0.0.1:8010/v1 |
local/qwen-vision-fast |
openai/local/qwen-vision-fast |
http://127.0.0.1:8011/v1 |
local/embed-engineering |
openai/local/embed-engineering |
http://127.0.0.1:8012/v1 |
LiteLLM requires bearer authentication. Do not document the key value.
Recommended Use¶
Use local/qwen-coder for:
- coding agents
- repository work
- shell/script generation
- engineering automation
- technical reasoning over retrieved documents
Use local/qwen-vision-fast for:
- screenshots
- UI inspection
- diagrams
- drawing exports
- OCR-style image questions
- multimodal RAG follow-up work
Use local/embed-engineering only for embeddings:
- document ingestion
- query embeddings
- semantic and hybrid search
Deferred Model Lanes¶
These aliases are documented for future comparison but are not part of the active boot-enabled baseline:
| Future alias | Candidate role |
|---|---|
local/qwen-chat-fast |
fast text-only Qwen chat/reasoning |
local/qwen-reasoning |
stronger Qwen reasoning/multimodal route |
local/gemma-engineering |
manufacturing/design comparison route |
local/mistral-design |
long-document and visual reasoning comparison route |
local/gpt-oss-reasoning |
staged workflow and reasoning-heavy route |
Do not boot-enable every candidate model. Memory use, KV cache, and concurrent users are the operational limits, not only model file size.