Skip to content

Model Catalog and LiteLLM Aliases

Last updated: 2026-05-31

Active Aliases

LiteLLM alias Backend Port Purpose
local/qwen-coder llama.cpp 8010 coding, engineering automation, technical reasoning
local/qwen-vision-fast llama.cpp 8011 image understanding, screenshots, diagrams, OCR-style inspection
local/embed-engineering llama.cpp embeddings 8012 document embeddings for RAG

Clients should use these LiteLLM aliases instead of calling raw backend URLs.

Models on Disk

Coder model:

/srv/localai/models/qwen3-coder-next-q4_k_m/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf
/srv/localai/models/qwen3-coder-next-q4_k_m/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00002-of-00004.gguf
/srv/localai/models/qwen3-coder-next-q4_k_m/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00003-of-00004.gguf
/srv/localai/models/qwen3-coder-next-q4_k_m/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00004-of-00004.gguf

Embedding model:

/srv/localai/models/qwen3-embedding-0.6b-q8_0/Qwen3-Embedding-0.6B-Q8_0.gguf

Vision model:

/srv/localai/models/qwen3-vl-8b-instruct-q4_k_m/Qwen3VL-8B-Instruct-Q4_K_M.gguf
/srv/localai/models/qwen3-vl-8b-instruct-q4_k_m/mmproj-Qwen3VL-8B-Instruct-F16.gguf

The previous Qwen3.5-9B-Q4_K_M vision model remains on disk as a rollback option, but Qwen3-VL-8B-Instruct-Q4_K_M is the active default because it followed RapidDraft's final-answer prompt contract more reliably.

llama.cpp Runtime Settings

local/qwen-coder

Service:

localai-qwen-coder.service

Key runtime settings:

--alias local/qwen-coder
--host 127.0.0.1
--port 8010
--ctx-size 32768
--parallel 1
--gpu-layers auto
--flash-attn auto
--jinja
--metrics
--no-ui

local/qwen-coder uses a larger single-slot context for local agent apps such as Hermes Agent, whose default tool prompt can exceed 20k model tokens before the user message.

local/qwen-vision-fast

Service:

localai-qwen-vision.service

Key runtime settings:

--alias local/qwen-vision-fast
--model /srv/localai/models/qwen3-vl-8b-instruct-q4_k_m/Qwen3VL-8B-Instruct-Q4_K_M.gguf
--mmproj /srv/localai/models/qwen3-vl-8b-instruct-q4_k_m/mmproj-Qwen3VL-8B-Instruct-F16.gguf
--media-path /srv/localai/documents
--host 127.0.0.1
--port 8011
--ctx-size 8192
--parallel 1
--gpu-layers auto
--flash-attn auto
--jinja
--metrics
--no-ui

local/embed-engineering

Service:

localai-embed.service

Key runtime settings:

--alias local/embed-engineering
--host 127.0.0.1
--port 8012
--ctx-size 8192
--parallel 2
--gpu-layers auto
--embedding
--metrics
--no-ui

The embedding model returns 1024-dimensional embeddings. The RAG database schema uses vector(1024).

LiteLLM Mapping

LiteLLM maps aliases to loopback backend APIs:

LiteLLM alias LiteLLM backend model string Backend API base
local/qwen-coder openai/local/qwen-coder http://127.0.0.1:8010/v1
local/qwen-vision-fast openai/local/qwen-vision-fast http://127.0.0.1:8011/v1
local/embed-engineering openai/local/embed-engineering http://127.0.0.1:8012/v1

LiteLLM requires bearer authentication. Do not document the key value.

Use local/qwen-coder for:

  • coding agents
  • repository work
  • shell/script generation
  • engineering automation
  • technical reasoning over retrieved documents

Use local/qwen-vision-fast for:

  • screenshots
  • UI inspection
  • diagrams
  • drawing exports
  • OCR-style image questions
  • multimodal RAG follow-up work

Use local/embed-engineering only for embeddings:

  • document ingestion
  • query embeddings
  • semantic and hybrid search

Deferred Model Lanes

These aliases are documented for future comparison but are not part of the active boot-enabled baseline:

Future alias Candidate role
local/qwen-chat-fast fast text-only Qwen chat/reasoning
local/qwen-reasoning stronger Qwen reasoning/multimodal route
local/gemma-engineering manufacturing/design comparison route
local/mistral-design long-document and visual reasoning comparison route
local/gpt-oss-reasoning staged workflow and reasoning-heavy route

Do not boot-enable every candidate model. Memory use, KV cache, and concurrent users are the operational limits, not only model file size.