Local AI Model Selection¶

Last updated: 2026-05-27

Context¶

Target machine:

Fedora server deployment
llama.cpp with Vulkan first
LiteLLM as the only client-facing API
Tailscale/private access first
boot-time systemd services, no desktop login required
assumed future GPU-addressable/shared memory target: 96 GB

The current deployment choice is to run two model services first, then keep additional models documented for later comparison.

First Deploy¶

These are the two models to deploy first.

Priority	LiteLLM alias	Model	Source	Primary use	Modality	Deployment note
1	`local/qwen-coder`	`Qwen/Qwen3-Coder-Next-GGUF`	User-selected	coding agents, repo work, automation scripts, technical reasoning	text/code	Best first coding/engineering automation route. Keep context conservative at first even though the model supports long-context workflows.
2	`local/qwen-vision-fast`	`Qwen/Qwen3.5-9B`	Selected + user-confirmed	vision, diagrams, screenshots, OCR-style inspection, general multimodal reasoning	text/image/video	Best first multimodal route. Lightweight enough to keep always available while covering visual manufacturing/design tasks.

These two provide the best first coverage:

Qwen3-Coder-Next handles code, automation, scripts, CLI/IDE workflows, and technical implementation.
Qwen3.5-9B handles drawings, screenshots, diagrams, image/video understanding, and fast general reasoning.

Candidate Model Table¶

Deploy status	LiteLLM alias	Model	Source	Role	Modality	Notes
first deploy	`local/qwen-coder`	`Qwen/Qwen3-Coder-Next-GGUF`	user-selected	coding and engineering automation	text/code	Strongest match for coding agents, IDE use, repo-scale work, scaffolding, scripts, and tool-oriented workflows.
first deploy	`local/qwen-vision-fast`	`Qwen/Qwen3.5-9B`	selected + user-selected	fast multimodal reasoning	text/image/video	Main first vision route for drawings, diagrams, screenshots, OCR-like checks, and general multimodal reasoning.
deferred	`local/qwen-chat-fast`	`Qwen/Qwen3-30B-A3B-GGUF`	user-selected	fast text-only chat/reasoning	text	Practical MoE text route. Useful later if a stronger text-only general chat model is needed, but not one of the first two because vision is required.
deferred	`local/qwen-reasoning`	`Qwen/Qwen3.5-35B-A3B`	previously selected	stronger Qwen reasoning/multimodal route	text/image/video	Candidate premium Qwen route after first-deploy benchmarking and BIOS/shared-memory retest.
deferred	`local/gemma-engineering`	`google/gemma-4-26B-A4B-it`	previously selected	engineering and vision comparison	text/image	Useful comparison model for manufacturing/design reasoning, diagrams, and OCR-style work.
deferred	`local/mistral-design`	`mistralai/Mistral-Small-3.1-24B-Instruct-2503`	previously selected	design, long documents, function-calling comparison	text/image	Good comparison route for technical documents, visual reasoning, and function-calling behavior.
deferred	`local/gpt-oss-reasoning`	`openai/gpt-oss-20b` or GGUF conversion/repo	added by user request	high-reasoning and staged workflow reasoning	text	Strong candidate for reasoning-heavy, agentic, tool-use, and graduation/staged workflow tasks. Keep as a later comparison route, not first deploy.
local RAG support	`local/embed-engineering`	`Qwen/Qwen3-Embedding-0.6B-GGUF`	support model	engineering document retrieval	embeddings	First RAG embedding route. Uses 1024-dimensional vectors, which keeps pgvector HNSW indexing simple.

gpt-oss-20b Note¶

gpt-oss-20b should be documented as a later reasoning route.

Reasons to keep it in the candidate set:

It is designed for powerful reasoning, agentic tasks, and developer workflows.
It is text-only, so it does not replace the Qwen3.5 vision route.
It is useful for structured multi-step work, tool-use planning, coding-adjacent workflows, and graduation/staged workflow reasoning.
It may be a good comparison against Qwen3-Coder-Next for planning-heavy automation tasks.

Why it is not in the first deploy:

The first deploy needs one coding route and one vision route.
gpt-oss-20b adds another reasoning/text service, but not visual understanding.
It should be benchmarked after the two-model baseline is stable.

Suggested future alias:

local/gpt-oss-reasoning

Operating Guidance¶

Even with an assumed 96 GB GPU-addressable/shared memory target, do not boot-enable every candidate model.

First boot-enabled model services:

localai-qwen-coder.service
localai-qwen-vision.service
localai-litellm.service

Recommended first aliases:

local/qwen-coder
local/qwen-vision-fast

Keep these conservative at first:

context windows
parallel slots
max output tokens
number of resident large models

Model weights may fit in memory, but long-context KV cache and concurrent users are the real operational limit.

Manufacturing And Design Knowledge¶

The model should not be treated as the source of truth for manufacturing and design.

The better architecture is:

LiteLLM
  -> Qwen3-Coder-Next for engineering automation and code
  -> Qwen3.5-9B for vision and multimodal reasoning
  -> Qwen3-Embedding-0.6B for local engineering document RAG

The RAG deployment is split into:

local-first RAG with local Postgres + pgvector, file-drop ingestion, local chat UI, and optional secured API access for the Railway app
later Railway RAG with Railway Postgres + pgvector and local Fedora inference

See RAG Deployment Plan.

RAG should eventually index:

machine manuals
CAD/CAM notes
materials datasheets
process instructions
QA checklists
standards and tolerances
supplier specs
previous project reports
internal manufacturing lessons learned

Source Links¶

Qwen3-Coder-Next-GGUF: https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF
Qwen3.5-9B: https://huggingface.co/Qwen/Qwen3.5-9B
Qwen3-30B-A3B-GGUF: https://huggingface.co/Qwen/Qwen3-30B-A3B-GGUF
Qwen3-Embedding-0.6B-GGUF: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF
Qwen3.5-35B-A3B: https://huggingface.co/Qwen/Qwen3.5-35B-A3B
Gemma 4 26B-A4B: https://huggingface.co/google/gemma-4-26B-A4B-it
Mistral Small 3.1: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503
gpt-oss-20b: https://huggingface.co/openai/gpt-oss-20b
OpenAI gpt-oss model card: https://openai.com/index/gpt-oss-model-card/