Local AI Model Selection¶
Last updated: 2026-05-27
Context¶
Target machine:
- Fedora server deployment
llama.cppwith Vulkan first- LiteLLM as the only client-facing API
- Tailscale/private access first
- boot-time
systemdservices, no desktop login required - assumed future GPU-addressable/shared memory target:
96 GB
The current deployment choice is to run two model services first, then keep additional models documented for later comparison.
First Deploy¶
These are the two models to deploy first.
| Priority | LiteLLM alias | Model | Source | Primary use | Modality | Deployment note |
|---|---|---|---|---|---|---|
| 1 | local/qwen-coder |
Qwen/Qwen3-Coder-Next-GGUF |
User-selected | coding agents, repo work, automation scripts, technical reasoning | text/code | Best first coding/engineering automation route. Keep context conservative at first even though the model supports long-context workflows. |
| 2 | local/qwen-vision-fast |
Qwen/Qwen3.5-9B |
Selected + user-confirmed | vision, diagrams, screenshots, OCR-style inspection, general multimodal reasoning | text/image/video | Best first multimodal route. Lightweight enough to keep always available while covering visual manufacturing/design tasks. |
These two provide the best first coverage:
Qwen3-Coder-Nexthandles code, automation, scripts, CLI/IDE workflows, and technical implementation.Qwen3.5-9Bhandles drawings, screenshots, diagrams, image/video understanding, and fast general reasoning.
Candidate Model Table¶
| Deploy status | LiteLLM alias | Model | Source | Role | Modality | Notes |
|---|---|---|---|---|---|---|
| first deploy | local/qwen-coder |
Qwen/Qwen3-Coder-Next-GGUF |
user-selected | coding and engineering automation | text/code | Strongest match for coding agents, IDE use, repo-scale work, scaffolding, scripts, and tool-oriented workflows. |
| first deploy | local/qwen-vision-fast |
Qwen/Qwen3.5-9B |
selected + user-selected | fast multimodal reasoning | text/image/video | Main first vision route for drawings, diagrams, screenshots, OCR-like checks, and general multimodal reasoning. |
| deferred | local/qwen-chat-fast |
Qwen/Qwen3-30B-A3B-GGUF |
user-selected | fast text-only chat/reasoning | text | Practical MoE text route. Useful later if a stronger text-only general chat model is needed, but not one of the first two because vision is required. |
| deferred | local/qwen-reasoning |
Qwen/Qwen3.5-35B-A3B |
previously selected | stronger Qwen reasoning/multimodal route | text/image/video | Candidate premium Qwen route after first-deploy benchmarking and BIOS/shared-memory retest. |
| deferred | local/gemma-engineering |
google/gemma-4-26B-A4B-it |
previously selected | engineering and vision comparison | text/image | Useful comparison model for manufacturing/design reasoning, diagrams, and OCR-style work. |
| deferred | local/mistral-design |
mistralai/Mistral-Small-3.1-24B-Instruct-2503 |
previously selected | design, long documents, function-calling comparison | text/image | Good comparison route for technical documents, visual reasoning, and function-calling behavior. |
| deferred | local/gpt-oss-reasoning |
openai/gpt-oss-20b or GGUF conversion/repo |
added by user request | high-reasoning and staged workflow reasoning | text | Strong candidate for reasoning-heavy, agentic, tool-use, and graduation/staged workflow tasks. Keep as a later comparison route, not first deploy. |
| local RAG support | local/embed-engineering |
Qwen/Qwen3-Embedding-0.6B-GGUF |
support model | engineering document retrieval | embeddings | First RAG embedding route. Uses 1024-dimensional vectors, which keeps pgvector HNSW indexing simple. |
gpt-oss-20b Note¶
gpt-oss-20b should be documented as a later reasoning route.
Reasons to keep it in the candidate set:
- It is designed for powerful reasoning, agentic tasks, and developer workflows.
- It is text-only, so it does not replace the Qwen3.5 vision route.
- It is useful for structured multi-step work, tool-use planning, coding-adjacent workflows, and graduation/staged workflow reasoning.
- It may be a good comparison against Qwen3-Coder-Next for planning-heavy automation tasks.
Why it is not in the first deploy:
- The first deploy needs one coding route and one vision route.
gpt-oss-20badds another reasoning/text service, but not visual understanding.- It should be benchmarked after the two-model baseline is stable.
Suggested future alias:
local/gpt-oss-reasoning
Operating Guidance¶
Even with an assumed 96 GB GPU-addressable/shared memory target, do not boot-enable every candidate model.
First boot-enabled model services:
localai-qwen-coder.service
localai-qwen-vision.service
localai-litellm.service
Recommended first aliases:
local/qwen-coder
local/qwen-vision-fast
Keep these conservative at first:
- context windows
- parallel slots
- max output tokens
- number of resident large models
Model weights may fit in memory, but long-context KV cache and concurrent users are the real operational limit.
Manufacturing And Design Knowledge¶
The model should not be treated as the source of truth for manufacturing and design.
The better architecture is:
LiteLLM
-> Qwen3-Coder-Next for engineering automation and code
-> Qwen3.5-9B for vision and multimodal reasoning
-> Qwen3-Embedding-0.6B for local engineering document RAG
The RAG deployment is split into:
- local-first RAG with local Postgres + pgvector, file-drop ingestion, local chat UI, and optional secured API access for the Railway app
- later Railway RAG with Railway Postgres + pgvector and local Fedora inference
See RAG Deployment Plan.
RAG should eventually index:
- machine manuals
- CAD/CAM notes
- materials datasheets
- process instructions
- QA checklists
- standards and tolerances
- supplier specs
- previous project reports
- internal manufacturing lessons learned
Source Links¶
Qwen3-Coder-Next-GGUF: https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUFQwen3.5-9B: https://huggingface.co/Qwen/Qwen3.5-9BQwen3-30B-A3B-GGUF: https://huggingface.co/Qwen/Qwen3-30B-A3B-GGUFQwen3-Embedding-0.6B-GGUF: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUFQwen3.5-35B-A3B: https://huggingface.co/Qwen/Qwen3.5-35B-A3BGemma 4 26B-A4B: https://huggingface.co/google/gemma-4-26B-A4B-itMistral Small 3.1: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503gpt-oss-20b: https://huggingface.co/openai/gpt-oss-20b- OpenAI gpt-oss model card: https://openai.com/index/gpt-oss-model-card/