Skip to content

Local AI Model Selection

Last updated: 2026-05-27

Context

Target machine:

  • Fedora server deployment
  • llama.cpp with Vulkan first
  • LiteLLM as the only client-facing API
  • Tailscale/private access first
  • boot-time systemd services, no desktop login required
  • assumed future GPU-addressable/shared memory target: 96 GB

The current deployment choice is to run two model services first, then keep additional models documented for later comparison.

First Deploy

These are the two models to deploy first.

Priority LiteLLM alias Model Source Primary use Modality Deployment note
1 local/qwen-coder Qwen/Qwen3-Coder-Next-GGUF User-selected coding agents, repo work, automation scripts, technical reasoning text/code Best first coding/engineering automation route. Keep context conservative at first even though the model supports long-context workflows.
2 local/qwen-vision-fast Qwen/Qwen3.5-9B Selected + user-confirmed vision, diagrams, screenshots, OCR-style inspection, general multimodal reasoning text/image/video Best first multimodal route. Lightweight enough to keep always available while covering visual manufacturing/design tasks.

These two provide the best first coverage:

  • Qwen3-Coder-Next handles code, automation, scripts, CLI/IDE workflows, and technical implementation.
  • Qwen3.5-9B handles drawings, screenshots, diagrams, image/video understanding, and fast general reasoning.

Candidate Model Table

Deploy status LiteLLM alias Model Source Role Modality Notes
first deploy local/qwen-coder Qwen/Qwen3-Coder-Next-GGUF user-selected coding and engineering automation text/code Strongest match for coding agents, IDE use, repo-scale work, scaffolding, scripts, and tool-oriented workflows.
first deploy local/qwen-vision-fast Qwen/Qwen3.5-9B selected + user-selected fast multimodal reasoning text/image/video Main first vision route for drawings, diagrams, screenshots, OCR-like checks, and general multimodal reasoning.
deferred local/qwen-chat-fast Qwen/Qwen3-30B-A3B-GGUF user-selected fast text-only chat/reasoning text Practical MoE text route. Useful later if a stronger text-only general chat model is needed, but not one of the first two because vision is required.
deferred local/qwen-reasoning Qwen/Qwen3.5-35B-A3B previously selected stronger Qwen reasoning/multimodal route text/image/video Candidate premium Qwen route after first-deploy benchmarking and BIOS/shared-memory retest.
deferred local/gemma-engineering google/gemma-4-26B-A4B-it previously selected engineering and vision comparison text/image Useful comparison model for manufacturing/design reasoning, diagrams, and OCR-style work.
deferred local/mistral-design mistralai/Mistral-Small-3.1-24B-Instruct-2503 previously selected design, long documents, function-calling comparison text/image Good comparison route for technical documents, visual reasoning, and function-calling behavior.
deferred local/gpt-oss-reasoning openai/gpt-oss-20b or GGUF conversion/repo added by user request high-reasoning and staged workflow reasoning text Strong candidate for reasoning-heavy, agentic, tool-use, and graduation/staged workflow tasks. Keep as a later comparison route, not first deploy.
local RAG support local/embed-engineering Qwen/Qwen3-Embedding-0.6B-GGUF support model engineering document retrieval embeddings First RAG embedding route. Uses 1024-dimensional vectors, which keeps pgvector HNSW indexing simple.

gpt-oss-20b Note

gpt-oss-20b should be documented as a later reasoning route.

Reasons to keep it in the candidate set:

  • It is designed for powerful reasoning, agentic tasks, and developer workflows.
  • It is text-only, so it does not replace the Qwen3.5 vision route.
  • It is useful for structured multi-step work, tool-use planning, coding-adjacent workflows, and graduation/staged workflow reasoning.
  • It may be a good comparison against Qwen3-Coder-Next for planning-heavy automation tasks.

Why it is not in the first deploy:

  • The first deploy needs one coding route and one vision route.
  • gpt-oss-20b adds another reasoning/text service, but not visual understanding.
  • It should be benchmarked after the two-model baseline is stable.

Suggested future alias:

local/gpt-oss-reasoning

Operating Guidance

Even with an assumed 96 GB GPU-addressable/shared memory target, do not boot-enable every candidate model.

First boot-enabled model services:

localai-qwen-coder.service
localai-qwen-vision.service
localai-litellm.service

Recommended first aliases:

local/qwen-coder
local/qwen-vision-fast

Keep these conservative at first:

  • context windows
  • parallel slots
  • max output tokens
  • number of resident large models

Model weights may fit in memory, but long-context KV cache and concurrent users are the real operational limit.

Manufacturing And Design Knowledge

The model should not be treated as the source of truth for manufacturing and design.

The better architecture is:

LiteLLM
  -> Qwen3-Coder-Next for engineering automation and code
  -> Qwen3.5-9B for vision and multimodal reasoning
  -> Qwen3-Embedding-0.6B for local engineering document RAG

The RAG deployment is split into:

  1. local-first RAG with local Postgres + pgvector, file-drop ingestion, local chat UI, and optional secured API access for the Railway app
  2. later Railway RAG with Railway Postgres + pgvector and local Fedora inference

See RAG Deployment Plan.

RAG should eventually index:

  • machine manuals
  • CAD/CAM notes
  • materials datasheets
  • process instructions
  • QA checklists
  • standards and tolerances
  • supplier specs
  • previous project reports
  • internal manufacturing lessons learned
  • Qwen3-Coder-Next-GGUF: https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF
  • Qwen3.5-9B: https://huggingface.co/Qwen/Qwen3.5-9B
  • Qwen3-30B-A3B-GGUF: https://huggingface.co/Qwen/Qwen3-30B-A3B-GGUF
  • Qwen3-Embedding-0.6B-GGUF: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF
  • Qwen3.5-35B-A3B: https://huggingface.co/Qwen/Qwen3.5-35B-A3B
  • Gemma 4 26B-A4B: https://huggingface.co/google/gemma-4-26B-A4B-it
  • Mistral Small 3.1: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503
  • gpt-oss-20b: https://huggingface.co/openai/gpt-oss-20b
  • OpenAI gpt-oss model card: https://openai.com/index/gpt-oss-model-card/