Overview and Architecture¶

Last updated: 2026-05-28

Purpose¶

The local AI server runs private model inference and retrieval-augmented generation on local-server-adeel. It is intended for engineering, coding, manufacturing, design-document, and visual inspection workflows.

The first deployed mode is private and local-first:

inference runs on the Fedora server
raw model ports stay on loopback
LiteLLM provides the OpenAI-compatible API
RAG uses local PostgreSQL + pgvector
RAG API/chat UI stays on loopback until explicit external protection is added
Tailscale is the intended private network path for the LiteLLM API

High-Level Architecture¶

Tailscale client or local process
  -> LiteLLM API
       -> local/qwen-coder llama.cpp backend
       -> local/qwen-vision-fast llama.cpp backend
       -> local/embed-engineering llama.cpp backend

Local browser or localhost process
  -> RAG API/chat UI
       -> PostgreSQL + pgvector
       -> LiteLLM API
            -> local/qwen-coder
            -> local/qwen-vision-fast
            -> local/embed-engineering

The document ingestion path is:

/srv/localai/documents/inbox
  -> localai-rag-ingest.service
  -> extract text or OCR
  -> chunk text
  -> embed chunks through local/embed-engineering
  -> store documents and chunks in PostgreSQL + pgvector
  -> move source file to archive or failed

Security Boundary¶

The network boundary is intentionally simple:

127.0.0.1:8010: raw llama.cpp coder backend, localhost-only
127.0.0.1:8011: raw llama.cpp vision backend, localhost-only
127.0.0.1:8012: raw llama.cpp embedding backend, localhost-only
127.0.0.1:4100: RAG API/chat UI, local upstream for the Cloudflare Knowledge endpoint
0.0.0.0:4000: LiteLLM API, private Tailscale-facing API
PostgreSQL listens on localhost only

Do not expose raw llama.cpp ports directly. External clients should use LiteLLM, and any future external RAG access should go through an authenticated HTTPS proxy.

Current Implementation¶

The implemented stack is system-level, not desktop-session based. Services are managed by systemd and run as the localai user.

Implemented components:

llama.cpp built under /srv/localai/llama.cpp with Vulkan support
Qwen coder model route on 127.0.0.1:8010
Qwen vision model route on 127.0.0.1:8011
Qwen embedding model route on 127.0.0.1:8012
LiteLLM proxy on 0.0.0.0:4000
RAG API/chat UI on 127.0.0.1:4100, exposed for backend callers through https://knowledge.rapiddraft.ai
local PostgreSQL + pgvector database named localai_rag
file-drop ingestion through localai-rag-ingest.timer

Data Flow for Chat¶

For direct model calls:

client
  -> LiteLLM /v1/chat/completions
  -> selected llama.cpp backend
  -> response

For RAG chat:

question
  -> RAG API /chat
  -> embed question with local/embed-engineering
  -> retrieve matching chunks from pgvector/full-text indexes
  -> build cited prompt
  -> call LiteLLM using local/qwen-coder by default
  -> return answer and source chunks

For vision chat:

question and image
  -> RAG API /vision/chat-with-image
  -> LiteLLM
  -> local/qwen-vision-fast
  -> answer

Operational Principles¶

Keep the raw model ports loopback-only.
Use LiteLLM aliases instead of backend URLs in client code.
Keep model services conservative until memory and throughput are benchmarked after BIOS/shared-memory changes.
Treat local documents as the source of truth for manufacturing and design knowledge.
Do not put secret values in documentation, shell history, tickets, or wiki pages.
Verify boot behavior after kernel, driver, BIOS, model, LiteLLM, or RAG changes.