Skip to content

Overview and Architecture

Last updated: 2026-05-28

Purpose

The local AI server runs private model inference and retrieval-augmented generation on local-server-adeel. It is intended for engineering, coding, manufacturing, design-document, and visual inspection workflows.

The first deployed mode is private and local-first:

  • inference runs on the Fedora server
  • raw model ports stay on loopback
  • LiteLLM provides the OpenAI-compatible API
  • RAG uses local PostgreSQL + pgvector
  • RAG API/chat UI stays on loopback until explicit external protection is added
  • Tailscale is the intended private network path for the LiteLLM API

High-Level Architecture

Tailscale client or local process
  -> LiteLLM API
       -> local/qwen-coder llama.cpp backend
       -> local/qwen-vision-fast llama.cpp backend
       -> local/embed-engineering llama.cpp backend

Local browser or localhost process
  -> RAG API/chat UI
       -> PostgreSQL + pgvector
       -> LiteLLM API
            -> local/qwen-coder
            -> local/qwen-vision-fast
            -> local/embed-engineering

The document ingestion path is:

/srv/localai/documents/inbox
  -> localai-rag-ingest.service
  -> extract text or OCR
  -> chunk text
  -> embed chunks through local/embed-engineering
  -> store documents and chunks in PostgreSQL + pgvector
  -> move source file to archive or failed

Security Boundary

The network boundary is intentionally simple:

  • 127.0.0.1:8010: raw llama.cpp coder backend, localhost-only
  • 127.0.0.1:8011: raw llama.cpp vision backend, localhost-only
  • 127.0.0.1:8012: raw llama.cpp embedding backend, localhost-only
  • 127.0.0.1:4100: RAG API/chat UI, local upstream for the Cloudflare Knowledge endpoint
  • 0.0.0.0:4000: LiteLLM API, private Tailscale-facing API
  • PostgreSQL listens on localhost only

Do not expose raw llama.cpp ports directly. External clients should use LiteLLM, and any future external RAG access should go through an authenticated HTTPS proxy.

Current Implementation

The implemented stack is system-level, not desktop-session based. Services are managed by systemd and run as the localai user.

Implemented components:

  • llama.cpp built under /srv/localai/llama.cpp with Vulkan support
  • Qwen coder model route on 127.0.0.1:8010
  • Qwen vision model route on 127.0.0.1:8011
  • Qwen embedding model route on 127.0.0.1:8012
  • LiteLLM proxy on 0.0.0.0:4000
  • RAG API/chat UI on 127.0.0.1:4100, exposed for backend callers through https://knowledge.rapiddraft.ai
  • local PostgreSQL + pgvector database named localai_rag
  • file-drop ingestion through localai-rag-ingest.timer

Data Flow for Chat

For direct model calls:

client
  -> LiteLLM /v1/chat/completions
  -> selected llama.cpp backend
  -> response

For RAG chat:

question
  -> RAG API /chat
  -> embed question with local/embed-engineering
  -> retrieve matching chunks from pgvector/full-text indexes
  -> build cited prompt
  -> call LiteLLM using local/qwen-coder by default
  -> return answer and source chunks

For vision chat:

question and image
  -> RAG API /vision/chat-with-image
  -> LiteLLM
  -> local/qwen-vision-fast
  -> answer

Operational Principles

  • Keep the raw model ports loopback-only.
  • Use LiteLLM aliases instead of backend URLs in client code.
  • Keep model services conservative until memory and throughput are benchmarked after BIOS/shared-memory changes.
  • Treat local documents as the source of truth for manufacturing and design knowledge.
  • Do not put secret values in documentation, shell history, tickets, or wiki pages.
  • Verify boot behavior after kernel, driver, BIOS, model, LiteLLM, or RAG changes.