Overview and Architecture¶
Last updated: 2026-05-28
Purpose¶
The local AI server runs private model inference and retrieval-augmented generation on local-server-adeel. It is intended for engineering, coding, manufacturing, design-document, and visual inspection workflows.
The first deployed mode is private and local-first:
- inference runs on the Fedora server
- raw model ports stay on loopback
- LiteLLM provides the OpenAI-compatible API
- RAG uses local PostgreSQL + pgvector
- RAG API/chat UI stays on loopback until explicit external protection is added
- Tailscale is the intended private network path for the LiteLLM API
High-Level Architecture¶
Tailscale client or local process
-> LiteLLM API
-> local/qwen-coder llama.cpp backend
-> local/qwen-vision-fast llama.cpp backend
-> local/embed-engineering llama.cpp backend
Local browser or localhost process
-> RAG API/chat UI
-> PostgreSQL + pgvector
-> LiteLLM API
-> local/qwen-coder
-> local/qwen-vision-fast
-> local/embed-engineering
The document ingestion path is:
/srv/localai/documents/inbox
-> localai-rag-ingest.service
-> extract text or OCR
-> chunk text
-> embed chunks through local/embed-engineering
-> store documents and chunks in PostgreSQL + pgvector
-> move source file to archive or failed
Security Boundary¶
The network boundary is intentionally simple:
127.0.0.1:8010: raw llama.cpp coder backend, localhost-only127.0.0.1:8011: raw llama.cpp vision backend, localhost-only127.0.0.1:8012: raw llama.cpp embedding backend, localhost-only127.0.0.1:4100: RAG API/chat UI, local upstream for the Cloudflare Knowledge endpoint0.0.0.0:4000: LiteLLM API, private Tailscale-facing API- PostgreSQL listens on localhost only
Do not expose raw llama.cpp ports directly. External clients should use LiteLLM, and any future external RAG access should go through an authenticated HTTPS proxy.
Current Implementation¶
The implemented stack is system-level, not desktop-session based. Services are managed by systemd and run as the localai user.
Implemented components:
llama.cppbuilt under/srv/localai/llama.cppwith Vulkan support- Qwen coder model route on
127.0.0.1:8010 - Qwen vision model route on
127.0.0.1:8011 - Qwen embedding model route on
127.0.0.1:8012 - LiteLLM proxy on
0.0.0.0:4000 - RAG API/chat UI on
127.0.0.1:4100, exposed for backend callers throughhttps://knowledge.rapiddraft.ai - local PostgreSQL + pgvector database named
localai_rag - file-drop ingestion through
localai-rag-ingest.timer
Data Flow for Chat¶
For direct model calls:
client
-> LiteLLM /v1/chat/completions
-> selected llama.cpp backend
-> response
For RAG chat:
question
-> RAG API /chat
-> embed question with local/embed-engineering
-> retrieve matching chunks from pgvector/full-text indexes
-> build cited prompt
-> call LiteLLM using local/qwen-coder by default
-> return answer and source chunks
For vision chat:
question and image
-> RAG API /vision/chat-with-image
-> LiteLLM
-> local/qwen-vision-fast
-> answer
Operational Principles¶
- Keep the raw model ports loopback-only.
- Use LiteLLM aliases instead of backend URLs in client code.
- Keep model services conservative until memory and throughput are benchmarked after BIOS/shared-memory changes.
- Treat local documents as the source of truth for manufacturing and design knowledge.
- Do not put secret values in documentation, shell history, tickets, or wiki pages.
- Verify boot behavior after kernel, driver, BIOS, model, LiteLLM, or RAG changes.