RAG Deployment Plan¶

Last updated: 2026-05-28

Goal¶

Split document search and chat into two deployment modes:

Full local RAG deployment:
local vector database
local document ingestion
local inference
local chat interface
optional secure API for the Railway web app
Later Railway RAG deployment:
Railway Postgres with pgvector
Railway-hosted upload and retrieval workflow
local inference on the Fedora server
Railway web app calls the local inference API

This lets the demo prove document ingestion and question answering locally first, then move the database and upload workflow into Railway later.

Recommended First Architecture¶

Use local Postgres + pgvector as the first vector database.

Reason:

It matches the later Railway Postgres + pgvector target.
The schema and retrieval queries can move with minimal changes.
It avoids introducing a separate vector database API just for the local phase.

First local stack:

/srv/localai/documents/inbox
  -> local ingestion worker
  -> local Postgres + pgvector
  -> local RAG API
  -> local chat UI
  -> local LiteLLM
       -> local/qwen-coder
       -> local/qwen-vision-fast
       -> local/embed-engineering

External demo access:

Railway web app
  -> secure HTTPS RAG API on Fedora
       -> local Postgres + pgvector
       -> local LiteLLM

The Railway app should not connect directly to local Postgres.

Local Document Workflow¶

The first local workflow is file-drop based.

Directory layout:

/srv/localai/documents/
  inbox/       # user drops new files here
  processing/  # ingestion worker temporary state
  archive/     # successfully ingested source files
  failed/      # files that failed parsing or embedding

Supported first file types:

PDF
TXT
Markdown
DOCX
CSV
images/screenshots/drawing exports where OCR or vision-captioning is useful

Ingestion behavior:

User places files in /srv/localai/documents/inbox.
A systemd timer or path-triggered service detects new files.
The ingestion worker computes a file checksum.
Duplicate files are skipped.
Text is extracted.
OCR or vision-captioning is added for scanned/image-heavy documents when available.
Text is chunked.
Chunks are embedded through local/embed-engineering.
Chunks and vectors are stored in local Postgres + pgvector.
The original file moves to archive/ or failed/.

The ingestion worker must update progress continuously so the UI can show how many documents and chunks have been embedded.

Document states:

queued
processing
extracting
chunking
embedding
indexed
failed

Progress fields:

total documents
queued documents
processing documents
indexed documents
failed documents
current file
current stage
chunk count
embedded chunk count
failed chunk count
percent complete
last error

Suggested services:

postgresql.service
localai-embed.service
localai-rag-api.service
localai-rag-ingest.timer
localai-rag-ingest.service

Optional:

localai-rag-ingest.path

Use a timer first if path-triggered ingestion is noisy.

Local Chat Interface¶

The local chat interface should be served by the local RAG API service.

Suggested endpoint:

http://127.0.0.1:4100

If Tailscale-local access is wanted:

http://100.x.y.z:4100

The current live choice is different: keep the RAG API bound to 127.0.0.1 and require a bearer token on protected routes. Use a browser on Fedora itself or an SSH tunnel from a trusted client instead of direct Tailscale exposure for now.

The interface should provide:

document list
ingestion status
ingestion progress meter
current file/stage indicator
chat box
model selector
search mode selector
advanced search filters
source/citation display
retrieved chunk preview for demo transparency
optional image upload for vision questions

The UI does not need to be large. A small FastAPI app with server-rendered HTML or static frontend is enough for the first demo.

Implemented local status:

Service env/config lives under /etc/localai.
Source files are deployed under /srv/localai/rag.
Document drop folders live under /srv/localai/documents.
The first service binds the RAG API/UI to 127.0.0.1:4100.
localai-rag-ingest.timer runs every two minutes and can also be triggered through POST /documents/ingest.
A smoke document has been indexed and retrieved through hybrid and phrase search.
The UI includes an image upload path backed by POST /vision/chat-with-image; a generated OCR smoke image returned VISION OK.
Bearer auth is enabled for protected RAG routes; GET /health remains public and reports auth_enabled.
Recursive nested-folder ingestion is live, preserving relative folder structure into archive/ and failed/.
A first real corpus pass indexed 9 TextCAD standards PDFs and 21 CVAT job 75 images; DS ISO 1101.pdf is the only known failed standards file from that batch.

Local RAG API¶

The local RAG API owns retrieval and orchestration.

Suggested bind:

127.0.0.1:4100

Expose it externally only through Caddy or Cloudflare Tunnel.

Minimum endpoints:

GET  /health
GET  /documents
GET  /ingestion/status
GET  /ingestion/events
POST /documents/ingest
POST /search
POST /search/similar
POST /chat
POST /vision/chat

Suggested request flow for /chat:

question
  -> embed question with local/embed-engineering
  -> pgvector top-k search
  -> build prompt with retrieved chunks
  -> call LiteLLM local/qwen-coder or local/qwen-vision-fast
  -> return answer + citations + retrieved chunks

Suggested request flow for /search:

query
  -> choose search mode
  -> run semantic, keyword, exact, boolean, regex, metadata, or hybrid search
  -> return matching chunks + document metadata + ranking details

The Railway app can call /search for document search and /chat for full RAG answers. RapidDraft Agent should treat this service as a backend Knowledge tool, not as a separate user-facing chat product.

Sources¶

/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/LOCALAI_SERVER_PLAN.md
/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/rag/app/app.py
/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/rag/app/ingest.py

Database Schema Shape¶

Use UUIDs and store source references for citations.

Initial schema:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE rag_documents (
  id uuid PRIMARY KEY,
  title text NOT NULL,
  source_path text NOT NULL,
  source_uri text,
  mime_type text,
  checksum text UNIQUE,
  status text NOT NULL DEFAULT 'uploaded',
  ingest_stage text,
  chunk_count integer NOT NULL DEFAULT 0,
  embedded_chunk_count integer NOT NULL DEFAULT 0,
  failed_chunk_count integer NOT NULL DEFAULT 0,
  ingest_progress numeric NOT NULL DEFAULT 0,
  ingest_started_at timestamptz,
  ingest_finished_at timestamptz,
  last_error text,
  metadata jsonb NOT NULL DEFAULT '{}',
  created_at timestamptz NOT NULL DEFAULT now(),
  updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE rag_chunks (
  id uuid PRIMARY KEY,
  document_id uuid NOT NULL REFERENCES rag_documents(id) ON DELETE CASCADE,
  chunk_index integer NOT NULL,
  page_number integer,
  section_title text,
  content text NOT NULL,
  content_tsv tsvector,
  token_count integer,
  metadata jsonb NOT NULL DEFAULT '{}',
  embedding vector(1024),
  created_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX rag_chunks_document_idx
ON rag_chunks (document_id, chunk_index);

CREATE INDEX rag_chunks_embedding_hnsw
ON rag_chunks
USING hnsw (embedding vector_cosine_ops);

CREATE INDEX rag_chunks_content_tsv_gin
ON rag_chunks
USING gin (content_tsv);

CREATE INDEX rag_chunks_metadata_gin
ON rag_chunks
USING gin (metadata);

The first implementation uses Qwen3-Embedding-0.6B and vector(1024) so pgvector HNSW indexing works cleanly. Qwen3-Embedding-4B emits up to 2560 dimensions and should be treated as a later option unless the schema changes to a compatible indexed type such as halfvec.

Ingestion Progress UI¶

The local chat UI should show ingestion progress live.

Summary display:

Documents: 37 total
Indexed:   31
Processing: 2
Queued:     3
Failed:     1

Current job display:

Current: supplier_manual_v2.pdf
Stage: embedding
Progress: 184 / 240 chunks, 76%

The RAG API should expose:

GET /ingestion/status for polling
GET /ingestion/events for Server-Sent Events if live updates are enabled

First implementation can poll /ingestion/status; SSE can be added if the UI needs smoother live updates.

Search Strategy¶

Use hybrid retrieval by default, but expose multiple explicit search modes.

Default mode:

vector similarity with pgvector
PostgreSQL full-text search
reciprocal rank fusion to combine semantic and keyword candidates
metadata filters by document, folder, type, or project
neighbor expansion around high-ranking chunks
source citations in the answer

Additional search modes:

Mode	Implementation	Best for
Semantic	pgvector cosine search	conceptual questions and similar meaning
Keyword	PostgreSQL `tsvector` / `tsquery`	normal word search
Exact phrase	quoted phrase search with indexed text fallback	copied specification language
Boolean	`websearch_to_tsquery` or explicit `to_tsquery`	`AND`, `OR`, `NOT` workflows
Regex	PostgreSQL regex operators `~` / `~*`	part numbers, serials, standards, dimensions
Metadata	`jsonb` and column filters	project, document type, revision, file, tag, date
Hybrid	semantic + keyword + fusion	default search mode
Faceted	grouped counts over metadata/search hits	browsing larger document sets
Similar chunk	vector search from an existing chunk	find passages like this result
Neighbor expansion	add previous/next chunks after retrieval	preserve technical context around a hit
OCR text	same search modes over extracted OCR text	scanned drawings and image-heavy PDFs
Vision-caption	search generated image descriptions	diagrams/photos where OCR is weak

Reason:

Manufacturing and design documents often include exact terms that semantic search can miss:

part numbers
material names
thread sizes
tolerances
standards
machine codes
revision IDs

First implementation target:

Hybrid search as the default.
Exact phrase search.
Boolean keyword search.
Regex search.
Metadata filters.
Neighbor expansion.
Similar chunk search.
OCR/caption search if image extraction is enabled.

Example vector query:

SELECT
  c.id,
  c.document_id,
  d.title,
  c.page_number,
  c.content,
  1 - (c.embedding <=> $1::vector) AS similarity
FROM rag_chunks c
JOIN rag_documents d ON d.id = c.document_id
ORDER BY c.embedding <=> $1::vector
LIMIT 8;

Example keyword query:

SELECT
  c.id,
  c.document_id,
  d.title,
  c.page_number,
  c.content,
  ts_rank_cd(c.content_tsv, websearch_to_tsquery('english', $1)) AS rank
FROM rag_chunks c
JOIN rag_documents d ON d.id = c.document_id
WHERE c.content_tsv @@ websearch_to_tsquery('english', $1)
ORDER BY rank DESC
LIMIT 40;

Example regex patterns useful in manufacturing/design search:

M6x1(\.0)?
ISO\s?2768
6061[- ]T6
[A-Z]{2,5}-\d{3,6}
Rev(?:ision)?\s?[A-Z0-9]+
\d+(\.\d+)?\s?(mm|in|Nm|MPa)

Hybrid ranking should use reciprocal rank fusion first:

score = 1 / (60 + vector_rank) + 1 / (60 + keyword_rank)

This is simple, robust, and explainable for a company demo.

API Access For Railway App¶

For the company demo, expose the local RAG API, not local Postgres.

Recommended:

Railway app
  -> https://ai.example.com/rag/search
  -> https://ai.example.com/rag/chat

The local Fedora server should expose only:

RAG API, if the Railway app needs document search/chat
LiteLLM API, only if the Railway app needs direct model calls

Protect access with:

HTTPS
bearer API key
CORS restricted to the Railway app domain
optional Cloudflare Access or Cloudflare Tunnel policy
no public access to raw llama-server ports
no public access to Postgres

Tailscale-only access is enough for local devices, but a Railway-hosted app cannot usually call a private Tailscale IP unless the Railway service is also joined to the tailnet. For a company demo, a secured HTTPS endpoint is the simpler integration path.

Later Railway Vector Database Mode¶

Second architecture:

Railway web app
  -> Railway object storage / bucket
  -> Railway Postgres + pgvector
  -> secure HTTPS LiteLLM API on Fedora
       -> local/qwen-coder
       -> local/qwen-vision-fast
       -> local/embed-engineering

In this mode:

Users upload documents through the Railway web app.
Original files go to Railway object storage or another S3-compatible bucket.
Railway worker parses and chunks files.
Railway worker calls Fedora local/embed-engineering for embeddings.
Railway Postgres stores chunks and vectors.
Railway chat endpoint retrieves chunks from Railway Postgres.
Railway chat endpoint calls Fedora LiteLLM for generation.

The local RAG API becomes optional in this mode. It can remain useful for local-only demos and offline testing.

Migration Path¶

To move from local pgvector to Railway pgvector:

Keep the same schema where possible.
Keep the same embedding model and vector dimension.
Export local rag_documents and rag_chunks, or re-ingest from /srv/localai/documents/archive.
Validate search quality on Railway with the same benchmark questions.
Switch the Railway app from local RAG API retrieval to Railway DB retrieval.
Keep Fedora as the inference endpoint.

Re-ingestion is often safer than raw database migration if parsers, OCR, or chunking changed during development.

First Implementation Order¶

Finish local inference baseline:
local/qwen-coder
local/qwen-vision-fast
LiteLLM
Add local embedding service:
local/embed-engineering
Install local Postgres + pgvector.
Create local RAG schema.
Create /srv/localai/documents directories.
Build ingestion worker.
Build local RAG API and chat UI.
Ingest a small demo document set.
Validate local /search and /chat.
Expose the RAG API securely for Railway if needed.
Add Railway pgvector mode later.

Success Criteria¶

Local mode is successful when:

files placed in /srv/localai/documents/inbox are ingested automatically or by command
chunks and embeddings appear in local Postgres
local chat UI can answer questions with citations
local search returns relevant chunks
services start after reboot without desktop login
Railway app can call the secured local RAG API when exposure is enabled

Railway mode is successful when:

Railway app handles uploads
Railway pgvector stores chunks and embeddings
Railway app retrieves relevant chunks
Railway app calls local inference securely
answers include citations
local Fedora server exposes inference but not raw backend ports