RAG Deployment Plan¶
Last updated: 2026-05-28
Goal¶
Split document search and chat into two deployment modes:
- Full local RAG deployment:
- local vector database
- local document ingestion
- local inference
- local chat interface
- optional secure API for the Railway web app
- Later Railway RAG deployment:
- Railway Postgres with pgvector
- Railway-hosted upload and retrieval workflow
- local inference on the Fedora server
- Railway web app calls the local inference API
This lets the demo prove document ingestion and question answering locally first, then move the database and upload workflow into Railway later.
Recommended First Architecture¶
Use local Postgres + pgvector as the first vector database.
Reason:
- It matches the later Railway Postgres + pgvector target.
- The schema and retrieval queries can move with minimal changes.
- It avoids introducing a separate vector database API just for the local phase.
First local stack:
/srv/localai/documents/inbox
-> local ingestion worker
-> local Postgres + pgvector
-> local RAG API
-> local chat UI
-> local LiteLLM
-> local/qwen-coder
-> local/qwen-vision-fast
-> local/embed-engineering
External demo access:
Railway web app
-> secure HTTPS RAG API on Fedora
-> local Postgres + pgvector
-> local LiteLLM
The Railway app should not connect directly to local Postgres.
Local Document Workflow¶
The first local workflow is file-drop based.
Directory layout:
/srv/localai/documents/
inbox/ # user drops new files here
processing/ # ingestion worker temporary state
archive/ # successfully ingested source files
failed/ # files that failed parsing or embedding
Supported first file types:
- TXT
- Markdown
- DOCX
- CSV
- images/screenshots/drawing exports where OCR or vision-captioning is useful
Ingestion behavior:
- User places files in
/srv/localai/documents/inbox. - A systemd timer or path-triggered service detects new files.
- The ingestion worker computes a file checksum.
- Duplicate files are skipped.
- Text is extracted.
- OCR or vision-captioning is added for scanned/image-heavy documents when available.
- Text is chunked.
- Chunks are embedded through
local/embed-engineering. - Chunks and vectors are stored in local Postgres + pgvector.
- The original file moves to
archive/orfailed/.
The ingestion worker must update progress continuously so the UI can show how many documents and chunks have been embedded.
Document states:
- queued
- processing
- extracting
- chunking
- embedding
- indexed
- failed
Progress fields:
- total documents
- queued documents
- processing documents
- indexed documents
- failed documents
- current file
- current stage
- chunk count
- embedded chunk count
- failed chunk count
- percent complete
- last error
Suggested services:
postgresql.service
localai-embed.service
localai-rag-api.service
localai-rag-ingest.timer
localai-rag-ingest.service
Optional:
localai-rag-ingest.path
Use a timer first if path-triggered ingestion is noisy.
Local Chat Interface¶
The local chat interface should be served by the local RAG API service.
Suggested endpoint:
http://127.0.0.1:4100
If Tailscale-local access is wanted:
http://100.x.y.z:4100
The current live choice is different: keep the RAG API bound to 127.0.0.1 and require a bearer token on protected routes. Use a browser on Fedora itself or an SSH tunnel from a trusted client instead of direct Tailscale exposure for now.
The interface should provide:
- document list
- ingestion status
- ingestion progress meter
- current file/stage indicator
- chat box
- model selector
- search mode selector
- advanced search filters
- source/citation display
- retrieved chunk preview for demo transparency
- optional image upload for vision questions
The UI does not need to be large. A small FastAPI app with server-rendered HTML or static frontend is enough for the first demo.
Implemented local status:
- Service env/config lives under
/etc/localai. - Source files are deployed under
/srv/localai/rag. - Document drop folders live under
/srv/localai/documents. - The first service binds the RAG API/UI to
127.0.0.1:4100. localai-rag-ingest.timerruns every two minutes and can also be triggered throughPOST /documents/ingest.- A smoke document has been indexed and retrieved through hybrid and phrase search.
- The UI includes an image upload path backed by
POST /vision/chat-with-image; a generated OCR smoke image returnedVISION OK. - Bearer auth is enabled for protected RAG routes;
GET /healthremains public and reportsauth_enabled. - Recursive nested-folder ingestion is live, preserving relative folder structure into
archive/andfailed/. - A first real corpus pass indexed
9TextCAD standards PDFs and21CVAT job 75 images;DS ISO 1101.pdfis the only known failed standards file from that batch.
Local RAG API¶
The local RAG API owns retrieval and orchestration.
Suggested bind:
127.0.0.1:4100
Expose it externally only through Caddy or Cloudflare Tunnel.
Minimum endpoints:
GET /health
GET /documents
GET /ingestion/status
GET /ingestion/events
POST /documents/ingest
POST /search
POST /search/similar
POST /chat
POST /vision/chat
Suggested request flow for /chat:
question
-> embed question with local/embed-engineering
-> pgvector top-k search
-> build prompt with retrieved chunks
-> call LiteLLM local/qwen-coder or local/qwen-vision-fast
-> return answer + citations + retrieved chunks
Suggested request flow for /search:
query
-> choose search mode
-> run semantic, keyword, exact, boolean, regex, metadata, or hybrid search
-> return matching chunks + document metadata + ranking details
The Railway app can call /search for document search and /chat for full RAG answers.
RapidDraft Agent should treat this service as a backend Knowledge tool, not as a separate user-facing chat product.
Sources¶
/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/LOCALAI_SERVER_PLAN.md/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/rag/app/app.py/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/rag/app/ingest.py
Database Schema Shape¶
Use UUIDs and store source references for citations.
Initial schema:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE rag_documents (
id uuid PRIMARY KEY,
title text NOT NULL,
source_path text NOT NULL,
source_uri text,
mime_type text,
checksum text UNIQUE,
status text NOT NULL DEFAULT 'uploaded',
ingest_stage text,
chunk_count integer NOT NULL DEFAULT 0,
embedded_chunk_count integer NOT NULL DEFAULT 0,
failed_chunk_count integer NOT NULL DEFAULT 0,
ingest_progress numeric NOT NULL DEFAULT 0,
ingest_started_at timestamptz,
ingest_finished_at timestamptz,
last_error text,
metadata jsonb NOT NULL DEFAULT '{}',
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE rag_chunks (
id uuid PRIMARY KEY,
document_id uuid NOT NULL REFERENCES rag_documents(id) ON DELETE CASCADE,
chunk_index integer NOT NULL,
page_number integer,
section_title text,
content text NOT NULL,
content_tsv tsvector,
token_count integer,
metadata jsonb NOT NULL DEFAULT '{}',
embedding vector(1024),
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX rag_chunks_document_idx
ON rag_chunks (document_id, chunk_index);
CREATE INDEX rag_chunks_embedding_hnsw
ON rag_chunks
USING hnsw (embedding vector_cosine_ops);
CREATE INDEX rag_chunks_content_tsv_gin
ON rag_chunks
USING gin (content_tsv);
CREATE INDEX rag_chunks_metadata_gin
ON rag_chunks
USING gin (metadata);
The first implementation uses Qwen3-Embedding-0.6B and vector(1024) so pgvector HNSW indexing works cleanly. Qwen3-Embedding-4B emits up to 2560 dimensions and should be treated as a later option unless the schema changes to a compatible indexed type such as halfvec.
Ingestion Progress UI¶
The local chat UI should show ingestion progress live.
Summary display:
Documents: 37 total
Indexed: 31
Processing: 2
Queued: 3
Failed: 1
Current job display:
Current: supplier_manual_v2.pdf
Stage: embedding
Progress: 184 / 240 chunks, 76%
The RAG API should expose:
GET /ingestion/statusfor pollingGET /ingestion/eventsfor Server-Sent Events if live updates are enabled
First implementation can poll /ingestion/status; SSE can be added if the UI needs smoother live updates.
Search Strategy¶
Use hybrid retrieval by default, but expose multiple explicit search modes.
Default mode:
- vector similarity with pgvector
- PostgreSQL full-text search
- reciprocal rank fusion to combine semantic and keyword candidates
- metadata filters by document, folder, type, or project
- neighbor expansion around high-ranking chunks
- source citations in the answer
Additional search modes:
| Mode | Implementation | Best for |
|---|---|---|
| Semantic | pgvector cosine search | conceptual questions and similar meaning |
| Keyword | PostgreSQL tsvector / tsquery |
normal word search |
| Exact phrase | quoted phrase search with indexed text fallback | copied specification language |
| Boolean | websearch_to_tsquery or explicit to_tsquery |
AND, OR, NOT workflows |
| Regex | PostgreSQL regex operators ~ / ~* |
part numbers, serials, standards, dimensions |
| Metadata | jsonb and column filters |
project, document type, revision, file, tag, date |
| Hybrid | semantic + keyword + fusion | default search mode |
| Faceted | grouped counts over metadata/search hits | browsing larger document sets |
| Similar chunk | vector search from an existing chunk | find passages like this result |
| Neighbor expansion | add previous/next chunks after retrieval | preserve technical context around a hit |
| OCR text | same search modes over extracted OCR text | scanned drawings and image-heavy PDFs |
| Vision-caption | search generated image descriptions | diagrams/photos where OCR is weak |
Reason:
Manufacturing and design documents often include exact terms that semantic search can miss:
- part numbers
- material names
- thread sizes
- tolerances
- standards
- machine codes
- revision IDs
First implementation target:
- Hybrid search as the default.
- Exact phrase search.
- Boolean keyword search.
- Regex search.
- Metadata filters.
- Neighbor expansion.
- Similar chunk search.
- OCR/caption search if image extraction is enabled.
Example vector query:
SELECT
c.id,
c.document_id,
d.title,
c.page_number,
c.content,
1 - (c.embedding <=> $1::vector) AS similarity
FROM rag_chunks c
JOIN rag_documents d ON d.id = c.document_id
ORDER BY c.embedding <=> $1::vector
LIMIT 8;
Example keyword query:
SELECT
c.id,
c.document_id,
d.title,
c.page_number,
c.content,
ts_rank_cd(c.content_tsv, websearch_to_tsquery('english', $1)) AS rank
FROM rag_chunks c
JOIN rag_documents d ON d.id = c.document_id
WHERE c.content_tsv @@ websearch_to_tsquery('english', $1)
ORDER BY rank DESC
LIMIT 40;
Example regex patterns useful in manufacturing/design search:
M6x1(\.0)?
ISO\s?2768
6061[- ]T6
[A-Z]{2,5}-\d{3,6}
Rev(?:ision)?\s?[A-Z0-9]+
\d+(\.\d+)?\s?(mm|in|Nm|MPa)
Hybrid ranking should use reciprocal rank fusion first:
score = 1 / (60 + vector_rank) + 1 / (60 + keyword_rank)
This is simple, robust, and explainable for a company demo.
API Access For Railway App¶
For the company demo, expose the local RAG API, not local Postgres.
Recommended:
Railway app
-> https://ai.example.com/rag/search
-> https://ai.example.com/rag/chat
The local Fedora server should expose only:
- RAG API, if the Railway app needs document search/chat
- LiteLLM API, only if the Railway app needs direct model calls
Protect access with:
- HTTPS
- bearer API key
- CORS restricted to the Railway app domain
- optional Cloudflare Access or Cloudflare Tunnel policy
- no public access to raw
llama-serverports - no public access to Postgres
Tailscale-only access is enough for local devices, but a Railway-hosted app cannot usually call a private Tailscale IP unless the Railway service is also joined to the tailnet. For a company demo, a secured HTTPS endpoint is the simpler integration path.
Later Railway Vector Database Mode¶
Second architecture:
Railway web app
-> Railway object storage / bucket
-> Railway Postgres + pgvector
-> secure HTTPS LiteLLM API on Fedora
-> local/qwen-coder
-> local/qwen-vision-fast
-> local/embed-engineering
In this mode:
- Users upload documents through the Railway web app.
- Original files go to Railway object storage or another S3-compatible bucket.
- Railway worker parses and chunks files.
- Railway worker calls Fedora
local/embed-engineeringfor embeddings. - Railway Postgres stores chunks and vectors.
- Railway chat endpoint retrieves chunks from Railway Postgres.
- Railway chat endpoint calls Fedora LiteLLM for generation.
The local RAG API becomes optional in this mode. It can remain useful for local-only demos and offline testing.
Migration Path¶
To move from local pgvector to Railway pgvector:
- Keep the same schema where possible.
- Keep the same embedding model and vector dimension.
- Export local
rag_documentsandrag_chunks, or re-ingest from/srv/localai/documents/archive. - Validate search quality on Railway with the same benchmark questions.
- Switch the Railway app from local RAG API retrieval to Railway DB retrieval.
- Keep Fedora as the inference endpoint.
Re-ingestion is often safer than raw database migration if parsers, OCR, or chunking changed during development.
First Implementation Order¶
- Finish local inference baseline:
local/qwen-coderlocal/qwen-vision-fast- LiteLLM
- Add local embedding service:
local/embed-engineering- Install local Postgres + pgvector.
- Create local RAG schema.
- Create
/srv/localai/documentsdirectories. - Build ingestion worker.
- Build local RAG API and chat UI.
- Ingest a small demo document set.
- Validate local
/searchand/chat. - Expose the RAG API securely for Railway if needed.
- Add Railway pgvector mode later.
Success Criteria¶
Local mode is successful when:
- files placed in
/srv/localai/documents/inboxare ingested automatically or by command - chunks and embeddings appear in local Postgres
- local chat UI can answer questions with citations
- local search returns relevant chunks
- services start after reboot without desktop login
- Railway app can call the secured local RAG API when exposure is enabled
Railway mode is successful when:
- Railway app handles uploads
- Railway pgvector stores chunks and embeddings
- Railway app retrieves relevant chunks
- Railway app calls local inference securely
- answers include citations
- local Fedora server exposes inference but not raw backend ports