API Usage Guide¶
Last updated: 2026-06-02
API Surfaces¶
There are two API surfaces:
| API | URL | Auth | Intended access |
|---|---|---|---|
| LiteLLM | http://127.0.0.1:4000/v1 locally, https://localai.rapiddraft.ai/v1 for backend-to-backend access |
bearer key | main model API |
| RAG API | http://127.0.0.1:4100 locally, https://knowledge.rapiddraft.ai for backend-to-backend access |
bearer key on protected routes | Knowledge/search/cited answer API |
LiteLLM is OpenAI-compatible. Use it for backend integrations.
The RAG API is a FastAPI app for document search, RAG chat, and the local chat UI. Protected routes require a bearer key on the live server. Remote backend callers should use the Cloudflare endpoint; browser clients should not call it directly.
Secret Handling¶
Do not paste real keys into wiki pages.
For local shell tests on the server, load the bearer key without printing it:
sudo bash -c '
set -a
source /etc/localai/localai.env
set +a
TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
curl -fsS -H "Authorization: Bearer ${TOKEN}" http://127.0.0.1:4000/health
'
For a remote client, set the key in that client environment using the normal secret-management path, then call the Tailscale address:
export LITELLM_API_KEY='<your-private-key>'
export LITELLM_BASE_URL='http://<server-tailnet-ip>:4000/v1'
Never commit or document the real key value.
RAG key handling follows the same rule. Load it from the local environment or the saved server-side key file, but never paste the real value into docs:
export LOCALAI_RAG_API_KEY='<your-private-rag-key>'
export LOCALAI_RAG_BASE_URL='http://127.0.0.1:4100'
For Railway or another backend caller:
export LITELLM_API_KEY='<your-private-key>'
export LITELLM_BASE_URL='https://localai.rapiddraft.ai/v1'
export LOCALAI_RAG_API_KEY='<your-private-rag-key>'
export LOCALAI_RAG_BASE_URL='https://knowledge.rapiddraft.ai'
These variables must stay backend-only.
LiteLLM Health¶
Unauthenticated LiteLLM health checks return 401. That is expected.
Authenticated health check:
sudo bash -c '
set -a
source /etc/localai/localai.env
set +a
TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
curl -fsS -H "Authorization: Bearer ${TOKEN}" http://127.0.0.1:4000/health
'
Expected summary:
healthy_count: 3
unhealthy_count: 0
LiteLLM Chat Completion¶
Call local/qwen-coder:
sudo bash -c '
set -a
source /etc/localai/localai.env
set +a
TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
curl -fsS http://127.0.0.1:4000/v1/chat/completions \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '"'"'{
"model": "local/qwen-coder",
"messages": [
{"role": "user", "content": "Reply with a one sentence status check."}
],
"max_tokens": 80
}'"'"'
'
Remote client shape:
curl -fsS "${LITELLM_BASE_URL}/chat/completions" \
-H "Authorization: Bearer ${LITELLM_API_KEY}" \
-H 'Content-Type: application/json' \
-d '{
"model": "local/qwen-coder",
"messages": [
{"role": "user", "content": "Reply with a one sentence status check."}
],
"max_tokens": 80
}'
Cloudflare-backed remote shape:
curl -fsS "https://localai.rapiddraft.ai/v1/chat/completions" \
-H "Authorization: Bearer ${LITELLM_API_KEY}" \
-H 'Content-Type: application/json' \
-d '{
"model": "local/qwen-coder",
"messages": [
{"role": "user", "content": "Reply with a one sentence status check."}
],
"max_tokens": 80
}'
LiteLLM Embeddings¶
sudo bash -c '
set -a
source /etc/localai/localai.env
set +a
TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
curl -fsS http://127.0.0.1:4000/v1/embeddings \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '"'"'{
"model": "local/embed-engineering",
"input": "fixture clamp torque specification"
}'"'"'
'
Expected embedding dimensionality:
1024
RAG API Endpoints¶
Public endpoints:
GET /
GET /chat
GET /health
GET /static/*
Protected endpoints when LOCALAI_RAG_API_KEY is set:
GET /documents
GET /ingestion/status
GET /ingestion/events
POST /documents/ingest
POST /search
POST /search/similar
POST /chat
POST /vision/chat
POST /vision/chat-with-image
RAG Health¶
curl -fsS http://127.0.0.1:4100/health
Expected response:
{"ok": true, "auth_enabled": true}
If auth_enabled is false, the bearer gate is disabled for protected routes.
RAG Chat¶
Ask a question using local documents:
curl -fsS http://127.0.0.1:4100/chat \
-H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
-H 'Content-Type: application/json' \
-d '{
"question": "What do the local documents say about fixture clamp torque?",
"model": "local/qwen-coder",
"search_mode": "hybrid",
"use_documents": true,
"limit": 8
}'
Response shape:
{
"answer": "text with bracketed citations",
"sources": [
{
"id": "chunk uuid",
"document_id": "document uuid",
"title": "source filename",
"chunk_index": 0,
"page_number": null,
"source_kind": "text",
"content": "retrieved chunk text",
"metadata": {},
"score": 0.0
}
]
}
Ask without document retrieval:
curl -fsS http://127.0.0.1:4100/chat \
-H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
-H 'Content-Type: application/json' \
-d '{
"question": "Give a short local model status reply.",
"use_documents": false
}'
Vision Chat¶
Text-only vision route through the RAG API:
curl -fsS http://127.0.0.1:4100/vision/chat \
-H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
-H 'Content-Type: application/json' \
-d '{
"question": "Reply with a short status check.",
"use_documents": false
}'
Image upload:
curl -fsS http://127.0.0.1:4100/vision/chat-with-image \
-H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
-F 'question=Describe the image and identify any visible text.' \
-F 'image=@/path/to/image.png'
The vision route uses local/qwen-vision-fast.
RAG Frontend¶
To test on the Fedora machine itself:
- Open
http://127.0.0.1:4100in a browser. - Paste the saved RAG key into the
RAG API keyfield. - Click
Use Key. - Use the normal chat box for questions.
The key is not pasted into the chat message field. The page stores the token locally in the browser for subsequent protected API calls.
Python Client Example¶
Use the OpenAI Python client against LiteLLM:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["LITELLM_API_KEY"],
base_url=os.environ.get("LITELLM_BASE_URL", "http://127.0.0.1:4000/v1"),
)
response = client.chat.completions.create(
model="local/qwen-coder",
messages=[{"role": "user", "content": "Return one short status sentence."}],
max_tokens=80,
)
print(response.choices[0].message.content)
For a remote Tailscale client, set LITELLM_BASE_URL to the server tailnet URL on port 4000.
Sources¶
/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/rag/app/app.py/Users/adeelyj/code/local ai server setup/LOCAL_RAG_API_HELPERS.md/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/HANDOFF.md