API Usage Guide¶

Last updated: 2026-06-02

API Surfaces¶

There are two API surfaces:

API	URL	Auth	Intended access
LiteLLM	`http://127.0.0.1:4000/v1` locally, `https://localai.rapiddraft.ai/v1` for backend-to-backend access	bearer key	main model API
RAG API	`http://127.0.0.1:4100` locally, `https://knowledge.rapiddraft.ai` for backend-to-backend access	bearer key on protected routes	Knowledge/search/cited answer API

LiteLLM is OpenAI-compatible. Use it for backend integrations.

The RAG API is a FastAPI app for document search, RAG chat, and the local chat UI. Protected routes require a bearer key on the live server. Remote backend callers should use the Cloudflare endpoint; browser clients should not call it directly.

Secret Handling¶

Do not paste real keys into wiki pages.

For local shell tests on the server, load the bearer key without printing it:

sudo bash -c '
  set -a
  source /etc/localai/localai.env
  set +a
  TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
  curl -fsS -H "Authorization: Bearer ${TOKEN}" http://127.0.0.1:4000/health
'

For a remote client, set the key in that client environment using the normal secret-management path, then call the Tailscale address:

export LITELLM_API_KEY='<your-private-key>'
export LITELLM_BASE_URL='http://<server-tailnet-ip>:4000/v1'

Never commit or document the real key value.

RAG key handling follows the same rule. Load it from the local environment or the saved server-side key file, but never paste the real value into docs:

export LOCALAI_RAG_API_KEY='<your-private-rag-key>'
export LOCALAI_RAG_BASE_URL='http://127.0.0.1:4100'

For Railway or another backend caller:

export LITELLM_API_KEY='<your-private-key>'
export LITELLM_BASE_URL='https://localai.rapiddraft.ai/v1'
export LOCALAI_RAG_API_KEY='<your-private-rag-key>'
export LOCALAI_RAG_BASE_URL='https://knowledge.rapiddraft.ai'

These variables must stay backend-only.

LiteLLM Health¶

Unauthenticated LiteLLM health checks return 401. That is expected.

Authenticated health check:

sudo bash -c '
  set -a
  source /etc/localai/localai.env
  set +a
  TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
  curl -fsS -H "Authorization: Bearer ${TOKEN}" http://127.0.0.1:4000/health
'

Expected summary:

healthy_count: 3
unhealthy_count: 0

LiteLLM Chat Completion¶

Call local/qwen-coder:

sudo bash -c '
  set -a
  source /etc/localai/localai.env
  set +a
  TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
  curl -fsS http://127.0.0.1:4000/v1/chat/completions \
    -H "Authorization: Bearer ${TOKEN}" \
    -H "Content-Type: application/json" \
    -d '"'"'{
      "model": "local/qwen-coder",
      "messages": [
        {"role": "user", "content": "Reply with a one sentence status check."}
      ],
      "max_tokens": 80
    }'"'"'
'

Remote client shape:

curl -fsS "${LITELLM_BASE_URL}/chat/completions" \
  -H "Authorization: Bearer ${LITELLM_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "local/qwen-coder",
    "messages": [
      {"role": "user", "content": "Reply with a one sentence status check."}
    ],
    "max_tokens": 80
  }'

Cloudflare-backed remote shape:

curl -fsS "https://localai.rapiddraft.ai/v1/chat/completions" \
  -H "Authorization: Bearer ${LITELLM_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "local/qwen-coder",
    "messages": [
      {"role": "user", "content": "Reply with a one sentence status check."}
    ],
    "max_tokens": 80
  }'

LiteLLM Embeddings¶

sudo bash -c '
  set -a
  source /etc/localai/localai.env
  set +a
  TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
  curl -fsS http://127.0.0.1:4000/v1/embeddings \
    -H "Authorization: Bearer ${TOKEN}" \
    -H "Content-Type: application/json" \
    -d '"'"'{
      "model": "local/embed-engineering",
      "input": "fixture clamp torque specification"
    }'"'"'
'

Expected embedding dimensionality:

RAG API Endpoints¶

Public endpoints:

GET  /
GET  /chat
GET  /health
GET  /static/*

Protected endpoints when LOCALAI_RAG_API_KEY is set:

GET  /documents
GET  /ingestion/status
GET  /ingestion/events
POST /documents/ingest
POST /search
POST /search/similar
POST /chat
POST /vision/chat
POST /vision/chat-with-image

RAG Health¶

curl -fsS http://127.0.0.1:4100/health

Expected response:

{"ok": true, "auth_enabled": true}

If auth_enabled is false, the bearer gate is disabled for protected routes.

RAG Chat¶

Ask a question using local documents:

curl -fsS http://127.0.0.1:4100/chat \
  -H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "What do the local documents say about fixture clamp torque?",
    "model": "local/qwen-coder",
    "search_mode": "hybrid",
    "use_documents": true,
    "limit": 8
  }'

Response shape:

{
  "answer": "text with bracketed citations",
  "sources": [
    {
      "id": "chunk uuid",
      "document_id": "document uuid",
      "title": "source filename",
      "chunk_index": 0,
      "page_number": null,
      "source_kind": "text",
      "content": "retrieved chunk text",
      "metadata": {},
      "score": 0.0
    }
  ]
}

Ask without document retrieval:

curl -fsS http://127.0.0.1:4100/chat \
  -H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "Give a short local model status reply.",
    "use_documents": false
  }'

Vision Chat¶

Text-only vision route through the RAG API:

curl -fsS http://127.0.0.1:4100/vision/chat \
  -H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "Reply with a short status check.",
    "use_documents": false
  }'

Image upload:

curl -fsS http://127.0.0.1:4100/vision/chat-with-image \
  -H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
  -F 'question=Describe the image and identify any visible text.' \
  -F 'image=@/path/to/image.png'

The vision route uses local/qwen-vision-fast.

RAG Frontend¶

To test on the Fedora machine itself:

Open http://127.0.0.1:4100 in a browser.
Paste the saved RAG key into the RAG API key field.
Click Use Key.
Use the normal chat box for questions.

The key is not pasted into the chat message field. The page stores the token locally in the browser for subsequent protected API calls.

Python Client Example¶

Use the OpenAI Python client against LiteLLM:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["LITELLM_API_KEY"],
    base_url=os.environ.get("LITELLM_BASE_URL", "http://127.0.0.1:4000/v1"),
)

response = client.chat.completions.create(
    model="local/qwen-coder",
    messages=[{"role": "user", "content": "Return one short status sentence."}],
    max_tokens=80,
)

print(response.choices[0].message.content)

For a remote Tailscale client, set LITELLM_BASE_URL to the server tailnet URL on port 4000.

Sources¶

/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/rag/app/app.py
/Users/adeelyj/code/local ai server setup/LOCAL_RAG_API_HELPERS.md
/Users/adeelyj/code/local ai server setup/local-ai-stack-repo/HANDOFF.md