Skip to content

API Usage Guide

Last updated: 2026-06-02

API Surfaces

There are two API surfaces:

API URL Auth Intended access
LiteLLM http://127.0.0.1:4000/v1 locally, https://localai.rapiddraft.ai/v1 for backend-to-backend access bearer key main model API
RAG API http://127.0.0.1:4100 locally, https://knowledge.rapiddraft.ai for backend-to-backend access bearer key on protected routes Knowledge/search/cited answer API

LiteLLM is OpenAI-compatible. Use it for backend integrations.

The RAG API is a FastAPI app for document search, RAG chat, and the local chat UI. Protected routes require a bearer key on the live server. Remote backend callers should use the Cloudflare endpoint; browser clients should not call it directly.

Secret Handling

Do not paste real keys into wiki pages.

For local shell tests on the server, load the bearer key without printing it:

sudo bash -c '
  set -a
  source /etc/localai/localai.env
  set +a
  TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
  curl -fsS -H "Authorization: Bearer ${TOKEN}" http://127.0.0.1:4000/health
'

For a remote client, set the key in that client environment using the normal secret-management path, then call the Tailscale address:

export LITELLM_API_KEY='<your-private-key>'
export LITELLM_BASE_URL='http://<server-tailnet-ip>:4000/v1'

Never commit or document the real key value.

RAG key handling follows the same rule. Load it from the local environment or the saved server-side key file, but never paste the real value into docs:

export LOCALAI_RAG_API_KEY='<your-private-rag-key>'
export LOCALAI_RAG_BASE_URL='http://127.0.0.1:4100'

For Railway or another backend caller:

export LITELLM_API_KEY='<your-private-key>'
export LITELLM_BASE_URL='https://localai.rapiddraft.ai/v1'
export LOCALAI_RAG_API_KEY='<your-private-rag-key>'
export LOCALAI_RAG_BASE_URL='https://knowledge.rapiddraft.ai'

These variables must stay backend-only.

LiteLLM Health

Unauthenticated LiteLLM health checks return 401. That is expected.

Authenticated health check:

sudo bash -c '
  set -a
  source /etc/localai/localai.env
  set +a
  TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
  curl -fsS -H "Authorization: Bearer ${TOKEN}" http://127.0.0.1:4000/health
'

Expected summary:

healthy_count: 3
unhealthy_count: 0

LiteLLM Chat Completion

Call local/qwen-coder:

sudo bash -c '
  set -a
  source /etc/localai/localai.env
  set +a
  TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
  curl -fsS http://127.0.0.1:4000/v1/chat/completions \
    -H "Authorization: Bearer ${TOKEN}" \
    -H "Content-Type: application/json" \
    -d '"'"'{
      "model": "local/qwen-coder",
      "messages": [
        {"role": "user", "content": "Reply with a one sentence status check."}
      ],
      "max_tokens": 80
    }'"'"'
'

Remote client shape:

curl -fsS "${LITELLM_BASE_URL}/chat/completions" \
  -H "Authorization: Bearer ${LITELLM_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "local/qwen-coder",
    "messages": [
      {"role": "user", "content": "Reply with a one sentence status check."}
    ],
    "max_tokens": 80
  }'

Cloudflare-backed remote shape:

curl -fsS "https://localai.rapiddraft.ai/v1/chat/completions" \
  -H "Authorization: Bearer ${LITELLM_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "local/qwen-coder",
    "messages": [
      {"role": "user", "content": "Reply with a one sentence status check."}
    ],
    "max_tokens": 80
  }'

LiteLLM Embeddings

sudo bash -c '
  set -a
  source /etc/localai/localai.env
  set +a
  TOKEN="${LITELLM_API_KEY:-${LITELLM_MASTER_KEY:-}}"
  curl -fsS http://127.0.0.1:4000/v1/embeddings \
    -H "Authorization: Bearer ${TOKEN}" \
    -H "Content-Type: application/json" \
    -d '"'"'{
      "model": "local/embed-engineering",
      "input": "fixture clamp torque specification"
    }'"'"'
'

Expected embedding dimensionality:

1024

RAG API Endpoints

Public endpoints:

GET  /
GET  /chat
GET  /health
GET  /static/*

Protected endpoints when LOCALAI_RAG_API_KEY is set:

GET  /documents
GET  /ingestion/status
GET  /ingestion/events
POST /documents/ingest
POST /search
POST /search/similar
POST /chat
POST /vision/chat
POST /vision/chat-with-image

RAG Health

curl -fsS http://127.0.0.1:4100/health

Expected response:

{"ok": true, "auth_enabled": true}

If auth_enabled is false, the bearer gate is disabled for protected routes.

RAG Chat

Ask a question using local documents:

curl -fsS http://127.0.0.1:4100/chat \
  -H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "What do the local documents say about fixture clamp torque?",
    "model": "local/qwen-coder",
    "search_mode": "hybrid",
    "use_documents": true,
    "limit": 8
  }'

Response shape:

{
  "answer": "text with bracketed citations",
  "sources": [
    {
      "id": "chunk uuid",
      "document_id": "document uuid",
      "title": "source filename",
      "chunk_index": 0,
      "page_number": null,
      "source_kind": "text",
      "content": "retrieved chunk text",
      "metadata": {},
      "score": 0.0
    }
  ]
}

Ask without document retrieval:

curl -fsS http://127.0.0.1:4100/chat \
  -H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "Give a short local model status reply.",
    "use_documents": false
  }'

Vision Chat

Text-only vision route through the RAG API:

curl -fsS http://127.0.0.1:4100/vision/chat \
  -H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "Reply with a short status check.",
    "use_documents": false
  }'

Image upload:

curl -fsS http://127.0.0.1:4100/vision/chat-with-image \
  -H "Authorization: Bearer ${LOCALAI_RAG_API_KEY}" \
  -F 'question=Describe the image and identify any visible text.' \
  -F 'image=@/path/to/image.png'

The vision route uses local/qwen-vision-fast.

RAG Frontend

To test on the Fedora machine itself:

  1. Open http://127.0.0.1:4100 in a browser.
  2. Paste the saved RAG key into the RAG API key field.
  3. Click Use Key.
  4. Use the normal chat box for questions.

The key is not pasted into the chat message field. The page stores the token locally in the browser for subsequent protected API calls.

Python Client Example

Use the OpenAI Python client against LiteLLM:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["LITELLM_API_KEY"],
    base_url=os.environ.get("LITELLM_BASE_URL", "http://127.0.0.1:4000/v1"),
)

response = client.chat.completions.create(
    model="local/qwen-coder",
    messages=[{"role": "user", "content": "Return one short status sentence."}],
    max_tokens=80,
)

print(response.choices[0].message.content)

For a remote Tailscale client, set LITELLM_BASE_URL to the server tailnet URL on port 4000.

Sources

  • /Users/adeelyj/code/local ai server setup/local-ai-stack-repo/rag/app/app.py
  • /Users/adeelyj/code/local ai server setup/LOCAL_RAG_API_HELPERS.md
  • /Users/adeelyj/code/local ai server setup/local-ai-stack-repo/HANDOFF.md