Local RAG Guide: CAE Knowledge Base¶
Source files:
Architechture & Research/Infrastructure Research/Knowledge Base & RAG/Local RAG Guide.mdLast synthesized: March 2026
Overview¶
This guide describes how to set up a local Retrieval-Augmented Generation (RAG) system over a 14,000-file CAE toolbox. The system runs entirely on-premise (no cloud), searches 14,200 text-searchable technical documents, and uses a local LLM to answer engineering questions.
Goal: Enable RapidDraft and Autonomous CAE to reference past solutions, standards, and automation scripts without cloud API calls or data exposure.
Your Collection at a Glance¶
| Category | File Count | Notes |
|---|---|---|
| PDFs | 4,191 | Handbooks, standards, tutorials — highest value |
| PowerPoint (PPT/PPTX) | 708 | Training presentations, procedure docs |
| HTML/HTM | 5,166 | NX/ANSYS training web pages, documentation |
| Scripts (.py, .tcl, .sh, .inp) | 1,618 | Automation scripts, solver inputs, CAM macros |
| Text/CSV/TXT | 1,565 | Notes, data, logs, parameters |
| CAE Binary (.prt, .fem, .h3d, .sim) | 6,243 | Not text-searchable — skip |
| Images (.gif, .png, .jpg) | 17,045 | Not searchable — skip |
| Other (.js, .css, .otf, fonts) | ~18,000 | Web assets from training — skip |
| Total text-searchable | ~14,200 | This is what gets indexed |
Architecture¶
┌─────────────────────────────────────────────────────┐
│ Your Windows Machine │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────┐ │
│ │ 113_CAE_ │ │ File Watcher│ │ ChromaDB
│ │ Toolbox │───>│ Service │──>│ (vectors)
│ │ (OneDrive) │ │ (watchdog) │ │ (local) │
│ └──────────────┘ └──────────────┘ └───┬────┘ │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ LM Studio │<───│ Chat UI │<────┘ │
│ │ (Qwen 32B) │ │ (Gradio) │ │
│ │ port 1234 │ │ port 7860 │ │
│ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────┘
How It Works¶
- File Watcher runs as background service, monitors CAE Toolbox folder
- New/changed files get parsed, chunked, embedded, and stored in ChromaDB
- Already-indexed files tracked by hash — no re-processing
- Chat UI takes your question, retrieves relevant chunks from ChromaDB
- LM Studio (Qwen 32B) generates answer from question + retrieved context
Setup (Windows — Step by Step)¶
Prerequisites¶
- Python 3.11+ (you have this)
- LM Studio running with Qwen 32B or similar (you have this; listens on port 1234)
- ~4GB RAM for embeddings + vector DB (you have 128GB, so no issue)
Step 1: Create Project Directory¶
Open PowerShell:
mkdir C:\Users\adeel\cae_rag_service
cd C:\Users\adeel\cae_rag_service
python -m venv venv
.\venv\Scripts\Activate.ps1
Step 2: Install Dependencies¶
pip install chromadb sentence-transformers watchdog pymupdf python-pptx
pip install beautifulsoup4 gradio openai chardet tiktoken
pip install python-docx openpyxl lxml
Why each package:
- chromadb — Local vector database, persistent storage, no server needed
- sentence-transformers — Embedding model (runs locally, no API calls)
- watchdog — Filesystem monitoring (detects new files)
- pymupdf (fitz) — Fastest PDF parser; handles scanned PDFs
- python-pptx — PowerPoint text extraction
- beautifulsoup4 — HTML parsing for training pages
- python-docx — Word document extraction
- chardet — Encoding detection for old text files
Step 3: Configuration¶
Create config.py:
"""Configuration for CAE Toolbox RAG Service."""
import os
# === PATHS ===
TOOLBOX_ROOT = r"C:\Users\adeel\OneDrive\100_Knowledge\113_CAE_Toolbox"
CHROMA_DB_PATH = r"C:\Users\adeel\cae_rag_service\chroma_db"
INDEX_STATE_PATH = r"C:\Users\adeel\cae_rag_service\index_state.json"
LOG_PATH = r"C:\Users\adeel\cae_rag_service\logs"
# === EMBEDDING MODEL ===
# Runs locally; ~400MB download first time
EMBEDDING_MODEL = "BAAI/bge-base-en-v1.5"
# === LM STUDIO ===
LM_STUDIO_URL = "http://localhost:1234/v1"
LM_STUDIO_MODEL = "qwen2.5-coder-32b"
# === CHUNKING ===
CHUNK_SIZE = 800 # tokens per chunk
CHUNK_OVERLAP = 100 # overlap between chunks
MAX_FILE_SIZE_MB = 100 # skip larger files
# === RETRIEVAL ===
TOP_K = 8 # chunks to retrieve per query
SIMILARITY_THRESHOLD = 0.25 # minimum relevance (0-1)
# === FILE TYPES TO INDEX ===
TEXT_EXTENSIONS = {
# Documents
'.pdf', '.doc', '.docx', '.ppt', '.pptx', '.rtf',
'.xls', '.xlsx',
# Web content
'.html', '.htm',
# Plain text
'.txt', '.md', '.csv', '.tsv', '.log', '.out',
# Code & scripts
'.py', '.tcl', '.sh', '.bat', '.m', '.cfg',
# Solver inputs
'.inp', '.dat', '.k', '.key', '.bdf', '.nas', '.fem',
# Other
'.xml', '.json', '.tex',
}
# Skip these
SKIP_EXTENSIONS = {
'.gif', '.png', '.jpg', '.jpeg', '.bmp', '.tif', '.tiff', '.svg',
'.avi', '.mp4', '.wmv', '.mov',
'.js', '.css', '.otf', '.woff', '.eot', '.ttf',
'.pyc', '.pyo', '.class',
'.zip', '.rar', '.7z', '.gz', '.tar',
'.exe', '.dll', '.so', '.msi',
'.h3d', '.op2', '.sim', '.prt', '.stp', '.step', '.iges',
'.stl', '.d3plot', '.binout', '.rst', '.rth', '.db', '.cdb',
'.catpart', '.catproduct', '.sldprt', '.sldasm',
'.x_t', '.x_b', '.jt', '.3dm', '.dwg', '.dxf',
'.stat', '.mvw', '.hm', '.rad',
}
SKIP_FOLDERS = {
'__pycache__', '.git', 'node_modules', '.ipynb_checkpoints',
}
os.makedirs(CHROMA_DB_PATH, exist_ok=True)
os.makedirs(LOG_PATH, exist_ok=True)
Step 4: File Parser Module¶
Create parsers.py:
"""Document parsers for different file types."""
import os, chardet, logging
logger = logging.getLogger("cae_rag")
def parse_pdf(filepath: str) -> str:
"""Extract text from PDF using PyMuPDF."""
import fitz
try:
doc = fitz.open(filepath)
texts = [page.get_text() for page in doc]
doc.close()
return "\n".join(texts)
except Exception as e:
logger.warning(f"PDF parse failed: {filepath} — {e}")
return ""
def parse_pptx(filepath: str) -> str:
"""Extract text from PowerPoint."""
from pptx import Presentation
try:
prs = Presentation(filepath)
texts = []
for slide_num, slide in enumerate(prs.slides, 1):
slide_text = []
for shape in slide.shapes:
if shape.has_text_frame:
for para in shape.text_frame.paragraphs:
text = para.text.strip()
if text:
slide_text.append(text)
if slide_text:
texts.append(f"[Slide {slide_num}]\n" + "\n".join(slide_text))
return "\n\n".join(texts)
except Exception as e:
logger.warning(f"PPTX parse failed: {filepath} — {e}")
return ""
def parse_docx(filepath: str) -> str:
"""Extract text from Word documents."""
from docx import Document
try:
doc = Document(filepath)
return "\n".join(para.text for para in doc.paragraphs if para.text.strip())
except Exception as e:
logger.warning(f"DOCX parse failed: {filepath} — {e}")
return ""
def parse_html(filepath: str) -> str:
"""Extract text from HTML files."""
from bs4 import BeautifulSoup
try:
raw = open(filepath, 'rb').read()
encoding = chardet.detect(raw[:10000])['encoding'] or 'utf-8'
html = raw.decode(encoding, errors='replace')
soup = BeautifulSoup(html, 'lxml')
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
text = soup.get_text(separator='\n', strip=True)
return text
except Exception as e:
logger.warning(f"HTML parse failed: {filepath} — {e}")
return ""
def parse_text(filepath: str) -> str:
"""Read plain text files with encoding detection."""
try:
raw = open(filepath, 'rb').read()
if not raw:
return ""
encoding = chardet.detect(raw[:10000])['encoding'] or 'utf-8'
return raw.decode(encoding, errors='replace')
except Exception as e:
logger.warning(f"Text parse failed: {filepath} — {e}")
return ""
Step 5: RAG Service¶
Create rag_service.py:
"""Local RAG service: index documents + retrieve + generate answers."""
import chromadb, hashlib, json, logging
from sentence_transformers import SentenceTransformer
from pathlib import Path
from config import *
from parsers import parse_pdf, parse_pptx, parse_docx, parse_html, parse_text
import tiktoken
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("cae_rag")
# Initialize ChromaDB and embedding model
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
collection = client.get_or_create_collection(name="cae_toolbox")
embedding_model = SentenceTransformer(EMBEDDING_MODEL)
encoder = tiktoken.get_encoding("cl100k_base")
def get_file_hash(filepath: str) -> str:
"""Compute SHA256 hash of file."""
sha256 = hashlib.sha256()
with open(filepath, 'rb') as f:
sha256.update(f.read())
return sha256.hexdigest()
def index_document(filepath: str, text: str):
"""Chunk, embed, and store document."""
if not text.strip():
logger.info(f"Skipping empty document: {filepath}")
return
# Chunk by token count
tokens = encoder.encode(text)
chunks = []
for i in range(0, len(tokens), CHUNK_SIZE - CHUNK_OVERLAP):
chunk_tokens = tokens[i:i + CHUNK_SIZE]
chunk_text = encoder.decode(chunk_tokens)
chunks.append(chunk_text)
# Embed and store chunks
for idx, chunk in enumerate(chunks):
try:
embedding = embedding_model.encode(chunk)
collection.add(
ids=[f"{filepath}_{idx}"],
embeddings=[embedding],
metadatas=[{"source": filepath, "chunk": idx}],
documents=[chunk]
)
except Exception as e:
logger.error(f"Failed to embed chunk: {e}")
logger.info(f"Indexed {len(chunks)} chunks from {filepath}")
def parse_file(filepath: str) -> str:
"""Parse file based on extension."""
ext = Path(filepath).suffix.lower()
if ext == '.pdf':
return parse_pdf(filepath)
elif ext in ['.pptx', '.ppt']:
return parse_pptx(filepath)
elif ext in ['.docx', '.doc']:
return parse_docx(filepath)
elif ext in ['.html', '.htm']:
return parse_html(filepath)
else:
return parse_text(filepath)
def scan_and_index():
"""Scan toolbox, index new files."""
indexed = 0
for root, dirs, files in os.walk(TOOLBOX_ROOT):
# Skip unwanted folders
dirs[:] = [d for d in dirs if d not in SKIP_FOLDERS]
for filename in files:
ext = Path(filename).suffix.lower()
if ext not in TEXT_EXTENSIONS or ext in SKIP_EXTENSIONS:
continue
filepath = os.path.join(root, filename)
size_mb = os.path.getsize(filepath) / (1024 * 1024)
if size_mb > MAX_FILE_SIZE_MB:
logger.warning(f"Skipping large file: {filepath} ({size_mb:.1f} MB)")
continue
try:
text = parse_file(filepath)
if text.strip():
index_document(filepath, text)
indexed += 1
except Exception as e:
logger.error(f"Error processing {filepath}: {e}")
logger.info(f"Total documents indexed: {indexed}")
def retrieve(query: str, top_k: int = TOP_K) -> list:
"""Retrieve relevant chunks for a query."""
query_embedding = embedding_model.encode(query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
where={"distance": {"$lte": 1 - SIMILARITY_THRESHOLD}}
)
documents = []
if results and results['documents']:
for doc, metadata, distance in zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
):
documents.append({
'text': doc,
'source': metadata['source'],
'relevance': 1 - distance
})
return documents
def generate_answer(query: str) -> str:
"""Retrieve context + generate answer using LM Studio."""
import requests
# Retrieve context
context_docs = retrieve(query)
context = "\n\n".join([f"Source: {d['source']}\n{d['text']}" for d in context_docs])
# Generate answer
prompt = f"""You are a CAE engineering expert. Answer the question using the provided context.
Question: {query}
Context:
{context}
Answer:"""
try:
response = requests.post(
f"{LM_STUDIO_URL}/chat/completions",
json={
"model": LM_STUDIO_MODEL,
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.7,
"max_tokens": 500
},
timeout=60
)
response.raise_for_status()
return response.json()['choices'][0]['message']['content']
except Exception as e:
logger.error(f"LM Studio error: {e}")
return f"Error generating answer: {e}"
if __name__ == "__main__":
logger.info("Starting CAE RAG indexing...")
scan_and_index()
logger.info("Indexing complete.")
Step 6: Chat UI (Gradio)¶
Create chat_ui.py:
"""Gradio-based chat interface."""
import gradio as gr
from rag_service import generate_answer, retrieve
def chat(message, history):
"""Chat interface."""
answer = generate_answer(message)
return answer
def show_sources(query):
"""Show retrieved sources for a query."""
docs = retrieve(query, top_k=5)
sources = "\n\n".join([
f"**{d['source']}** (relevance: {d['relevance']:.2f})\n{d['text'][:200]}..."
for d in docs
])
return sources
# Build interface
with gr.Blocks(title="CAE Knowledge Base") as app:
gr.Markdown("# CAE Knowledge Base RAG")
with gr.Tabs():
with gr.TabItem("Chat"):
chatbot = gr.ChatInterface(chat)
with gr.TabItem("Search Sources"):
query_input = gr.Textbox(label="Query", placeholder="Search CAE Toolbox...")
sources_output = gr.Markdown()
search_btn = gr.Button("Search")
search_btn.click(show_sources, inputs=[query_input], outputs=[sources_output])
if __name__ == "__main__":
app.launch(server_name="127.0.0.1", server_port=7860)
Step 7: Run the Service¶
In PowerShell:
.\venv\Scripts\Activate.ps1
# Index documents (run once)
python rag_service.py
# Start chat UI
python chat_ui.py
# Open browser: http://127.0.0.1:7860
Query Examples¶
Example 1: DFM Rule Lookup¶
Query: "What is the minimum internal corner radius for CNC milling?" Expected context: Protolabs guide, internal standards, supplier docs Answer: System retrieves relevant sections and synthesizes answer
Example 2: Script Search¶
Query: "How do I automate NX CAM programming?" Expected context: NX macro examples, TCL scripts, CAM playbooks Answer: System suggests relevant automation scripts from toolbox
Example 3: Solver Parameter Question¶
Query: "What mesh size should I use for NASTRAN stress analysis?" Expected context: Analysis best practices, tutorials, solver documentation Answer: System provides guidance with examples from past analyses
Performance Characteristics¶
| Metric | Value | Notes |
|---|---|---|
| Indexing time | ~2-3 hours | First run for 14K documents |
| Vector DB size | ~2-3 GB | Depends on chunking strategy |
| Query latency | <1 second | Vector search only; excludes LLM time |
| LLM generation time | 10-30 seconds | Qwen 32B on GPU; varies by answer length |
| Total Q&A latency | 15-40 seconds | Retrieval + generation combined |
Optimization Tips¶
1. Incremental Indexing¶
Once initial index is built, only re-index changed files:
def incremental_index():
"""Only index files modified since last run."""
index_state = load_index_state() # JSON file tracking file hashes
for filepath in walk_toolbox():
current_hash = get_file_hash(filepath)
if filepath not in index_state or index_state[filepath] != current_hash:
parse_and_index(filepath)
index_state[filepath] = current_hash
save_index_state(index_state)
2. Chunk Size Tuning¶
- Smaller chunks (400 tokens): Better precision, more retrieval overhead
- Larger chunks (1000 tokens): Faster retrieval, more noise in context
- Sweet spot: 800 tokens with 100-token overlap
3. Top-K Parameter¶
- Top-K=5: Fast, lower noise
- Top-K=8: Default, good balance
- Top-K=15: More context, slower generation, more hallucinations possible
Maintenance¶
Update Toolbox¶
Add new files to CAE Toolbox folder; re-run scan_and_index() periodically.
Reset Index¶
Monitor Logs¶
Check logs/ directory for parsing errors and indexing issues.
Limitations & Future Work¶
| Limitation | Workaround |
|---|---|
| No image search | Use OCR to extract text from screenshots |
| Table extraction | Tables become flattened text (suboptimal) |
| Multi-document reasoning | Retrieve multiple docs; let LLM synthesize |
| No real-time updates | Re-index manually when toolbox changes |
| Hallucinations in LLM | Use relevance threshold; ask for sources |
Next Steps¶
- Index your toolbox: Run
scan_and_index()(2-3 hours first time) - Test retrieval: Query for simple topics to validate setup
- Tune parameters: Adjust
CHUNK_SIZE,TOP_Kbased on results - Integrate with RapidDraft: Link RAG to DFM findings for justification
- Monitor performance: Log query times and answer quality
Quick Checklist¶
- Python 3.11+ installed
- Dependencies installed:
pip install -r requirements.txt -
config.pyconfigured with correct paths - LM Studio running (port 1234) with Qwen 32B
-
rag_service.pyexecuted to build initial index -
chat_ui.pyrunning (port 7860) - Tested sample queries
- Documented CAE Toolbox location and structure