Purpose & Vision
AI Orchestrator Studio (AOS) is a full-stack, self-hosted AI platform that lets organisations build, deploy, and manage autonomous AI agents capable of performing real infrastructure tasks — not just answering questions.
Traditional AI chatbots are limited to text conversations. AOS agents go further: they SSH into servers, execute SQL queries, scan documents with OCR, call REST APIs, manage Docker containers, and delegate work to other agents. Each agent is a configurable unit with its own system prompt, skill set, LLM provider, and credential bindings.
AOS is designed for enterprise IT teams, MSPs, and DevOps organisations that need AI to automate complex, multi-step operational workflows — not just generate text. The platform covers 9 enterprise domains: Infrastructure, Database Administration, Security & Compliance, Networking, DevOps, Cloud / OpenStack, Project Management, Customer Experience, and Microsoft 365 / Copilot — with 95+ built-in skills across 37 categories.
Architecture Deep-Dive
AOS follows a three-tier architecture designed for production deployment, horizontal scaling, and clear separation of concerns. Every tier can run on a separate server.
│
Enterprise Orchestrator → UniversalAgentExecutor → ModelRouter → LLMClient (10+ providers)
│
Celery Worker → Tesseract → easyocr → LLM Vision → ChromaDB Vector Store
Tier 1 — Application Server
| Component | Technology | Purpose |
|---|---|---|
| Frontend | React 18 + MUI 5 + TypeScript | Agent Builder UI, Chat Studio, Dashboard, Skill Manager, Document Manager, System Config |
| Reverse Proxy | nginx 1.20+ | TLS termination, static file serving, API routing to backend port 8000 |
| Backend API | FastAPI + uvicorn | All REST endpoints — agents, skills, credentials, auth, RAG, chat, scheduling |
| Enterprise Orchestrator | enterprise_orchestrator.py | Deterministic 3-path routing: ANALYTICS → DOCUMENTS → GENERAL. Rule-based keyword matching, no LLM in router. Redis-cached, traced, observable. |
| Agent Executor | UniversalAgentExecutor | Receives user messages, assembles context (system prompt + internal skills + RAG chunks + planning), routes to LLM via ModelRouter, parses tool calls with 4-tier fallback, executes skills, manages delegation |
| Model Router | ModelRouter | Per-agent multi-LLM routing: task classification → task_routing map → primary_connection → fallback_chain[] → system_default. Supports 10+ LLM providers. |
| LLM Client | LLMClient | Provider auto-detection from URL, endpoint probing, auth handling (Bearer/API_KEY/Basic/Custom/None). For vLLM/Custom, tools payload is skipped (ReAct fallback). |
| RAG Engine | ChromaDB + Cross-Encoder | 6-stage pipeline: semantic chunking → ingest-time embedding → hybrid BM25+vector with RRF → metadata filtering → cross-encoder re-ranking → semantic cache. |
Tier 2 — Database & Cache Layer
| Component | Technology | Purpose |
|---|---|---|
| Primary Database | PostgreSQL 16 | Agents, skills, credentials (AES-256 encrypted), users, documents, document_agent_links, audit trails, scheduler jobs. PostgreSQL-only (no SQLite). |
| Session & RAG Cache | Redis | Chat memory, rate limiting, API Gateway counters, Celery task broker, semantic query cache (cosine similarity >0.95, 24h TTL), Enterprise Orchestrator routing cache |
| Vector Store | ChromaDB | Document embeddings for RAG retrieval. Persistent agent-scoped collections (agent_{id}_documents). |
Tier 3 — Async Workers & LLM
| Component | Technology | Purpose |
|---|---|---|
| OCR Worker | Celery + Tesseract 5 + easyocr + OpenCV | 3-strategy cascade: Tesseract → easyocr → LLM Vision. 8-step image preprocessing. Auto-promotes results into RAG vector store. |
| LLM Vision Fallback | Any multimodal LLM | When Tesseract/easyocr confidence is below threshold, pages are sent to LLM Vision for re-extraction |
| LLM Server | vLLM / Ollama / Azure OpenAI / OpenAI / Anthropic / Cohere / HuggingFace / LlamaCpp / TextGen WebUI / Custom | Any OpenAI-compatible endpoint. ModelRouter selects per-agent per-task. Automatic fallback chains. |
Agent Execution Flow
When a user sends a message, it passes through the Enterprise Orchestrator for deterministic 3-path routing (no LLM in the router), then into the UniversalAgentExecutor for the full agent pipeline.
Enterprise Orchestrator — 3-Path Deterministic Routing
ANALYTICS KPI, metric, dashboard, chart → Metrics API
DOCUMENTS document, search, find file, OCR → RAG Search
GENERAL everything else → UniversalAgentExecutor
UniversalAgentExecutor — Full Pipeline
- 1. Agent Loading Load agent config from PostgreSQL: name, system_prompt, skills, parameters, LLM routing config.
-
2. Skill Separation
Loaded skills are split into two groups:
• Internal skills (type=internal) → prompt-injected as context enrichment, NOT exposed as callable tools.
• Callable skills (type=ssh,http,python,sql,ansible, etc.) → converted to OpenAI function-calling tool definitions. -
3. Credential Discovery (3 tiers)
Tier 1: Agent config (
agent.extra_config.credentials).
Tier 2: RBAC bindings (credential_bindingstable).
Tier 3: Auto-discovery (scan skills → match by credential type/name). -
4. RAG Injection
If agent has linked documents: vector search user query against
DocumentChunksin ChromaDB → top-K results injected into system prompt as### Reference Context ###. This is where.skillpackage reference files become searchable. - 5. System Prompt Assembly Final prompt is built in order: [1] Agent's base system_prompt + [2] Internal skill prompts + [3] RAG reference chunks + [4] Deep Planning instructions (if enabled) + [5] ReAct instructions (if provider doesn't support native function calling).
-
6. ModelRouter Selection
ModelRouter.select(agent_config, task_type, message):
• Task routing map (e.g.,code_gen → connection_id_for_codellama)
• Primary connection (primary_connection_id)
• System default (system_settings.default_llm_connection_id) -
7. LLM Inference
LLMClientauto-detects provider from URL, probes candidate endpoints, sends prompt. For vLLM/Custom providers: tools payload is skipped (ReAct text fallback handles tool calling). -
8. 4-Tier Tool Call Parsing (max 10 rounds)
Tier 1: native
tool_calls[]format → execute directly.
Tier 2: parseACTION / ACTION_INPUTblocks from text (ReAct).
Tier 3: false-completion detection + forceful re-prompt.
Tier 4: keyword scoring against skill descriptions → auto-invoke best match. -
9. Skill Execution + Delegation
Matched skill handler runs (SSH, SQL, HTTP, Docker, Ansible, etc.). If skill is
agent_delegation, spawns a sub-agent with recursion depth guards (max 3 levels). Results feed back as OBSERVATION. - 10. Final Answer When LLM produces no more tool calls (FINAL ANSWER), the synthesised response is returned to the user and stored in chat history (Redis).
Multi-LLM Routing Engine
AOS doesn't lock you to a single LLM. The ModelRouter (model_router.py) implements
per-agent, task-aware model selection with automatic fallback chains across 10+ LLM providers.
Routing Decision Flow
(keyword scoring) → ModelRouter.select()
ModelRouter → ① task_routing[task]? → ② primary_connection? → ③ system_default
LLMClient → Auto-detect provider from URL → Probe endpoints → Cache working endpoint
Task Classification
The ModelRouter classifies each user message into a task type using keyword scoring. Each task type can be routed to a different LLM connection.
| Task Type | Trigger Keywords | Best Model For |
|---|---|---|
reasoning | "think", "analyse", "why", "explain" | GPT-4, Claude 3 Opus |
code_gen | "write code", "debug", "function", "script" | Codellama, GPT-4, DeepSeek |
rag_answer | "search", "find in documents", "what does the doc say" | GPT-4o, Qwen |
summarize | "summarize", "TLDR", "brief" | Claude 3 Haiku, Llama 3 |
classify | "classify", "categorize", "label" | Fast local models |
extract | "extract", "parse", "pull out" | GPT-4o-mini, Llama 3 |
translate | "translate to", "in Arabic", "en français" | GPT-4, Llama 3 |
chat | General conversation | Any model (default) |
tool_call | "run", "execute", "SSH into" | Models with function calling |
planning | "plan", "steps to", "how to" | GPT-4, Claude 3 Opus |
Per-Agent Routing Configuration
Each agent’s llm_routing config supports:
| Field | Type | Description |
|---|---|---|
primary_connection_id | UUID | Default LLM connection for this agent |
fallback_chain | UUID[] | Ordered list of fallback connections if primary fails |
task_routing | Map<task, UUID> | Task-specific LLM overrides (e.g., {"code_gen": "id-for-codellama"}) |
cost_aware | bool | Enable cost-optimised routing (prefer cheaper models for simple tasks) |
max_fallback_attempts | int | Max attempts before giving up (default 3) |
Supported LLM Providers
The LLMClient (llm_client.py) auto-detects the provider from the connection URL and
adapts its behaviour accordingly:
| Provider | Detection Pattern | Tool Calling | Auth |
|---|---|---|---|
OpenAI | api.openai.com | Native tool_calls | Bearer token |
Azure OpenAI | *.openai.azure.com | Native tool_calls | API key |
Anthropic | api.anthropic.com | Native tool_calls | API key |
Cohere | api.cohere.ai | Native | Bearer token |
vLLM | /v1/completions | ⚠️ Skipped → ReAct text fallback | Bearer / None |
Ollama | :11434 | ⚠️ Skipped → ReAct text fallback | None |
TextGen WebUI | :5001 | ⚠️ Skipped → ReAct text fallback | None |
LlamaCpp | :8080 | ⚠️ Skipped → ReAct text fallback | None |
HuggingFace | api-inference | ⚠️ Skipped → ReAct text fallback | Bearer token |
Custom | Any other URL | ⚠️ Skipped → ReAct text fallback | Configurable |
ACTION / ACTION_INPUT format), and Tier 2 parsing handles
tool execution. This means any LLM works with any skill — no provider lock-in.
Fallback Chain Behaviour
The call_with_fallback() method in ModelRouter automatically retries with the next connection
in the chain when the current one fails. This ensures agents stay operational even when individual
LLM endpoints go down.
Deep Agent Planning Loop
AOS implements a Deep Agent execution model that forces LLMs to plan before they act. When enabled, the system injects a planning prompt into the agent’s system message, requiring the LLM to produce a numbered execution plan with dependencies and risk checks before invoking any tools.
(up to N steps) → LLM Generates Plan → Step-by-Step Execution
Tool Result → Validate Observation → Next Step / Fallback → Final Synthesis
4-Tier Tool Execution
The execution loop uses a 4-tier fallback strategy to guarantee tool execution works with any LLM — whether it supports native function calling or not:
| Tier | Strategy | When |
|---|---|---|
| Tier 1 | Native tool_calls | LLM returns structured tool_calls in OpenAI format → execute directly |
| Tier 2 | ReAct Parsing | No native tool_calls → parse ACTION / ACTION_INPUT blocks from text |
| Tier 3 | False-Completion Detection | LLM hallucinated completion or described steps instead of doing them → forceful re-prompt |
| Tier 4 | Intent Auto-Dispatch | If the model still refuses structured tool output, infer the most likely tool + parameters from intent and execute as last resort |
False-Completion Detection
A critical challenge with LLMs is hallucinated action — the model claims “done!” without calling any tool. AOS detects this with a dual-layer heuristic:
• Future-tense laziness: Phrases like “I would…”, “here’s how…”, “you should…” (20+ patterns)
• Past-tense hallucination: Phrases like “has been created”, “successfully configured”, “done!” (50+ patterns)
When detected, the system switches to ReAct mode and re-prompts: "STOP. You did NOT execute anything. DO IT NOW."
If the model still fails to emit structured ACTION blocks, Tier 4 intent auto-dispatch attempts safe tool recovery.
Configuration
| Env Variable | Default | Description |
|---|---|---|
DEEP_AGENT_PLANNING_ENABLED | true | Enable planning-before-action mode |
DEEP_AGENT_MAX_PLAN_STEPS | 6 | Maximum plan steps before execution begins |
DEEP_AGENT_PLAN_STYLE | structured | structured (dependencies + risk checks) or concise |
config JSON
(deep_planning_enabled, deep_planning_max_steps, deep_planning_style).
The global settings serve as defaults for all agents.
.skill Package System
AOS agents can be extended with .skill packages — portable ZIP archives that bundle a skill definition with reference knowledge files. When imported, the skill is registered and all reference documents are automatically ingested into the RAG vector store.
.skill File Format
A .skill file is a renamed ZIP archive with this structure:
SKILL.md Structure
The YAML frontmatter (name + description) is parsed as skill metadata.
The markdown body becomes the skill’s system_prompt.
Files in references/ are automatically chunked, embedded, and stored in the agent’s
ChromaDB collection as RAG documents.
Skill Types
| Type | Behaviour | Use Case |
|---|---|---|
internal | Prompt-injected as context. NOT presented as a callable tool. | Knowledge enrichment, persona definition, domain expertise |
ssh | Executes SSH commands on bound server | Server administration, log checks, service management |
http | Calls REST/GraphQL APIs | External integrations, webhooks, data fetching |
python | Runs Python scripts in sandbox | Data processing, calculations, custom logic |
sql | Executes SQL queries against bound database | Database administration, reporting, health checks |
ansible | Runs Ansible playbooks | Infrastructure automation, configuration management |
shell | Executes shell commands locally | Local automation, file operations |
graphql | Executes GraphQL queries | Modern API integrations |
scrapling | Web scraping with Scrapling | Data extraction from websites |
huggingface | Calls HuggingFace inference API | ML model inference, NLP tasks |
Import API
internal, its system_prompt is
appended to the agent’s prompt as context enrichment — the LLM absorbs the knowledge but cannot
"call" the skill as a tool. All other types (ssh, http, sql, etc.)
are converted to OpenAI function-calling tool definitions and can be invoked by the LLM during execution.
Pre-Built Template Agents
AOS ships with 4 ready-to-import .skill packages in the repository root. Each contains a full skill definition plus reference knowledge files that are auto-ingested as RAG documents.
| Package | Domain | Knowledge Base |
|---|---|---|
presales-agent.skill | Enterprise presales: RFP responses, proposals, competitive analysis, objection handling (LAER method), ROI/TCO | rfp-templates.md, objection-handling.md, roi-models.md |
finance-agent.skill | Financial analysis: budgeting, forecasting, compliance reporting, revenue recognition, audit preparation | financial-models.md, compliance-frameworks.md |
infra-agent.skill | Infrastructure ops: server management, monitoring, incident response, capacity planning, patching, backup/recovery | runbooks/, architecture-guides/ |
legal-agent.skill | Legal & compliance: contract review, NDA analysis, regulatory compliance, risk assessment, policy drafting | contract-templates.md, regulatory-guides.md |
SKILL.md + references/,
ZIP it with a .skill extension, and import via POST /api/skills/import-package or the Chat Studio UI.
OpenStack / HCS Skills
AOS ships with 15 built-in OpenStack/HCS skills covering all core services. Agents can manage compute, networking, storage, identity, orchestration, and object storage through natural-language instructions.
| Skill | Service | Capability |
|---|---|---|
openstack_list_servers | Nova | List compute instances with status, IPs, flavors |
openstack_server_action | Nova | Start, stop, reboot, suspend, resume instances |
openstack_create_server | Nova | Provision new instances with flavor, image, network |
openstack_list_networks | Neutron | List networks, subnets, and routers |
openstack_manage_network | Neutron | Create/delete networks, subnets, security groups |
openstack_list_volumes | Cinder | List block storage volumes with status and attachments |
openstack_manage_volume | Cinder | Create, delete, attach, detach, extend volumes |
openstack_list_images | Glance | List available OS images |
openstack_keystone_auth | Keystone | Authenticate and manage tokens, projects, users |
openstack_list_flavors | Nova | List instance types / flavors |
openstack_heat_stack | Heat | Create, list, delete orchestration stacks |
openstack_list_projects | Keystone | List tenants / projects |
openstack_quota_usage | Nova | Check compute and storage quota usage |
openstack_server_console | Nova | Get VNC console URL for instances |
openstack_object_storage | Swift | List, upload, download from object storage |
api_key credential with your OpenStack auth_url,
project_name, username, and password in the extra fields.
Bind it to the agent and all OpenStack skills will auto-inject authentication.
6-Stage RAG Pipeline
AOS implements a production-grade, 6-stage RAG pipeline that goes far beyond basic chunking
and keyword search. Each stage is independently configurable, wrapped in try/except fallbacks, and
wired to the RAGOptimizer policy engine for dynamic tuning.
Query → ③ Hybrid BM25+Vector → ④ Metadata Filter → ⑤ Cross-Encoder → ⑥ Semantic Cache
Stage ① — Semantic Chunking
Instead of fixed character-count splitting, AOS uses the ChunkingPolicy from the RAG Optimizer
to split text at paragraph boundaries, heading patterns (Markdown, uppercase, CHAPTER/SECTION),
and sentence endings. Small fragments are auto-merged with neighbors. Overlap is applied for
cross-boundary context. Falls back to fixed 4000-char chunking if semantic splitting fails.
Each chunk is stored with enriched metadata: doc_id, doc_type,
source_file, upload_date, page_number, chunk_index,
and chunking_method — enabling downstream metadata filtering.
Stage ② — Ingest-Time Embedding
Immediately after chunking, the DocumentProcessor calls VectorStoreService.embed_document()
to embed all chunks using sentence-transformers/all-MiniLM-L6-v2 (384-dim) and store them
in ChromaDB with enriched metadata. This means documents are queryable the instant processing completes
— no separate "embed" step needed. Non-fatal: if embedding fails, the document is still text-searchable.
Stage ③ — Hybrid Search with Reciprocal Rank Fusion
At query time, AOS runs two parallel searches: keyword (BM25-style scoring with structural detection) and vector (cosine similarity via ChromaDB). Results are merged using Reciprocal Rank Fusion (RRF):
score = bm25_weight × 1/(60 + rank_keyword) + vector_weight × 1/(60 + rank_vector)Default weights from
IndexingPolicy: bm25_weight = 0.3, vector_weight = 0.7.The constant 60 is from the original RRF paper. Chunks that appear in both lists get boosted; unique hits from either source are preserved.
Stage ④ — Metadata Pre-Filtering
The retrieve_for_agent() method now accepts optional filter parameters:
doc_type (e.g. "pdf", "docx"), source_file (substring match),
upload_date_from / upload_date_to (ISO date range).
Filters are applied at both the ChromaDB where-clause level (for vector search) and as
post-filters (for keyword results). All filters are fully optional — no filter = search everything.
Stage ⑤ — Cross-Encoder Re-Ranking
After hybrid search, the top candidates are re-ranked using cross-encoder/ms-marco-MiniLM-L-6-v2
— a dedicated re-ranking model that scores each (query, chunk) pair locally.
This is the PRIMARY re-ranker (fast, no API cost). If the cross-encoder is unavailable,
the system falls back to the original LLM API re-ranking (sends chunks to the LLM for 0.0–1.0 scoring).
| Re-Ranker | Model | Speed | Cost |
|---|---|---|---|
| Primary | cross-encoder/ms-marco-MiniLM-L-6-v2 | ~5ms per chunk | Zero (local inference) |
| Fallback | Configured LLM (vLLM / Ollama / Azure) | ~200ms per batch | LLM API token cost |
Combined scoring: 30% keyword score + 70% cross-encoder score. The final top-K results are returned to the agent's context window.
Stage ⑥ — Semantic Query Cache
A new RAGSemanticCache service uses Redis to cache RAG results.
On each query, the cache embeds the query and checks for any cached entry with
cosine similarity > 0.95. Cache hit → return cached results instantly.
Cache miss → run the full pipeline → store results with a 24-hour TTL.
| Setting | Default | Description |
|---|---|---|
RAG_CACHE_TTL_SECONDS | 86400 (24h) | Time-to-live for cached query results |
RAG_CACHE_SIMILARITY_THRESHOLD | 0.95 | Minimum cosine similarity for a cache hit |
RAGSemanticCache.invalidate(agent_id) when documents change.
The cache is per-agent scoped — updating one agent's documents won't affect another agent's cache.
OCR Worker
The Celery-based OCR worker applies an 8-step image preprocessing pipeline (deskew, denoise, threshold, contrast, DPI scaling, border removal, rotation correction, binarisation) before running Tesseract 5. If the confidence score falls below the threshold, the page is automatically sent to an LLM Vision model for re-extraction. Successfully extracted text is auto-promoted into the 6-stage RAG pipeline — no manual step required.
Enhanced RAG Pipeline — Self-Healing
The 6-stage pipeline above describes what RAG does. The v2 enhancements below describe how the pipeline survives real production conditions: air-gapped servers, sqlite-old RHEL hosts, transient Ollama hiccups, connection-pool storms during bulk reindex, and stale stats reporting.
Query → ④ Hybrid + RRF → ⑤ Filter + X-Encoder → ⑥ Semantic Cache → Grounded Answer
Embedding Provider Chain (3 tiers + retry)
VectorStoreService._embed_via_api() resolves the embedding endpoint in this order, then commits to it
for the rest of the process via a sticky _api_endpoint_available flag — so a transient
failure can never silently downgrade to an offline local model.
| # | Source | Resolution | Default |
|---|---|---|---|
| 1 | Explicit | EMBEDDING_BASE_URL + EMBEDDING_API_KEY + EMBEDDING_MODEL | — |
| 2 | Auto-detect | Probe http://127.0.0.1:11434/api/tags at process start | qwen3-embedding:0.6b |
| 3 | Legacy | Re-use the chat LLM_BASE_URL for /v1/embeddings | — |
Retry Policy
| Setting | Default | Description |
|---|---|---|
EMBEDDING_API_MAX_RETRIES | 4 | Attempts per batch on timeouts, connection resets, 5xx, 429 |
EMBEDDING_API_TIMEOUT | 180 s | Per-request timeout (was 60 s — bumped for slow Ollama under load) |
| Backoff | 2,4,8,15 s | Exponential, capped at 15 s. 4xx (bad config) fails fast. |
| Logging | WARNING | Every retry is visible in journalctl as Embedding API attempt N/4 failed … |
sentence-transformers/all-MiniLM-L6-v2,
which then immediately failed every remaining doc with "HF offline mode set and model not cached".
The sticky API flag + retry policy makes that impossible.
Vector Store Backend Fallback
VectorStoreService.__init__() wraps Chroma initialisation in except Exception — not just
ImportError — so RHEL9 hosts whose sqlite3 < 3.35.0 trigger Chroma's
RuntimeError still get a working RAG via the Postgres-backed _SQLiteVecBackend.
| Backend | Storage | When used |
|---|---|---|
_ChromaBackend | On-disk Chroma persistent client (HNSW, cosine) | Default — when chromadb imports cleanly |
_SQLiteVecBackend | DocumentChunk.embedding_vector JSON column in PostgreSQL | Fallback — Chroma unavailable, missing, or sqlite too old |
Tunable Chunk Sizing
Defaults were lowered from 4000/600 to 1500/300 for better recall on small/medium corpora. Override per-deployment via env:
Bulk Operations API
| Endpoint | Verb | Purpose |
|---|---|---|
/api/documents/embed-pending | POST | Embed every doc whose has_embeddings = false for the given agent. Resolves agent_id as UUID OR display name. |
/api/documents/reindex-all | POST | Re-chunk + re-embed every linked doc. Uses current RAG_CHUNK_SIZE / RAG_CHUNK_OVERLAP. Returns chunk_size_used + chunk_overlap_used. |
/api/documents/vector-store/stats | GET | Reports the effective backend + embedding model + endpoint + ollama_auto_detected flag. No more stale config. |
Sandbox & On-Spot Skills
Operators can author and attach a Python skill to any agent in a single API call — without restarting the backend, without editing files, and without giving the LLM unrestricted code execution. Inline Python skills run inside a hardened executor by default.
Create & Attach in One Call
Sandbox Guarantees
| Layer | Limit | Mechanism |
|---|---|---|
| Imports | Allow-list only | json, re, math, datetime, itertools, collections, statistics, hashlib, base64. Everything else blocked at __import__. |
| Network | None | Socket module unavailable; no requests, no urllib, no httpx. |
| Filesystem | None | open() overridden; no read/write to host paths. |
| CPU / wall-clock | Configurable | Per-skill timeout, default 5 s, hard kill on overrun. |
| Memory | Soft cap | RLIMIT_AS where supported (Linux). |
| Output | Captured | stdout / stderr collected, returned to executor for logging. |
Skill Types
| Type | Sandbox? | Use case |
|---|---|---|
python_inline | Default ON | Pure-Python transforms, parsing, calculations |
http_call | n/a | HTTP-only — no code execution path |
shell_exec | Off | Power-user only — credentials + RBAC required |
ssh_command | Off | Targets pre-bound SSH credentials only |
sql_query | Read-mode flag | Optional readonly: true blocks DDL/DML |
python_inline + sandbox, promote to a
.skill package once stable, ship to other environments via the import endpoint.
Agent Academy
The Agent Academy turns a fresh universal agent into a domain expert in four steps —
using only the building blocks already in AOS (.skill packages, the 6-stage RAG pipeline,
sandbox skills, and the bulk reindex API).
Tracks
| Track | .skill pack | What the agent learns |
|---|---|---|
| Presales | presales-agent.skill | RFP responses, LAER objection handling, ROI / TCO modelling, demo prep |
| Finance | finance-agent.skill | Budgeting, forecasting, compliance reporting, revenue recognition |
| Infrastructure | infra-agent.skill | Runbooks, incident response, capacity planning, patching, backup/recovery |
| Legal | legal-agent.skill | Contract review, NDA analysis, regulatory compliance, policy drafting |
| Integration | BYO — drop in your .process / .bw docs | Tibco-style integration mapping, BW activity reference, end-point catalog |
The Four Steps
- 1 · Enrol Create a universal agent (or pick one). Optionally bind credentials it will need at graduation.
- 2 · Ingest References Import a
.skillpack — itsreferences/folder is auto-uploaded as documents and linked. RAG kicks in via/api/documents/embed-pending. Verify with/vector-store/stats:total_vectorsshould grow. - 3 · Self-Quiz An on-spot sandbox skill (
academy_quiz,python_inline) generates 10 questions from chunked references and asks the agent to answer them. Wrong answers are logged with the missing chunk IDs — feeding the next iteration. - 4 · Graduate The agent is flagged
academy_status = graduatedin metadata. Channel connectors (Teams, Slack, REST) can now publish it. Re-enrolment re-runs steps 2–3 if references change.
Bootstrap Script (sketch)
OCR / RAG Architecture — Dual-Path Design
AOS processes every uploaded document through two independent paths that converge into a single RAG vector store. Understanding both paths is essential for production tuning and troubleshooting.
├──→ Path A · App-Server (sync) → PyPDF2 / PyMuPDF → ChromaDB
└──→ Path B · Celery Worker (async) → Tesseract → easyocr → LLM Vision → ChromaDB
Database Schema
| Database | Technology | Tables / Collections | Role |
|---|---|---|---|
| Relational | PostgreSQL (mandatory, no SQLite) | documents, document_chunks, document_agent_links, agents, skills, credentials | All metadata, status, agent links, file paths, credentials (AES-256). Single PostgreSQL database for everything. |
| Vector | ChromaDB | agent_{id}_documents (per-agent) | Embedded chunks for semantic search (384-dim via all-MiniLM-L6-v2) |
Path A — App-Server (Synchronous)
Runs inside the FastAPI process on POST /api/documents/upload.
Best for native-text PDFs, DOCX, TXT, Markdown — any format
that already contains extractable text. Completes in seconds.
- 1 · File receipt & metadata Save to
uploads/, insert row intodocumentstable (status =processing). - 2 · Text extraction PyPDF2 first, fallback to PyMuPDF (fitz). If extracted text <
MIN_USEFUL_TEXT(50 chars), OCR is triggered inline via Tesseract → easyocr → LLM Vision cascade. - 3 · Semantic chunking Split into overlapping chunks (default 1 000 tokens, 200 overlap) using sentence-boundary-aware splitter.
- 4 · Embedding & storage Each chunk →
all-MiniLM-L6-v2(384-dim, L2-normalised to 768-dim) → upserted into agent-scoped ChromaDB collection. - 5 · Status update Set document status to
completed. RAG is immediately available.
Path B — Celery Worker (Asynchronous)
Triggered via POST /api/documents/batch-scan or the admin “Scan NFS” button.
Designed for scanned PDFs, images (TIFF/PNG/JPEG), and bulk directory ingestion.
Runs on a separate worker node (or the same host) via Redis-brokered Celery.
| Setting | Default | Description |
|---|---|---|
OCR_STRATEGY | cascade | Try Tesseract → easyocr → LLM Vision in order |
OCR_DPI | 300 | Resolution for PDF-to-image conversion |
OCR_CONFIDENCE_THRESHOLD | 0.60 | Below this, page escalates to next strategy |
MIN_USEFUL_TEXT | 50 | Chars required before text-extraction is considered valid |
- 1 · Directory scan Walk the NFS/local path, discover PDF/image files, create
documentsrows (status =queued). - 2 · Celery task dispatch One Celery task per file. Redis broker distributes across available workers.
- 3 · PDF → image
pdf2image(poppler) converts each page to a PIL image at configured DPI. - 4 · 8-step preprocessing Deskew → denoise → threshold → contrast → DPI scale → border removal → rotation correction → binarisation.
- 5 · OCR cascade Tesseract 5 first. If confidence < threshold → easyocr. If still low → LLM Vision (vLLM multimodal endpoint).
- 6 · Auto-promote to RAG Extracted text is chunked and embedded exactly like Path A (same chunker, same model) → upserted into ChromaDB.
- 7 · Status callback Celery result backend updates document status to
completedorfailedwith error details.
Path A vs Path B — Comparison
| Dimension | Path A · App-Server | Path B · Celery Worker |
|---|---|---|
| Trigger | POST /api/documents/upload | POST /api/documents/batch-scan |
| Execution | Synchronous (blocking) | Asynchronous (Celery task) |
| Best for | Native-text PDFs, DOCX, TXT | Scanned PDFs, images, bulk dirs |
| OCR engines | Tesseract → easyocr → LLM Vision (inline) | Tesseract → easyocr → LLM Vision (worker) |
| Preprocessing | None (text already extractable) | 8-step image pipeline |
| Scalability | Single-process (FastAPI) | Horizontal (add Celery workers) |
| Latency | 1–5 seconds | 10–120 seconds per document |
| Embedding model | all-MiniLM-L6-v2 (384-dim, L2-normalised) | |
| Vector store | ChromaDB — agent-scoped collection | |
Batch-Scan Flow
Document → Agent Linking
Documents are linked to agents via the document_agent_links join table.
A single document can be linked to multiple agents — each agent gets its own
copy of the chunks in its scoped ChromaDB collection.
| Method | Endpoint / Action | Description |
|---|---|---|
| Upload with agent | POST /api/documents/upload?agent_id=X | Link at upload time |
| Batch-scan with agent | POST /api/documents/batch-scan body | Link all scanned docs to specified agent |
| Manual link | POST /api/documents/links | Link existing document to agent after the fact |
| Auto-link | Agent config: auto_link_uploads = true | Automatically link all new uploads |
Query-Time RAG Flow
When a user sends a message to an agent with RAG enabled, the query goes through the 6-stage pipeline to retrieve the most relevant context from that agent’s document collection.
- 1 · Semantic cache check Hash the query → check Redis semantic cache. If hit (cosine > 0.95), return cached answer immediately.
- 2 · Hybrid search Run both dense (embedding similarity) and sparse (BM25 keyword) search on the agent’s ChromaDB collection.
- 3 · RRF fusion Reciprocal Rank Fusion merges both result lists into a single ranked list (k=60).
- 4 · Cross-encoder re-ranking
cross-encoder/ms-marco-MiniLM-L-6-v2scores each candidate for precise relevance. Top-N (default 5) survive. - 5 · LLM generation Surviving chunks are injected into the system prompt as context. The LLM generates a grounded answer with source citations.
Embedding Pipeline
| Stage | Component | Details |
|---|---|---|
| Chunking | Sentence-boundary splitter | 1 000 tokens, 200 overlap, respects sentence boundaries |
| Embedding model | all-MiniLM-L6-v2 | 384-dim output, L2-normalised to 768-dim for ChromaDB |
| Vector store | ChromaDB | Persistent on-disk, agent-scoped collections |
| Metadata | Per-chunk | source_file, page_number, chunk_index, doc_type, upload_date |
Security Model
| Layer | Mechanism | Details |
|---|---|---|
| Authentication | JWT + OAuth2 | Bearer tokens with configurable expiry. Login via email/password. Token refresh supported. |
| AD / LDAP | ldap3 + LDAPS | Enterprise Active Directory authentication. Service-account search + user bind, or direct UPN bind. Auto-provision local users on first login. |
| Authorisation | RBAC (4 roles) | Admin — full access. Operator — manage agents/skills. Viewer — read-only. Developer — API + code execution. |
| Credential Vault | AES-256 Fernet | All credentials encrypted at rest with a server-side key. Decrypted only at skill execution time, never exposed to frontend. |
| Transport | TLS 1.2+ | nginx handles TLS termination. Self-signed or CA-issued certificates supported. |
| Audit Trail | Full logging | Every agent execution, skill call, login, and config change is logged with timestamp, user, and result. |
| Data Governance | PII + Regulatory | PII masking, data classification, retention policies with ISO 27001, NIST CSF, GDPR, CCPA, HIPAA, PCI DSS, SOC 2 regulatory references. |
Data Governance & Regulatory References
AOS includes a built-in Data Governance engine that enforces enterprise policies for data classification, PII detection, retention rules, and access controls. Each policy is enriched with regulatory references mapping to international standards.
| Standard | Full Name | Scope |
|---|---|---|
ISO 27001 | Information Security Management | Data classification, access control, risk management |
ISO 27701 | Privacy Information Management | PII processing, privacy controls, data subject rights |
NIST CSF | Cybersecurity Framework | Identify, Protect, Detect, Respond, Recover |
NIST 800-53 | Security & Privacy Controls | Federal information system controls (US) |
NIST AI RMF | AI Risk Management Framework | AI system trustworthiness, bias, transparency |
GDPR | General Data Protection Regulation | EU personal data processing, consent, erasure |
CCPA | California Consumer Privacy Act | Consumer data rights, opt-out, disclosure |
HIPAA | Health Insurance Portability Act | Protected Health Information (PHI) safeguards |
PCI DSS | Payment Card Industry Data Security | Cardholder data protection, encryption, access |
SOC 2 | Service Organization Controls | Security, availability, processing integrity, confidentiality, privacy |
GET /api/data-governance/references returns the full regulatory reference catalogue.
Each policy response now includes a references_resolved array with standard names, full titles, and descriptions.
Semantic Query Cache Architecture
The RAGSemanticCache in rag_cache.py is a standalone service class with
four public methods. It can be integrated at any point in the retrieval pipeline.
(cosine ≥ 0.95?) → HIT: Return Cached
MISS → Run Full RAG Pipeline → Store in Redis → Return Results
| Method | Signature | Description |
|---|---|---|
get() | async get(query, agent_id?) → Dict | None | Embed query → scan Redis for cosine ≥ 0.95 → return cached result or None |
put() | async put(query, chunks, summary?, agent_id?) | Store query vector + results in Redis with TTL |
invalidate() | async invalidate(agent_id?) → int | Clear cache entries, optionally scoped to one agent |
get_stats() | get_stats() → Dict | Return total entries, TTL, threshold, Redis status |
rag_cache:{agent_id}:{sha256(query)[:16]}. Each entry stores the
query vector, up to 50 serialised chunks (text capped at 4000 chars each), optional summary, and timestamp.
The Redis index set tracks all active keys for efficient similarity scanning.
Deployment Options
Quick Start (Single Server)
Production (Multi-Server)
| Server | Role | Services |
|---|---|---|
| App Server | Frontend + API | nginx + React build + FastAPI (port 8000) |
| DB Server | Data layer | PostgreSQL 16 (port 5432) + Redis (port 6379) |
| OCR Worker(s) | Document processing | Celery worker + Tesseract 5 + OpenCV |
| LLM Server | Model inference | vLLM or Ollama (GPU recommended) |
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| OS | RHEL 8 / Ubuntu 20.04 | RHEL 9 / Ubuntu 22.04 |
| CPU | 4 cores | 8+ cores |
| RAM | 8 GB | 16+ GB (32 GB with LLM) |
| Storage | 50 GB | 200+ GB (for documents) |
| Python | 3.9 | 3.11+ |
| Node.js | 18 | 20+ |
Infrastructure Admin Panel
The Infrastructure tab on the Admin Panel gives operators a single pane-of-glass
to view and edit the backend’s .env configuration and to verify that every
external service is reachable — without SSH access.
Server Configuration
Fields are grouped into four colour-coded sections:
| Section | Colour | Fields |
|---|---|---|
| App Server | Blue | API_HOST, API_PORT, AUTH_SECRET_KEY, LOG_LEVEL, CORS_ORIGINS |
| Database | Green | POSTGRES_HOST, POSTGRES_PORT, POSTGRES_DATABASE, POSTGRES_USER, POSTGRES_PASSWORD, VECTOR_DB_URL, VECTOR_DB_COLLECTION |
| Worker / AI | Amber | REDIS_HOST, REDIS_PORT, REDIS_PASSWORD, LLM_BASE_URL, LLM_DEFAULT_MODEL, LLM_API_KEY |
| Storage | Purple | NFS_BASE_PATH |
Password fields are always displayed as •••••••• and are only written
back to .env if the user actually changes them.
An amber dot appears next to any field that has been modified but not yet saved.
On save the backend writes the new values into the .env file and clears the
settings cache (get_settings.cache_clear()) so the next request picks up changes immediately.
Service Health Check
The Service Health panel tests real connectivity to five infrastructure components:
| Service | How it’s tested |
|---|---|
| PostgreSQL | TCP connect + SELECT 1 via SQLAlchemy session |
| Redis | TCP connect + PING command |
| ChromaDB | TCP connect to the parsed URL host:port |
| Celery Workers | Scan Redis for celery* / _kombu* keys |
| vLLM Server | HTTP GET /models (falls back to TCP if HTTP fails) |
Each service shows a • green / • amber / • grey status dot plus a detail string so operators can diagnose connectivity issues at a glance.
API Reference (Key Endpoints)
| Method | Endpoint | Description |
|---|---|---|
POST | /api/auth/login | Authenticate and receive JWT token |
GET | /api/agents | List all agents |
POST | /api/agents | Create a new agent |
POST | /api/chat/{agent_id} | Send a message to an agent |
GET | /api/skills | List all registered skills |
POST | /api/skills | Register a custom skill |
POST | /api/skills/import-package | Import a .skill ZIP package (registers skill + ingests reference docs as RAG) |
GET | /api/credentials | List credentials (metadata only) |
POST | /api/documents/upload | Upload documents for RAG |
POST | /api/documents/batch-scan | Scan NFS/local directory, queue Celery OCR tasks for all discovered files |
POST | /api/documents/register-path | Register a watch-path for automatic document discovery |
POST | /api/documents/{id}/process | Re-process a single document (re-extract, re-chunk, re-embed) |
POST | /api/documents/links | Link/unlink documents to agents (populates document_agent_links) |
GET | /api/rag/debug/{agent_id} | RAG debug info: collection stats, chunk count, embedding dimensions, sample chunks |
POST | /api/auth/login/ad | Authenticate via Active Directory / LDAP |
GET | /api/data-governance/policies | List data governance policies with regulatory references |
GET | /api/data-governance/references | Full regulatory reference catalogue (ISO, NIST, GDPR…) |
POST | /api/documents/search | Hybrid RAG search with RRF fusion + cross-encoder re-ranking. Accepts doc_type, source_file, upload_date_from/to filters. |
GET | /api/rag/cache/stats | Semantic query cache statistics (entries, TTL, threshold, Redis status) |
DELETE | /api/rag/cache | Invalidate semantic query cache (optionally scoped by agent_id) |
GET | /api/scheduler/jobs | List scheduled agent jobs |
GET | /api/audit | Retrieve audit trail entries |
GET | /api/admin/config | Get infrastructure config (grouped, passwords masked) |
PUT | /api/admin/config | Update .env fields (skips masked passwords) |
GET | /api/admin/health | Test connectivity to PostgreSQL, Redis, ChromaDB, Celery, vLLM |
POST | /api/admin/restart-hint | Clear settings cache, signal restart recommended |
http://your-server:8000/docs
for the interactive Swagger/OpenAPI documentation with all endpoints, request schemas, and response models.