AOS Documentation — Architecture, Purpose & API Reference

Overview

Purpose & Vision

AI Orchestrator Studio (AOS) is a full-stack, self-hosted AI platform that lets organisations build, deploy, and manage autonomous AI agents capable of performing real infrastructure tasks — not just answering questions.

Traditional AI chatbots are limited to text conversations. AOS agents go further: they SSH into servers, execute SQL queries, scan documents with OCR, call REST APIs, manage Docker containers, and delegate work to other agents. Each agent is a configurable unit with its own system prompt, skill set, LLM provider, and credential bindings.

Core Principle: Your data never leaves your infrastructure. AOS runs entirely on-premise. LLM endpoints, databases, documents, and credentials are all under your control. Zero cloud dependency.

AOS is designed for enterprise IT teams, MSPs, and DevOps organisations that need AI to automate complex, multi-step operational workflows — not just generate text. The platform covers 9 enterprise domains: Infrastructure, Database Administration, Security & Compliance, Networking, DevOps, Cloud / OpenStack, Project Management, Customer Experience, and Microsoft 365 / Copilot — with 95+ built-in skills across 37 categories.

System Design

Architecture Deep-Dive

AOS follows a three-tier architecture designed for production deployment, horizontal scaling, and clear separation of concerns. Every tier can run on a separate server.

React 18 Frontend → nginx Reverse Proxy → FastAPI Backend → PostgreSQL + Redis
│
Enterprise Orchestrator → UniversalAgentExecutor → ModelRouter → LLMClient (10+ providers)
│
Celery Worker → Tesseract → easyocr → LLM Vision → ChromaDB Vector Store

Tier 1 — Application Server

Component	Technology	Purpose
Frontend	`React 18` + `MUI 5` + `TypeScript`	Agent Builder UI, Chat Studio, Dashboard, Skill Manager, Document Manager, System Config
Reverse Proxy	`nginx 1.20+`	TLS termination, static file serving, API routing to backend port 8000
Backend API	`FastAPI` + `uvicorn`	All REST endpoints — agents, skills, credentials, auth, RAG, chat, scheduling
Enterprise Orchestrator	`enterprise_orchestrator.py`	Deterministic 3-path routing: ANALYTICS → DOCUMENTS → GENERAL. Rule-based keyword matching, no LLM in router. Redis-cached, traced, observable.
Agent Executor	`UniversalAgentExecutor`	Receives user messages, assembles context (system prompt + internal skills + RAG chunks + planning), routes to LLM via ModelRouter, parses tool calls with 4-tier fallback, executes skills, manages delegation
Model Router	`ModelRouter`	Per-agent multi-LLM routing: task classification → task_routing map → primary_connection → fallback_chain[] → system_default. Supports 10+ LLM providers.
LLM Client	`LLMClient`	Provider auto-detection from URL, endpoint probing, auth handling (Bearer/API_KEY/Basic/Custom/None). For vLLM/Custom, tools payload is skipped (ReAct fallback).
RAG Engine	`ChromaDB` + `Cross-Encoder`	6-stage pipeline: semantic chunking → ingest-time embedding → hybrid BM25+vector with RRF → metadata filtering → cross-encoder re-ranking → semantic cache.

Tier 2 — Database & Cache Layer

Component	Technology	Purpose
Primary Database	`PostgreSQL 16`	Agents, skills, credentials (AES-256 encrypted), users, documents, document_agent_links, audit trails, scheduler jobs. PostgreSQL-only (no SQLite).
Session & RAG Cache	`Redis`	Chat memory, rate limiting, API Gateway counters, Celery task broker, semantic query cache (cosine similarity >0.95, 24h TTL), Enterprise Orchestrator routing cache
Vector Store	`ChromaDB`	Document embeddings for RAG retrieval. Persistent agent-scoped collections (`agent_{id}_documents`).

Tier 3 — Async Workers & LLM

Component	Technology	Purpose
OCR Worker	`Celery` + `Tesseract 5` + `easyocr` + `OpenCV`	3-strategy cascade: Tesseract → easyocr → LLM Vision. 8-step image preprocessing. Auto-promotes results into RAG vector store.
LLM Vision Fallback	`Any multimodal LLM`	When Tesseract/easyocr confidence is below threshold, pages are sent to LLM Vision for re-extraction
LLM Server	`vLLM` / `Ollama` / `Azure OpenAI` / `OpenAI` / `Anthropic` / `Cohere` / `HuggingFace` / `LlamaCpp` / `TextGen WebUI` / `Custom`	Any OpenAI-compatible endpoint. ModelRouter selects per-agent per-task. Automatic fallback chains.

Runtime

Agent Execution Flow

When a user sends a message, it passes through the Enterprise Orchestrator for deterministic 3-path routing (no LLM in the router), then into the UniversalAgentExecutor for the full agent pipeline.

Enterprise Orchestrator — 3-Path Deterministic Routing

User Message → Enterprise Orchestrator

ANALYTICS KPI, metric, dashboard, chart → Metrics API
DOCUMENTS document, search, find file, OCR → RAG Search
GENERAL everything else → UniversalAgentExecutor

Key design: The Enterprise Orchestrator uses rule-based keyword matching — no LLM call in the routing decision. This makes routing deterministic, fast, and observable. Results are cached in Redis.

UniversalAgentExecutor — Full Pipeline

1. Agent Loading Load agent config from PostgreSQL: name, system_prompt, skills, parameters, LLM routing config.
2. Skill Separation Loaded skills are split into two groups:
• Internal skills (type=internal) → prompt-injected as context enrichment, NOT exposed as callable tools.
• Callable skills (type=ssh, http, python, sql, ansible, etc.) → converted to OpenAI function-calling tool definitions.
3. Credential Discovery (3 tiers) Tier 1: Agent config (agent.extra_config.credentials).
Tier 2: RBAC bindings (credential_bindings table).
Tier 3: Auto-discovery (scan skills → match by credential type/name).
4. RAG Injection If agent has linked documents: vector search user query against DocumentChunks in ChromaDB → top-K results injected into system prompt as ### Reference Context ###. This is where .skill package reference files become searchable.
5. System Prompt Assembly Final prompt is built in order: [1] Agent's base system_prompt + [2] Internal skill prompts + [3] RAG reference chunks + [4] Deep Planning instructions (if enabled) + [5] ReAct instructions (if provider doesn't support native function calling).
6. ModelRouter Selection ModelRouter.select(agent_config, task_type, message):
• Task routing map (e.g., code_gen → connection_id_for_codellama)
• Primary connection (primary_connection_id)
• System default (system_settings.default_llm_connection_id)
7. LLM Inference LLMClient auto-detects provider from URL, probes candidate endpoints, sends prompt. For vLLM/Custom providers: tools payload is skipped (ReAct text fallback handles tool calling).
8. 4-Tier Tool Call Parsing (max 10 rounds) Tier 1: native tool_calls[] format → execute directly.
Tier 2: parse ACTION / ACTION_INPUT blocks from text (ReAct).
Tier 3: false-completion detection + forceful re-prompt.
Tier 4: keyword scoring against skill descriptions → auto-invoke best match.
9. Skill Execution + Delegation Matched skill handler runs (SSH, SQL, HTTP, Docker, Ansible, etc.). If skill is agent_delegation, spawns a sub-agent with recursion depth guards (max 3 levels). Results feed back as OBSERVATION.
10. Final Answer When LLM produces no more tool calls (FINAL ANSWER), the synthesised response is returned to the user and stored in chat history (Redis).

LLM Management

Multi-LLM Routing Engine

AOS doesn't lock you to a single LLM. The ModelRouter (model_router.py) implements per-agent, task-aware model selection with automatic fallback chains across 10+ LLM providers.

Routing Decision Flow

User Message → Task Classification
(keyword scoring) → ModelRouter.select()

ModelRouter → ① task_routing[task]? → ② primary_connection? → ③ system_default

LLMClient → Auto-detect provider from URL → Probe endpoints → Cache working endpoint

Task Classification

The ModelRouter classifies each user message into a task type using keyword scoring. Each task type can be routed to a different LLM connection.

Task Type	Trigger Keywords	Best Model For
`reasoning`	"think", "analyse", "why", "explain"	GPT-4, Claude 3 Opus
`code_gen`	"write code", "debug", "function", "script"	Codellama, GPT-4, DeepSeek
`rag_answer`	"search", "find in documents", "what does the doc say"	GPT-4o, Qwen
`summarize`	"summarize", "TLDR", "brief"	Claude 3 Haiku, Llama 3
`classify`	"classify", "categorize", "label"	Fast local models
`extract`	"extract", "parse", "pull out"	GPT-4o-mini, Llama 3
`translate`	"translate to", "in Arabic", "en français"	GPT-4, Llama 3
`chat`	General conversation	Any model (default)
`tool_call`	"run", "execute", "SSH into"	Models with function calling
`planning`	"plan", "steps to", "how to"	GPT-4, Claude 3 Opus

Per-Agent Routing Configuration

Each agent’s llm_routing config supports:

Field	Type	Description
`primary_connection_id`	`UUID`	Default LLM connection for this agent
`fallback_chain`	`UUID[]`	Ordered list of fallback connections if primary fails
`task_routing`	`Map<task, UUID>`	Task-specific LLM overrides (e.g., `{"code_gen": "id-for-codellama"}`)
`cost_aware`	`bool`	Enable cost-optimised routing (prefer cheaper models for simple tasks)
`max_fallback_attempts`	`int`	Max attempts before giving up (default 3)

Supported LLM Providers

The LLMClient (llm_client.py) auto-detects the provider from the connection URL and adapts its behaviour accordingly:

Provider	Detection Pattern	Tool Calling	Auth
`OpenAI`	`api.openai.com`	Native `tool_calls`	Bearer token
`Azure OpenAI`	`*.openai.azure.com`	Native `tool_calls`	API key
`Anthropic`	`api.anthropic.com`	Native `tool_calls`	API key
`Cohere`	`api.cohere.ai`	Native	Bearer token
`vLLM`	`/v1/completions`	⚠️ Skipped → ReAct text fallback	Bearer / None
`Ollama`	`:11434`	⚠️ Skipped → ReAct text fallback	None
`TextGen WebUI`	`:5001`	⚠️ Skipped → ReAct text fallback	None
`LlamaCpp`	`:8080`	⚠️ Skipped → ReAct text fallback	None
`HuggingFace`	`api-inference`	⚠️ Skipped → ReAct text fallback	Bearer token
`Custom`	Any other URL	⚠️ Skipped → ReAct text fallback	Configurable

Critical behaviour: For providers that don’t support OpenAI function-calling format (vLLM, Ollama, TextGen, LlamaCpp, Custom), the tools payload is skipped entirely. Instead, the agent’s system prompt is injected with ReAct instructions (ACTION / ACTION_INPUT format), and Tier 2 parsing handles tool execution. This means any LLM works with any skill — no provider lock-in.

Fallback Chain Behaviour

primary_connection → timeout / error / rate limit → fallback_chain[0] → fallback_chain[1] → system_default

The call_with_fallback() method in ModelRouter automatically retries with the next connection in the chain when the current one fails. This ensures agents stay operational even when individual LLM endpoints go down.

Autonomous Execution

Deep Agent Planning Loop

AOS implements a Deep Agent execution model that forces LLMs to plan before they act. When enabled, the system injects a planning prompt into the agent’s system message, requiring the LLM to produce a numbered execution plan with dependencies and risk checks before invoking any tools.

User Message → Planning Prompt
(up to N steps) → LLM Generates Plan → Step-by-Step Execution

Tool Result → Validate Observation → Next Step / Fallback → Final Synthesis

4-Tier Tool Execution

The execution loop uses a 4-tier fallback strategy to guarantee tool execution works with any LLM — whether it supports native function calling or not:

Tier	Strategy	When
Tier 1	`Native tool_calls`	LLM returns structured `tool_calls` in OpenAI format → execute directly
Tier 2	`ReAct Parsing`	No native tool_calls → parse `ACTION / ACTION_INPUT` blocks from text
Tier 3	`False-Completion Detection`	LLM hallucinated completion or described steps instead of doing them → forceful re-prompt
Tier 4	`Intent Auto-Dispatch`	If the model still refuses structured tool output, infer the most likely tool + parameters from intent and execute as last resort

False-Completion Detection

A critical challenge with LLMs is hallucinated action — the model claims “done!” without calling any tool. AOS detects this with a dual-layer heuristic:

• Future-tense laziness: Phrases like “I would…”, “here’s how…”, “you should…” (20+ patterns)
• Past-tense hallucination: Phrases like “has been created”, “successfully configured”, “done!” (50+ patterns)
When detected, the system switches to ReAct mode and re-prompts: "STOP. You did NOT execute anything. DO IT NOW." If the model still fails to emit structured ACTION blocks, Tier 4 intent auto-dispatch attempts safe tool recovery.

Configuration

Env Variable	Default	Description
`DEEP_AGENT_PLANNING_ENABLED`	`true`	Enable planning-before-action mode
`DEEP_AGENT_MAX_PLAN_STEPS`	`6`	Maximum plan steps before execution begins
`DEEP_AGENT_PLAN_STYLE`	`structured`	`structured` (dependencies + risk checks) or `concise`

Per-agent override: Each agent can override these settings via its config JSON (deep_planning_enabled, deep_planning_max_steps, deep_planning_style). The global settings serve as defaults for all agents.

Agent Knowledge

.skill Package System

AOS agents can be extended with .skill packages — portable ZIP archives that bundle a skill definition with reference knowledge files. When imported, the skill is registered and all reference documents are automatically ingested into the RAG vector store.

.skill File Format

A .skill file is a renamed ZIP archive with this structure:

presales-agent.skill (ZIP) └── presales-agent/ ├── SKILL.md # Skill definition (YAML frontmatter + markdown body) ├── references/ # Knowledge files (auto-ingested as RAG documents) │ ├── rfp-templates.md │ ├── objection-handling.md │ └── roi-models.md └── scripts/ # Optional automation scripts

SKILL.md Structure

--- name: presales-agent description: "Enterprise presales: RFP, proposals, competitive analysis, ROI/TCO" --- # System Prompt (becomes the skill's system_prompt) You are an enterprise presales specialist... ## Core Competencies - Solution Architecture mapping - RFP/RFI response generation - Competitive analysis frameworks - ROI/TCO calculations ...

The YAML frontmatter (name + description) is parsed as skill metadata. The markdown body becomes the skill’s system_prompt. Files in references/ are automatically chunked, embedded, and stored in the agent’s ChromaDB collection as RAG documents.

Skill Types

Type	Behaviour	Use Case
`internal`	Prompt-injected as context. NOT presented as a callable tool.	Knowledge enrichment, persona definition, domain expertise
`ssh`	Executes SSH commands on bound server	Server administration, log checks, service management
`http`	Calls REST/GraphQL APIs	External integrations, webhooks, data fetching
`python`	Runs Python scripts in sandbox	Data processing, calculations, custom logic
`sql`	Executes SQL queries against bound database	Database administration, reporting, health checks
`ansible`	Runs Ansible playbooks	Infrastructure automation, configuration management
`shell`	Executes shell commands locally	Local automation, file operations
`graphql`	Executes GraphQL queries	Modern API integrations
`scrapling`	Web scraping with Scrapling	Data extraction from websites
`huggingface`	Calls HuggingFace inference API	ML model inference, NLP tasks

Import API

# Upload via API POST /api/skills/import-package Content-Type: multipart/form-data file: presales-agent.skill # Or via Chat Studio UI: # Agents → Import Skill → select .skill file

Internal vs Callable: When a skill type is internal, its system_prompt is appended to the agent’s prompt as context enrichment — the LLM absorbs the knowledge but cannot "call" the skill as a tool. All other types (ssh, http, sql, etc.) are converted to OpenAI function-calling tool definitions and can be invoked by the LLM during execution.

Pre-Built Template Agents

AOS ships with 4 ready-to-import .skill packages in the repository root. Each contains a full skill definition plus reference knowledge files that are auto-ingested as RAG documents.

Package	Domain	Knowledge Base
`presales-agent.skill`	Enterprise presales: RFP responses, proposals, competitive analysis, objection handling (LAER method), ROI/TCO	`rfp-templates.md`, `objection-handling.md`, `roi-models.md`
`finance-agent.skill`	Financial analysis: budgeting, forecasting, compliance reporting, revenue recognition, audit preparation	`financial-models.md`, `compliance-frameworks.md`
`infra-agent.skill`	Infrastructure ops: server management, monitoring, incident response, capacity planning, patching, backup/recovery	`runbooks/`, `architecture-guides/`
`legal-agent.skill`	Legal & compliance: contract review, NDA analysis, regulatory compliance, risk assessment, policy drafting	`contract-templates.md`, `regulatory-guides.md`

Create your own: Use any of these as a template. Create a folder with SKILL.md + references/, ZIP it with a .skill extension, and import via POST /api/skills/import-package or the Chat Studio UI.

Cloud Skills

OpenStack / HCS Skills

AOS ships with 15 built-in OpenStack/HCS skills covering all core services. Agents can manage compute, networking, storage, identity, orchestration, and object storage through natural-language instructions.

Skill	Service	Capability
`openstack_list_servers`	Nova	List compute instances with status, IPs, flavors
`openstack_server_action`	Nova	Start, stop, reboot, suspend, resume instances
`openstack_create_server`	Nova	Provision new instances with flavor, image, network
`openstack_list_networks`	Neutron	List networks, subnets, and routers
`openstack_manage_network`	Neutron	Create/delete networks, subnets, security groups
`openstack_list_volumes`	Cinder	List block storage volumes with status and attachments
`openstack_manage_volume`	Cinder	Create, delete, attach, detach, extend volumes
`openstack_list_images`	Glance	List available OS images
`openstack_keystone_auth`	Keystone	Authenticate and manage tokens, projects, users
`openstack_list_flavors`	Nova	List instance types / flavors
`openstack_heat_stack`	Heat	Create, list, delete orchestration stacks
`openstack_list_projects`	Keystone	List tenants / projects
`openstack_quota_usage`	Nova	Check compute and storage quota usage
`openstack_server_console`	Nova	Get VNC console URL for instances
`openstack_object_storage`	Swift	List, upload, download from object storage

Credential Binding: Create an api_key credential with your OpenStack auth_url, project_name, username, and password in the extra fields. Bind it to the agent and all OpenStack skills will auto-inject authentication.

Document Intelligence

6-Stage RAG Pipeline

AOS implements a production-grade, 6-stage RAG pipeline that goes far beyond basic chunking and keyword search. Each stage is independently configurable, wrapped in try/except fallbacks, and wired to the RAGOptimizer policy engine for dynamic tuning.

Upload / NFS / OCR → ① Semantic Chunk → ② Embed + Store → ChromaDB

Query → ③ Hybrid BM25+Vector → ④ Metadata Filter → ⑤ Cross-Encoder → ⑥ Semantic Cache

Stage ① — Semantic Chunking

Instead of fixed character-count splitting, AOS uses the ChunkingPolicy from the RAG Optimizer to split text at paragraph boundaries, heading patterns (Markdown, uppercase, CHAPTER/SECTION), and sentence endings. Small fragments are auto-merged with neighbors. Overlap is applied for cross-boundary context. Falls back to fixed 4000-char chunking if semantic splitting fails.

Each chunk is stored with enriched metadata: doc_id, doc_type, source_file, upload_date, page_number, chunk_index, and chunking_method — enabling downstream metadata filtering.

Stage ② — Ingest-Time Embedding

Immediately after chunking, the DocumentProcessor calls VectorStoreService.embed_document() to embed all chunks using sentence-transformers/all-MiniLM-L6-v2 (384-dim) and store them in ChromaDB with enriched metadata. This means documents are queryable the instant processing completes — no separate "embed" step needed. Non-fatal: if embedding fails, the document is still text-searchable.

Stage ③ — Hybrid Search with Reciprocal Rank Fusion

At query time, AOS runs two parallel searches: keyword (BM25-style scoring with structural detection) and vector (cosine similarity via ChromaDB). Results are merged using Reciprocal Rank Fusion (RRF):

RRF Formula: score = bm25_weight × 1/(60 + rank_keyword) + vector_weight × 1/(60 + rank_vector)
Default weights from IndexingPolicy: bm25_weight = 0.3, vector_weight = 0.7.
The constant 60 is from the original RRF paper. Chunks that appear in both lists get boosted; unique hits from either source are preserved.

Stage ④ — Metadata Pre-Filtering

The retrieve_for_agent() method now accepts optional filter parameters: doc_type (e.g. "pdf", "docx"), source_file (substring match), upload_date_from / upload_date_to (ISO date range). Filters are applied at both the ChromaDB where-clause level (for vector search) and as post-filters (for keyword results). All filters are fully optional — no filter = search everything.

Stage ⑤ — Cross-Encoder Re-Ranking

After hybrid search, the top candidates are re-ranked using cross-encoder/ms-marco-MiniLM-L-6-v2 — a dedicated re-ranking model that scores each (query, chunk) pair locally. This is the PRIMARY re-ranker (fast, no API cost). If the cross-encoder is unavailable, the system falls back to the original LLM API re-ranking (sends chunks to the LLM for 0.0–1.0 scoring).

Re-Ranker	Model	Speed	Cost
Primary	`cross-encoder/ms-marco-MiniLM-L-6-v2`	~5ms per chunk	Zero (local inference)
Fallback	Configured LLM (vLLM / Ollama / Azure)	~200ms per batch	LLM API token cost

Combined scoring: 30% keyword score + 70% cross-encoder score. The final top-K results are returned to the agent's context window.

Stage ⑥ — Semantic Query Cache

A new RAGSemanticCache service uses Redis to cache RAG results. On each query, the cache embeds the query and checks for any cached entry with cosine similarity > 0.95. Cache hit → return cached results instantly. Cache miss → run the full pipeline → store results with a 24-hour TTL.

Setting	Default	Description
`RAG_CACHE_TTL_SECONDS`	`86400` (24h)	Time-to-live for cached query results
`RAG_CACHE_SIMILARITY_THRESHOLD`	`0.95`	Minimum cosine similarity for a cache hit

Cache invalidation: Call RAGSemanticCache.invalidate(agent_id) when documents change. The cache is per-agent scoped — updating one agent's documents won't affect another agent's cache.

OCR Worker

The Celery-based OCR worker applies an 8-step image preprocessing pipeline (deskew, denoise, threshold, contrast, DPI scaling, border removal, rotation correction, binarisation) before running Tesseract 5. If the confidence score falls below the threshold, the page is automatically sent to an LLM Vision model for re-extraction. Successfully extracted text is auto-promoted into the 6-stage RAG pipeline — no manual step required.

Document Intelligence · v2

Enhanced RAG Pipeline — Self-Healing

The 6-stage pipeline above describes what RAG does. The v2 enhancements below describe how the pipeline survives real production conditions: air-gapped servers, sqlite-old RHEL hosts, transient Ollama hiccups, connection-pool storms during bulk reindex, and stale stats reporting.

Upload / NFS / OCR → ① Chunk 1500/300 → ② Embed Chain → ③ Vector Store Fallback

Query → ④ Hybrid + RRF → ⑤ Filter + X-Encoder → ⑥ Semantic Cache → Grounded Answer

Embedding Provider Chain (3 tiers + retry)

VectorStoreService._embed_via_api() resolves the embedding endpoint in this order, then commits to it for the rest of the process via a sticky _api_endpoint_available flag — so a transient failure can never silently downgrade to an offline local model.

#	Source	Resolution	Default
1	Explicit	`EMBEDDING_BASE_URL` + `EMBEDDING_API_KEY` + `EMBEDDING_MODEL`	—
2	Auto-detect	Probe `http://127.0.0.1:11434/api/tags` at process start	`qwen3-embedding:0.6b`
3	Legacy	Re-use the chat `LLM_BASE_URL` for `/v1/embeddings`	—

Retry Policy

Setting	Default	Description
`EMBEDDING_API_MAX_RETRIES`	`4`	Attempts per batch on timeouts, connection resets, 5xx, 429
`EMBEDDING_API_TIMEOUT`	`180` s	Per-request timeout (was 60 s — bumped for slow Ollama under load)
Backoff	`2,4,8,15 s`	Exponential, capped at 15 s. 4xx (bad config) fails fast.
Logging	`WARNING`	Every retry is visible in `journalctl` as Embedding API attempt N/4 failed …

Why this matters: in earlier builds a single Ollama timeout during a 47-document Tibco reindex caused the dispatcher to silently fall through to sentence-transformers/all-MiniLM-L6-v2, which then immediately failed every remaining doc with "HF offline mode set and model not cached". The sticky API flag + retry policy makes that impossible.

Vector Store Backend Fallback

VectorStoreService.__init__() wraps Chroma initialisation in except Exception — not just ImportError — so RHEL9 hosts whose sqlite3 < 3.35.0 trigger Chroma's RuntimeError still get a working RAG via the Postgres-backed _SQLiteVecBackend.

Backend	Storage	When used
`_ChromaBackend`	On-disk Chroma persistent client (HNSW, cosine)	Default — when `chromadb` imports cleanly
`_SQLiteVecBackend`	`DocumentChunk.embedding_vector` JSON column in PostgreSQL	Fallback — Chroma unavailable, missing, or sqlite too old

Tunable Chunk Sizing

Defaults were lowered from 4000/600 to 1500/300 for better recall on small/medium corpora. Override per-deployment via env:

# /opt/aos/.env RAG_CHUNK_SIZE=1500 RAG_CHUNK_OVERLAP=300 EMBEDDING_BASE_URL=http://127.0.0.1:11434 EMBEDDING_MODEL=qwen3-embedding:0.6b EMBEDDING_API_MAX_RETRIES=4 EMBEDDING_API_TIMEOUT=180 # DB pool sized for bulk reindex (4 workers * 60 conns = 240 cap) DB_POOL_SIZE=20 DB_MAX_OVERFLOW=40 DB_POOL_TIMEOUT=60

Bulk Operations API

Endpoint	Verb	Purpose
`/api/documents/embed-pending`	`POST`	Embed every doc whose `has_embeddings = false` for the given agent. Resolves `agent_id` as UUID OR display name.
`/api/documents/reindex-all`	`POST`	Re-chunk + re-embed every linked doc. Uses current `RAG_CHUNK_SIZE` / `RAG_CHUNK_OVERLAP`. Returns `chunk_size_used` + `chunk_overlap_used`.
`/api/documents/vector-store/stats`	`GET`	Reports the effective backend + embedding model + endpoint + `ollama_auto_detected` flag. No more stale config.

# Example: re-index everything for an agent identified by name curl -s -X POST "http://127.0.0.1:8000/api/documents/reindex-all?agent_id=Integration-Tibco-Agent" \ -H "Authorization: Bearer $TOKEN"

Skills · Live Authoring

Sandbox & On-Spot Skills

Operators can author and attach a Python skill to any agent in a single API call — without restarting the backend, without editing files, and without giving the LLM unrestricted code execution. Inline Python skills run inside a hardened executor by default.

Create & Attach in One Call

# Forces sandbox=True for python_inline. Idempotent: reuses existing skill if same name. curl -X POST "http://127.0.0.1:8000/api/universal-agents/<agent_id>/skills:create_and_attach" \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "name": "summarise_ticket", "type": "python_inline", "description": "Summarise a Jira ticket payload", "code": "def run(payload):\n return {\"summary\": payload[\"description\"][:280]}\n", "sandbox": true }'

Sandbox Guarantees

Layer	Limit	Mechanism
Imports	Allow-list only	`json`, `re`, `math`, `datetime`, `itertools`, `collections`, `statistics`, `hashlib`, `base64`. Everything else blocked at `__import__`.
Network	None	Socket module unavailable; no `requests`, no `urllib`, no `httpx`.
Filesystem	None	`open()` overridden; no read/write to host paths.
CPU / wall-clock	Configurable	Per-skill timeout, default 5 s, hard kill on overrun.
Memory	Soft cap	RLIMIT_AS where supported (Linux).
Output	Captured	stdout / stderr collected, returned to executor for logging.

Skill Types

Type	Sandbox?	Use case
`python_inline`	Default ON	Pure-Python transforms, parsing, calculations
`http_call`	n/a	HTTP-only — no code execution path
`shell_exec`	Off	Power-user only — credentials + RBAC required
`ssh_command`	Off	Targets pre-bound SSH credentials only
`sql_query`	Read-mode flag	Optional `readonly: true` blocks DDL/DML

Recommended workflow: prototype fast with python_inline + sandbox, promote to a .skill package once stable, ship to other environments via the import endpoint.

Onboarding · Learning

Agent Academy

The Agent Academy turns a fresh universal agent into a domain expert in four steps — using only the building blocks already in AOS (.skill packages, the 6-stage RAG pipeline, sandbox skills, and the bulk reindex API).

① Enrol → ② Ingest References → ③ Self-Quiz → ④ Graduate

Tracks

Track	.skill pack	What the agent learns
Presales	`presales-agent.skill`	RFP responses, LAER objection handling, ROI / TCO modelling, demo prep
Finance	`finance-agent.skill`	Budgeting, forecasting, compliance reporting, revenue recognition
Infrastructure	`infra-agent.skill`	Runbooks, incident response, capacity planning, patching, backup/recovery
Legal	`legal-agent.skill`	Contract review, NDA analysis, regulatory compliance, policy drafting
Integration	BYO — drop in your `.process` / `.bw` docs	Tibco-style integration mapping, BW activity reference, end-point catalog

The Four Steps

1 · Enrol Create a universal agent (or pick one). Optionally bind credentials it will need at graduation.
2 · Ingest References Import a .skill pack — its references/ folder is auto-uploaded as documents and linked. RAG kicks in via /api/documents/embed-pending. Verify with /vector-store/stats: total_vectors should grow.
3 · Self-Quiz An on-spot sandbox skill (academy_quiz, python_inline) generates 10 questions from chunked references and asks the agent to answer them. Wrong answers are logged with the missing chunk IDs — feeding the next iteration.
4 · Graduate The agent is flagged academy_status = graduated in metadata. Channel connectors (Teams, Slack, REST) can now publish it. Re-enrolment re-runs steps 2–3 if references change.

Bootstrap Script (sketch)

# 1. import .skill pack (creates skills + uploads references) curl -X POST .../api/universal-agents/"$AGENT"/import-skill-pack \ -F file=@infra-agent.skill -H "Authorization: Bearer $TOKEN" # 2. embed pending references via local Ollama (auto-detected) curl -X POST ".../api/documents/embed-pending?agent_id=$AGENT" \ -H "Authorization: Bearer $TOKEN" # 3. attach an on-spot sandbox quiz skill curl -X POST ".../api/universal-agents/$AGENT/skills:create_and_attach" \ -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" \ -d @academy_quiz.json # 4. mark graduated once the quiz passes curl -X PATCH ".../api/universal-agents/$AGENT" \ -H "Authorization: Bearer $TOKEN" \ -d '{"metadata":{"academy_status":"graduated"}}'

Tip: the same loop works for continual learning. Schedule a nightly job that re-ingests changed reference files, re-runs the quiz, and re-graduates the agent — all over the public API, no UI required.

Document Intelligence

OCR / RAG Architecture — Dual-Path Design

AOS processes every uploaded document through two independent paths that converge into a single RAG vector store. Understanding both paths is essential for production tuning and troubleshooting.

User Upload / API
├──→ Path A · App-Server (sync) → PyPDF2 / PyMuPDF → ChromaDB
└──→ Path B · Celery Worker (async) → Tesseract → easyocr → LLM Vision → ChromaDB

Database Schema

Database	Technology	Tables / Collections	Role
Relational	`PostgreSQL` (mandatory, no SQLite)	`documents`, `document_chunks`, `document_agent_links`, `agents`, `skills`, `credentials`	All metadata, status, agent links, file paths, credentials (AES-256). Single PostgreSQL database for everything.
Vector	`ChromaDB`	`agent_{id}_documents` (per-agent)	Embedded chunks for semantic search (384-dim via all-MiniLM-L6-v2)

Path A — App-Server (Synchronous)

Runs inside the FastAPI process on POST /api/documents/upload. Best for native-text PDFs, DOCX, TXT, Markdown — any format that already contains extractable text. Completes in seconds.

1 · File receipt & metadata Save to uploads/, insert row into documents table (status = processing).
2 · Text extraction PyPDF2 first, fallback to PyMuPDF (fitz). If extracted text < MIN_USEFUL_TEXT (50 chars), OCR is triggered inline via Tesseract → easyocr → LLM Vision cascade.
3 · Semantic chunking Split into overlapping chunks (default 1 000 tokens, 200 overlap) using sentence-boundary-aware splitter.
4 · Embedding & storage Each chunk → all-MiniLM-L6-v2 (384-dim, L2-normalised to 768-dim) → upserted into agent-scoped ChromaDB collection.
5 · Status update Set document status to completed. RAG is immediately available.

Path B — Celery Worker (Asynchronous)

Triggered via POST /api/documents/batch-scan or the admin “Scan NFS” button. Designed for scanned PDFs, images (TIFF/PNG/JPEG), and bulk directory ingestion. Runs on a separate worker node (or the same host) via Redis-brokered Celery.

Setting	Default	Description
`OCR_STRATEGY`	`cascade`	Try Tesseract → easyocr → LLM Vision in order
`OCR_DPI`	`300`	Resolution for PDF-to-image conversion
`OCR_CONFIDENCE_THRESHOLD`	`0.60`	Below this, page escalates to next strategy
`MIN_USEFUL_TEXT`	`50`	Chars required before text-extraction is considered valid

1 · Directory scan Walk the NFS/local path, discover PDF/image files, create documents rows (status = queued).
2 · Celery task dispatch One Celery task per file. Redis broker distributes across available workers.
3 · PDF → image pdf2image (poppler) converts each page to a PIL image at configured DPI.
4 · 8-step preprocessing Deskew → denoise → threshold → contrast → DPI scale → border removal → rotation correction → binarisation.
5 · OCR cascade Tesseract 5 first. If confidence < threshold → easyocr. If still low → LLM Vision (vLLM multimodal endpoint).
6 · Auto-promote to RAG Extracted text is chunked and embedded exactly like Path A (same chunker, same model) → upserted into ChromaDB.
7 · Status callback Celery result backend updates document status to completed or failed with error details.

Path A vs Path B — Comparison

Dimension	Path A · App-Server	Path B · Celery Worker
Trigger	`POST /api/documents/upload`	`POST /api/documents/batch-scan`
Execution	Synchronous (blocking)	Asynchronous (Celery task)
Best for	Native-text PDFs, DOCX, TXT	Scanned PDFs, images, bulk dirs
OCR engines	Tesseract → easyocr → LLM Vision (inline)	Tesseract → easyocr → LLM Vision (worker)
Preprocessing	None (text already extractable)	8-step image pipeline
Scalability	Single-process (FastAPI)	Horizontal (add Celery workers)
Latency	1–5 seconds	10–120 seconds per document
Embedding model	`all-MiniLM-L6-v2` (384-dim, L2-normalised)
Vector store	`ChromaDB` — agent-scoped collection

Batch-Scan Flow

Admin UI / API → /api/documents/batch-scan → Scan NFS path → Celery tasks → OCR cascade → ChromaDB

Document → Agent Linking

Documents are linked to agents via the document_agent_links join table. A single document can be linked to multiple agents — each agent gets its own copy of the chunks in its scoped ChromaDB collection.

Method	Endpoint / Action	Description
Upload with agent	`POST /api/documents/upload?agent_id=X`	Link at upload time
Batch-scan with agent	`POST /api/documents/batch-scan` body	Link all scanned docs to specified agent
Manual link	`POST /api/documents/links`	Link existing document to agent after the fact
Auto-link	Agent config: `auto_link_uploads = true`	Automatically link all new uploads

Query-Time RAG Flow

When a user sends a message to an agent with RAG enabled, the query goes through the 6-stage pipeline to retrieve the most relevant context from that agent’s document collection.

1 · Semantic cache check Hash the query → check Redis semantic cache. If hit (cosine > 0.95), return cached answer immediately.
2 · Hybrid search Run both dense (embedding similarity) and sparse (BM25 keyword) search on the agent’s ChromaDB collection.
3 · RRF fusion Reciprocal Rank Fusion merges both result lists into a single ranked list (k=60).
4 · Cross-encoder re-ranking cross-encoder/ms-marco-MiniLM-L-6-v2 scores each candidate for precise relevance. Top-N (default 5) survive.
5 · LLM generation Surviving chunks are injected into the system prompt as context. The LLM generates a grounded answer with source citations.

Embedding Pipeline

Stage	Component	Details
Chunking	Sentence-boundary splitter	1 000 tokens, 200 overlap, respects sentence boundaries
Embedding model	`all-MiniLM-L6-v2`	384-dim output, L2-normalised to 768-dim for ChromaDB
Vector store	`ChromaDB`	Persistent on-disk, agent-scoped collections
Metadata	Per-chunk	`source_file`, `page_number`, `chunk_index`, `doc_type`, `upload_date`

Key insight: Both Path A and Path B produce identical chunk/embedding output — the only difference is how the text is obtained (native extraction vs OCR). Once text is extracted, the downstream pipeline (chunking → embedding → ChromaDB) is shared. This means RAG query results are uniform regardless of whether the source document was a native PDF or a scanned image.

Enterprise

Security Model

Layer	Mechanism	Details
Authentication	`JWT + OAuth2`	Bearer tokens with configurable expiry. Login via email/password. Token refresh supported.
AD / LDAP	`ldap3 + LDAPS`	Enterprise Active Directory authentication. Service-account search + user bind, or direct UPN bind. Auto-provision local users on first login.
Authorisation	`RBAC (4 roles)`	Admin — full access. Operator — manage agents/skills. Viewer — read-only. Developer — API + code execution.
Credential Vault	`AES-256 Fernet`	All credentials encrypted at rest with a server-side key. Decrypted only at skill execution time, never exposed to frontend.
Transport	`TLS 1.2+`	nginx handles TLS termination. Self-signed or CA-issued certificates supported.
Audit Trail	`Full logging`	Every agent execution, skill call, login, and config change is logged with timestamp, user, and result.
Data Governance	`PII + Regulatory`	PII masking, data classification, retention policies with ISO 27001, NIST CSF, GDPR, CCPA, HIPAA, PCI DSS, SOC 2 regulatory references.

Compliance

Data Governance & Regulatory References

AOS includes a built-in Data Governance engine that enforces enterprise policies for data classification, PII detection, retention rules, and access controls. Each policy is enriched with regulatory references mapping to international standards.

Standard	Full Name	Scope
`ISO 27001`	Information Security Management	Data classification, access control, risk management
`ISO 27701`	Privacy Information Management	PII processing, privacy controls, data subject rights
`NIST CSF`	Cybersecurity Framework	Identify, Protect, Detect, Respond, Recover
`NIST 800-53`	Security & Privacy Controls	Federal information system controls (US)
`NIST AI RMF`	AI Risk Management Framework	AI system trustworthiness, bias, transparency
`GDPR`	General Data Protection Regulation	EU personal data processing, consent, erasure
`CCPA`	California Consumer Privacy Act	Consumer data rights, opt-out, disclosure
`HIPAA`	Health Insurance Portability Act	Protected Health Information (PHI) safeguards
`PCI DSS`	Payment Card Industry Data Security	Cardholder data protection, encryption, access
`SOC 2`	Service Organization Controls	Security, availability, processing integrity, confidentiality, privacy

API: GET /api/data-governance/references returns the full regulatory reference catalogue. Each policy response now includes a references_resolved array with standard names, full titles, and descriptions.

Performance

Semantic Query Cache Architecture

The RAGSemanticCache in rag_cache.py is a standalone service class with four public methods. It can be integrated at any point in the retrieval pipeline.

User Query → Embed Query → Redis Lookup
(cosine ≥ 0.95?) → HIT: Return Cached

MISS → Run Full RAG Pipeline → Store in Redis → Return Results

Method	Signature	Description
`get()`	`async get(query, agent_id?) → Dict \| None`	Embed query → scan Redis for cosine ≥ 0.95 → return cached result or None
`put()`	`async put(query, chunks, summary?, agent_id?)`	Store query vector + results in Redis with TTL
`invalidate()`	`async invalidate(agent_id?) → int`	Clear cache entries, optionally scoped to one agent
`get_stats()`	`get_stats() → Dict`	Return total entries, TTL, threshold, Redis status

Key format: rag_cache:{agent_id}:{sha256(query)[:16]}. Each entry stores the query vector, up to 50 serialised chunks (text capped at 4000 chars each), optional summary, and timestamp. The Redis index set tracks all active keys for efficient similarity scanning.

Setup

Deployment Options

Quick Start (Single Server)

# Clone and install git clone https://github.com/muhammedali275/AI-Orchestrator-Studio cd AI-Orchestrator-Studio ./install.sh # Start all services ./start-all.sh # Default login Email: admin@orchestrator.local Password: AOS@Admin2026!

Production (Multi-Server)

Server	Role	Services
App Server	Frontend + API	`nginx` + `React build` + `FastAPI (port 8000)`
DB Server	Data layer	`PostgreSQL 16 (port 5432)` + `Redis (port 6379)`
OCR Worker(s)	Document processing	`Celery worker` + `Tesseract 5` + `OpenCV`
LLM Server	Model inference	`vLLM` or `Ollama` (GPU recommended)

System Requirements

Component	Minimum	Recommended
OS	RHEL 8 / Ubuntu 20.04	RHEL 9 / Ubuntu 22.04
CPU	4 cores	8+ cores
RAM	8 GB	16+ GB (32 GB with LLM)
Storage	50 GB	200+ GB (for documents)
Python	3.9	3.11+
Node.js	18	20+

First-time setup: After installation, login with the default admin credentials and immediately change the password via User Management. Configure your LLM endpoint in System Config → LLM Settings. Then create your first agent in the Agent Builder.

Administration

Infrastructure Admin Panel

The Infrastructure tab on the Admin Panel gives operators a single pane-of-glass to view and edit the backend’s .env configuration and to verify that every external service is reachable — without SSH access.

Server Configuration

Fields are grouped into four colour-coded sections:

Section	Colour	Fields
App Server	Blue	`API_HOST`, `API_PORT`, `AUTH_SECRET_KEY`, `LOG_LEVEL`, `CORS_ORIGINS`
Database	Green	`POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DATABASE`, `POSTGRES_USER`, `POSTGRES_PASSWORD`, `VECTOR_DB_URL`, `VECTOR_DB_COLLECTION`
Worker / AI	Amber	`REDIS_HOST`, `REDIS_PORT`, `REDIS_PASSWORD`, `LLM_BASE_URL`, `LLM_DEFAULT_MODEL`, `LLM_API_KEY`
Storage	Purple	`NFS_BASE_PATH`

Password fields are always displayed as •••••••• and are only written back to .env if the user actually changes them. An amber dot appears next to any field that has been modified but not yet saved. On save the backend writes the new values into the .env file and clears the settings cache (get_settings.cache_clear()) so the next request picks up changes immediately.

Service Health Check

The Service Health panel tests real connectivity to five infrastructure components:

Service	How it’s tested
PostgreSQL	TCP connect + `SELECT 1` via SQLAlchemy session
Redis	TCP connect + `PING` command
ChromaDB	TCP connect to the parsed URL host:port
Celery Workers	Scan Redis for `celery` / `_kombu` keys
vLLM Server	HTTP `GET /models` (falls back to TCP if HTTP fails)

Each service shows a • green / • amber / • grey status dot plus a detail string so operators can diagnose connectivity issues at a glance.

Developer

API Reference (Key Endpoints)

Method	Endpoint	Description
`POST`	`/api/auth/login`	Authenticate and receive JWT token
`GET`	`/api/agents`	List all agents
`POST`	`/api/agents`	Create a new agent
`POST`	`/api/chat/{agent_id}`	Send a message to an agent
`GET`	`/api/skills`	List all registered skills
`POST`	`/api/skills`	Register a custom skill
`POST`	`/api/skills/import-package`	Import a .skill ZIP package (registers skill + ingests reference docs as RAG)
`GET`	`/api/credentials`	List credentials (metadata only)
`POST`	`/api/documents/upload`	Upload documents for RAG
`POST`	`/api/documents/batch-scan`	Scan NFS/local directory, queue Celery OCR tasks for all discovered files
`POST`	`/api/documents/register-path`	Register a watch-path for automatic document discovery
`POST`	`/api/documents/{id}/process`	Re-process a single document (re-extract, re-chunk, re-embed)
`POST`	`/api/documents/links`	Link/unlink documents to agents (populates `document_agent_links`)
`GET`	`/api/rag/debug/{agent_id}`	RAG debug info: collection stats, chunk count, embedding dimensions, sample chunks
`POST`	`/api/auth/login/ad`	Authenticate via Active Directory / LDAP
`GET`	`/api/data-governance/policies`	List data governance policies with regulatory references
`GET`	`/api/data-governance/references`	Full regulatory reference catalogue (ISO, NIST, GDPR…)
`POST`	`/api/documents/search`	Hybrid RAG search with RRF fusion + cross-encoder re-ranking. Accepts `doc_type`, `source_file`, `upload_date_from/to` filters.
`GET`	`/api/rag/cache/stats`	Semantic query cache statistics (entries, TTL, threshold, Redis status)
`DELETE`	`/api/rag/cache`	Invalidate semantic query cache (optionally scoped by `agent_id`)
`GET`	`/api/scheduler/jobs`	List scheduled agent jobs
`GET`	`/api/audit`	Retrieve audit trail entries
`GET`	`/api/admin/config`	Get infrastructure config (grouped, passwords masked)
`PUT`	`/api/admin/config`	Update `.env` fields (skips masked passwords)
`GET`	`/api/admin/health`	Test connectivity to PostgreSQL, Redis, ChromaDB, Celery, vLLM
`POST`	`/api/admin/restart-hint`	Clear settings cache, signal restart recommended

Full API docs: Once the backend is running, visit http://your-server:8000/docs for the interactive Swagger/OpenAPI documentation with all endpoints, request schemas, and response models.