Cortex (LLM gateway)

The internal hop every LLM, embedding, and image call goes through. Where model routing, fallback, and observability live.

Cortex is the in-cluster gateway every LLM, embedding, image, and OCR call goes through. It exists for one reason: the platform must be able to swap models, fall back on failure, and apply governance without rebuilding workflows. The Cortex service also serves the customer-facing Chat application; this page covers the gateway internals.

Cortex is an internal service. Workflows never call it directly — they reference a capability ("an LLM that supports tool use"), and Cortex resolves that capability to whichever installed integration matches.

What it does

Concern	How Cortex handles it
Capability resolution	A workflow asks for an LLM. Cortex picks the right provider based on per-call override → org policy → automatic match against installed candidates. Never hardcoded.
Credential handling	Integration credentials live in the Secrets vault. Cortex fetches the right credential for the resolved provider, scoped to the workspace and the request.
Streaming	All providers are exposed through a single SSE-based streaming interface, so the agentic runtime doesn't care whether the model is OpenAI, Anthropic, vLLM, or Bedrock.
Fallback	If the primary provider returns an error class that matches the org's fallback policy, Cortex retries against the next match.
Observability	Token counts, latency, cost, model id, and integration source are emitted to the analytics pipeline for billing and dashboards.
Governance	The DLP guardrails layer can pre- and post-filter content (PII detection, hallucination scoring) without the workflow having to wire it.

Provider catalogue

Cortex doesn't ship its own models. It exposes whatever providers are installed in the integration registry:

Self-hosted — Ollama, vLLM, Azure AI Foundry deployments in your own subscription.
Cloud — OpenAI, Anthropic, Mistral, AWS Bedrock, Azure OpenAI.
Custom — anything authored with the Integrations SDK that exposes an LLM capability.

Each model declares its own capability surface — context window, tool support, temperature range, structured output, vision, etc. Workflows can require those capabilities, and Cortex routes only to models that match.

Self-hosted vs. cloud routing

Cortex doesn't care whether the model runs in your GPU namespace or at a third-party vendor. The same call site is used:

Workflow → Cortex → resolve capability → pick provider:
  ├── Self-hosted vLLM in scrydon-inference namespace
  ├── Ollama service in scrydon-inference namespace
  ├── Azure AI Foundry deployment in your subscription
  ├── OpenAI / Anthropic / Bedrock (external, opt-in)
  └── Custom authored integration

The choice is governed by org policy. Common patterns:

"All AI on-cluster" — only self-hosted providers are installed. No outbound calls.
"Self-hosted with cloud burst" — the default model is self-hosted; the org allows fallback to a cloud provider for tool-heavy or long-context calls.
"Cloud first" — the org runs primarily on a managed cloud model (typical Azure-tenant deployments).

Where you configure this

Integrations — what's installed and which capabilities each provider exposes. Settings → Platform → Integrations.
Capability policy — per-capability defaults and allowlists. Settings → Platform → Integrations → [Vendor] → Capabilities.
Per-workflow override — a specific Agent block can pin a model. The block UI shows the resolved provider so you can see what would run.

See the Vendors capability resolution section for the full ordering.

Integrations — the registry Cortex resolves against.
Vendors — the catalogue of providers Scrydon ships with.
Architecture → Agentic — what calls Cortex.
Security → Secrets management — where provider credentials live.

Cortex (LLM gateway)

What it does

Provider catalogue

Self-hosted vs. cloud routing

Where you configure this

Related

On this page