Cortex (LLM gateway)
The internal hop every LLM, embedding, and image call goes through. Where model routing, fallback, and observability live.
Cortex is the in-cluster gateway every LLM, embedding, image, and OCR call goes through. It exists for one reason: the platform must be able to swap models, fall back on failure, and apply governance without rebuilding workflows. The Cortex service also serves the customer-facing Chat application; this page covers the gateway internals.
Cortex is an internal service. Workflows never call it directly — they reference a capability ("an LLM that supports tool use"), and Cortex resolves that capability to whichever installed integration matches.
What it does
| Concern | How Cortex handles it |
|---|---|
| Capability resolution | A workflow asks for an LLM. Cortex picks the right provider based on per-call override → org policy → automatic match against installed candidates. Never hardcoded. |
| Credential handling | Integration credentials live in the Secrets vault. Cortex fetches the right credential for the resolved provider, scoped to the workspace and the request. |
| Streaming | All providers are exposed through a single SSE-based streaming interface, so the agentic runtime doesn't care whether the model is OpenAI, Anthropic, vLLM, or Bedrock. |
| Fallback | If the primary provider returns an error class that matches the org's fallback policy, Cortex retries against the next match. |
| Observability | Token counts, latency, cost, model id, and integration source are emitted to the analytics pipeline for billing and dashboards. |
| Governance | The DLP guardrails layer can pre- and post-filter content (PII detection, hallucination scoring) without the workflow having to wire it. |
Provider catalogue
Cortex doesn't ship its own models. It exposes whatever providers are installed in the integration registry:
- Self-hosted — Ollama, vLLM, Azure AI Foundry deployments in your own subscription.
- Cloud — OpenAI, Anthropic, Mistral, AWS Bedrock, Azure OpenAI.
- Custom — anything authored with the Integrations SDK that exposes an LLM capability.
Each model declares its own capability surface — context window, tool support, temperature range, structured output, vision, etc. Workflows can require those capabilities, and Cortex routes only to models that match.
Self-hosted vs. cloud routing
Cortex doesn't care whether the model runs in your GPU namespace or at a third-party vendor. The same call site is used:
Workflow → Cortex → resolve capability → pick provider:
├── Self-hosted vLLM in scrydon-inference namespace
├── Ollama service in scrydon-inference namespace
├── Azure AI Foundry deployment in your subscription
├── OpenAI / Anthropic / Bedrock (external, opt-in)
└── Custom authored integrationThe choice is governed by org policy. Common patterns:
- "All AI on-cluster" — only self-hosted providers are installed. No outbound calls.
- "Self-hosted with cloud burst" — the default model is self-hosted; the org allows fallback to a cloud provider for tool-heavy or long-context calls.
- "Cloud first" — the org runs primarily on a managed cloud model (typical Azure-tenant deployments).
Where you configure this
- Integrations — what's installed and which capabilities each provider exposes. Settings → Platform → Integrations.
- Capability policy — per-capability defaults and allowlists. Settings → Platform → Integrations → [Vendor] → Capabilities.
- Per-workflow override — a specific Agent block can pin a model. The block UI shows the resolved provider so you can see what would run.
See the Vendors capability resolution section for the full ordering.
Related
- Integrations — the registry Cortex resolves against.
- Vendors — the catalogue of providers Scrydon ships with.
- Architecture → Agentic — what calls Cortex.
- Security → Secrets management — where provider credentials live.