Data Sources

Author declarative poll data sources and ship them inside a Scrydon Pack — no runtime code required

This artifact ships inside a Pack. For the shared lifecycle — install, pack build, upload — see Packs & Authoring SDK.

A Declarative Data Source is a poll-mode table source encoded entirely as JSON-serializable configuration — an HTTP request spec, an itemsPath selector, a field-mapping table, and a column list. You author it once with defineDataSource; the platform's generic poll runtime drives it on every tick without executing any customer-supplied code.

Data sources shipped in a Pack are pure JSON — the archive carries the request spec, mapping, and column definitions, never code. The generic poll runtime (fetch → select → map → validate) lives in the platform and is invoked from the manifest on each tick.

When to use this SDK

Use the Data Source Authoring SDK when:

You want to pull rows from a public REST API on a schedule and make them available as a typed table inside Scrydon.
You need the source definition to live in a Pack so re-uploading the pack also refreshes the source (same idempotency contract as ontologies and process flows).
You are building a demo or starter Pack and want the data tables to come pre-wired with no manual configuration.

Do not use this SDK if:

Your ingest logic is conditional, stateful, or requires code (pagination cursors, OAuth token refresh, custom payload signing). For those, author a code-shipped source in the platform monorepo — it uses the same defineDataSource call but with a produce() function.
The destination is not a table. Non-tabular ingestion paths are out of scope for this SDK today.

Key constraint: packs carry pure JSON — never functions. The bundler strips any produce() functions during serialization. A packable source must therefore be fully declarative — every field in the source definition must survive JSON.parse(JSON.stringify(...)) round-trip intact.

Install

bun add -d @scrydon/sdk-authoring zod

npm install --save-dev @scrydon/sdk-authoring zod

import { defineDataSource } from '@scrydon/sdk-authoring/integrations'

Anatomy of a declarative data source

A declarative data source is made up of five blocks:

Block	Field(s)	Purpose
`request`	`url`, `method`, `headers`, `query`, `authRef`	Describes the HTTP call the runtime makes on each tick. `url` must be `https://`. Credentials are referenced by `authRef` (a credential connection id) — never inline.
`response.itemsPath`	`itemsPath`	A dot/bracket path (e.g. `$.ac`, `data.items`) from the JSON response envelope to the array of row objects. Leading `$.` is optional.
`mapping`	`Record<columnName, FieldMapping>`	Maps each output column from a source field, optionally applying one of the four bounded transforms.
`filter`	`requireNonNull`	Drops candidate rows before mapping if any of the listed source field paths is `null` or `undefined`.
`table.columns`	`TableColumnDef[]`	Declares the output schema. Column names and types drive the row validator — no separate Zod `table.schema` is needed for declarative sources.

The mapping DSL

Each mapping entry either copies a path directly or applies a named transform. The transform set is bounded and auditable — adding a new transform requires a reviewed code change to the platform, not pack data. Arbitrary expressions, eval, and sandboxed code are intentionally not supported.

Transform	What it does	Example `args`
`trim_to_null`	Trims leading/trailing whitespace; coerces empty string or non-string to `null`. `"RCH123 "` → `"RCH123"`, `" "` → `null`, `undefined` → `null`.	(none required)
`number_or_null`	Passes finite numbers unchanged; coerces strings, `NaN`, `Infinity`, and `undefined` to `null`.	(none required)
`value_map`	Maps literal string keys via a `map` dictionary; `passthrough: "number"` passes numeric values unchanged; anything else falls back to `default`.	`{ map: { ground: 0 }, passthrough: "number", default: null }`
`iso_from_epoch_offset`	Combines an envelope-level base timestamp (`basePath` resolves against the response root) with a per-row signed offset, and returns an ISO 8601 string. Useful for APIs that report an absolute `now` clock plus per-aircraft `seen` seconds-ago values.	`{ basePath: "$.now", baseUnit: "ms", offsetUnit: "s", direction: "subtract" }`

The `iso_from_epoch_offset` hard case

Some REST APIs — the ADS-B family is the canonical example — do not return per-row absolute timestamps. Instead, the envelope carries a single now field (epoch milliseconds) and each row carries a seen field (seconds ago). The iso_from_epoch_offset transform bridges the two:

seenAt = new Date(envelope.now - row.seen * 1000).toISOString()

Configure it as:

seenAt: {
  path: "seen",                              // per-row offset field
  transform: "iso_from_epoch_offset",
  args: {
    basePath: "$.now",                       // envelope field — resolved against the response root
    baseUnit: "ms",                          // $.now is epoch milliseconds
    offsetUnit: "s",                         // row.seen is seconds
    direction: "subtract",                   // now − seen → absolute time
  },
},

If row.seen is missing or non-numeric, the runtime defaults the offset to 0 (base time exactly), matching the code-source convention seenSecondsAgo = typeof raw.seen === "number" ? raw.seen : 0.

A complete example

The following is the real adsb-lol-military-declarative source shipped in @scrydon/sdk-authoring. It pulls military aircraft positions from https://api.adsb.lol/v2/mil and maps them to a typed aircraft_position table. The golden parity test in the monorepo asserts this produces byte-for-byte identical rows to the equivalent code source.

import { defineDataSource } from '@scrydon/sdk-authoring/integrations'

export const adsbLolMilitaryDeclarative = defineDataSource({
  kind: "table",
  id: "adsb-lol-military-declarative",
  vendor: "adsb-lol",
  displayName: "ADS-B Lol — Military Aircraft (declarative)",
  scope: "global",
  table: {
    name: "aircraft_position",
    primaryKey: ["icao24", "seenAt"],
    timestampColumn: "seenAt",
    columns: [
      { name: "icao24",           dataType: "string",  isPrimaryKey: true },
      { name: "callsign",         dataType: "string",  nullable: true },
      { name: "registration",     dataType: "string",  nullable: true },
      { name: "aircraftType",     dataType: "string",  nullable: true },
      { name: "category",         dataType: "string",  nullable: true },
      { name: "latitude",         dataType: "decimal" },
      { name: "longitude",        dataType: "decimal" },
      { name: "altitudeFeet",     dataType: "int",     nullable: true },
      { name: "groundSpeedKnots", dataType: "double",  nullable: true },
      { name: "heading",          dataType: "double",  nullable: true },
      { name: "seenAt",           dataType: "timestamp", isPrimaryKey: true },
    ],
  },
  ingest: {
    mode: "poll",
    intervalSec: 60,
    minIntervalSec: 30,
    request: {
      url: "https://api.adsb.lol/v2/mil",
      method: "GET",
      headers: { accept: "application/json" },
    },
    response: { itemsPath: "$.ac" },
    filter: {
      // Drop rows missing hex (ICAO), lat, or lon — mirrors the code source guard.
      requireNonNull: ["hex", "lat", "lon"],
    },
    mapping: {
      icao24:           { path: "hex" },
      // "RCH123 " → "RCH123", "   " → null. Mirrors raw.flight?.trim() || null
      callsign:         { path: "flight", transform: "trim_to_null" },
      // undefined → null. Mirrors raw.r ?? null
      registration:     { path: "r" },
      // undefined → null. Mirrors raw.t ?? null
      aircraftType:     { path: "t" },
      // undefined → null. Mirrors raw.category ?? null
      category:         { path: "category" },
      latitude:         { path: "lat" },
      longitude:        { path: "lon" },
      // "ground" → 0, number → passthrough, else → null. Mirrors parseAltitude()
      altitudeFeet: {
        path: "alt_baro",
        transform: "value_map",
        args: { map: { ground: 0 }, passthrough: "number", default: null },
      },
      // finite number → itself, string/NaN/undefined → null
      groundSpeedKnots: { path: "gs",    transform: "number_or_null" },
      heading:          { path: "track", transform: "number_or_null" },
      // new Date(envelope.now - row.seen * 1000).toISOString()
      seenAt: {
        path: "seen",
        transform: "iso_from_epoch_offset",
        args: {
          basePath: "$.now",
          baseUnit: "ms",
          offsetUnit: "s",
          direction: "subtract",
        },
      },
    },
  },
})

defineDataSource is an identity function at runtime — it validates the manifest via the DataSourceManifestSchema Zod schema and derives a row validator from table.columns. The emitted definition is pure data: no produce() function, no closures, serializable as-is into data-source-<slug>/manifest.json.

Write modes

Every tick produces a batch of rows; ingest.writeMode controls how that batch lands in the table. It is install-only — it provisions the StarRocks key model at table-create time and cannot be changed afterwards (changing it requires re-installing the pack with an updated manifest). The default is upsert.

`writeMode`	Dedup key	Rows kept per entity	Use it for
`upsert` (default)	`table.primaryKey` (identity)	1 — latest only	"Current state" feeds. Each poll overwrites the entity's row in place; no history.
`changed-only`	`primaryKey` + a synthesized content hash	N — one per distinct state	Slowly-changing data where you want history but not a row per poll. Identical consecutive polls collapse onto the same key; a row is written only when a value actually changes.
`append`	none	every poll	Raw event streams where every observation matters, including exact repeats.
`replace`	`primaryKey`	latest full snapshot	Small reference tables re-published wholesale each tick (truncate-then-load).

changed-only vs upsert. Both upsert under the hood, but they dedup on different keys. upsert keeps one always-current row per entity (no history). changed-only keeps one row per distinct state the entity passed through, skipping polls where nothing changed. For continuously-changing telemetry (e.g. an aircraft's live position, where latitude/longitude change every poll) changed-only behaves almost identically to append — the content hash differs every tick. There, the growth lever is retention, not the write mode.

ingest: {
  mode: "poll",
  intervalSec: 60,
  writeMode: "changed-only", // omit for the "upsert" default
  // ...request, response, mapping
}

Upgrading a source's columns

Shipping a new pack version that adds columns to table.columns just works: on the first sync after the upgrade, the platform compares your declared columns against the existing table and adds the missing ones in place — with the declared type, classification, and description, no re-install and no data loss. Orgs that upgraded before syncing recover automatically on their next scheduled sync.

Two changes are not applied in place, because the warehouse cannot alter them on a live table:

Changing a column's type. The existing column keeps its type; declare a new column instead.
Adding or changing a primaryKey column. The key model is fixed at create time (same rule as writeMode). The sync log records the skipped key column; pick a new table.name (which provisions a fresh table) if the identity truly has to change.

Columns you remove from the declaration are left in place in the warehouse — they simply stop receiving values.

After an upgrade you can refresh everything at once: the Analytics → Data Sources view has a Resync all toolbar action that re-runs every installed source's sync (a few at a time), alongside the per-row Sync now. Statuses, last-sync times, and row counts update in the table as each source finishes.

Static reference data (inline rows)

Not every table comes from an HTTP feed. Reference data — base registries, type catalogues, fleet rosters, curated synthetic telemetry — can ship inside the manifest as inline rows. Replace the request/response/mapping trio with a rows array; everything else (scheduler, write modes, retention, the install flow) is identical, because the same poll driver runs the source and simply re-emits the rows on every tick. That makes the backing table idempotent and self-healing: drop a row in the warehouse and the next tick restores it. No network egress, no credentials, no SSRF surface.

import { defineDataSource } from "@scrydon/sdk-authoring/integrations";

export default defineDataSource({
  kind: "table",
  id: "nato-bases",
  vendor: "scrydon",
  displayName: "NATO Air Bases (Eastern Flank)",
  scope: "org",
  table: {
    name: "nato_base",
    primaryKey: ["baseId"],
    timestampColumn: "updatedAt",
    columns: [
      { name: "baseId", dataType: "string", isPrimaryKey: true },
      { name: "name", dataType: "string" },
      { name: "country", dataType: "string" },
      { name: "latitude", dataType: "decimal" },
      { name: "longitude", dataType: "decimal" },
      { name: "updatedAt", dataType: "timestamp" },
    ],
  },
  ingest: {
    mode: "poll",
    intervalSec: 3600,          // reference data — an hourly re-assert is plenty
    rows: [
      { baseId: "EYSA", name: "Šiauliai Air Base", country: "Lithuania", latitude: 55.894, longitude: 23.395 },
      { baseId: "EPPW", name: "Powidz Air Base", country: "Poland", latitude: 52.379, longitude: 17.854 },
      // ... up to 10,000 rows
    ],
    stampColumn: "updatedAt",   // stamped with the tick time on every emit
    writeMode: "upsert",
  },
});

The static flavor is recognized by the presence of rows (the HTTP flavor by request); a manifest cannot mix the two. Field notes:

Field	Notes
`rows`	1–10,000 inline rows. Values are scalars only (`string`, `number`, `boolean`, `null`) so the manifest stays fully serializable.
`stampColumn`	Optional. Stamped with the tick time (ISO 8601) on every emitted row — lets reference rows satisfy `table.timestampColumn` without baking literal timestamps into the manifest.
`writeMode`	Same four modes as the HTTP flavor. `upsert` (default) re-asserts the reference set in place; `replace` is the truncate-then-load alternative for wholesale snapshots.

Computed rows are still static rows. Because the manifest is built by defineDataSource() at pack build time, the rows array can be generated by ordinary TypeScript — derive a nearestBase column from a bases table, simulate a telemetry snapshot, downsample a CSV. What ships in the bundle is the frozen result; the platform never executes pack code at tick time.

Retention

For time-series tables that grow continuously, declare table.ttlSec alongside table.timestampColumn. The platform partitions the StarRocks table by day on the timestamp column and keeps only the most recent ceil(ttlSec / 86400) daily partitions — older partitions are dropped automatically.

table: {
  name: "aircraft_position",
  primaryKey: ["icao24", "seenAt"],
  timestampColumn: "seenAt",   // the column partitions are cut on
  ttlSec: 604800,              // keep ~7 days of partitions
  columns: [ /* ... */ ],
}

ttlSec is the authoring default. An organization admin can override the effective retention per source from Settings → Platform → Data Sources (open a source's detail panel and edit "Retention") — the override is applied live, with no table rebuild. The detail panel also shows the source's write mode (read-only) and its recent sync history, and links straight to the backing table.

The detail panel's Danger zone also has Remove data source: it drops the backing table and all of its rows, deletes the source, and — when the originating pack contributed nothing else — removes the pack too. If the pack ships other content, the source is removed but the pack is kept (uninstall it from Settings → Packs). Removal is irreversible; re-installing the pack brings the source back.

Retention bounds table growth for any write mode, and it is the right tool when changed-only can't help (continuously-changing telemetry). It is independent of the write mode: you can combine append or changed-only with ttlSec to keep history but cap it to a rolling window.

Pack layout

A Pack that ships a data source adds one data-source-<slug>/ subdir per source, each containing a single manifest.json. Multiple data sources may coexist in the same Pack — data-source is a repeatable content kind, like workflow.

<pack>.scrydon-pack.tar.gz
├── pack.json                               # PackBundleManifestSchema
└── data-source-adsb-lol-military/
    └── manifest.json                       # DataSourceManifestSchema (pure JSON, no code)

The top-level pack.json lists the data source as a contents[] entry and includes "data-source" in installOrder:

import { defineScrydonPack } from '@scrydon/sdk-authoring/packs'
import { adsbLolMilitaryDeclarative } from './data-source-adsb-lol-military'

export default defineScrydonPack({
  manifestVersion: 1,
  package: {
    id: 'adsb-lol-military',
    name: 'ADS-B Lol Military Aircraft',
    version: '1.0.0',
  },
  contents: [
    {
      kind: 'data-source',
      path: 'data-source-adsb-lol-military',
      version: '1.0.0',
      required: true,
    },
  ],
  installOrder: ['data-source'],
  metadata: { isSystemPack: false, isDemoPack: false, tags: ['adsb', 'aircraft'] },
})

Build, inspect, upload

Author each data source with defineDataSource. Compose the Pack with defineScrydonPack and add a contents[] entry per source with kind: "data-source".

bunx @scrydon/sdk-authoring pack build src/pack.ts --outDir dist
# → dist/<package.id>-<package.version>.scrydon-pack.tar.gz

bunx @scrydon/sdk-authoring pack inspect dist/adsb-lol-military-1.0.0.scrydon-pack.tar.gz

The inspector lists every subdir, validates each manifest against DataSourceManifestSchema, and surfaces any mapping or column errors before upload.

Data source packs upload to Settings → Packs in the platform app. Uploading admits the pack to the org's pack catalog. No data_source row is created yet — data sources defer to Stage 2 so the workspace user can pick which environment they materialize into.

Programmatic equivalent:

curl -X POST "$AGENTIC_URL/api/packs/import?organizationId=$ORG_ID" \
  -H "Cookie: $SESSION_COOKIE" \
  -F "file=@dist/adsb-lol-military-1.0.0.scrydon-pack.tar.gz"

The route returns the catalog entry id; dataSources.installedIds is empty at this stage by design.

In apps/analytics → Marketplace, while in a workspace + environment, pick the data source from a catalogued pack and click Install in this environment. The platform materializes the data_source row scoped to your env and starts polling on the configured intervalSec cadence — typically within about a minute. Operate the source from Analytics → Data Sources.

Data sources that need a credential

Some APIs require an account or API key. For those, use the request.authRef field — a named reference to a credential connection configured in your org's settings. The manifest never carries an inline secret; authRef is a pointer, not a value. (The DataSourceManifestSchema rejects any attempt to embed an Authorization header or API key directly in the manifest.)

ingest: {
  mode: "poll",
  intervalSec: 300,
  request: {
    url: "https://api.example.com/v1/records",
    method: "GET",
    authRef: "example-api-key",   // named reference — NOT an inline token
  },
  // ...
}

How the connection flows:

You (or your org admin) connect the account in org settings, creating an enabled credential connection named "example-api-key".
On each tick, the platform resolves that connection server-side and attaches the credential to the outbound request — your manifest never sees the secret.
If no enabled connection exists yet, the tick returns HTTP 412 data_source_connection_required — a clear signal to connect the account first. No data is fetched, and no error is silently swallowed.
Once a connection is enabled, subsequent ticks resolve it automatically and run.

No authRef? A source with no authRef (public API, no authentication needed) ticks immediately — no credential setup required.

The value you set for authRef is the connection name your org admin creates in settings. Coordinate the name between the pack author and the org admin — if the names don't match, ticks return 412 until the connection is created with the expected name.

Security

Egress guard: the generic poll runtime enforces an SSRF egress guard before every fetch. Request URLs must use https:// — plain http:// is rejected. Requests to loopback addresses (127.x.x.x, ::1), link-local addresses (169.254.x.x), private ranges (10.x, 172.16–31.x, 192.168.x), and IPv6 ULA prefixes (fd00::/8) are blocked at the URL-shape level before a connection is opened.

Note: the v1 guard is host/scheme-shape based. A hostname that DNS-resolves to a private IP at fetch time is not blocked by the v1 guard — resolved-IP pinning is planned for a future release.

Credentials by reference only: the request.authRef field accepts a credential connection id (a pointer to a Tier-1 connection stored in the platform's secret store). Never embed an API key, Bearer token, or Authorization header value directly in the manifest. The DataSourceManifestSchema uses .strict() on the request block and rejects any field beyond url, method, headers, query, and authRef — stray fields like apiKey or secret cause a validation error at build time.

Resource limits enforced per-tick:

Limit	Default
Response body	64 MB
Row count	10,000 rows
Fetch timeout	15 seconds