Scrydon
Authoring: Data Sources

Data Sources

Author declarative poll data sources and ship them inside a Scrydon Pack — no runtime code required

This artifact ships inside a Pack. For the shared lifecycle — install, pack build, upload — see Packs & Authoring SDK.

A Declarative Data Source is a poll-mode table source encoded entirely as JSON-serializable configuration — an HTTP request spec, an itemsPath selector, a field-mapping table, and a column list. You author it once with defineDataSource; the platform's generic poll runtime drives it on every tick without executing any customer-supplied code.

Data sources shipped in a Pack are pure JSON — the archive carries the request spec, mapping, and column definitions, never code. The generic poll runtime (fetch → select → map → validate) lives in the platform and is invoked from the manifest on each tick.

When to use this SDK

Use the Data Source Authoring SDK when:

  • You want to pull rows from a public REST API on a schedule and make them available as a typed table inside Scrydon.
  • You need the source definition to live in a Pack so re-uploading the pack also refreshes the source (same idempotency contract as ontologies and process flows).
  • You are building a demo or starter Pack and want the data tables to come pre-wired with no manual configuration.

Do not use this SDK if:

  • Your ingest logic is conditional, stateful, or requires code (pagination cursors, OAuth token refresh, custom payload signing). For those, author a code-shipped source in the platform monorepo — it uses the same defineDataSource call but with a produce() function.
  • The destination is not a table. Non-tabular ingestion paths are out of scope for this SDK today.

Key constraint: packs carry pure JSON — never functions. The bundler strips any produce() functions during serialization. A packable source must therefore be fully declarative — every field in the source definition must survive JSON.parse(JSON.stringify(...)) round-trip intact.

Install

bun add -d @scrydon/sdk-authoring zod
npm install --save-dev @scrydon/sdk-authoring zod
import { defineDataSource } from '@scrydon/sdk-authoring/integrations'

Anatomy of a declarative data source

A declarative data source is made up of five blocks:

BlockField(s)Purpose
requesturl, method, headers, query, authRefDescribes the HTTP call the runtime makes on each tick. url must be https://. Credentials are referenced by authRef (a credential connection id) — never inline.
response.itemsPathitemsPathA dot/bracket path (e.g. $.ac, data.items) from the JSON response envelope to the array of row objects. Leading $. is optional.
mappingRecord<columnName, FieldMapping>Maps each output column from a source field, optionally applying one of the four bounded transforms.
filterrequireNonNullDrops candidate rows before mapping if any of the listed source field paths is null or undefined.
table.columnsTableColumnDef[]Declares the output schema. Column names and types drive the row validator — no separate Zod table.schema is needed for declarative sources.

The mapping DSL

Each mapping entry either copies a path directly or applies a named transform. The transform set is bounded and auditable — adding a new transform requires a reviewed code change to the platform, not pack data. Arbitrary expressions, eval, and sandboxed code are intentionally not supported.

TransformWhat it doesExample args
trim_to_nullTrims leading/trailing whitespace; coerces empty string or non-string to null. "RCH123 ""RCH123", " "null, undefinednull.(none required)
number_or_nullPasses finite numbers unchanged; coerces strings, NaN, Infinity, and undefined to null.(none required)
value_mapMaps literal string keys via a map dictionary; passthrough: "number" passes numeric values unchanged; anything else falls back to default.{ map: { ground: 0 }, passthrough: "number", default: null }
iso_from_epoch_offsetCombines an envelope-level base timestamp (basePath resolves against the response root) with a per-row signed offset, and returns an ISO 8601 string. Useful for APIs that report an absolute now clock plus per-aircraft seen seconds-ago values.{ basePath: "$.now", baseUnit: "ms", offsetUnit: "s", direction: "subtract" }

The iso_from_epoch_offset hard case

Some REST APIs — the ADS-B family is the canonical example — do not return per-row absolute timestamps. Instead, the envelope carries a single now field (epoch milliseconds) and each row carries a seen field (seconds ago). The iso_from_epoch_offset transform bridges the two:

seenAt = new Date(envelope.now - row.seen * 1000).toISOString()

Configure it as:

seenAt: {
  path: "seen",                              // per-row offset field
  transform: "iso_from_epoch_offset",
  args: {
    basePath: "$.now",                       // envelope field — resolved against the response root
    baseUnit: "ms",                          // $.now is epoch milliseconds
    offsetUnit: "s",                         // row.seen is seconds
    direction: "subtract",                   // now − seen → absolute time
  },
},

If row.seen is missing or non-numeric, the runtime defaults the offset to 0 (base time exactly), matching the code-source convention seenSecondsAgo = typeof raw.seen === "number" ? raw.seen : 0.

A complete example

The following is the real adsb-lol-military-declarative source shipped in @scrydon/sdk-authoring. It pulls military aircraft positions from https://api.adsb.lol/v2/mil and maps them to a typed aircraft_position table. The golden parity test in the monorepo asserts this produces byte-for-byte identical rows to the equivalent code source.

import { defineDataSource } from '@scrydon/sdk-authoring/integrations'

export const adsbLolMilitaryDeclarative = defineDataSource({
  kind: "table",
  id: "adsb-lol-military-declarative",
  vendor: "adsb-lol",
  displayName: "ADS-B Lol — Military Aircraft (declarative)",
  scope: "global",
  table: {
    name: "aircraft_position",
    primaryKey: ["icao24", "seenAt"],
    timestampColumn: "seenAt",
    columns: [
      { name: "icao24",           dataType: "string",  isPrimaryKey: true },
      { name: "callsign",         dataType: "string",  nullable: true },
      { name: "registration",     dataType: "string",  nullable: true },
      { name: "aircraftType",     dataType: "string",  nullable: true },
      { name: "category",         dataType: "string",  nullable: true },
      { name: "latitude",         dataType: "decimal" },
      { name: "longitude",        dataType: "decimal" },
      { name: "altitudeFeet",     dataType: "int",     nullable: true },
      { name: "groundSpeedKnots", dataType: "double",  nullable: true },
      { name: "heading",          dataType: "double",  nullable: true },
      { name: "seenAt",           dataType: "timestamp", isPrimaryKey: true },
    ],
  },
  ingest: {
    mode: "poll",
    intervalSec: 60,
    minIntervalSec: 30,
    request: {
      url: "https://api.adsb.lol/v2/mil",
      method: "GET",
      headers: { accept: "application/json" },
    },
    response: { itemsPath: "$.ac" },
    filter: {
      // Drop rows missing hex (ICAO), lat, or lon — mirrors the code source guard.
      requireNonNull: ["hex", "lat", "lon"],
    },
    mapping: {
      icao24:           { path: "hex" },
      // "RCH123 " → "RCH123", "   " → null. Mirrors raw.flight?.trim() || null
      callsign:         { path: "flight", transform: "trim_to_null" },
      // undefined → null. Mirrors raw.r ?? null
      registration:     { path: "r" },
      // undefined → null. Mirrors raw.t ?? null
      aircraftType:     { path: "t" },
      // undefined → null. Mirrors raw.category ?? null
      category:         { path: "category" },
      latitude:         { path: "lat" },
      longitude:        { path: "lon" },
      // "ground" → 0, number → passthrough, else → null. Mirrors parseAltitude()
      altitudeFeet: {
        path: "alt_baro",
        transform: "value_map",
        args: { map: { ground: 0 }, passthrough: "number", default: null },
      },
      // finite number → itself, string/NaN/undefined → null
      groundSpeedKnots: { path: "gs",    transform: "number_or_null" },
      heading:          { path: "track", transform: "number_or_null" },
      // new Date(envelope.now - row.seen * 1000).toISOString()
      seenAt: {
        path: "seen",
        transform: "iso_from_epoch_offset",
        args: {
          basePath: "$.now",
          baseUnit: "ms",
          offsetUnit: "s",
          direction: "subtract",
        },
      },
    },
  },
})

defineDataSource is an identity function at runtime — it validates the manifest via the DataSourceManifestSchema Zod schema and derives a row validator from table.columns. The emitted definition is pure data: no produce() function, no closures, serializable as-is into data-source-<slug>/manifest.json.

Write modes

Every tick produces a batch of rows; ingest.writeMode controls how that batch lands in the table. It is install-only — it provisions the StarRocks key model at table-create time and cannot be changed afterwards (changing it requires re-installing the pack with an updated manifest). The default is upsert.

writeModeDedup keyRows kept per entityUse it for
upsert (default)table.primaryKey (identity)1 — latest only"Current state" feeds. Each poll overwrites the entity's row in place; no history.
changed-onlyprimaryKey + a synthesized content hashN — one per distinct stateSlowly-changing data where you want history but not a row per poll. Identical consecutive polls collapse onto the same key; a row is written only when a value actually changes.
appendnoneevery pollRaw event streams where every observation matters, including exact repeats.
replaceprimaryKeylatest full snapshotSmall reference tables re-published wholesale each tick (truncate-then-load).

changed-only vs upsert. Both upsert under the hood, but they dedup on different keys. upsert keeps one always-current row per entity (no history). changed-only keeps one row per distinct state the entity passed through, skipping polls where nothing changed. For continuously-changing telemetry (e.g. an aircraft's live position, where latitude/longitude change every poll) changed-only behaves almost identically to append — the content hash differs every tick. There, the growth lever is retention, not the write mode.

ingest: {
  mode: "poll",
  intervalSec: 60,
  writeMode: "changed-only", // omit for the "upsert" default
  // ...request, response, mapping
}

Retention

For time-series tables that grow continuously, declare table.ttlSec alongside table.timestampColumn. The platform partitions the StarRocks table by day on the timestamp column and keeps only the most recent ceil(ttlSec / 86400) daily partitions — older partitions are dropped automatically.

table: {
  name: "aircraft_position",
  primaryKey: ["icao24", "seenAt"],
  timestampColumn: "seenAt",   // the column partitions are cut on
  ttlSec: 604800,              // keep ~7 days of partitions
  columns: [ /* ... */ ],
}

ttlSec is the authoring default. An organization admin can override the effective retention per source from Settings → Platform → Data Sources (open a source's detail panel and edit "Retention") — the override is applied live, with no table rebuild. The detail panel also shows the source's write mode (read-only) and its recent sync history, and links straight to the backing table.

The detail panel's Danger zone also has Remove data source: it drops the backing table and all of its rows, deletes the source, and — when the originating pack contributed nothing else — removes the pack too. If the pack ships other content, the source is removed but the pack is kept (uninstall it from Settings → Packs). Removal is irreversible; re-installing the pack brings the source back.

Retention bounds table growth for any write mode, and it is the right tool when changed-only can't help (continuously-changing telemetry). It is independent of the write mode: you can combine append or changed-only with ttlSec to keep history but cap it to a rolling window.

Bundle layout

A Pack that ships a data source adds one data-source-<slug>/ subdir per source, each containing a single manifest.json. Multiple data sources may coexist in the same Pack — data-source is a repeatable content kind, like workflow.

<bundle>.scrydon-pack.tar.gz
├── pack.json                               # PackBundleManifestSchema
└── data-source-adsb-lol-military/
    └── manifest.json                       # DataSourceManifestSchema (pure JSON, no code)

The top-level pack.json lists the data source as a contents[] entry and includes "data-source" in installOrder:

import { defineScrydonPack } from '@scrydon/sdk-authoring/packs'
import { adsbLolMilitaryDeclarative } from './data-source-adsb-lol-military'

export default defineScrydonPack({
  manifestVersion: 1,
  package: {
    id: 'adsb-lol-military',
    name: 'ADS-B Lol Military Aircraft',
    version: '1.0.0',
  },
  contents: [
    {
      kind: 'data-source',
      path: 'data-source-adsb-lol-military',
      version: '1.0.0',
      required: true,
    },
  ],
  installOrder: ['data-source'],
  metadata: { isSystemPack: false, isDemoPack: false, tags: ['adsb', 'aircraft'] },
})

Build, inspect, upload

Author each data source with defineDataSource. Compose the Pack with defineScrydonPack and add a contents[] entry per source with kind: "data-source".

bunx @scrydon/sdk-authoring pack build src/pack.ts --outDir dist
# → dist/<package.id>-<package.version>.scrydon-pack.tar.gz
bunx @scrydon/sdk-authoring pack inspect dist/adsb-lol-military-1.0.0.scrydon-pack.tar.gz

The inspector lists every subdir, validates each manifest against DataSourceManifestSchema, and surfaces any mapping or column errors before upload.

Data source packs upload to Settings → Packs in the platform app. Uploading admits the pack to the org's pack catalog. No data_source row is created yet — data sources defer to Stage 2 so the workspace user can pick which environment they materialize into.

Programmatic equivalent:

curl -X POST "$AGENTIC_URL/api/packs/import?organizationId=$ORG_ID" \
  -H "Cookie: $SESSION_COOKIE" \
  -F "file=@dist/adsb-lol-military-1.0.0.scrydon-pack.tar.gz"

The route returns the catalog entry id; dataSources.installedIds is empty at this stage by design.

In apps/analytics → Marketplace, while in a workspace + environment, pick the data source from a catalogued pack and click Install in this environment. The platform materializes the data_source row scoped to your env and starts polling on the configured intervalSec cadence — typically within about a minute. Operate the source from Analytics → Data Sources.

Data sources that need a credential

Some APIs require an account or API key. For those, use the request.authRef field — a named reference to a credential connection configured in your org's settings. The manifest never carries an inline secret; authRef is a pointer, not a value. (The DataSourceManifestSchema rejects any attempt to embed an Authorization header or API key directly in the manifest.)

ingest: {
  mode: "poll",
  intervalSec: 300,
  request: {
    url: "https://api.example.com/v1/records",
    method: "GET",
    authRef: "example-api-key",   // named reference — NOT an inline token
  },
  // ...
}

How the connection flows:

  1. You (or your org admin) connect the account in org settings, creating an enabled credential connection named "example-api-key".
  2. On each tick, the platform resolves that connection server-side and attaches the credential to the outbound request — your manifest never sees the secret.
  3. If no enabled connection exists yet, the tick returns HTTP 412 data_source_connection_required — a clear signal to connect the account first. No data is fetched, and no error is silently swallowed.
  4. Once a connection is enabled, subsequent ticks resolve it automatically and run.

No authRef? A source with no authRef (public API, no authentication needed) ticks immediately — no credential setup required.

The value you set for authRef is the connection name your org admin creates in settings. Coordinate the name between the pack author and the org admin — if the names don't match, ticks return 412 until the connection is created with the expected name.

Security

Egress guard: the generic poll runtime enforces an SSRF egress guard before every fetch. Request URLs must use https:// — plain http:// is rejected. Requests to loopback addresses (127.x.x.x, ::1), link-local addresses (169.254.x.x), private ranges (10.x, 172.16–31.x, 192.168.x), and IPv6 ULA prefixes (fd00::/8) are blocked at the URL-shape level before a connection is opened.

Note: the v1 guard is host/scheme-shape based. A hostname that DNS-resolves to a private IP at fetch time is not blocked by the v1 guard — resolved-IP pinning is planned for a future release.

Credentials by reference only: the request.authRef field accepts a credential connection id (a pointer to a Tier-1 connection stored in the platform's secret store). Never embed an API key, Bearer token, or Authorization header value directly in the manifest. The DataSourceManifestSchema uses .strict() on the request block and rejects any field beyond url, method, headers, query, and authRef — stray fields like apiKey or secret cause a validation error at build time.

Resource limits enforced per-tick:

LimitDefault
Response body8 MB
Row count10,000 rows
Fetch timeout15 seconds

Where to next

On this page

On this page