Schema inference
How Scrydon picks column types from a file on first data arrival — sampling, type rules, overrides.
When you upload a file, Scrydon infers a schema by sampling the data and classifying each column's type. This page documents the rules so you can predict what'll happen with your data — and override the result when the heuristic is wrong.
Sampling
The inference runs on a sample of the file:
- CSV / JSONL — first N rows (default 10,000). Configurable per upload.
- JSON — the whole document if it's an array; otherwise the top-level keys.
Files larger than the sample are still ingested in full once the schema is committed — the sample is used only to decide the schema.
Type rules
For each column the inference walks a priority list and picks the most-specific type that every sampled value matches:
| Priority | Type | Match rule |
|---|---|---|
| 1 | BOOLEAN | All values are true/false/yes/no/0/1 (case-insensitive) |
| 2 | BIGINT | All values match ^-?\d+$ and fit in 64 bits |
| 3 | DOUBLE | All values are valid IEEE-754 doubles |
| 4 | DATE | All values match a recognised date format (YYYY-MM-DD, DD/MM/YYYY, …) |
| 5 | DATETIME | All values match a recognised datetime / ISO 8601 format |
| 6 | JSON | All values parse as JSON objects or arrays |
| 7 | STRING | Fallback — anything else, including columns that mix types |
Empty / null / NA values are ignored when inferring the type — they don't force a column to STRING. They do mark the column as nullable.
Nullability
A column is nullable if any sampled value is empty, null, NA, or NaN. Otherwise it's marked non-null. You can flip this manually in the schema confirmation step.
Column-name handling
Headers are preserved losslessly. The physical column name is sanitised to satisfy SQL identifier rules; the original header is stored as a display name. See Column names for the full sanitisation table.
Overrides
After inference runs you get a schema confirmation step. From there you can:
- Change a column's type (e.g. force a
STRINGtoBIGINTif the sample happened to have one rogue value). - Toggle a column's nullability.
- Mark a column as the primary key for upsert support.
- Set a column's classification — public / internal / confidential / restricted. See Classification & masking.
- Rename a column's display name without changing the physical name.
Overrides apply to subsequent re-uploads of the same file structure.
Re-uploading a file
Re-uploading the same file structure to an existing table is the standard pattern. The platform translates display names → physical names automatically, so you keep the original CSV headers in your source file and never have to rename.
If the new file adds columns, they're appended as nullable. If it removes columns, the existing columns remain. If it renames a column, the renamed column is treated as new — the platform won't guess that "OldName" became "NewName".
When inference is wrong
The most common failure modes:
| Symptom | What to check |
|---|---|
All numeric column inferred as STRING | A single non-numeric value in the sample. Increase the sample size or clean the data. |
Date column inferred as STRING | The date format isn't in the recognised list. Override the type to DATE and provide a format string. |
Mixed-currency column inferred as STRING | Decimals carry a currency prefix. Either strip the prefix in the source or accept the STRING type. |
Column you wanted as BOOLEAN came in as BIGINT | The values are 0 and 1 — BIGINT won the priority. Override to BOOLEAN in the confirmation step. |
Related
- Managed tables — lifecycle and write modes.
- Column names — header sanitisation.
- Classification & masking — per-column governance.