Column names — how Scrydon handles dotted and non-SQL-safe headers
Lossless preservation of source-provided column names like `fius.fiu_id` via a physical/display split, and how the approach compares to BigQuery, Snowflake, Databricks, Fabric, AWS Glue, Foundry, and Atlan.
CSVs in the wild rarely come with SQL-safe column names. Headers carry dots (fius.fiu_id), spaces (Customer Email), unicode (café), or just start with a digit (2024_revenue). What a data platform does with those names says a lot about how it expects to be used.
This page describes Scrydon Analytics's approach (Foundry/Atlan-style physical/display split) and how it compares to other platforms.
What Scrydon does
When you upload a file with a dotted header like fius.fiu_id, Scrydon ingests it lossless:
| Layer | Value |
|---|---|
| StarRocks physical column | fius_fiu_id |
Catalog metadata column_name | fius_fiu_id |
Catalog metadata display_name | fius.fiu_id |
| UI rendering (preview, schema view, ingest dialog) | fius.fiu_id |
| SQL queries you write against the table | fius_fiu_id (or `fius_fiu_id` if you prefer to quote) |
The raw header is preserved as display name in the catalog, never lost. The physical name is sanitized to satisfy StarRocks identifier rules so SQL stays simple — no mandatory backtick quoting in every query, no surprise breakage in BI tools that auto-build identifiers.
Sanitization rules
Applied once, at schema inference time:
- Any character outside
[A-Za-z0-9_]becomes_ - A leading digit gets a
_prefix - An empty result falls back to
column - Result is truncated to 128 characters (StarRocks identifier limit)
- Within a single table, sanitized names that collide get suffixed
_2,_3, … in source order
display_name is recorded only when sanitization changed the source header, or when the source header was already SQL-safe but had to be suffixed for collision avoidance.
What you write in SQL
SELECT fius_fiu_id, fius_country, fius_annual_sar_volume
FROM compliance_authorities
WHERE fius_egmont_member = true;The query interface, BI integrations, and downstream tools all operate on fius_fiu_id. The dot only appears in the UI label. This is the same convention Foundry and Atlan have used for years.
What happens during write/ingest
CSV uploads send rows keyed by the original headers (fius.fiu_id). The write path translates display_name → column_name automatically before validation and Stream Load, so you can keep re-uploading files with the dotted headers — no need to rename columns in the source file.
How other platforms handle this
| Platform | Behavior with fius.fiu_id | Where dots end up |
|---|---|---|
| Scrydon Analytics | Physical: fius_fiu_id. Display name: fius.fiu_id. UI shows the original; SQL uses sanitized. | First-class catalog metadata. |
| Google BigQuery | Auto-load silently sanitizes . → _ and discards the original. Column-level description exists but no display_name slot. | Lost. |
| Snowflake | Identifiers allow dots only when double-quoted ("fius.fiu_id"). Most teams sanitize on load because every query then needs quoting. | Optional, awkward. |
| Databricks (Delta + Unity Catalog) | Without column mapping: Parquet-strict, no dots allowed. With delta.columnMapping.mode='name': arbitrary names, but SQL must backtick (`fius.fiu_id`). Auto Loader / COPY INTO sanitize by default. | Optional via opt-in feature; uncommon in practice. |
| Microsoft Fabric / Synapse | Lakehouse is Delta — same as Databricks. Synapse SQL allows dots if quoted; Power BI uses bracket references ([fius.fiu_id]). Most teams sanitize. | Optional, rarely kept raw. |
| AWS Glue / Athena | Glue catalog stores any name; Athena reads with backticks. Crawlers default to sanitizing problematic characters. | Optional. |
| Palantir Foundry | Splits physical name (sanitized, SQL-safe) and display name (original). UI shows display; Spark/SQL uses physical. | First-class metadata. |
| Atlan / Collibra | Same pattern as Foundry — physical name plus business-name / display-name. Lineage and queries use physical, UI uses display. | First-class metadata. |
| Datadog / OpenTelemetry | Dots are semantic hierarchy (http.status_code). Stored as flat columns with literal dots in their custom column-store; or nested as structs in warehouses. | First-class but as a schema convention. |
The industry essentially clusters into three camps:
- Sanitize-and-forget (BigQuery, default Databricks/Snowflake/Fabric ingest): cheap, lossy, ergonomic in SQL — but the original is gone forever.
- Quote-and-allow (Databricks column mapping, Snowflake quoted identifiers): lossless on disk — but every query, BI tool, and connector needs quoting logic and the surface area for ergonomic friction is large.
- Physical + display split (Foundry, Atlan, Scrydon): sanitized name in SQL, original kept as catalog metadata, UI/BI labels render the original. This is the "right" answer for catalog-style platforms — we picked it for the same reasons Foundry and Atlan did.
Why physical/display split
Scrydon Analytics is a catalog, not a raw warehouse. The original header is information. Throwing it away (BigQuery model) makes uploads succeed but loses provenance. Forcing every query to quote it (Snowflake / Databricks column-mapping model) makes uploads succeed but pushes friction onto every consumer downstream.
Concretely:
- Re-uploading the same file works. Drop the CSV again, the write path resolves dotted headers via
display_name → column_nameautomatically. You don't have to rename columns in the source. - BI tools and embedded SQL stay simple. A Tableau worksheet, an analyst notebook, or a workflow's
SELECT fius_fiu_iddoesn't need to know about the dot. - The catalog is auditable. When a column was renamed at ingest, that fact is queryable (
display_name IS NOT NULL) — useful for ops triage when someone wonders why their query againstfius.fiu_idreturns "column does not exist." - Migrating in or out is symmetric. Exporting back to CSV uses the display name; consumers get the file back in the shape they sent it.
What gets sanitized in practice
| Source header | Physical (column_name) | Stored display_name |
|---|---|---|
customer_id | customer_id | null |
fius.fiu_id | fius_fiu_id | fius.fiu_id |
Customer Email | Customer_Email | Customer Email |
2024_revenue | _2024_revenue | 2024_revenue |
café | caf_ | café |
| `` (empty) | column | `` (empty string) |
Two distinct headers that sanitize to the same string get suffixed in source order — fius.fiu_id becomes fius_fiu_id, and a later fius_fiu_id in the same file becomes fius_fiu_id_2. The display_name for both round-trips the original verbatim, so the UI never confuses them.
Limits
- StarRocks identifiers are ASCII-only and capped at 128 characters. Headers longer than 128 characters get truncated after sanitization; the full original is still preserved in
display_name. - A few characters (control codes, NUL) are stripped along with the rest. If your CSV uses something exotic that you specifically want to preserve, surface it via
descriptionrather than expecting it indisplay_name. - Schema-first tables (where you explicitly declared columns up-front) bypass sanitization — declared columns are required to be SQL-safe at table creation time. Sanitization only kicks in for data-first ingest paths.