Scrydon
Analytics

Column names — how Scrydon handles dotted and non-SQL-safe headers

Lossless preservation of source-provided column names like `fius.fiu_id` via a physical/display split, and how the approach compares to BigQuery, Snowflake, Databricks, Fabric, AWS Glue, Foundry, and Atlan.

CSVs in the wild rarely come with SQL-safe column names. Headers carry dots (fius.fiu_id), spaces (Customer Email), unicode (café), or just start with a digit (2024_revenue). What a data platform does with those names says a lot about how it expects to be used.

This page describes Scrydon Analytics's approach (Foundry/Atlan-style physical/display split) and how it compares to other platforms.

What Scrydon does

When you upload a file with a dotted header like fius.fiu_id, Scrydon ingests it lossless:

LayerValue
StarRocks physical columnfius_fiu_id
Catalog metadata column_namefius_fiu_id
Catalog metadata display_namefius.fiu_id
UI rendering (preview, schema view, ingest dialog)fius.fiu_id
SQL queries you write against the tablefius_fiu_id (or `fius_fiu_id` if you prefer to quote)

The raw header is preserved as display name in the catalog, never lost. The physical name is sanitized to satisfy StarRocks identifier rules so SQL stays simple — no mandatory backtick quoting in every query, no surprise breakage in BI tools that auto-build identifiers.

Sanitization rules

Applied once, at schema inference time:

  • Any character outside [A-Za-z0-9_] becomes _
  • A leading digit gets a _ prefix
  • An empty result falls back to column
  • Result is truncated to 128 characters (StarRocks identifier limit)
  • Within a single table, sanitized names that collide get suffixed _2, _3, … in source order

display_name is recorded only when sanitization changed the source header, or when the source header was already SQL-safe but had to be suffixed for collision avoidance.

What you write in SQL

SELECT fius_fiu_id, fius_country, fius_annual_sar_volume
FROM compliance_authorities
WHERE fius_egmont_member = true;

The query interface, BI integrations, and downstream tools all operate on fius_fiu_id. The dot only appears in the UI label. This is the same convention Foundry and Atlan have used for years.

What happens during write/ingest

CSV uploads send rows keyed by the original headers (fius.fiu_id). The write path translates display_name → column_name automatically before validation and Stream Load, so you can keep re-uploading files with the dotted headers — no need to rename columns in the source file.

How other platforms handle this

PlatformBehavior with fius.fiu_idWhere dots end up
Scrydon AnalyticsPhysical: fius_fiu_id. Display name: fius.fiu_id. UI shows the original; SQL uses sanitized.First-class catalog metadata.
Google BigQueryAuto-load silently sanitizes ._ and discards the original. Column-level description exists but no display_name slot.Lost.
SnowflakeIdentifiers allow dots only when double-quoted ("fius.fiu_id"). Most teams sanitize on load because every query then needs quoting.Optional, awkward.
Databricks (Delta + Unity Catalog)Without column mapping: Parquet-strict, no dots allowed. With delta.columnMapping.mode='name': arbitrary names, but SQL must backtick (`fius.fiu_id`). Auto Loader / COPY INTO sanitize by default.Optional via opt-in feature; uncommon in practice.
Microsoft Fabric / SynapseLakehouse is Delta — same as Databricks. Synapse SQL allows dots if quoted; Power BI uses bracket references ([fius.fiu_id]). Most teams sanitize.Optional, rarely kept raw.
AWS Glue / AthenaGlue catalog stores any name; Athena reads with backticks. Crawlers default to sanitizing problematic characters.Optional.
Palantir FoundrySplits physical name (sanitized, SQL-safe) and display name (original). UI shows display; Spark/SQL uses physical.First-class metadata.
Atlan / CollibraSame pattern as Foundry — physical name plus business-name / display-name. Lineage and queries use physical, UI uses display.First-class metadata.
Datadog / OpenTelemetryDots are semantic hierarchy (http.status_code). Stored as flat columns with literal dots in their custom column-store; or nested as structs in warehouses.First-class but as a schema convention.

The industry essentially clusters into three camps:

  • Sanitize-and-forget (BigQuery, default Databricks/Snowflake/Fabric ingest): cheap, lossy, ergonomic in SQL — but the original is gone forever.
  • Quote-and-allow (Databricks column mapping, Snowflake quoted identifiers): lossless on disk — but every query, BI tool, and connector needs quoting logic and the surface area for ergonomic friction is large.
  • Physical + display split (Foundry, Atlan, Scrydon): sanitized name in SQL, original kept as catalog metadata, UI/BI labels render the original. This is the "right" answer for catalog-style platforms — we picked it for the same reasons Foundry and Atlan did.

Why physical/display split

Scrydon Analytics is a catalog, not a raw warehouse. The original header is information. Throwing it away (BigQuery model) makes uploads succeed but loses provenance. Forcing every query to quote it (Snowflake / Databricks column-mapping model) makes uploads succeed but pushes friction onto every consumer downstream.

Concretely:

  1. Re-uploading the same file works. Drop the CSV again, the write path resolves dotted headers via display_name → column_name automatically. You don't have to rename columns in the source.
  2. BI tools and embedded SQL stay simple. A Tableau worksheet, an analyst notebook, or a workflow's SELECT fius_fiu_id doesn't need to know about the dot.
  3. The catalog is auditable. When a column was renamed at ingest, that fact is queryable (display_name IS NOT NULL) — useful for ops triage when someone wonders why their query against fius.fiu_id returns "column does not exist."
  4. Migrating in or out is symmetric. Exporting back to CSV uses the display name; consumers get the file back in the shape they sent it.

What gets sanitized in practice

Source headerPhysical (column_name)Stored display_name
customer_idcustomer_idnull
fius.fiu_idfius_fiu_idfius.fiu_id
Customer EmailCustomer_EmailCustomer Email
2024_revenue_2024_revenue2024_revenue
cafécaf_café
`` (empty)column`` (empty string)

Two distinct headers that sanitize to the same string get suffixed in source order — fius.fiu_id becomes fius_fiu_id, and a later fius_fiu_id in the same file becomes fius_fiu_id_2. The display_name for both round-trips the original verbatim, so the UI never confuses them.

Limits

  • StarRocks identifiers are ASCII-only and capped at 128 characters. Headers longer than 128 characters get truncated after sanitization; the full original is still preserved in display_name.
  • A few characters (control codes, NUL) are stripped along with the rest. If your CSV uses something exotic that you specifically want to preserve, surface it via description rather than expecting it in display_name.
  • Schema-first tables (where you explicitly declared columns up-front) bypass sanitization — declared columns are required to be SQL-safe at table creation time. Sanitization only kicks in for data-first ingest paths.
On this page

On this page