Skip to content

Dataset scanning in Rime Editor

Dataset scanning is the editor workflow that makes Rime feel different from editing YAML alone. Every table-producing node can be inspected as a concrete dataset: shape, columns, samples, profiles, and the script or expression that produced it.

Rime Editor table preview focused on sampled patient rows and column profile cards.

The selected node panel should make the output legible before you read any code:

SurfaceWhy it matters
Shape tupleThe fastest signal that a node changed row or column count
Cache/run stateExplains whether you are seeing fresh output or a cached artifact
Column profilesShows type, nulls, cardinality, and distribution hints
Row sampleLets you inspect real values without opening a notebook
Source/configKeeps the SQL, script, expression, or YAML beside the data it produced

The point is not to replace analysis. The point is to make obvious pipeline mistakes visible immediately: empty cohorts, exploded joins, unexpected nulls, accidental wide pivots, or feature columns with nonsense ranges.

Source nodes should show file path, inferred shape, sampled rows, and column profiles. For CSVs, check whether numeric-looking columns inferred correctly and whether empty strings became null.

Filters should make row-count loss obvious. A filter that keeps zero rows may be logically valid, but it usually deserves a second look before downstream stats run.

- id: repeat_visitors
kind: filter
inputs: [risk_index]
expr: "[n_visits] >= 1"

Derived columns should be easy to find in the preview. Inspect the new column’s profile before trusting downstream results.

- id: lab_load
kind: derive
inputs: [patient_lab_wide]
as: lab_load
expr: "[crp_mean] * [ldl_max] / 1000.0"

Joins deserve row-count attention. If an inner join unexpectedly shrinks the table, inspect key coverage. If a left join expands rows, check for many-to-many matches.

Aggregate and pivot nodes usually change shape dramatically. The output column names are part of the review: aliases like [mean_risk_index] should read like report-ready metrics.

Focused table sample showing patient identifiers, demographics, and lab columns.

Column profiles are useful because they compress a lot of context:

  • null count can reveal failed joins or missing source values
  • cardinality distinguishes IDs from low-cardinality groups
  • numeric distributions make outliers and impossible values visible
  • type hints show whether a field is usable in expression nodes and statistical nodes

When a node uses the expression language, profiles are often the quickest way to decide whether the formula made sense.

A good Rime workflow is:

  1. Use core nodes for visible transformations.
  2. Scan each output as it changes.
  3. Drop to SQL/Python/R/JavaScript only when the core-node expression would be harder to review than code.
  4. Return to the scan surface after the script node runs.

The editor is not trying to hide code. It is trying to keep code and data side by side.