Nodes
Nodes are functions, not jobs
Section titled “Nodes are functions, not jobs”A node in Rime is a function over dataframes.
You write what the function computes — the body. Rime’s runtime owns everything else: reading the inputs, materializing them into a native value (pandas DataFrame, R tibble, DuckDB table, JS object array), running your code, capturing the return value, content-addressing it, caching it, and handing it to the next node.
You never write read_csv() at the top of a node. You never write to_parquet() at the bottom. The runtime does both. Your function signature is the contract.
How that differs from Airflow / Prefect / Dagster
Section titled “How that differs from Airflow / Prefect / Dagster”The workflow orchestrators in the ETL world (Airflow, Prefect, Dagster) treat each step as a task: a Python function that reads from somewhere, transforms, and writes somewhere else. The function body is responsible for its own I/O. Tasks coordinate by writing artifacts that downstream tasks happen to read — the side effect is the contract.
Rime inverts this. Side effects are the runtime’s job; functions just compute.
| In Airflow / Prefect / Dagster | In Rime |
|---|---|
| Task reads from S3, writes to S3 | Function takes a dataframe, returns a dataframe |
| Each task owns its own I/O | Runtime owns I/O |
| Coordination via storage paths | Coordination via typed dataframe ports |
| Reproducibility requires hand-rolled idempotency | Caching is automatic (content-addressed) |
| Multi-language = hand-wiring separate task runtimes | Multi-language = kind: r in YAML; dataframes cross through Rime artifacts |
| You write the boilerplate | The runtime owns the boilerplate |
This is the same intuition behind dbt’s “you write the SELECT, we handle materialization” — extended past SQL into Python, R, and JavaScript.
The smallest possible example
Section titled “The smallest possible example”def run(patients): # `patients` arrives as a pandas DataFrame. # You did not open a file. You did not pick a serializer. return patients[patients["age"] >= 18]- id: cohort kind: python source: scripts/cohort.py in: patients: raw_patients # upstream node IDThat’s the whole node. The runtime:
- Reads the upstream
raw_patientsoutput from disk (or cache), - Decodes it as a pandas DataFrame,
- Calls
run(patients=<the dataframe>), - Captures the returned dataframe,
- Hashes the
(source code + inputs)pair into a content address, - Writes the result to
outputs/cohort/default.parquet, - Makes it available to any downstream node that references
cohort.
Switch kind: python to kind: r and write the same function in R — same protocol, same caching, no glue code between them.
Built-in node kinds
Section titled “Built-in node kinds”Most pipelines don’t even need to write a custom function for common shapes. Rime ships 14 built-in kinds that cover the things you’d otherwise re-write every project:
Source
Section titled “Source”| Kind | What it does |
|---|---|
source | Read a CSV / JSON / NDJSON / Parquet file into a tabular value |
- id: patients kind: source path: data/patients.csvSingle-input transforms
Section titled “Single-input transforms”| Kind | What it does |
|---|---|
filter | Keep rows matching a boolean expression |
derive | Add a computed column |
select | Keep specific columns |
sort | Order rows by one or more expressions |
aggregate | Group + reduce, with named metrics |
These nodes share Rime’s expression language. The useful pattern is to keep data-shaping logic visible as small formulas instead of hiding every operation inside a script node:
- id: risk_index kind: derive inputs: [patient_lab_wide] as: risk_index expr: "coalesce([crp_mean], 0) * 2.0 + coalesce([ldl_max], 0) * 0.05"Multi-input combinators
Section titled “Multi-input combinators”| Kind | What it does |
|---|---|
join | Two-input inner / left join on column keys |
concat | Stack tables row-wise with a label column |
pivot | Wide-format aggregation |
Statistical terminals
Section titled “Statistical terminals”These return a small JSON-shaped result (test statistic, p-value, etc.) rather than a table. Reports render them as stat-style key-value output cells.
| Kind | What it does |
|---|---|
t_test | Welch / equal-variance two-sample t-test |
anova | One-way ANOVA across N groups |
mann_whitney_u | Non-parametric two-sample test |
chi_square | Categorical independence test |
correlation | Pearson / Spearman correlation between two columns |
linear_regression | Single-feature OLS, optional train/test split |
Statistical nodes also emit assumption warnings. Those warnings show up in reports and the editor review surfaces because they are often as important as the p-value: low expected cell counts for chi-square, small or skewed groups for t-tests/ANOVA, Pearson/Spearman disagreement for correlation, and high-residual observations for linear regression.
Composition
Section titled “Composition”| Kind | What it does |
|---|---|
subgraph | Embed an external .dag.yaml file with named bindings + outputs |
Subgraphs are opaque from the outside; their bindings: map outer node refs to inner slot names, and their outputs: map exposed names to inner refs.
The language node — the escape hatch
Section titled “The language node — the escape hatch”Anything you can’t express with the built-ins is a language node. Same functional contract — you write a function, declare its inputs as named slots, return a dataframe (or a dict of named dataframes):
- id: features kind: python source: scripts/features.py in: cohort: upstream_node # dataframe slot threshold: params.threshold # scalar slotNative values per language: pandas DataFrame (Python), data.frame/tibble-style table (R), row arrays (JS), temp table (SQL). See Polyglot runtime for the per-language details.
Metadata (optional, all kinds)
Section titled “Metadata (optional, all kinds)”metadata: label: "Friendly node label" # used in reports and visualizations group: "feature_engineering" # logical grouping visual_stats: ["row_count"] # engine emits these on each run cache: false # boolean or { policy: ttl, seconds: N }What this buys you
Section titled “What this buys you”- Move scripts between languages without rewriting glue. Switch
kind: pythontokind: r; the function signature stays the same. - No serialization decisions in user code. Arrow IPC and Parquet are runtime concerns, not yours.
- Caching is automatic. Change a script — only it and its downstream re-run. Change an input — same.
- Reproducibility is a side effect of the model, not extra work. The cache key is
hash(source + inputs); same key = same result, every time.
Per-kind field reference lives under Node Reference in the sidebar.