Raras
Buscar doenças, sintomas, genes...
Research
ENPT
Research · World Models · Rare Disease

The first patient world model for rare disease, grounded in the genome and a biomedical knowledge graph, trained on a national single-payer system.

A Propose → Simulate → Verify architecture. 42,265 real SUS trajectories. 20M parameters. Five minutes on one H100. Level-1 and Level-2 capability empirically validated; Level 3 by design.
Authors · Dimas Timmers · Alexandre Melo Kawassaki · João Bosco Oliveira
Date · May 2026
A short explainer.
Abstract, in plain language
Rare diseases hit ~300 million people, and the question that matters in the clinic is rarely "what is it?", it is "what happens next, and what changes if I intervene?". That is the question world models are built to answer. GEMEO is, to our knowledge, the first patient world model built for rare disease, and the first trained on real records of a national single-payer system (Brazil's SUS). It has a three-pillar architecture, Propose (grounds candidate first-onset events in a biomedical knowledge graph and the patient's genome), Simulate (a recurrence-aware Causal Diffusion Forcing transformer that predicts genuinely novel events instead of echoing the past), Verify (an agentic panel that adjudicates every prediction with a traceable evidence path). On a new public, autocorrelation-immune benchmark, RareBench-BR Trajectory, GEMEO reaches 53.7% Top-1 new-onset prediction versus a 38.2% frequency baseline, and beats count-based methods on every long-context task (will-change AUROC 0.906, time-to-transition 0.827, treatment discontinuation 0.838). Its genomic pillar, validated on real ClinVar variants, scores variant pathogenicity at AUROC 0.93 (AlphaMissense, missense), 0.82 (Evo 2, zero-shot), and 0.73 (AlphaGenome, splice). The flagship is 20M parameters and trains in five minutes on one H100. We validate Level-1 (state-conditioned) and Level-2 (action-conditioned) capability on the clinical world-model rubric (NeurIPS 2025), with the architecture designed for the counterfactual rollout of Level 3. Architecture, weights, and benchmark are released openly.

A patient world model is a generative model of patient dynamics, a learned simulator that can roll out a trajectory under chosen actions, and therefore reason about counterfactuals. For rare disease, where the median diagnostic odyssey exceeds six years and the high-value clinical question is "if I change treatment, what changes in ten years?", a world model is the right object, and until now, no one had built one.

World models have transformed video (Sora), interactive environments (Genie), and embodied agents (Dreamer V3), built on the backbone family that culminates in Diffusion Forcing. In medicine, general clinical world models have begun to appear, EHRWorld, CLARITY, but none targets rare disease, grounds itself in a biomedical knowledge graph, or conditions on the genome, the very signal that, for rare disease, is the root cause.

We introduce GEMEO: the first patient world model for rare disease.

The three pillars

GEMEO is a Propose → Simulate → Verify pipeline, instantiated against any electronic health record expressible in the Medical Event Data Standard (MEDS), the tuple stream (subject_id, time, code, value). The flagship, gemeo-sus, is trained on Brazil's SUS.

The GEMEO architecture: three pillars
Figure 1 · The GEMEO architecture. Genome and knowledge-graph grounding feed the Propose pillar (A); the recurrence-aware Causal Diffusion Forcing world model simulates trajectories (B); the agentic panel Verifies and re-ranks every prediction with a traceable evidence path (C); all instantiated over a pluggable MEDS substrate.

Pillar A · Propose: genome- and graph-grounded onset

Given a patient's manifested state, Pillar A proposes the clinical events they have not yet had, the first-onset candidates that a repetition baseline cannot fabricate. It draws on two grounded sources.

Knowledge graph. Random-Walk-with-Restart (RWR) over a heterogeneous biomedical graph derived from PrimeKG, seeded on the patient's manifested diseases, phenotypes, and variant-bearing genes. The stationary distribution ranks unseen genes, phenotypes, and diseases by network proximity; already-manifested nodes are excluded, and each candidate ships its shortest evidence path back to a seed.

Genome. The patient's variants are scored by an ensemble of genomic foundation models, each on its domain: AlphaMissense for missense, Evo 2 (zero-shot likelihood delta) for coding and noncoding, and AlphaGenome for splice and regulatory effects. Variant pathogenicity reweights the RWR restart vector, so the genome steers the proposal toward the patient's actual molecular lesion.

Pillar B · Simulate: a recurrence-aware Causal Diffusion Forcing world model

The dynamics core is a Causal Diffusion Forcing transformer: each token receives an independent noise level σ ∼ 𝒰(0,1), unifying autoregressive prediction, full-sequence diffusion, and variable-horizon rollout in one model. The backbone is a 20M-parameter SwiGLU + RMSNorm + RoPE transformer with gated cross-attention to a PrimeKG ego-subgraph (tanh(α) gate, α initialised at zero, after Flamingo and Genie).

The critical design choice is the training objective. In SUS orphan-drug data, only 17.8% of events are first occurrences; the other 82.2% are repeats (a patient on a monthly orphan drug receives the same dispensing code every month). A naive next-event loss therefore rewards copying. Following the recurrence-aware principle of RAVEN, we weight each token's cross-entropy by w = max(λ^count, w_min), with λ = 0.25, first occurrences carry full weight, repeats decay toward zero. This is the single decisive lever.

Pillar C · Verify: an agentic, evidence-grounded panel

Candidates from Pillar A, scored by Pillar B, are adjudicated by a case-adaptive multi-agent panel in the lineage of CAMP, DAVP, ClinicalAgents, LA-MARRVEL, and DeepRare. Specialist agents (genetics, phenotype, network/RWR, genomic, world-model) each cast a three-valued vote (KEEP / REFUSE / NEUTRAL) derived from real graph edges, RWR proximity, variant pathogenicity, or model hazard, never free text. Votes are aggregated by weighted log-odds into a calibrated decision; every verdict carries its evidence path.

Where GEMEO stands on the world-model rubric

We position GEMEO precisely on the clinical world-model capability rubric of Qazi et al. (NeurIPS 2025), which grades world models from Level 1 (state-conditioned temporal prediction) to Level 4 (planning/control).

1
State-conditioned predictionNovel-event and long-context prediction conditioned on patient state, §4.1, §4.2, §4.3
Validated
2
Action-conditioned rolloutA treatment-vocabulary variant predicts a patient's future significantly better when conditioned on the action they actually received, §4.4
Validated
3
Counterfactual rollout for decision supportDiffusion Forcing supports variable-horizon, action-conditioned generation by construction; reaching Level 3 empirically requires an interventional cohort, §6
By design
4
Planning / controlOut of scope for this paper
Future work

The data substrate

42,265 patients with ≥4 events and ≥2 distinct codes, drawn from CNS-hash-linked SUS subsystems, high-complexity outpatient APAC-SIA, hospitalisations SIH, mortality SIM, exported to MEDS v0.4.1 with canonical code namespaces (ICD10//, SIGTAP//, APAC//, ORPHA, MEDS_*). De-identification: ages bucketed, residence at state (UF) granularity, CNS hashed, k-anonymity ≥ 5. Patient-level 70/15/15 train/val/test split.

42,265
Patients · ≥4 events, ≥2 distinct codes
2.4M
MEDS events across SIH-RD · APAC-SIA · SIM
20M
Parameters · trains in ~5 min on one H100
k ≥ 5
k-anonymity floor on every released artifact

Result 1, the world model predicts genuinely new events

The decisive metric is new-onset prediction: scoring only positions where the true token is a first occurrence (repeats excluded). Recurrence-aware GEMEO attains Top-1 of 53.7% (95% CI 51.4–56.1) versus a 38.2% frequency baseline (+15.5 pp, non-overlapping CI). An ablation without the recurrence-aware objective falls below the baseline, confirming that the objective, not autocorrelation, drives the result.

ModelNew-onset Top-1vs frequency (38.2%)
Frequency baseline38.2%reference
GEMEO (flat loss, ablation)14.6% [13.5, 15.8]below baseline
GEMEO (recurrence-aware)53.7% [51.4, 56.1]+15.5 pp

An ablation grid (n = 3,405 new-onset positions, frequency baseline 37.5%) isolates what drives novelty prediction: removing the recurrence-aware loss collapses new-onset to 8–15%, below the frequency baseline (the model reverts to copying repeats); adding it lifts performance to 55–59%, a swing exceeding 40 percentage points. Positional features are roughly neutral on novelty.

Recurrence-awarePositional featuresNew-onset Top-1
55.1% [53.5, 56.7]
58.8% [57.1, 60.4]
14.6% [13.5, 15.8]
7.6% [6.8, 8.5]

Result 2, the world model wins every long-context task

On RareBench-BR Trajectory, under EHRSHOT-style frozen-representation linear probes with mandatory count-based baselines on the same candidate space, GEMEO's learned representation leads on every novelty and long-context task. The standout is treatment discontinuation (predicting >6-month dropout, a clinically critical outcome in rare disease), where GEMEO beats the count-based probe by +0.142 AUROC.

TasknGEMEOStrong baselineMarginp
New-onset (Top-1)1,73053.7%38.2% (frequency)+15.5 pp<0.001
Will-change (AUROC)6,4050.9060.889 (count-based)+0.0170.003
Transition within 12 mo (AUROC)6,4050.8270.790 (count-based)+0.037<0.001
Treatment discontinuation (AUROC)6,5910.8380.696 (count-based)+0.142<0.001
Next-procedure at transition (R@1)5,71815.5%63.5% (bigram)−48.1<0.001

The split is exactly what the 2026 EHR literature predicts: for single-step Markov transitions, immediate history dominates and a count-based bigram is near-optimal; world-model advantage emerges on novelty and long-context outcomes, which is precisely where GEMEO wins.

Result 3, reasoning from the genome

Rare disease is largely monogenic, so GEMEO grounds the proposer in the genome via the genomic ensemble, validated on real ClinVar variants with reference-verified coordinates. Each model is evaluated on its domain.

ModelDomainVariant setAUROC
AlphaMissensemissense451 variants, 20 rare-disease genes0.928
Evo 2 7b (zero-shot)coding / globalsame 451 variants0.816
AlphaGenomesplice / regulatory40 ClinVar splice variants0.734

Wiring the variant scorer into Pillar A closes the causal chain variant → gene → disease → trajectory: a real pathogenic variant recovers the correct rare disease at Top-1 = 7/7 across the cohort's causal genes (a pathogenic FBN1 variant → Marfan syndrome, with a traceable gene → disease evidence path), while a benign control is correctly not flagged. The full genome-conditioned world model, per-patient variant embeddings entering Pillar B at the input, requires per-patient whole-exome data; the architecture and scorers are built and validated to receive it.

Result 4, the model uses the action (Level 2)

A world model must condition on actions, not merely on state. We test this directly on a treatment-vocabulary variant: every common 10-digit orphan-drug dispensing becomes a distinct action token (65 actions). For 12,380 held-out patients who initiate a treatment at position k, we compare the model's likelihood of the patient's observed future under a prefix that contains the actual treatment token versus one in which it is masked.

Knowing the action improves the likelihood of the observed future by Δ logP = +2.28 (95% CI 2.19 to 2.36; the interval excludes zero), and the predicted future distribution shifts measurably when the action is removed (action-sensitivity KL = 0.10). The rollout is therefore genuinely action-conditioned, empirical Level-2 capability validated against observational ground truth.

RareBench-BR Trajectory, an autocorrelation-immune benchmark

Patient-trajectory prediction is dominated by event autocorrelation: a model that copies the patient's last code scores near-perfectly on naive next-event tasks, the documented pitfall of recurrence-aware modelling. No public rare-disease trajectory benchmark existed. We release RareBench-BR Trajectory v2: 44,051 CNS-linked Brazilian SUS rare-disease trajectories, five tasks, balanced/stratified splits, a geographic-external test, and mandatory count-based baselines (frequency, bigram, repeat-last). The benchmark is built so a repeat-last oracle scores only 12.4% (not the ~99% it would score on a naive next-event track), the quantitative proof of autocorrelation-immunity.

Headline numbers
A world model that predicts new clinical events, beats every long-context baseline, and reasons from the genome.
53.7%
New-onset Top-1 · +15.5 pp over frequency baseline · CI excludes zero
+0.142
Treatment-discontinuation AUROC margin over the count-based probe
0.93
AlphaMissense AUROC on real ClinVar variants (rare-disease genes)
+2.28
Δ logP · the model genuinely uses the action (Level 2)

Toward Level 3, a concrete validation programme

Level 1 and Level 2 are empirically validated; the remaining frontier is Level 3, counterfactual rollout for decision support. Reaching it requires two further experiments, at increasing cost:

  1. Synthetic counterfactual (Level 3 with known ground truth). Train GEMEO on trajectories generated by a structural causal model of a well-characterised disease (e.g., FBN1 → aortic dilation, with a known intervention) and test whether the model recovers the simulated treatment effect.
  2. RCT replication (Level 3, clinical ground truth). Estimate, by counterfactual rollout over a matched SUS cohort, the effect size of a published rare-disease trial (e.g., nusinersen in spinal muscular atrophy, eculizumab in atypical haemolytic-uraemic syndrome) and compare to the trial's hazard ratio.

Together with a genome-conditioned world model trained on a sequenced cohort, these define the path to decision-support-grade counterfactual capability.

Honest scope and limitations

(i) The flagship is trained on structured SUS events only; the genome-conditioned world model and interventional validation await a sequenced, multimodal substrate (All of Us, UK Biobank). (ii) For single-step Markov transitions, count-based baselines remain competitive; the world model's advantage is long-context. (iii) The genomic pillar is validated on hundreds of ClinVar variants per model; population-scale validation is future work. (iv) SUS mortality coding is coarse, bounding the survival head (C-index 0.70). (v) Counterfactual sign-agreement with a clinician panel is not yet a powered study.

Open release, the recipe, kept honest

Apache-2.0, architecture, reference implementation, conformance suite, reproducers: github.com/rarasAI/gemeo and huggingface.co/Raras-AI/gemeo-arch. CC-BY-NC 4.0, model weights (huggingface.co/Raras-AI/gemeo-sus) and the rarebench-br-trajectory benchmark. Held back: the proprietary DATASUS extraction pipeline. Every result has a committed JSON and a single-GPU reproducer; a preflight conformance suite verifies, on every release, that every public number traces to a committed result file.

Data & ethics

Sources

The data used in this study comes exclusively from DATASUS’s open portal (SIH-RD, APAC-Medicamentos, SIM), released by Brazil’s Ministry of Health for transparency and research purposes.

Legal basis

Processing was performed under Art. 7, IV (studies by a research body) and Art. 11, II, items “c” and “f” (sensitive health data for public-health studies and research) of Brazil’s General Data Protection Law (LGPD, Federal Law 13.709/2018).

Pseudonymisation and k-anonymity

No personally identifiable data (names, taxpayer IDs, plaintext health-card numbers, or addresses) was accessed or processed. The AP_CNSPCN identifier is a hash of the National Health Card generated upstream by the public-health system itself — the longitudinal linkage of 42,265 patients is performed on this pre-existing pseudonym, with no re-identification. Ages bucketed, residence at state (UF) granularity, k-anonymity ≥ 5 floor on every released artifact.

Non-reidentifiability

The model neither stores nor reconstructs individual patient trajectories; its predictions operate over aggregate embeddings and do not permit reverse inference to individuals.

Not a diagnosis

GEMEO is a research tool. Its predictions do not constitute medical diagnosis, prescription, or a substitute for clinical evaluation by a licensed professional.

Ethics review

Studies based exclusively on Brazil’s open and anonymised DATASUS records are exempt from CEP/CONEP ethics review pursuant to National Health Council Resolution 510/2016, Art. 1, sole paragraph, V.

Compliance

The authors declare compliance with Brazil’s LGPD, the Access to Information Law (Federal Decree 7.724/2012), and the DATASUS open-data usage policy.

GEMEO · Raras Health, Rare Disease Research · São Paulo, Brazil · 2026