A patient world model is a generative model of patient dynamics, a learned simulator that can roll out a trajectory under chosen actions, and therefore reason about counterfactuals. For rare disease, where the median diagnostic odyssey exceeds six years and the high-value clinical question is "if I change treatment, what changes in ten years?", a world model is the right object, and until now, no one had built one.
World models have transformed video (Sora), interactive environments (Genie), and embodied agents (Dreamer V3), built on the backbone family that culminates in Diffusion Forcing. In medicine, general clinical world models have begun to appear, EHRWorld, CLARITY, but none targets rare disease, grounds itself in a biomedical knowledge graph, or conditions on the genome, the very signal that, for rare disease, is the root cause.
We introduce GEMEO: the first patient world model for rare disease.
The three pillars
GEMEO is a Propose → Simulate → Verify pipeline, instantiated against any electronic health record expressible in the Medical Event Data Standard (MEDS), the tuple stream (subject_id, time, code, value). The flagship, gemeo-sus, is trained on Brazil's SUS.
Pillar A · Propose: genome- and graph-grounded onset
Given a patient's manifested state, Pillar A proposes the clinical events they have not yet had, the first-onset candidates that a repetition baseline cannot fabricate. It draws on two grounded sources.
Knowledge graph. Random-Walk-with-Restart (RWR) over a heterogeneous biomedical graph derived from PrimeKG, seeded on the patient's manifested diseases, phenotypes, and variant-bearing genes. The stationary distribution ranks unseen genes, phenotypes, and diseases by network proximity; already-manifested nodes are excluded, and each candidate ships its shortest evidence path back to a seed.
Genome. The patient's variants are scored by an ensemble of genomic foundation models, each on its domain: AlphaMissense for missense, Evo 2 (zero-shot likelihood delta) for coding and noncoding, and AlphaGenome for splice and regulatory effects. Variant pathogenicity reweights the RWR restart vector, so the genome steers the proposal toward the patient's actual molecular lesion.
Pillar B · Simulate: a recurrence-aware Causal Diffusion Forcing world model
The dynamics core is a Causal Diffusion Forcing transformer: each token receives an independent noise level σ ∼ 𝒰(0,1), unifying autoregressive prediction, full-sequence diffusion, and variable-horizon rollout in one model. The backbone is a 20M-parameter SwiGLU + RMSNorm + RoPE transformer with gated cross-attention to a PrimeKG ego-subgraph (tanh(α) gate, α initialised at zero, after Flamingo and Genie).
The critical design choice is the training objective. In SUS orphan-drug data, only 17.8% of events are first occurrences; the other 82.2% are repeats (a patient on a monthly orphan drug receives the same dispensing code every month). A naive next-event loss therefore rewards copying. Following the recurrence-aware principle of RAVEN, we weight each token's cross-entropy by w = max(λ^count, w_min), with λ = 0.25, first occurrences carry full weight, repeats decay toward zero. This is the single decisive lever.
Pillar C · Verify: an agentic, evidence-grounded panel
Candidates from Pillar A, scored by Pillar B, are adjudicated by a case-adaptive multi-agent panel in the lineage of CAMP, DAVP, ClinicalAgents, LA-MARRVEL, and DeepRare. Specialist agents (genetics, phenotype, network/RWR, genomic, world-model) each cast a three-valued vote (KEEP / REFUSE / NEUTRAL) derived from real graph edges, RWR proximity, variant pathogenicity, or model hazard, never free text. Votes are aggregated by weighted log-odds into a calibrated decision; every verdict carries its evidence path.
Where GEMEO stands on the world-model rubric
We position GEMEO precisely on the clinical world-model capability rubric of Qazi et al. (NeurIPS 2025), which grades world models from Level 1 (state-conditioned temporal prediction) to Level 4 (planning/control).
The data substrate
42,265 patients with ≥4 events and ≥2 distinct codes, drawn from CNS-hash-linked SUS subsystems, high-complexity outpatient APAC-SIA, hospitalisations SIH, mortality SIM, exported to MEDS v0.4.1 with canonical code namespaces (ICD10//, SIGTAP//, APAC//, ORPHA, MEDS_*). De-identification: ages bucketed, residence at state (UF) granularity, CNS hashed, k-anonymity ≥ 5. Patient-level 70/15/15 train/val/test split.
Result 1, the world model predicts genuinely new events
The decisive metric is new-onset prediction: scoring only positions where the true token is a first occurrence (repeats excluded). Recurrence-aware GEMEO attains Top-1 of 53.7% (95% CI 51.4–56.1) versus a 38.2% frequency baseline (+15.5 pp, non-overlapping CI). An ablation without the recurrence-aware objective falls below the baseline, confirming that the objective, not autocorrelation, drives the result.
| Model | New-onset Top-1 | vs frequency (38.2%) |
|---|---|---|
| Frequency baseline | 38.2% | reference |
| GEMEO (flat loss, ablation) | 14.6% [13.5, 15.8] | below baseline |
| GEMEO (recurrence-aware) | 53.7% [51.4, 56.1] | +15.5 pp |
An ablation grid (n = 3,405 new-onset positions, frequency baseline 37.5%) isolates what drives novelty prediction: removing the recurrence-aware loss collapses new-onset to 8–15%, below the frequency baseline (the model reverts to copying repeats); adding it lifts performance to 55–59%, a swing exceeding 40 percentage points. Positional features are roughly neutral on novelty.
| Recurrence-aware | Positional features | New-onset Top-1 |
|---|---|---|
| ✓ | ✓ | 55.1% [53.5, 56.7] |
| ✓ | ✗ | 58.8% [57.1, 60.4] |
| ✗ | ✓ | 14.6% [13.5, 15.8] |
| ✗ | ✗ | 7.6% [6.8, 8.5] |
Result 2, the world model wins every long-context task
On RareBench-BR Trajectory, under EHRSHOT-style frozen-representation linear probes with mandatory count-based baselines on the same candidate space, GEMEO's learned representation leads on every novelty and long-context task. The standout is treatment discontinuation (predicting >6-month dropout, a clinically critical outcome in rare disease), where GEMEO beats the count-based probe by +0.142 AUROC.
| Task | n | GEMEO | Strong baseline | Margin | p |
|---|---|---|---|---|---|
| New-onset (Top-1) | 1,730 | 53.7% | 38.2% (frequency) | +15.5 pp | <0.001 |
| Will-change (AUROC) | 6,405 | 0.906 | 0.889 (count-based) | +0.017 | 0.003 |
| Transition within 12 mo (AUROC) | 6,405 | 0.827 | 0.790 (count-based) | +0.037 | <0.001 |
| Treatment discontinuation (AUROC) | 6,591 | 0.838 | 0.696 (count-based) | +0.142 | <0.001 |
| Next-procedure at transition (R@1) | 5,718 | 15.5% | 63.5% (bigram) | −48.1 | <0.001 |
The split is exactly what the 2026 EHR literature predicts: for single-step Markov transitions, immediate history dominates and a count-based bigram is near-optimal; world-model advantage emerges on novelty and long-context outcomes, which is precisely where GEMEO wins.
Result 3, reasoning from the genome
Rare disease is largely monogenic, so GEMEO grounds the proposer in the genome via the genomic ensemble, validated on real ClinVar variants with reference-verified coordinates. Each model is evaluated on its domain.
| Model | Domain | Variant set | AUROC |
|---|---|---|---|
| AlphaMissense | missense | 451 variants, 20 rare-disease genes | 0.928 |
| Evo 2 7b (zero-shot) | coding / global | same 451 variants | 0.816 |
| AlphaGenome | splice / regulatory | 40 ClinVar splice variants | 0.734 |
Wiring the variant scorer into Pillar A closes the causal chain variant → gene → disease → trajectory: a real pathogenic variant recovers the correct rare disease at Top-1 = 7/7 across the cohort's causal genes (a pathogenic FBN1 variant → Marfan syndrome, with a traceable gene → disease evidence path), while a benign control is correctly not flagged. The full genome-conditioned world model, per-patient variant embeddings entering Pillar B at the input, requires per-patient whole-exome data; the architecture and scorers are built and validated to receive it.
Result 4, the model uses the action (Level 2)
A world model must condition on actions, not merely on state. We test this directly on a treatment-vocabulary variant: every common 10-digit orphan-drug dispensing becomes a distinct action token (65 actions). For 12,380 held-out patients who initiate a treatment at position k, we compare the model's likelihood of the patient's observed future under a prefix that contains the actual treatment token versus one in which it is masked.
RareBench-BR Trajectory, an autocorrelation-immune benchmark
Patient-trajectory prediction is dominated by event autocorrelation: a model that copies the patient's last code scores near-perfectly on naive next-event tasks, the documented pitfall of recurrence-aware modelling. No public rare-disease trajectory benchmark existed. We release RareBench-BR Trajectory v2: 44,051 CNS-linked Brazilian SUS rare-disease trajectories, five tasks, balanced/stratified splits, a geographic-external test, and mandatory count-based baselines (frequency, bigram, repeat-last). The benchmark is built so a repeat-last oracle scores only 12.4% (not the ~99% it would score on a naive next-event track), the quantitative proof of autocorrelation-immunity.
Toward Level 3, a concrete validation programme
Level 1 and Level 2 are empirically validated; the remaining frontier is Level 3, counterfactual rollout for decision support. Reaching it requires two further experiments, at increasing cost:
- Synthetic counterfactual (Level 3 with known ground truth). Train GEMEO on trajectories generated by a structural causal model of a well-characterised disease (e.g., FBN1 → aortic dilation, with a known intervention) and test whether the model recovers the simulated treatment effect.
- RCT replication (Level 3, clinical ground truth). Estimate, by counterfactual rollout over a matched SUS cohort, the effect size of a published rare-disease trial (e.g., nusinersen in spinal muscular atrophy, eculizumab in atypical haemolytic-uraemic syndrome) and compare to the trial's hazard ratio.
Together with a genome-conditioned world model trained on a sequenced cohort, these define the path to decision-support-grade counterfactual capability.
Honest scope and limitations
(i) The flagship is trained on structured SUS events only; the genome-conditioned world model and interventional validation await a sequenced, multimodal substrate (All of Us, UK Biobank). (ii) For single-step Markov transitions, count-based baselines remain competitive; the world model's advantage is long-context. (iii) The genomic pillar is validated on hundreds of ClinVar variants per model; population-scale validation is future work. (iv) SUS mortality coding is coarse, bounding the survival head (C-index 0.70). (v) Counterfactual sign-agreement with a clinician panel is not yet a powered study.
Open release, the recipe, kept honest
Apache-2.0, architecture, reference implementation, conformance suite, reproducers: github.com/rarasAI/gemeo and huggingface.co/Raras-AI/gemeo-arch. CC-BY-NC 4.0, model weights (huggingface.co/Raras-AI/gemeo-sus) and the rarebench-br-trajectory benchmark. Held back: the proprietary DATASUS extraction pipeline. Every result has a committed JSON and a single-GPU reproducer; a preflight conformance suite verifies, on every release, that every public number traces to a committed result file.
Sources
The data used in this study comes exclusively from DATASUS’s open portal (SIH-RD, APAC-Medicamentos, SIM), released by Brazil’s Ministry of Health for transparency and research purposes.
Legal basis
Processing was performed under Art. 7, IV (studies by a research body) and Art. 11, II, items “c” and “f” (sensitive health data for public-health studies and research) of Brazil’s General Data Protection Law (LGPD, Federal Law 13.709/2018).
Pseudonymisation and k-anonymity
No personally identifiable data (names, taxpayer IDs, plaintext health-card numbers, or addresses) was accessed or processed. The AP_CNSPCN identifier is a hash of the National Health Card generated upstream by the public-health system itself — the longitudinal linkage of 42,265 patients is performed on this pre-existing pseudonym, with no re-identification. Ages bucketed, residence at state (UF) granularity, k-anonymity ≥ 5 floor on every released artifact.
Non-reidentifiability
The model neither stores nor reconstructs individual patient trajectories; its predictions operate over aggregate embeddings and do not permit reverse inference to individuals.
Not a diagnosis
GEMEO is a research tool. Its predictions do not constitute medical diagnosis, prescription, or a substitute for clinical evaluation by a licensed professional.
Ethics review
Studies based exclusively on Brazil’s open and anonymised DATASUS records are exempt from CEP/CONEP ethics review pursuant to National Health Council Resolution 510/2016, Art. 1, sole paragraph, V.
Compliance
The authors declare compliance with Brazil’s LGPD, the Access to Information Law (Federal Decree 7.724/2012), and the DATASUS open-data usage policy.