14 real examples. In milliseconds, patient speech becomes an HPO term.
A rare-disease diagnostic odyssey in Brazil starts with words. The child “não vinga.” The baby “tem água na cabeça.” The patient “esparra” at night. These words are real. But biomedical research speaks a different language, the standardised vocabulary of the Human Phenotype Ontology, with 17,000 terms in medical Latin. Between these two languages lies a gap that costs years of diagnostic delay.
The gap between speech and science
Picture a community health worker in rural Ceará writing a home-visit note about a two-year-old:
For that sentence to become traceable science, it has to resolve to three specific HPO terms: HP:0001250 (Seizure), HP:0001252 (Muscular hypotonia), and HP:0000256 (Macrocephaly). Each of those codes opens a continent of medical literature, gene associations, clinical trials, reference centres. But the health worker doesn't speak HPO. The mother doesn't speak HPO. And AI models trained on English medical text, including the best ones available today, don't either.
The result: the most valuable clinical observation stays trapped in the very sentence that expressed it.
What HPO is, in one figure
The Human Phenotype Ontology is, in essence, the common dictionary of genomic medicine: roughly 17,000 terms organised in a hierarchical tree that runs from broad concepts (“abnormality of the nervous system”) down to highly specific descriptions (“low-frequency postural hand tremor”). When a geneticist describes a patient in HPO, they are speaking a language any researcher in the world can understand, and that diagnostic search algorithms can process.
HPO is maintained by the Jackson Laboratory and continuously updated by the international community. In 2024 it received an official Portuguese translation through the Babelon HPO-PT project. But formal translation is not real speech. “Hydrocephalus” is in Babelon. “Água na cabeça” is not.
How Araras learned to listen
We started from BioLORD-2023, the best biomedical encoder published in 2024, and ran multi-phase contrastive fine-tuning across four kinds of data:
Training also automatically mines ~44,000 hard negatives, HPO terms that look similar to the input but are wrong, and uses a Multiple Negatives Ranking loss with cosine similarity. The result is a 110M-parameter encoder that projects any sentence into a 768-dimensional space where semantically equivalent symptoms, in any register, formal or colloquial, in PT or EN, sit close together.
What changed, in numbers
We evaluated Araras on five scenarios, comparing against BioLORD-2023 (the previous state-of-the-art). The most dramatic jump appears exactly where the field always failed: colloquial Portuguese.
| Scenario | BioLORD-2023 | Araras | Δ |
|---|---|---|---|
| RareBench (EN, n=13,763), Top1 | 95.01% | 97.81% | +2.80 |
| BR-PT formal (Babelon, n=7,142), Top1 | 17.22% | 62.00% | +44.78 |
| BR-PT colloquial (n=24, eval), Top1 | 4.17% | 79.17% | +75.00 |
| BR-PT colloquial, Top5 | 4.17% | 100.00% | +95.83 |
| Brazilian clinical narratives (n=22, 5 cases), Acc@1 | — | 95.5% | — |
The point is not just that Araras is state-of-the-art: it is that colloquial Portuguese went from indistinguishable from chance (4.17% Top-1) to nearly solved (79.17% Top-1, 100% Top-5). On real Brazilian clinical narratives it gets the right HPO term on the first guess 95.5% of the time.
What it’s for, in practice
Five concrete uses already shipping or in development inside the Raras ecosystem:
- Patient digital health card. When a family records “tem água na cabeça” in the Raras app, the backend automatically converts it to
HP:0000238, and from there the system can suggest differential diagnoses, patient communities with the same condition, and relevant reference centres. - Community health worker (ACS) notes. More than 280,000 ACS write home-visit notes every month in Brazil’s public-health system, in regional colloquial Portuguese. Today those notes are dead information. With Araras they become structured data, without the ACS having to change how they write.
- Rare-disease diagnostic pipelines. Araras is the first stage (encoder) in pipelines like our own RarasNet Swarm, feeding retrievers over the HPO graph and Bayesian rankers like LIRICAL or textual ones like PubCaseFinder.
- Multilingual semantic search. Brazilian researchers can now search English literature using colloquial Portuguese terms, and still recover the right paper.
- Annotation of Portuguese-language biomedical literature. For the growing ecosystem of clinical papers published in Portuguese (especially public-health master’s and PhD theses), Araras serves as an automatic phenotype annotator.
Open source, from day zero
Araras HPO Brasil is available now at huggingface.co/Raras-AI/araras-hpo-brasil under the Apache 2.0 license. The model inherits the license from BioLORD-2023 and can be used in commercial and academic production without restrictions. Weights, training code, and benchmarks are public.
Three extensions are in the works:
- Full multilingual version, colloquial Spanish (Mexico, Argentina, Colombia), colloquial French (Maghreb, Quebec), starting with countries whose public-health systems mirror Brazil’s SUS.
- Instruct version with returned explanation (“I think this is hydrocephalus, based on the description of intracranial pressure”).
- Federated training across the 45 reference centres of RARAS-BRDN, refining the model on real clinical narratives without moving any patient data.
Sources
Training data comes exclusively from public sources: the HPO ontology (Jackson Laboratory, CC-BY 4.0), its official Babelon HPO-PT translations (CC-BY 4.0), and colloquial terms collected by internal curation without any real patient data.
Pseudonymisation
No personally identifiable data was used in any phase of training. The Brazilian clinical examples used in validation (n=22 HPOs, 5 real cases) were provided by medical collaborators with explicit consent and fully de-identified before use.
Not a diagnosis
Araras is a term encoder, not a diagnostic system. It answers the question “which HPO term is closest to this phrase?”, not “what disease does this patient have?”. Its output should always be reviewed by a licensed health professional before any clinical decision.
Known biases
The model inherits biases from BioLORD-2023 (trained predominantly on Anglophone literature) and from HPO itself (catalogued predominantly in European populations). Brazilian colloquial coverage was deliberately biased toward under-represented regions (Northeast, North), but gaps remain. Reports of uncovered terms are welcome via issues on the repository.
License and attribution
Apache 2.0, inherited from BioLORD-2023. Recommended citation available on the model card.