Research · Models · Rare Disease

Araras HPO Brasil

Bilingual encoder · 110M params · Apache 2.0

The AI that understands how Brazilian patients actually speak, and translates it into science.

An open-source model that connects colloquial Brazilian phrases, "água na cabeça", "pereba", "esparro", to the Human Phenotype Ontology, the universal vocabulary of clinical phenotypes used in rare-disease research worldwide.

Authors · Raras-AI Team

Date · May 2026

What it does

14 real examples. In milliseconds, patient speech becomes an HPO term.

"pereba"

→ HPO

Eczematoid dermatitis · HP:0000964

"água na cabeça"

→ HPO

Hydrocephalus · HP:0000238

"esparro"

→ HPO

Seizure · HP:0001250

"corcunda"

→ HPO

Kyphosis · HP:0002808

"não vinga"

→ HPO

Failure to thrive · HP:0001508

"dança dos olhos"

→ HPO

Nystagmus · HP:0000639

"fígado inchado"

→ HPO

Hepatomegaly · HP:0002240

"molinho"

→ HPO

Muscular hypotonia · HP:0001252

"amarelão"

→ HPO

Jaundice · HP:0000952

"cabeção"

→ HPO

Macrocephaly · HP:0000256

"chiado no peito"

→ HPO

Wheezing · HP:0030828

"apagão"

→ HPO

Syncope · HP:0001279

"tremedeira"

→ HPO

Tremor · HP:0001337

"falta de ar"

→ HPO

Dyspnea · HP:0002094

huggingface.co/spaces/Raras-AI/araras-hpo-brasil-demo

demo

Hydrocephalus HP:0000238 · cosine 0.94

similarity 0.94

The demo is interactive, try your own colloquial terms, including regional Brazilian slang. Open the live demo on Hugging Face ↗

A rare-disease diagnostic odyssey in Brazil starts with words. The child “não vinga.” The baby “tem água na cabeça.” The patient “esparra” at night. These words are real. But biomedical research speaks a different language, the standardised vocabulary of the Human Phenotype Ontology, with 17,000 terms in medical Latin. Between these two languages lies a gap that costs years of diagnostic delay.

The gap between speech and science

Picture a community health worker in rural Ceará writing a home-visit note about a two-year-old:

“Doctor, he has these esparros at night, gets all molinho afterwards, and his cabeção is growing too big.”

For that sentence to become traceable science, it has to resolve to three specific HPO terms: HP:0001250 (Seizure), HP:0001252 (Muscular hypotonia), and HP:0000256 (Macrocephaly). Each of those codes opens a continent of medical literature, gene associations, clinical trials, reference centres. But the health worker doesn't speak HPO. The mother doesn't speak HPO. And AI models trained on English medical text, including the best ones available today, don't either.

The result: the most valuable clinical observation stays trapped in the very sentence that expressed it.

What HPO is, in one figure

The Human Phenotype Ontology is, in essence, the common dictionary of genomic medicine: roughly 17,000 terms organised in a hierarchical tree that runs from broad concepts (“abnormality of the nervous system”) down to highly specific descriptions (“low-frequency postural hand tremor”). When a geneticist describes a patient in HPO, they are speaking a language any researcher in the world can understand, and that diagnostic search algorithms can process.

Hierarchical view of HPO terms — Figure · The HPO ontology is a tree. Each term is a child of broader terms and a parent of more specific ones. “Macrocephaly” lives under “Abnormality of the head”; “Tonic-clonic seizure” under “Seizure.” This structure lets diagnostic systems exploit kinship and hierarchical depth, if the patient has term X, they likely also have ancestors Y and Z.

HPO is maintained by the Jackson Laboratory and continuously updated by the international community. In 2024 it received an official Portuguese translation through the Babelon HPO-PT project. But formal translation is not real speech. “Hydrocephalus” is in Babelon. “Água na cabeça” is not.

How Araras learned to listen

We started from BioLORD-2023, the best biomedical encoder published in 2024, and ran multi-phase contrastive fine-tuning across four kinds of data:

35,000

Canonical HPO pairs (name, synonyms, definitions)

23,885

Official Portuguese translations (Babelon HPO-PT)

800+

Colloquial Brazilian pairs, hand-curated, covering NE, SE, S, CO, N regions

46,000

IS_A relations from the HPO hierarchy, used for hierarchical regularisation

Training also automatically mines ~44,000 hard negatives, HPO terms that look similar to the input but are wrong, and uses a Multiple Negatives Ranking loss with cosine similarity. The result is a 110M-parameter encoder that projects any sentence into a 768-dimensional space where semantically equivalent symptoms, in any register, formal or colloquial, in PT or EN, sit close together.

What changed, in numbers

We evaluated Araras on five scenarios, comparing against BioLORD-2023 (the previous state-of-the-art). The most dramatic jump appears exactly where the field always failed: colloquial Portuguese.

Scenario	BioLORD-2023	Araras	Δ
RareBench (EN, n=13,763), Top1	95.01%	97.81%	+2.80
BR-PT formal (Babelon, n=7,142), Top1	17.22%	62.00%	+44.78
BR-PT colloquial (n=24, eval), Top1	4.17%	79.17%	+75.00
BR-PT colloquial, Top5	4.17%	100.00%	+95.83
Brazilian clinical narratives (n=22, 5 cases), Acc@1	—	95.5%	—

The point is not just that Araras is state-of-the-art: it is that colloquial Portuguese went from indistinguishable from chance (4.17% Top-1) to nearly solved (79.17% Top-1, 100% Top-5). On real Brazilian clinical narratives it gets the right HPO term on the first guess 95.5% of the time.

What it’s for, in practice

Five concrete uses already shipping or in development inside the Raras ecosystem:

Patient digital health card. When a family records “tem água na cabeça” in the Raras app, the backend automatically converts it to HP:0000238, and from there the system can suggest differential diagnoses, patient communities with the same condition, and relevant reference centres.
Community health worker (ACS) notes. More than 280,000 ACS write home-visit notes every month in Brazil’s public-health system, in regional colloquial Portuguese. Today those notes are dead information. With Araras they become structured data, without the ACS having to change how they write.
Rare-disease diagnostic pipelines. Araras is the first stage (encoder) in pipelines like our own RarasNet Swarm, feeding retrievers over the HPO graph and Bayesian rankers like LIRICAL or textual ones like PubCaseFinder.
Multilingual semantic search. Brazilian researchers can now search English literature using colloquial Portuguese terms, and still recover the right paper.
Annotation of Portuguese-language biomedical literature. For the growing ecosystem of clinical papers published in Portuguese (especially public-health master’s and PhD theses), Araras serves as an automatic phenotype annotator.

Open source, from day zero

Araras HPO Brasil is available now at huggingface.co/Raras-AI/araras-hpo-brasil under the Apache 2.0 license. The model inherits the license from BioLORD-2023 and can be used in commercial and academic production without restrictions. Weights, training code, and benchmarks are public.

Three extensions are in the works:

Full multilingual version, colloquial Spanish (Mexico, Argentina, Colombia), colloquial French (Maghreb, Quebec), starting with countries whose public-health systems mirror Brazil’s SUS.
Instruct version with returned explanation (“I think this is hydrocephalus, based on the description of intracranial pressure”).
Federated training across the 45 reference centres of RARAS-BRDN, refining the model on real clinical narratives without moving any patient data.

Data & ethics

Sources

Training data comes exclusively from public sources: the HPO ontology (Jackson Laboratory, CC-BY 4.0), its official Babelon HPO-PT translations (CC-BY 4.0), and colloquial terms collected by internal curation without any real patient data.

Pseudonymisation

No personally identifiable data was used in any phase of training. The Brazilian clinical examples used in validation (n=22 HPOs, 5 real cases) were provided by medical collaborators with explicit consent and fully de-identified before use.

Not a diagnosis

Araras is a term encoder, not a diagnostic system. It answers the question “which HPO term is closest to this phrase?”, not “what disease does this patient have?”. Its output should always be reviewed by a licensed health professional before any clinical decision.

Known biases

The model inherits biases from BioLORD-2023 (trained predominantly on Anglophone literature) and from HPO itself (catalogued predominantly in European populations). Brazilian colloquial coverage was deliberately biased toward under-represented regions (Northeast, North), but gaps remain. Reports of uncovered terms are welcome via issues on the repository.

License and attribution

Apache 2.0, inherited from BioLORD-2023. Recommended citation available on the model card.

Model on Hugging Face ↗ Try the live demo ↗