Medplexity explorer

Explore performance of LLMs on medical benchmarks.

Datasets

Config

Dataset of consumer health questions released by Google for the Med-PaLM paper. This HealthSearchQA dataset consists of 3,173 commonly searched consumer health questions. These questions were curated using seed medical conditions and their associated symptoms, reflecting real-world consumer concerns in the healthcare domain. Paper: "Large Language Models Encode Clinical Knowledge" 2022 * Singhal, K., Azizi, S., Tu, T. et al. https://arxiv.org/abs/2212.13138

Model

See GitHub for more information on how to run your own evaluations.

Note, we use just a small collection (~50) of examples from bigger datasets. LLM predictions aren't deterministic, so you may see different results each time you run the model. These are meant just to help develop an intuition about how models answer, not to make any conclusions on the overall performance.

GitHub Contact me

Medplexity