This comparative study investigates how different large language models (LLMs), GPT-4o-Mini, Gemini-1.5 Flash, and Llama 3.1, perform in extracting structured clinical evidence from HTA reports. Using a shared prompt and JSON schema, the research measures similarity scores, output validity, and detail level for clinical trials, real-world evidence, and indirect treatment comparisons.
Findings showed varying degrees of JSON formatting success, with GPT having the highest validity and Gemini offering the most detailed outputs. However, inconsistencies in endpoint interpretation, output accuracy, and data completeness suggest caution when deploying LLMs in HTA workflows. Presented at ISPOR 2025, this work highlights both the potential and current limitations of AI-assisted evidence extraction.
Please click here to download the poster
Don’t forget to subscribe to our newsletter below for the latest company updates, recently published HTA guidance, industry insights, and more!