When Tools Help and Hurt Scientific Simulator Agents: A PHREEQC-MCQ-200 Case Study

This project equips state-of-the-art LLMs with PHREEQC — a widely used open-source geochemistry simulator for aqueous speciation, chemical reactions, and reactive transport — as an external tool, and tests whether simulator access improves scientific question answering on a new 200-question benchmark (PHREEQC-MCQ-200). Because raw PHREEQC output is huge, the agent uses a table-of-contents (TOC) representation to cut token usage roughly in half while preserving accuracy. Tool access lifts top-model accuracy by ~40 points over direct answering, but tool use is not purely additive: agents also lose some questions they previously answered correctly.


Poster — work submitted to NeurIPS

If the poster does not display, you can open it in a new tab.


← Back to main website