Nishant Balepur

Email:

nbalepur[at]umd[dot]edu

Hi! My name is Nishant and I’m a third-year Ph.D. candidate at the University of Maryland, where I am fortunate to be advised by Professors Jordan Boyd-Grayber and Rachel Rudinger. I’m currently interning with Ai2 to personalize ScholarQA and visiting NYU as a researcher with Eunsol Choi.

I want to build LLMs that actually help people, not just say what people want to hear. As a result, I often think about better ways to design evaluations offline and to capture downstream human feedback online. I’m currently excited about three research questions, in descending order of excitement:

How can we build systems that help users? [flashcards (EMNLP’24), memorable study aids (EMNLP’24), personalized dpo (ACL’25), multi-step plans (EMNLP’25)]
How can we rigorously evaluate model flaws? [eliminative reasoning (ACL’24), mcqa artifacts (ACL’24), benchmark cheating (ACL’24), mcqa plausibility (EMNLP’24), abductive reasoning (NAACL’25), mcqa edu theory (ACL’25), agents, reasoning LLM shortcuts]
How can we synthesize factual sources? [topic mining (ACL’23), expository text (EMNLP’23), fact transfer (EMNLP’23), debatable queries (NAACL’25)]

I’m generally interested in research that is helpful for humans and fun to read. If you’re interested in similar problems, don’t hesitate to reach out!

And if you’ve seen another “Balepur, N” during your literature search, you may be looking for my sister 😛

📝 Selected Publications

2025

ACL 2025
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur, Rachel Rudinger, and Jordan Lee Boyd-Graber

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025

Oral at ACL 2025, Best Paper Award and Oral (1.5%) at MASC-SLL 2025

Abs TL;DR Bib PDF

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA’s format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing—where LLMs construct and explain answers—better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA—robustness, biases, and unfaithful explanations—showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.

We review and critique current multiple-choice evaluation practices, and borrow insights from education research to propose solutions
@inproceedings{balepur-etal-2025-best, title = {Which of These Best Describes Multiple Choice Evaluation with {LLM}s? A) Forced {B}) Flawed {C}) Fixable {D}) All of the Above}, author = {Balepur, Nishant and Rudinger, Rachel and Boyd-Graber, Jordan Lee}, editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher}, booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2025.acl-long.169}, pages = {3394--3418}, isbn = {979-8-89176-251-0}, tldr = {We review and critique current multiple-choice evaluation practices, and borrow insights from education research to propose solutions}, thoughts = {I think this is my best argued paper, especially since we had to survey results from existing papers and come up with our own arguments. I think finishing this made me a much better writer!}, }

2024

EMNLP 2024
A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

Nishant Balepur, Matthew Shu, Alexander Hoyle, and 4 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs TL;DR Bib PDF

Keyword mnemonics are memorable explanations that link new terms to simpler keywords.Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning.We build SMART, a mnemonic generator trained on feedback from real students learning new terms.To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics.We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: **expressed** (inferred from ratings) and **observed** (inferred from student learning), yielding three key findings.First, expressed and observed preferences disagree; what students *think* is helpful does not always capture what is *truly* helpful.Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal.SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.

We generate mnemonics by aligning an LLM using preferences from 47 GRE test-takers on the mnemonics they like and which mnemonics aid learning
@inproceedings{balepur-etal-2024-smart, title = {A {SMART} Mnemonic Sounds like {``}Glue Tonic{''}: Mixing {LLM}s with Student Feedback to Make Mnemonic Learning Stick}, author = {Balepur, Nishant and Shu, Matthew and Hoyle, Alexander and Robey, Alison and Feng, Shi and Goldfarb-Tarrant, Seraphina and Boyd-Graber, Jordan Lee}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.786}, tldr = {We generate mnemonics by aligning an LLM using preferences from 47 GRE test-takers on the mnemonics they like and which mnemonics aid learning}, thoughts = {This was the first paper in my Ph.D. I was super proud of. It felt pretty ambitious on my end---working on a new task, collecting our own data with real students, some new modeling decisions, and bothering experts to help us evaluate---as in my undergrad, I worked on a lot of safe / SOTA type papers on offline benchmarks. This paper also got me in the habit of writing self-referential titles! (thanks jbg)}, pages = {14202--14225}, }
ACL 2024
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024

Best Paper Award (4%) and Oral (7%) at MASC-SLL 2024

Abs TL;DR Bib PDF

Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. We hope to motivate the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets, and further efforts to explain LLM decision-making.

We find that LLMs don’t need the question in multiple-choice question answering to do better than random chance, and explore how
@inproceedings{balepur-etal-2024-artifacts, title = {Artifacts or Abduction: How Do {LLM}s Answer Multiple-Choice Questions Without the Question?}, author = {Balepur, Nishant and Ravichander, Abhilasha and Rudinger, Rachel}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.acl-long.555}, pages = {10308--10330}, thoughts = {I like this paper a lot because I think the idea is super simple to understand and had really interesting results; I always wondered if I was good enough to cheat on MCQA exams, but turns out LLMs definitely are. This was also my first external collaboration!}, tldr = {We find that LLMs don't need the question in multiple-choice question answering to do better than random chance, and explore how} }

🥳 Research Highlights

Aug 26, 2025	We release AstaBench at Ai2, a more rigorous evaluation suite for AI agents. Check out our technical report and leaderboard!
Aug 20, 2025	One paper accepted to EMNLP! We build an interface to help users solve complex questions with plans and show predicting which plans help humans is difficult for humans, reward models, and agents!
May 15, 2025	Two papers accepted to ACL 2025! We design a simple technique to improve DPO’s personalization and make our case for why MCQA is a terrible evaluation format (oral!)
May 6, 2025	I passed my thesis proposal so I’m Ph-Done! (with being a regular student as I am now a candidate 🤓☝️). Fun fact: my sister and I proposed our theses on the same day 😁
Apr 5, 2025	Excited to give an oral presentation on why MCQA sucks at MASC-SLL 2025. Also humbled to win a best paper award!
Mar 24, 2025	Humbled to be invited for talks at Imperial College London on building Helpful QA systems (slides) and Google Translate’s Reading Group on improving MCQA evals (slides)

😔 Negative Results

Aug 8, 2025	One paper got bad reviews at EMNLP 2025, then desk rejected from AAAI 2025 (never adding an Appendix again smh)
Jul 7, 2025	One paper rejected from COLM 2025 💪
Jun 11, 2025	Our Schmidt Science Expression of Interest for AI Safety in the Inference-Time Compute Paradigm was rejected
Feb 13, 2025	One paper got bad reviews in December ARR
Dec 19, 2024	Didn’t get intern/fellow offers after interviewing at Meta, Cohere, and Anthropic
Jun 15, 2024	KAR³L is on its fourth resubmission 🫡
Apr 15, 2024	One paper not committed to ACL 2024
Feb 15, 2024	Two papers not committed to NAACL 2024
Feb 10, 2024	Banned on r/ACT for trying to advertise our KAR³L user study 😭
Oct 6, 2023	One paper rejected from EMNLP 2023
Mar 20, 2023	My first ever review score of 1 recieved on an ARR submission