Publications | Nishant Balepur

2025

Under Review

A Survey on LLM Jailbreaking Attacks and Defenses

Sander Schulhoff*, Nishant Balepur*, Arjun Akkiraju, and 4 more authors

2025

Abs TL;DR

GenAI systems are designed to safely address users’ needs specified in input prompts, but these same prompts can be carefully manipulated to elicit harmful behaviors from the target system, a phenomenon commonly referred to as prompt hacking or jailbreaking. Such attacks pose security and reliability concerns, as malicious end users and third parties can circumvent model safeguards and obtain responses that spread misinformation, include dangerous information, or steal end-user data. The nature of these attacks is ever-growing, with many being proposed in academia each year, and even more “in the wild” on the Internet. In this paper, we introduce a comprehensive taxonomy that categorizes emerging prompt attacks and defense mechanisms. Our goal is to provide a structured foundation for future work and highlight the urgency of addressing GenAI security challenges.

We review how researchers and users in-the-wild attack LLMs and defend against them, and discuss future work in the context of GenAI advancements
ACL 2025
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur, Rachel Rudinger, and Jordan Lee Boyd-Graber

arXiv preprint arXiv:2502.14127, 2025

Best Paper Award and Oral (1.5%) at MASC-SLL 2025

Abs TL;DR Bib PDF

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA’s format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing—where LLMs construct and explain answers—better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA—robustness, biases, and unfaithful explanations—showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.

We review and critique current multiple-choice evaluation practices, and borrow insights from education research to propose solutions
@article{balepur2025these, title = {Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above}, author = {Balepur, Nishant and Rudinger, Rachel and Boyd-Graber, Jordan Lee}, year = {2025}, journal = {arXiv preprint arXiv:2502.14127}, tldr = {We review and critique current multiple-choice evaluation practices, and borrow insights from education research to propose solutions}, thoughts = {I think this is my best argued paper, especially since we had to survey results from existing papers and come up with our own arguments. I think finishing this made me a much better writer!}, }
ACL 2025
Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

Nishant Balepur, Vishakh Padmakumar, Fumeng Yang, and 3 more authors

arXiv preprint arXiv:2501.11549, 2025

Abs TL;DR Bib PDF

LLMs are tuned to follow instructions (aligned) by learning which of two outputs users prefer for a prompt. However, this preference data format does not convey why users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply abductive reasoning to preference data, inferring needs and interests of users, i.e. personas, that may prefer each output. We test this idea in two steps: Persona Inference (PI)—abductively inferring personas of users who prefer chosen or rejected outputs—and Persona Tailoring (PT)—training models to tailor responses to personas from PI. We find: 1) LLMs infer personas accurately explaining why different users may prefer both chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization, enabling models to support user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.

We propose a data augmentation strategy of abductive persona inference to improve personalization in direct preference optimization
@article{balepur2025whose, title = {Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas}, author = {Balepur, Nishant and Padmakumar, Vishakh and Yang, Fumeng and Feng, Shi and Rudinger, Rachel and Boyd-Graber, Jordan Lee}, journal = {arXiv preprint arXiv:2501.11549}, year = {2025}, tldr = {We propose a data augmentation strategy of abductive persona inference to improve personalization in direct preference optimization}, }
NAACL 2025
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can‘t Answer?

Nishant Balepur, Feng Gu, Abhilasha Ravichander, and 3 more authors

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Apr 2025

Oral

Abs TL;DR Bib PDF

Question answering (QA)—producing correct answers for input questions—is popular, but we test a reverse question answering (RQA) task: given an input answer, generate a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and assessing reasoning consistency. 16 LLMs run QA and RQA with trivia questions/answers, showing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to give valid multi-hop questions. By finding question and answer types yielding RQA errors, we suggest improvements for LLM RQA reasoning.

We find a surprising LLM weakness in reverse question answering: given an answer, can an LLM generate any valid question with that answer?
@inproceedings{balepur-etal-2025-reverse, title = {Reverse Question Answering: Can an {LLM} Write a Question so Hard (or Bad) that it Can`t Answer?}, author = {Balepur, Nishant and Gu, Feng and Ravichander, Abhilasha and Feng, Shi and Boyd-Graber, Jordan Lee and Rudinger, Rachel}, editor = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu}, booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.naacl-short.5/}, pages = {44--64}, isbn = {979-8-89176-190-2}, tldr = {We find a surprising LLM weakness in reverse question answering: given an answer, can an LLM generate any valid question with that answer?}, thoughts = {I think this paper has the weirdest / most surprising results. That is, I didn't think that so many LLMs wouldn't be able to generate questions for numbers when playing around with ChatGPT}, }
NAACL 2025
MoDS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections

Nishant Balepur, Alexa Siu, Nedim Lipka, and 4 more authors

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025

Oral

Abs TL;DR Bib PDF

Query-focused summarization (QFS) gives a summary of documents to answer a query. Past QFS work assumes queries have one answer, ignoring debatable ones (Is law school worth it?). We introduce Debatable QFS (DQFS), a task to create summaries that answer debatable queries via documents with opposing perspectives; summaries must comprehensively cover all sources and balance perspectives, favoring no side. These goals elude LLM QFS systems, which: 1) lack structured content plans, failing to guide LLMs to write balanced summaries, and 2) use the same query to retrieve contexts across documents, failing to cover all perspectives specific to each document’s content. To overcome this, we design MODS, a multi-LLM framework mirroring human panel discussions. MODS treats documents as individual Speaker LLMs and has a Moderator LLM that picks speakers to respond to tailored queries for planned topics. Speakers use tailored queries to retrieve relevant contexts from their documents and supply perspectives, which are tracked in a rich outline, yielding a content plan to guide the final summary. Experiments on ConflictingQA with controversial web queries and DebateQFS, our new dataset of debate queries from Debatepedia, show MODS beats SOTA by 38-59% in topic paragraph coverage and balance, based on new citation metrics. Users also find MODS’s summaries to be readable and more balanced.

We propose a new task of answering debatable queries (Are EVs good?) from documents and use multi-agent summarization to reach SOTA
@inproceedings{balepur-etal-2025-mods, title = {{M}o{DS}: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections}, author = {Balepur, Nishant and Siu, Alexa and Lipka, Nedim and Dernoncourt, Franck and Sun, Tong and Boyd-Graber, Jordan Lee and Mathur, Puneet}, editor = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu}, booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.naacl-long.20/}, pages = {465--491}, isbn = {979-8-89176-189-6}, tldr = {We propose a new task of answering debatable queries (Are EVs good?) from documents and use multi-agent summarization to reach SOTA} }

2024

EMNLP 2024
A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

Nishant Balepur, Matthew Shu, Alexander Hoyle, and 4 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs TL;DR Bib PDF

Keyword mnemonics are memorable explanations that link new terms to simpler keywords.Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning.We build SMART, a mnemonic generator trained on feedback from real students learning new terms.To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics.We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: **expressed** (inferred from ratings) and **observed** (inferred from student learning), yielding three key findings.First, expressed and observed preferences disagree; what students *think* is helpful does not always capture what is *truly* helpful.Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal.SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.

We generate mnemonics by aligning an LLM using preferences from 47 GRE test-takers on the mnemonics they like and which mnemonics aid learning
@inproceedings{balepur-etal-2024-smart, title = {A {SMART} Mnemonic Sounds like {``}Glue Tonic{''}: Mixing {LLM}s with Student Feedback to Make Mnemonic Learning Stick}, author = {Balepur, Nishant and Shu, Matthew and Hoyle, Alexander and Robey, Alison and Feng, Shi and Goldfarb-Tarrant, Seraphina and Boyd-Graber, Jordan Lee}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.786}, tldr = {We generate mnemonics by aligning an LLM using preferences from 47 GRE test-takers on the mnemonics they like and which mnemonics aid learning}, thoughts = {This was the first paper in my Ph.D. I was super proud of. It felt pretty ambitious on my end---working on a new task, collecting our own data with real students, some new modeling decisions, and bothering experts to help us evaluate---as in my undergrad, I worked on a lot of safe / SOTA type papers on offline benchmarks. This paper also got me in the habit of writing self-referential titles! (thanks jbg)}, pages = {14202--14225}, }
EMNLP 2024
KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students

Matthew Shu*, Nishant Balepur*, Shi Feng, and 1 more author

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs TL;DR Bib PDF

Flashcard schedulers rely on 1) *student models* to predict the flashcards a student knows; and 2) *teaching policies* to pick which cards to show next via these predictions.Prior student models, however, just use study data like the student’s past responses, ignoring the text on cards. We propose **content-aware scheduling**, the first schedulers exploiting flashcard content.To give the first evidence that such schedulers enhance student learning, we build KARL, a simple but effective content-aware student model employing deep knowledge tracing (DKT), retrieval, and BERT to predict student recall.We train KARL by collecting a new dataset of 123,143 study logs on diverse trivia questions.KARL bests existing student models in AUC and calibration error.To ensure our improved predictions lead to better student learning, we create a novel delta-based teaching policy to deploy KARL online.Based on 32 study paths from 27 users, KARL improves learning efficiency over SOTA, showing KARL’s strength and encouraging researchers to look beyond historical study data to fully capture student abilities.

We design the first flashcard scheduler that uses LLMs and the text on the flashcards, and use this model to help 500+ students learn
@inproceedings{shu-etal-2024-karl, title = {{KARL}: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students}, author = {Shu, Matthew and Balepur, Nishant and Feng, Shi and Boyd-Graber, Jordan Lee}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, is_first_author = {true}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.784}, thoughts = {This was the hardest paper to publish during my Ph.D. so far---took around 5 submissions? Matthew (the undergrad leading this project) did a lot of great work getting results for this project, but I really struggled to publish it, since education was not (and still is not) my area of expertise. But I think I learned how to better respond to criticism and not care much what reviewers think ¯\_(ツ)_/¯. I think if I was a better writer at the time I could have made this paper more impactful, but still super proud of it!}, tldr = {We design the first flashcard scheduler that uses LLMs and the text on the flashcards, and use this model to help 500+ students learn}, pages = {14161--14178} }
EMNLP 2024
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

Shramay Palta, Nishant Balepur, Peter Rankel, and 3 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

Abs TL;DR Bib PDF

Questions involving commonsense reasoning about everyday situations often admit many possible or plausible answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the most plausible answer choice. On 250 MCQ items sampled from two commonsense reasoning benchmarks, we collect 5,000 independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

We show that the gold answer in commonsense multiple-choice datasets is not always the one perceived to be the most plausible
@inproceedings{palta-etal-2024-plausibly, title = {Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning}, author = {Palta, Shramay and Balepur, Nishant and Rankel, Peter and Wiegreffe, Sarah and Carpuat, Marine and Rudinger, Rachel}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-emnlp.198/}, doi = {10.18653/v1/2024.findings-emnlp.198}, pages = {3451--3473}, tldr = {We show that the gold answer in commonsense multiple-choice datasets is not always the one perceived to be the most plausible} }
Preprint
The Prompt Report: A Systematic Survey of Prompting Techniques

Sander Schulhoff, Michael Ilie, Nishant Balepur, and 28 more authors

Nov 2024

Huggingface #1 Paper of the Day

Abs TL;DR Bib PDF

Generative Artificial Intelligence (GenAI) systems are being increasingly deployed across all parts of industry and research settings. Developers and end users interact with these systems through the use of prompting or prompt engineering. While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area’s nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting.

We survey current techniques and practices when prompting generative AI systems like ChatGPT
@misc{schulhoff2024prompt, title = {The Prompt Report: A Systematic Survey of Prompting Techniques}, author = {Schulhoff, Sander and Ilie, Michael and Balepur, Nishant and Kahadze, Konstantine and Liu, Amanda and Si, Chenglei and Li, Yinheng and Gupta, Aayush and Han, HyoJung and Schulhoff, Sevien and Dulepet, Pranav Sandeep and Vidyadhara, Saurav and Ki, Dayeon and Agrawal, Sweta and Pham, Chau and Kroiz, Gerson and Li, Feileen and Tao, Hudson and Srivastava, Ashay and Costa, Hevander Da and Gupta, Saloni and Rogers, Megan L. and Goncearenco, Inna and Sarli, Giuseppe and Galynker, Igor and Peskoff, Denis and Carpuat, Marine and White, Jules and Anadkat, Shyamal and Hoyle, Alexander and Resnik, Philip}, year = {2024}, journal = {arXiv preprint arXiv:2406.06608}, tldr = {We survey current techniques and practices when prompting generative AI systems like ChatGPT} }
ACL 2024
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024

Best Paper Award (4%) and Oral (7%) at MASC-SLL 2024

Abs TL;DR Bib PDF

Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. We hope to motivate the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets, and further efforts to explain LLM decision-making.

We find that LLMs don’t need the question in multiple-choice question answering to do better than random chance, and explore how
@inproceedings{balepur-etal-2024-artifacts, title = {Artifacts or Abduction: How Do {LLM}s Answer Multiple-Choice Questions Without the Question?}, author = {Balepur, Nishant and Ravichander, Abhilasha and Rudinger, Rachel}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.acl-long.555}, pages = {10308--10330}, thoughts = {I like this paper a lot because I think the idea is super simple to understand and had really interesting results; I always wondered if I was good enough to cheat on MCQA exams, but turns out LLMs definitely are. This was also my first external collaboration!}, tldr = {We find that LLMs don't need the question in multiple-choice question answering to do better than random chance, and explore how} }
ACL 2024
It’s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning

Nishant Balepur, Shramay Palta, and Rachel Rudinger

In Findings of the Association for Computational Linguistics ACL 2024, Aug 2024

Abs TL;DR Bib PDF

Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work.

We find a surprising weakness in LLMs: eliminating incorrect options in multiple-choice question answering
@inproceedings{balepur-etal-2024-easy, title = {It{'}s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning}, author = {Balepur, Nishant and Palta, Shramay and Rudinger, Rachel}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Findings of the Association for Computational Linguistics ACL 2024}, month = aug, year = {2024}, address = {Bangkok, Thailand and virtual meeting}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-acl.604}, pages = {10143--10166}, thoughts = {This was the first paper I published during my Ph.D. and also my first (successful) short paper. I think short papers are under-valued; I think it's just as hard (or maybe even harder) to write a convincing argument in 4 pages, rather than inflating it to 8 pages :)}, tldr = {We find a surprising weakness in LLMs: eliminating incorrect options in multiple-choice question answering} }
ACL 2024
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?

Nishant Balepur, and Rachel Rudinger

In Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024), Aug 2024

Abs TL;DR Bib PDF

Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices, but does this mean that MCQA leaderboard rankings of LLMs are largely influenced by abilities in choices-only settings? To answer this, we use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. While previous works build contrast sets via expensive human annotations or model-generated data which can be biased, we employ graph mining to extract contrast sets from existing MCQA datasets. We use our method on UnifiedQA, a group of six commonsense reasoning datasets with high choices-only accuracy, to build an 820-question contrast set. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices. Thus, despite the susceptibility of MCQA to high choices-only accuracy, we argue that LLMs are not obtaining high ranks on MCQA leaderboards solely due to their ability to exploit choices-only shortcuts.

We study if the ability of LLMs to answer multiple-choice questions without the question is allowing models to cheat on benchmark leaderboards
@inproceedings{balepur-rudinger-2024-large, title = {Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?}, author = {Balepur, Nishant and Rudinger, Rachel}, editor = {Li, Sha and Li, Manling and Zhang, Michael JQ and Choi, Eunsol and Geva, Mor and Hase, Peter and Ji, Heng}, booktitle = {Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.knowllm-1.2}, pages = {15--26}, thoughts = {This was the first project I spent a few months on that led to totally negative results, so I'm grateful Rachel still encouraged me to try and publish it in a workshop. An unpopular take, but I think we more negative results papers rather than basically p-hacking our papers that work}, tldr = {We study if the ability of LLMs to answer multiple-choice questions without the question is allowing models to cheat on benchmark leaderboards} }

2023

EMNLP 2023
Expository Text Generation: Imitate, Retrieve, Paraphrase

Nishant Balepur, Jie Huang, and Kevin Chen-Chuan Chang

In The 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023, Aug 2023

Abs TL;DR Bib PDF Video

Expository documents are vital resources for conveying complex information to readers. Despite their usefulness, writing expository documents by hand is a time-consuming and labor-intensive process that requires knowledge of the domain of interest, careful content planning, and the ability to synthesize information from multiple sources. To ease these burdens, we introduce the task of expository text generation, which seeks to automatically generate an accurate and informative expository document from a knowledge source. We solve our task by developing IRP, an iterative framework that overcomes the limitations of language models and separately tackles the steps of content planning, fact selection, and rephrasing. Through experiments on three diverse datasets, we demonstrate that IRP produces high-quality expository documents that accurately inform readers.

We design an iterative planning, retrieval, and generation system to produce factual expository texts
@inproceedings{balepur2023expository, title = {Expository Text Generation: Imitate, Retrieve, Paraphrase}, author = {Balepur, Nishant and Huang, Jie and Chang, Kevin Chen-Chuan}, booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023}, year = {2023}, video = {https://youtu.be/6elNaka-JKM}, thoughts = {This was my first solo-led paper (but still with some guidance from Jie and Kevin), and working on this mostly independently made me a much better researcher}, tldr = {We design an iterative planning, retrieval, and generation system to produce factual expository texts} }
EMNLP 2023
Text Fact Transfer

Nishant Balepur, Jie Huang, and Kevin Chen-Chuan Chang

In The 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023, Aug 2023

Abs TL;DR Bib PDF Video

Text style transfer is a prominent task that aims to control the style of text without inherently changing its factual content. To cover more text modification applications, such as adapting past news for current events and repurposing educational materials, we propose the task of text fact transfer, which seeks to transfer the factual content of a source text between topics without modifying its style. We find that existing language models struggle with text fact transfer, due to their inability to preserve the specificity and phrasing of the source text, and tendency to hallucinate errors. To address these issues, we design ModQGA, a framework that minimally modifies a source text with a novel combination of end-to-end question generation and specificity-aware question answering. Through experiments on four existing datasets adapted for text fact transfer, we show that ModQGA can accurately transfer factual content without sacrificing the style of the source text.

We design a model to tackle the new task of text fact transfer, a complement to style transfer that seeks to alter facts without changing style
@inproceedings{balepur2023fact, title = {Text Fact Transfer}, author = {Balepur, Nishant and Huang, Jie and Chang, Kevin Chen-Chuan}, booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023}, year = {2023}, video = {https://youtu.be/U01fVWUbIQw}, tldr = {We design a model to tackle the new task of text fact transfer, a complement to style transfer that seeks to alter facts without changing style} }
ACL 2023
DynaMiTE: Discovering Explosive Topic Evolutions with User Guidance

Nishant Balepur, Shivam Agarwal, Karthik Venkat Ramanan, and 3 more authors

In Findings of the Association for Computational Linguistics: ACL 2023, Aug 2023

Abs TL;DR Bib PDF Video

Dynamic topic models (DTMs) analyze text streams to capture the evolution of topics. Despite their popularity, existing DTMs are either fully supervised, requiring expensive human annotations, or fully unsupervised, producing topic evolutions that often do not cater to a user’s needs. Further, the topic evolutions produced by DTMs tend to contain generic terms that are not indicative of their designated time steps. To address these issues, we propose the task of discriminative dynamic topic discovery. This task aims to discover topic evolutions from temporal corpora that distinctly align with a set of user-provided category names and uniquely capture topics at each time step. We solve this task by developing DynaMiTE, a framework that ensembles semantic similarity, category indicative, and time indicative scores to produce informative topic evolutions. Through experiments on three diverse datasets, including the use of a newly-designed human evaluation experiment, we demonstrate that DynaMiTE is a practical and efficient framework for helping users discover high-quality topic evolutions suited to their interests.

We design a model to perform dynamic topic modeling while using guidance from user-provided topics of interest
@inproceedings{balepur2023dynamite, title = {DynaMiTE: Discovering Explosive Topic Evolutions with User Guidance}, author = {Balepur, Nishant and Agarwal, Shivam and Ramanan, Karthik Venkat and Yoon, Susik and Yang, Diyi and Han, Jiawei}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, pages = {194--217}, year = {2023}, url = {https://aclanthology.org/2023.findings-acl.14}, video = {https://youtu.be/KAyd-QqYO6Y}, tldr = {We design a model to perform dynamic topic modeling while using guidance from user-provided topics of interest}, thoughts = {This was the first paper I ever wrote, so I obviously look back and cringe. I'm no longer interested in the topic and barely remember what I did, but do remember enjoying the writing process a lot} }