Publications
2025
- PreprintWhich of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the AboveNishant Balepur, Rachel Rudinger, and Jordan Lee Boyd-Graber2025
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA’s format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing—where LLMs construct and explain answers—better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA—robustness, biases, and unfaithful explanations—showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.
We review and critique current multiple-choice evaluation practices, and use insights from education research to propose solutions
- PreprintWhose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User PersonasNishant Balepur, Vishakh Padmakumar, Fumeng Yang, and 3 more authorsarXiv preprint arXiv:2501.11549, 2025
LLMs are tuned to follow instructions (aligned) by learning which of two outputs users prefer for a prompt. However, this preference data format does not convey why users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply abductive reasoning to preference data, inferring needs and interests of users, i.e. personas, that may prefer each output. We test this idea in two steps: Persona Inference (PI)—abductively inferring personas of users who prefer chosen or rejected outputs—and Persona Tailoring (PT)—training models to tailor responses to personas from PI. We find: 1) LLMs infer personas accurately explaining why different users may prefer both chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization, enabling models to support user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.
We propose a data augmentation strategy of abductive persona inference to improve personalization in direct preference optimization
@article{balepur2025whosf, title = {Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas}, author = {Balepur, Nishant and Padmakumar, Vishakh and Yang, Fumeng and Feng, Shi and Rudinger, Rachel and Boyd-Graber, Jordan Lee}, journal = {arXiv preprint arXiv:2501.11549}, year = {2025}, tldr = {We propose a data augmentation strategy of abductive persona inference to improve personalization in direct preference optimization}, }
- NAACL 2025MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document CollectionsNishant Balepur, Alexa Siu, Nedim Lipka, and 4 more authorsarXiv preprint arXiv:2502.00322, 2025
Query-focused summarization (QFS) gives a summary of documents to answer a query. Past QFS work assumes queries have one answer, ignoring debatable ones (Is law school worth it?). We introduce Debatable QFS (DQFS), a task to create summaries that answer debatable queries via documents with opposing perspectives; summaries must comprehensively cover all sources and balance perspectives, favoring no side. These goals elude LLM QFS systems, which: 1) lack structured content plans, failing to guide LLMs to write balanced summaries, and 2) use the same query to retrieve contexts across documents, failing to cover all perspectives specific to each document’s content. To overcome this, we design MODS, a multi-LLM framework mirroring human panel discussions. MODS treats documents as individual Speaker LLMs and has a Moderator LLM that picks speakers to respond to tailored queries for planned topics. Speakers use tailored queries to retrieve relevant contexts from their documents and supply perspectives, which are tracked in a rich outline, yielding a content plan to guide the final summary. Experiments on ConflictingQA with controversial web queries and DebateQFS, our new dataset of debate queries from Debatepedia, show MODS beats SOTA by 38-59% in topic paragraph coverage and balance, based on new citation metrics. Users also find MODS’s summaries to be readable and more balanced.
We propose a new task of answering debatable queries (Are EVs good?) from documents and use multi-agent summarization to reach SOTA
@article{balepur2025mods, title = {MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections}, author = {Balepur, Nishant and Siu, Alexa and Lipka, Nedim and Dernoncourt, Franck and Sun, Tong and Boyd-Graber, Jordan and Mathur, Puneet}, journal = {arXiv preprint arXiv:2502.00322}, year = {2025}, tldr = {We propose a new task of answering debatable queries (Are EVs good?) from documents and use multi-agent summarization to reach SOTA} }
2024
- NAACL 2025Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?Nishant Balepur, Feng Gu, Abhilasha Ravichander, and 3 more authorsarXiv preprint arXiv:2410.15512, 2024
Question answering (QA)—producing correct answers for input questions—is popular, but we test a reverse question answering (RQA) task: given an input answer, generate a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and assessing reasoning consistency. 16 LLMs run QA and RQA with trivia questions/answers, showing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to give valid multi-hop questions. By finding question and answer types yielding RQA errors, we suggest improvements for LLM RQA reasoning.
We find a surprising LLM weakness in reverse question answering: given an answer, can an LLM generate any valid question with that answer?
@article{balepur2024reverse, title = {Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?}, author = {Balepur, Nishant and Gu, Feng and Ravichander, Abhilasha and Feng, Shi and Boyd-Graber, Jordan and Rudinger, Rachel}, journal = {arXiv preprint arXiv:2410.15512}, year = {2024}, tldr = {We find a surprising LLM weakness in reverse question answering: given an answer, can an LLM generate any valid question with that answer?} }
- EMNLP 2024A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning StickNishant Balepur, Matthew Shu, Alexander Hoyle, and 4 more authorsIn Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024
Keyword mnemonics are memorable explanations that link new terms to simpler keywords.Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning.We build SMART, a mnemonic generator trained on feedback from real students learning new terms.To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics.We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor.We gather 2684 preferences from 45 students across two types: **expressed** (inferred from ratings) and **observed** (inferred from student learning), yielding three key findings.First, expressed and observed preferences disagree; what students *think* is helpful does not always capture what is *truly* helpful.Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal.SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.
We generate mnemonics by aligning an LLM using preferences from 47 GRE test-takers on the mnemonics they like and which mnemonics aid learning
@inproceedings{balepur-etal-2024-smart, title = {A {SMART} Mnemonic Sounds like {``}Glue Tonic{''}: Mixing {LLM}s with Student Feedback to Make Mnemonic Learning Stick}, author = {Balepur, Nishant and Shu, Matthew and Hoyle, Alexander and Robey, Alison and Feng, Shi and Goldfarb-Tarrant, Seraphina and Boyd-Graber, Jordan Lee}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.786}, tldr = {We generate mnemonics by aligning an LLM using preferences from 47 GRE test-takers on the mnemonics they like and which mnemonics aid learning}, pages = {14202--14225}, }
- EMNLP 2024KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in StudentsMatthew Shu*, Nishant Balepur*, Shi Feng, and 1 more authorIn Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024
Flashcard schedulers rely on 1) *student models* to predict the flashcards a student knows; and 2) *teaching policies* to pick which cards to show next via these predictions.Prior student models, however, just use study data like the student’s past responses, ignoring the text on cards. We propose **content-aware scheduling**, the first schedulers exploiting flashcard content.To give the first evidence that such schedulers enhance student learning, we build KARL, a simple but effective content-aware student model employing deep knowledge tracing (DKT), retrieval, and BERT to predict student recall.We train KARL by collecting a new dataset of 123,143 study logs on diverse trivia questions.KARL bests existing student models in AUC and calibration error.To ensure our improved predictions lead to better student learning, we create a novel delta-based teaching policy to deploy KARL online.Based on 32 study paths from 27 users, KARL improves learning efficiency over SOTA, showing KARL’s strength and encouraging researchers to look beyond historical study data to fully capture student abilities.
We design the first flashcard scheduler that uses LLMs and the text on the flashcards, and use this model to help 500+ students learn
@inproceedings{shu-etal-2024-karl, title = {{KARL}: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students}, author = {Shu, Matthew and Balepur, Nishant and Feng, Shi and Boyd-Graber, Jordan Lee}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, is_first_author = {true}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.784}, tldr = {We design the first flashcard scheduler that uses LLMs and the text on the flashcards, and use this model to help 500+ students learn}, pages = {14161--14178} }
- EMNLP 2024Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense ReasoningShramay Palta, Nishant Balepur, Peter Rankel, and 3 more authorsNov 2024
Questions involving commonsense reasoning about everyday situations often admit many possible or plausible answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the most plausible answer choice. On 250 MCQ items sampled from two commonsense reasoning benchmarks, we collect 5,000 independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.
We show that the gold answer in commonsense multiple-choice datasets is not always the one perceived to be the most plausible
@misc{palta2024plausiblyproblematicquestionsmultiplechoice, title = {Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning}, author = {Palta, Shramay and Balepur, Nishant and Rankel, Peter and Wiegreffe, Sarah and Carpuat, Marine and Rudinger, Rachel}, year = {2024}, eprint = {2410.10854}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, tldr = {We show that the gold answer in commonsense multiple-choice datasets is not always the one perceived to be the most plausible} }
- PreprintThe Prompt Report: A Systematic Survey of Prompting TechniquesSander Schulhoff, Michael Ilie, Nishant Balepur, and 28 more authorsNov 2024
Generative Artificial Intelligence (GenAI) systems are being increasingly deployed across all parts of industry and research settings. Developers and end users interact with these systems through the use of prompting or prompt engineering. While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area’s nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting.
We survey current techniques and practices when prompting generative AI systems like ChatGPT
@misc{schulhoff2024prompt, title = {The Prompt Report: A Systematic Survey of Prompting Techniques}, author = {Schulhoff, Sander and Ilie, Michael and Balepur, Nishant and Kahadze, Konstantine and Liu, Amanda and Si, Chenglei and Li, Yinheng and Gupta, Aayush and Han, HyoJung and Schulhoff, Sevien and Dulepet, Pranav Sandeep and Vidyadhara, Saurav and Ki, Dayeon and Agrawal, Sweta and Pham, Chau and Kroiz, Gerson and Li, Feileen and Tao, Hudson and Srivastava, Ashay and Costa, Hevander Da and Gupta, Saloni and Rogers, Megan L. and Goncearenco, Inna and Sarli, Giuseppe and Galynker, Igor and Peskoff, Denis and Carpuat, Marine and White, Jules and Anadkat, Shyamal and Hoyle, Alexander and Resnik, Philip}, year = {2024}, journal = {arXiv preprint arXiv:2406.06608}, tldr = {We survey current techniques and practices when prompting generative AI systems like ChatGPT} }
- ACL 2024Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?Nishant Balepur, Abhilasha Ravichander, and Rachel RudingerIn Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024Best Paper Award (4%) and Oral (7%) at MASC-SSL 2024
Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. We hope to motivate the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets, and further efforts to explain LLM decision-making.
We find that LLMs don’t need the question in multiple-choice question answering to do better than random chance, and explore how
@inproceedings{balepur-etal-2024-artifacts, title = {Artifacts or Abduction: How Do {LLM}s Answer Multiple-Choice Questions Without the Question?}, author = {Balepur, Nishant and Ravichander, Abhilasha and Rudinger, Rachel}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.acl-long.555}, pages = {10308--10330}, tldr = {We find that LLMs don't need the question in multiple-choice question answering to do better than random chance, and explore how} }
- ACL 2024It’s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination ReasoningNishant Balepur, Shramay Palta, and Rachel RudingerIn Findings of the Association for Computational Linguistics ACL 2024, Aug 2024
Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work.
We find a surprising weakness in LLMs: eliminating incorrect options in multiple-choice question answering
@inproceedings{balepur-etal-2024-easy, title = {It{'}s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning}, author = {Balepur, Nishant and Palta, Shramay and Rudinger, Rachel}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Findings of the Association for Computational Linguistics ACL 2024}, month = aug, year = {2024}, address = {Bangkok, Thailand and virtual meeting}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-acl.604}, pages = {10143--10166}, tldr = {We find a surprising weakness in LLMs: eliminating incorrect options in multiple-choice question answering} }
- ACL 2024Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?Nishant Balepur, and Rachel RudingerIn Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024), Aug 2024
Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices, but does this mean that MCQA leaderboard rankings of LLMs are largely influenced by abilities in choices-only settings? To answer this, we use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. While previous works build contrast sets via expensive human annotations or model-generated data which can be biased, we employ graph mining to extract contrast sets from existing MCQA datasets. We use our method on UnifiedQA, a group of six commonsense reasoning datasets with high choices-only accuracy, to build an 820-question contrast set. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices. Thus, despite the susceptibility of MCQA to high choices-only accuracy, we argue that LLMs are not obtaining high ranks on MCQA leaderboards solely due to their ability to exploit choices-only shortcuts.
We study if the ability of LLMs to answer multiple-choice questions without the question is allowing models to cheat on benchmark leaderboards
@inproceedings{balepur-rudinger-2024-large, title = {Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?}, author = {Balepur, Nishant and Rudinger, Rachel}, editor = {Li, Sha and Li, Manling and Zhang, Michael JQ and Choi, Eunsol and Geva, Mor and Hase, Peter and Ji, Heng}, booktitle = {Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.knowllm-1.2}, pages = {15--26}, tldr = {We study if the ability of LLMs to answer multiple-choice questions without the question is allowing models to cheat on benchmark leaderboards} }
2023
- EMNLP 2023Expository Text Generation: Imitate, Retrieve, ParaphraseNishant Balepur, Jie Huang, and Kevin Chen-Chuan ChangIn The 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023, Aug 2023
Expository documents are vital resources for conveying complex information to readers. Despite their usefulness, writing expository documents by hand is a time-consuming and labor-intensive process that requires knowledge of the domain of interest, careful content planning, and the ability to synthesize information from multiple sources. To ease these burdens, we introduce the task of expository text generation, which seeks to automatically generate an accurate and informative expository document from a knowledge source. We solve our task by developing IRP, an iterative framework that overcomes the limitations of language models and separately tackles the steps of content planning, fact selection, and rephrasing. Through experiments on three diverse datasets, we demonstrate that IRP produces high-quality expository documents that accurately inform readers.
We design an iterative planning, retrieval, and generation system to produce factual expository texts
@inproceedings{balepur2023expository, title = {Expository Text Generation: Imitate, Retrieve, Paraphrase}, author = {Balepur, Nishant and Huang, Jie and Chang, Kevin Chen-Chuan}, booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023}, year = {2023}, video = {https://youtu.be/6elNaka-JKM}, tldr = {We design an iterative planning, retrieval, and generation system to produce factual expository texts} }
- EMNLP 2023Text Fact TransferNishant Balepur, Jie Huang, and Kevin Chen-Chuan ChangIn The 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023, Aug 2023
Text style transfer is a prominent task that aims to control the style of text without inherently changing its factual content. To cover more text modification applications, such as adapting past news for current events and repurposing educational materials, we propose the task of text fact transfer, which seeks to transfer the factual content of a source text between topics without modifying its style. We find that existing language models struggle with text fact transfer, due to their inability to preserve the specificity and phrasing of the source text, and tendency to hallucinate errors. To address these issues, we design ModQGA, a framework that minimally modifies a source text with a novel combination of end-to-end question generation and specificity-aware question answering. Through experiments on four existing datasets adapted for text fact transfer, we show that ModQGA can accurately transfer factual content without sacrificing the style of the source text.
We design a model to tackle the new task of text fact transfer, a complement to style transfer that seeks to alter facts without changing style
@inproceedings{balepur2023fact, title = {Text Fact Transfer}, author = {Balepur, Nishant and Huang, Jie and Chang, Kevin Chen-Chuan}, booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023}, year = {2023}, video = {https://youtu.be/U01fVWUbIQw}, tldr = {We design a model to tackle the new task of text fact transfer, a complement to style transfer that seeks to alter facts without changing style} }
- ACL 2023DynaMiTE: Discovering Explosive Topic Evolutions with User GuidanceNishant Balepur, Shivam Agarwal, Karthik Venkat Ramanan, and 3 more authorsIn Findings of the Association for Computational Linguistics: ACL 2023, Aug 2023
Dynamic topic models (DTMs) analyze text streams to capture the evolution of topics. Despite their popularity, existing DTMs are either fully supervised, requiring expensive human annotations, or fully unsupervised, producing topic evolutions that often do not cater to a user’s needs. Further, the topic evolutions produced by DTMs tend to contain generic terms that are not indicative of their designated time steps. To address these issues, we propose the task of discriminative dynamic topic discovery. This task aims to discover topic evolutions from temporal corpora that distinctly align with a set of user-provided category names and uniquely capture topics at each time step. We solve this task by developing DynaMiTE, a framework that ensembles semantic similarity, category indicative, and time indicative scores to produce informative topic evolutions. Through experiments on three diverse datasets, including the use of a newly-designed human evaluation experiment, we demonstrate that DynaMiTE is a practical and efficient framework for helping users discover high-quality topic evolutions suited to their interests.
We design a model to perform dynamic topic modeling while using guidance from user-provided topics of interest
@inproceedings{balepur2023dynamite, title = {DynaMiTE: Discovering Explosive Topic Evolutions with User Guidance}, author = {Balepur, Nishant and Agarwal, Shivam and Ramanan, Karthik Venkat and Yoon, Susik and Yang, Diyi and Han, Jiawei}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, pages = {194--217}, year = {2023}, url = {https://aclanthology.org/2023.findings-acl.14}, video = {https://youtu.be/KAyd-QqYO6Y}, tldr = {We design a model to perform dynamic topic modeling while using guidance from user-provided topics of interest} }