My research activities focus on training and using large language models (LLMs) to answer the following language-related questions:
(1) Trustworthy LLMs: How to build models to generate factual and attributable content (Cao and Wang, EMNLP 2021; Liu et al, EMNLP 2024)? And how to calibrate their confidence based on what they know and what they don't know (Liu et al., ICLR 2024)? I share my thoughts on factuality of GenAI in this article and on AI safety and its usage in this webinar.
(2) Reasoning: How to train models with improved reasoning skills using self-verification and step-wise rewards (Zhang et al., ACL findings 2024; Khalifa et al., EMNLP findings 2023)?
(3) Evaluating LLMs: How to evaluate models' performance on challenging and in-the-wild tasks beyond traditional benchmarks with multi-choice or short references (Jabbour et al., arXiv 2025; Zhang et al., arXiv 2025; Bayat et al., arXiv 2025)? How to evaluate specific properties of LLMs, such as long-context understanding (Zou et al., arXiv 2025)?
(4) Narrative understanding: How human values are reflected in the story-telling processes and how does that influence the target audience (Zhang et al., NAACL 2024; Wu et al., EMNLP 2023)? And whether LLMs can discern values of human perspectives in narratives (Lee et al., arXiv 2025)?
For core natural language processing (NLP) problems, I have been building summarization systems for inputs of long documents (Huang et al., NAACL 2021) and from multiple sources (Peper et al., NAACL 2024) and developing controllable generation techniques (Liu et al., ACL 2023).
I am also interested in building AI applications to achieve domain impacts, including developing argument mining models (Hua and Wang, ACL findings 2022) to support writing assistants building (Nair et al., EMNLP 2024) and using information extraction and sentiment analysis models to understand how media informs and persuades the public by selecting and packaging information (Fan et al., EMNLP 2019).