GovReport Dataset for Long Document Summarization with Question-Summary Hierarchy Annotations (GovReport-QS) (First Release, 2022)

Peer Reviews for Argument Mining with Structure Annotations (AMPERE++) (First Release, 2022)

News Articles with Story-level Alignment (BIGNEWS and BIGNEWSALIGN) (First Release, 2022)

Open-ended Question Type Prediction and Question Generation Dataset (First Release, 2021)

GovReport Dataset for Long Document Summarization (GovReport) (First Release, 2021)

  • 19.5k U.S. government reports with expert-written long-form abstractive summaries.
  •  DATA
  • This corpus is distributed together with:
    Efficient Attentions for Long Document Summarization
    Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang
    Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

NLG with Planning for Content and Style (First Release, 2019)

News Media Bias (BASIL) (Second Release, 2021)

  • News articles labeled with lexical bias and informational bias on phrase-level and sentence-level.
  •  DATA
  • This corpus is distributed together with:
    In Plain Sight: Media Bias through the Lens of Factual Reporting
    Lisa Fan, Marshall White, Eva Sharma, Ruisi Su, Prafulla Kumar Choubey, Ruihong Huang, and Lu Wang
    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), short paper, 2019.

Reddit CMV Argument Corpus (a larger collection) and Arguments from News Media (First Release, 2019)

  • CMV arguments with collected relevant arguments from mainstream media of different ideological leanings.
  •  DATA
  • This corpus is distributed together with:
    Argument Generation with Retrieval, Planning, and Realization
    Xinyu Hua, Zhe Hu, and Lu Wang
    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.

BigPatent Dataset for Abstractive Summarization (BigPatent) (First Release, 2019)

Peer Reviews for Argument Mining (AMPERE) (First Release, 2019)

  • Peer reviews collected from machine learning conferences that are annotated with argument types of EVALUATION, REQUEST, FACT, REFERENCE, or QUOTE.
  •  DATA
  • This corpus is distributed together with:
    Argument Mining for Understanding Peer Reviews
    Xinyu Hua, Mitko Nikolov, Nikhil Badugu, and Lu Wang
    Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), short paper, 2019

Reddit CMV Argument Corpus (First Release, 2018)

Microblog Conversation Recommendation Corpus (First Release, 2018)

IDebate Argument Type Corpus (First Release, 2017)

Movie Review and Online Argument Corpus (First Release, 2016)

Socially-Informed Timeline Generation Corpus (First Release, 2015)

  • New York Times, CNN, and BBC news articles and user comments on four major events happened in 2014.
    New York Times news articles and user comments in 2013.
  •  DATA (.zip)
     README (.txt)
  • This corpus is distributed together with:
    Socially-Informed Timeline Generation for Complex Events
    Lu Wang, Claire Cardie, and Galen Marchetti
    Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2015.

Wikipedia Disputed Discussion Corpus (First Release, 2016)