The DAI Labor provides several datasets for research purpose

Author: Andreas Lommatzsch

At our lab we actively work on research projects in the domain of

  • Language-models and Chatbots
  • Semantic knowledge processing and knowledge graphs
  • Recommender Systems

In this post, we lists datasets we have been using in DAI-Lab’s publications. All datasets are publicly available by sending an email to corpora(at) or to the author of this post.

Delicious dataset

This dataset contains all public bookmarks of about 950,000 users retrieved from between December 2007 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments that were posted between September 2003 and December 2007. No spam filtering was done! Usernames have been anonymized to protect data privacy. The final dataset is around 7GB in size (compressed).
The full corpus is described and analyzed in:

Analyzing social bookmarking systems: A cookbook. Robert Wetzker, Carsten Zimmermann, and Christian Bauckhage. In Mining Social Data (MSoDa) Workshop Proceedings, pp. 26-30. ECAI 2008, (July 2008).

The Slashdot Zoo

This dataset represents the social network of the technology news web site The network contains 78,000 users and 510,000 relationships of the types friend and foe. The dataset was extracted from Slashdot between May 2008 and February 2009 and contains only the giant connected component that includes the user CmdrTaco (Rob Malda, moderator and founder of Slashdot). The relationship type friend and foe correspond to positive and negative endorsements.
An analysis of the dataset was presented at WWW 2009:

The Slashdot Zoo: Mining a Social Network with Negative Edges. Jerome Kunegis, Andreas Lommatzsch, and Christian Bauckhage. In Proceedings of the International Conference on World Wide Web, pp. 741–750, 2009.

Corpus for Internet News Sentiment Analysis

Details of the dataset and the annotation scheme and process are described in the technical report:

Bütow, F., Lommatzsch, A., Ploch, D.: Creation of a German Corpus for Internet News Sentiment Analysis. Project report, Berlin Institute of Technology, AOT (2016) [details]

GerOM: Dataset with sentiment-annotated quotations in German

The dataset consists of sentiment-annotated quotations. It can be used exclusively for academic, non-commercial research.

Details of the dataset and the annotation scheme are described in:

Ploch, D., 2015. Intelligent News Aggregator for German with Sentiment Analysis, in: Hopfgartner, F. (Ed.), Smart Information Systems, Advances in Computer Vision and Pattern Recognition. Springer International Publishing, pp. 5-46.

Twitter Sentiment Dataset

The dataset contains tweets that have been human-annotated with sentiment labels by 3 Mechanical Turk workers.
There are 12597 tweets in 4 languages: English, German, French and Portugese.
The labels annotated are positive, neutral, negative and n/a.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>