Announcing the 10 Intl. Workshop on News Recommendation and Analytics

Author: Andreas Lommatzsch

In our continuously changing world, news play a major role. In recent years we observed a rapidly changing news ecosystem leading to new challenges: Workshop on News Recommendation and Analytics (INRA) serves to exchange ideas and discuss recent trends, technological advancements, and open problems concerning news. The workshop provides a forum to discuss recent research, technical and interdisciplinary aspects as well as current trends related to news. Topics of interest include information access systems for news, advances in natural language processing, multi-modality, mis- and disinformation, trust and user experiences, and personalization

We welcome contributions in scientific articles, demonstrations, and ideas. We strive to bring together researchers, practitioners, and decision-makers to address crucial challenges. The 10th edition of the INRA workshop will be held co-located with SIGIR in Madrid, Spain in July 2022.
More details can be found on the INRA webpage:

INRA Workshop, colocated with SIGIR 2022, Madrid, Spain

The DAI Labor provides several datasets for research purpose

Author: Andreas Lommatzsch

At our lab we actively work on research projects in the domain of

  • Language-models and Chatbots
  • Semantic knowledge processing and knowledge graphs
  • Recommender Systems

In this post, we lists datasets we have been using in DAI-Lab’s publications. All datasets are publicly available by sending an email to corpora(at)dai-labor.de.

Delicious dataset

This dataset contains all public bookmarks of about 950,000 users retrieved from http://delicious.com between December 2007 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments that were posted between September 2003 and December 2007. No spam filtering was done! Usernames have been anonymized to protect data privacy. The final dataset is around 7GB in size (compressed).
The full corpus is described and analyzed in:

Analyzing social bookmarking systems: A del.icio.us cookbook. Robert Wetzker, Carsten Zimmermann, and Christian Bauckhage. In Mining Social Data (MSoDa) Workshop Proceedings, pp. 26-30. ECAI 2008, (July 2008).

The Slashdot Zoo

This dataset represents the social network of the technology news web site http://slashdot.org. The network contains 78,000 users and 510,000 relationships of the types friend and foe. The dataset was extracted from Slashdot between May 2008 and February 2009 and contains only the giant connected component that includes the user CmdrTaco (Rob Malda, moderator and founder of Slashdot). The relationship type friend and foe correspond to positive and negative endorsements.
An analysis of the dataset was presented at WWW 2009:

The Slashdot Zoo: Mining a Social Network with Negative Edges. Jerome Kunegis, Andreas Lommatzsch, and Christian Bauckhage. In Proceedings of the International Conference on World Wide Web, pp. 741–750, 2009.

Corpus for Internet News Sentiment Analysis

Details of the dataset and the annotation scheme and process are described in the technical report:

Bütow, F., Lommatzsch, A., Ploch, D.: Creation of a German Corpus for Internet News Sentiment Analysis. Project report, Berlin Institute of Technology, AOT (2016) [details]

GerOM: Dataset with sentiment-annotated quotations in German

The dataset consists of sentiment-annotated quotations. It can be used exclusively for academic, non-commercial research.

Details of the dataset and the annotation scheme are described in:

Ploch, D., 2015. Intelligent News Aggregator for German with Sentiment Analysis, in: Hopfgartner, F. (Ed.), Smart Information Systems, Advances in Computer Vision and Pattern Recognition. Springer International Publishing, pp. 5-46.

Twitter Sentiment Dataset

The dataset contains tweets that have been human-annotated with sentiment labels by 3 Mechanical Turk workers.
There are 12597 tweets in 4 languages: English, German, French and Portugese.
The labels annotated are positive, neutral, negative and n/a.

Our new Research Project "SPURT" (Language Models for Generating Polls and the Identification of Topics) has started

Author: Andreas Lommatzsch

Learning about user interests and opinions of readers of news portals is interesting for both readers and authors. Online surveys and polls have been shown useful to engage readers and to learn about the readers options. The challenge in defining good questions (interesting for most users) is the broad spectrum of topic (requiring matching questions) and finding an adequate question formulation.

Researching what questions attract most users and how readers of online news portals interact with online polls is the objective in a joined project "What-The-Question" between the DAI-Lab at the TU Berlin and Opinary. A special focus lies on applying language models for understanding the context and linguistic aspects of the questions. In addition, methods for the automatic creation of context-aware questions are researched.

The project is supported by the German Federal Ministry for Economic Affairs and Climate Action within the funding program "Central Innovation Programme for small and medium-sized enterprises (SMEs)".

Opinary Logo DAI-Labor Logo

Announcing the 9th INRA workshop, held in conjunction with RecSys 2021.

Author: Andreas Lommatzsch

ACM RecSys is the leading conference in the research areas of recommender systems. In 2021 in RecSys conference will be held as a hybrid event. The physical part will take place in the Amsterdam Conference Center (the former Amsterdam Stock Exchange Building). The physical sessions are planned for the morning sessions; the digital sessions mostly in the afternoon in the "Golden Hours" in order to ensure an active participation.

An important domain for recommender systems are Online News. The INRA workshop provides a forum for discussing current both technical and societal aspects of news recommendations and personalization techniques. 9th International Workshop on News Recommendation and Analytics (INRA2021) collocated with RecSys is organized in cooperation of the Norwegian University of Science and Technology (Norway), the Technische Universität Berlin (Germany), and the Penn State University (USA). The workshop submission deadline this year is 6 August 2021. In addition to the workshop, a UMUAI special issue is planed; the best papers of the INRA workshop will be invited to submit extended version to this journal.

More Details can be found on the INRA workshop webpage.

INRA 2021 Workshop webpage

INRA 2021 Workshop webpage

Announcing MediaEval – News Images 2020

Author: Andreas Lommatzsch

Images play an important role in online news articles and news consumption patterns. The influence of images on the perceived relevance of news items as well as the factors making images interesting in news are not well researched yet. This has been the motivation for us to setup the "News Images" task in the Multi-MediaEval Benchmark.

The "News Images" task aims to achieve additional insight about the role of images in news articles. The task supplies a large set of articles (including text body, and headlines) and the accompanying images. The task requires participants to predict which image was used to accompany each article and also predict frequently clicked articles on the basis of accompanying images.

In the challenge training data describing the news and the interest in the news items (number of views, number of clicks, CTR) are provided. The performance of the developed algorithms is benchmarked based on the data of a subsequent month. The reassignment of images to news items is measured based on the accuracy (percentage of correctly assigned images); the performance of the relevance prediction is measured based on the Precision@N (the top N entries should be predicted correctly).

The registration is now open for participants. The MediaEval conference will be held in December 2020. Due to COVID-19 this year MediaEval goes fully virtually online. Thus, participants can save time and money usually needed for traveling. So there is more time for developing new algorithms, building interesting presentations and the preparing good discussions. More details about the MediaEval benchmark can be found on the official web page.

Bobbi supports Berlin’s Administration in Answering Questions related to COVID-19

Author: Andreas Lommatzsch

Chatbots are a popular technique for building a scalable and easy-to-use solution answering user questions. Bobbi is the chatbot of the City of Berlin. The bot is designed for answering questions related to the services provided by the city administration.

With the spread of the SARS-CoV-2 virus and the related COVID-19 disease, questions related to the virus have become a major issue. Citizens frequently demand virus-related information and ask about the most recent regulations.
The development of a component optimized for answering COVID-19 related questions raised several challenges:

  • The topic COVID-19 is as recent such that no large training data collections exist.
  • The virus affects a lot of different domains. Thus, several different departments and ministries provide information that is relevant for answering user questions.
  • The information related to the virus is continuously changing. Thus, the answers must be frequently updated to ensure that all answers are based on the most recent state of information.
  • Government agencies provide diverse data. Answers to questions may consist of only one word; other answers are very long and consist of more than 15 sentences.
  • The answers provided by the chatbot must be correct. The risk of giving the wrong answer must be minimized.

Our chatbot framework provides a component that fulfills these requirements. A web crawler collects FAQ data from a list of defined sources, such as Berlin’s  COVID-19 website, the relevant Berlin’s Senate Administration, and the RKI (Robert-Koch-Institute). The information is re-crawled several times a day ensuring that the information is always up to date.

When a citizen asks a question in the Bobbi chat, the chatbot first checks whether the question is related to COVID-19. If the question is related and the question is very similar to a question available in the set of FAQs, the bot directly provides the answer.

If the question is related but does not exactly match a question from the set of crawled FAQs, the bot shows the user a list of the closest matches. This ensures that even though the user question contains synonyms or a negation, the bot provides a correct question-answer pair. The matching uses a German-language model and a collection of domain-specific synonyms for ensuring a good answer quality without the need for extensive training data.

In addition to the FAQ matching, the bot also searches for relevant administrative services to ensure that the user has access to comprehensive information. The question answering is available in nine different languages.

The chatbot’s usage statistics emphasize the high demand for COVID-19 information. In May 2020, Bobbi conducted about nine times more dialogs compared to the number of dialogs in January 2020. About 80% of the dialogs only consist of questions related to COVID-19.
Due to the high acceptance of this functionality, we will extend this feature so that questions from other topics and domains can be also answered in a similar style.
You can find the chatbot Bobbi on Berlin’s Official Services Web Portal.

ACM Conference on Recommender Systems 2019 and International Workshop on News Recommendation and Analytics

Author: Benjamin Kille

The ACM International Conference on Recommender Systems was held in Copenhagen from 16th to 20th September 2019. The 13th edition of RecSys features three days of conference talks followed by two days for workshops and tutorials. The program included two keynote speeches. First, Mireille Hildebrandt explored how the EU’s GDPR affects recommender systems. Second, Eszter Hargittai discussed recommender systems from the perspective of social research. The conference emphasized the interdisciplinary character of recommender systems with the keynotes, a variety of contributions, and multiple tutorials and workshops. For instance, the program featured tutorials on multi-stakeholder considerations and fairness along with a workshop on multistakeholder environments. This trend signifies that the recommender systems research community increasingly attracts experts from different disciplines.

In collaboration with partners from NTNU, we organized the seventh edition of the International Workshop on News Recommendation and Analytics. The workshop seeks to present cutting-edge research as well as practical insights from the intersection of news and recommender systems. We had received sixteen submissions, ten of which we could accept given the three-hour time slot. The accepted contributions split evenly into five long and short papers. We awarded each presenter ten minutes for a short paper and eighteen minutes for a long paper. We were very happy to welcome the University of Amsterdam’s Natali Helberger as a keynote speaker. Her talk aligned perfectly with the conference theme. She emphasized the intricate and subtle ways in which recommender systems affect societies.

Subsequent to the keynote, attendees followed along with these talks (speakers put in italics):

  • Public Service Media, Diversity and Algorithmic Recommendation: Tensions between Editorial Principles and Algorithms in European PSM Organizations [Jannick Kirk Sørensen]
  • Semi-supervised sentiment analysis for under-resourced languages with a sentiment lexicon [Peng Liu, Cristina Marco and Jon Atle Gulla]
  • On the Importance of News Content Representation in Hybrid Neural Session-based Recommender Systems [Gabriel De Souza P. Moreira, Dietmar Jannach and Adilson Marques Da Cunha]
  • Defining a Meaningful Baseline for News Recommender Systems [Benjamin Kille and Andreas Lommatzsch]
  • On-the-Fly News Recommendation Using Sequential Patterns [Mozhgan Karimi, Boris Cule and Bart Goethals]
  • Giveme5W1H: A Universal System for Extracting Main Events from News Articles [Felix Hamborg, Corinna Breitinger, and Bela Gipp]
  • Recommendation systems for news articles at the BBC [Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher and Felix Mercer Moss]
  • Trend-responsive user segmentation enabling traceable publishing insights. A case study of a real-world large-scale news recommendation system [Joanna Misztal-Radecka, Dominik Rusiecki, Michał Żmuda and Artur Bujak]
  • Leveraging Emotion Features in News Recommendations [Nastaran Babanejad, Ameeta Agrawal, Heidar Davoudi, Aijun An and Manos Papagelis]

The collection of talks features a balanced mixture of research and insights into practice. Unfortunately, Janu Verma could not attend to present his work on “Enriched Network Embeddings for News Recommendation.”

Besides, the authors had the chance to put up posters aiding the discussions during the break. For everyone who could not attend the workshop, we have included some visual impressions.

Presenting our Chatbot Research at the LWDA Conference 2019 in Berlin

Author: Andreas Lommatzsch

The 2019 LWDA conference has been held in Berlin from September 30th to October 2nd, 2019. This year’s venues have been the Smart Data Forum (next to the TU Berlin) in and the Berlin School of Library and Information Science (next to the main building of the Humboldt-University Berlin). The conference is organized by the German Computer Science Society (GI). The core topics of the conference are Knowledge Discovery and Machine Learning; Databases, and Information Retrieval.

From the many interesting presentations I would like to highlight the keynote "Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs" by Stefan Dietze. The talk underlined the importance of datasets and the aggregation of datasets for research. Challenges are ambiguity and missing meta-data for the available datasets. Converting crawled data into knowledge graphs by applying semantic and ML methods (e.g. NER, NED, Sentiment Detection) provides the basis for new research fields, especially related to social science. I liked the talk due to the fact that we made similar observation in our research projects (e.g. [1] and [2]). Created datasets are provided on our dataset web page.

CC IRML presented current research in the domain of chatbot systems at the conference. Our contribution "An Information Retrieval-based Approach for Building Intuitive Chatbots for Large Knowledge Bases" reports the experiences running the Virtual Assistant "Bobbi". Bobbi is a chatbot providing information related to services and locations of the Berlin Administration. The paper discusses how to build chatbots without training data (cold-start problem) and explains how to efficiently handle the wide variety of observed user intentions. The research uses data which we have collected in the live system deployed on the official website of the city of Berlin (service.berlin.de). We presented the results in a 30 minutes talk. Besides, we participated in the poster session to discuss more directly with attendees.

Presenting our Multimedia-based Recommender Approaches at the 19th I4CS Conference

Author: Andreas Lommatzsch

The International Conference on Innovative Internet Community Systems (I4CS) has been held in the CongressPark Wolfsburg, June 24 – 26, 2019.

This year the conference focuses was on Digital Innovations for the Public and Mobility Services. The conference focus was especially visible at the second day of the conference. The day started with a talk of the Mayor of Wolfsburg explaining the digitalization strategy of the city. Subsequently, the Wolfsburg.Digital program has been presented by the Volkswagen AG. The initiative supports innovative solutions for improving the quality of life by improving the digital infrastructure, efficient traffic management, creating the infrastructure for e-mobility, and zero-carbon building. The presentations and the discussion gave interesting insights in the current-state of development and the plans for the next years. The poster session gave much space for discussing research in detailed.

Overall, the conference presentation exciting insights in current research projects and new ideas for further research. I presented our framework for computing multimedia-based recommendations. The framework and the publicly available real-world news dataset are used in the MediaEval benchmark enabling researchers to evaluate new multimedia-based recommender algorithms. In addition to the conference presentation, I also presented the system in the poster session. This gave us time for detailed discussions and new cooperation ideas.

The highlight of social program of the conference was a guided tour through the Volkswagen factory. The tour gave insights into the different vehicle production areas and showed how different cars are produced. In an open Golf train the tour showed all steps of the production process.

In 2020, the 20th edition of the I4CS conference will be held in Bhubaneswar, India.

Multi-Media Analysis for Recommender Systems

The World Wide Web had initially comprised a collection of texts in the form of HTML documents. Over the years, organizations and increasingly users have added a variety of multimedia. Today, popular web portals attract viewers not only with captivating stories but audio, images, and videos.

Users continue struggling with the vast amount of information available at their fingertips. Finding relevant information has become a greater challenge.

Organizations operating popular portals have introduced systems supporting users in their quest to find interesting content. These recommender systems take the collection of items, process it automatically, and derive a small set of suggestions. Research has established tools to process texts automatically. Dealing with multimedia remains more difficult.

Content-agnostic methods, such as Collaborative Filtering, rely on strong user profiles. News publishers cannot provide such profiles as readers tend to visit their portals anonymously. Consequently, news publishers tend to combine non-personalized and content-based methods. Content-based filtering takes features describing the item and establishes similarities among them to find content that matches users’ preferences. Hitherto, multimedia content has been largely ignored due to technical difficulties.

This has motivated us to set up the “MediaEval – Multimedia for Recommender System” benchmark. The benchmark asks participants to predict the most popular news items based on image features. Participants obtain a data set spanning six weeks. They have to predict the items which will collect the most views in the following weeks. We have computed a set of image annotations to simplify getting started. Statistics and preliminary observations are described in the Task overview paper.

If you have promising ideas about how to extract useful features from images to predict news articles’ popularity, check out the challenge details here.

We have randomly selected three images of the categories sports, local, and politics to give you an impression of the data.