Announcing MediaEval NewsImages 2022

Author: Andreas Lommatzsch

The MediaEval NewsImages challenges has received growing interest in the last years. This is the movivation for us to continue the NewsImages challenge also in 2022. NewsImages provides a dataset and forum for researching the relation between texts and images in news. Images in news should attract the users attention, guide the perception of users, and highlight specific aspects of news texts. The specific challenge for news publishers is, that there are often no photos available for breaking news. So, images used in news article may come from a large spectrum of sources, including social media, screenshots from TV, stock photos, recent photos of a relevant person or a location.

NewsImages 2022 runs as a lab under the umbrella of MediaEval 2022. NewsImages provides a dataset tailored for learning and evaluating strategies for reassigning news texts and images. In 2022, the dataset consists of 3 parts: (1) News crawled from an RSS feed focusing on politic, (2) News crawled from Twitter, and (3) a News dataset crawled from a news portal in German.

The NewsImages 2022 workshop will be held in Bergen, Norway in January 2023.
More details about the MediaEval benchmark can be found on the official web page. A detailed description of the methods developed in 2021 can be found the conference proceedings from 2021.

The MediaEval 2022 website.

Presenting the Chatbot Bobbi on the “Hoffest” at the Red Townhall in Berlin

Author: Andreas Lommatzsch

The Hoffest (Courtyart Festival) at the Red Townhall in Berlin is traditional festival before the summer vacation. This year, Franziska Giffey, the Governing Mayor of Berlin, welcomed representatives from politics, business, science, diplomacy, culture, media and sports as well as citizens earned merits in the last months.

The DAI-Labor presented the robot Bobbi at the stand of the ITDZ Berlin. Bobbi is a cooperation project with the ITDZ Berlin. In this project we develop and evaluate a chatbot optimized for answering questions related to the services offered by the public administration.
For the Hof-Fest, Bobbi had been trained in the security domain. Based on the new knowledge, Bobbi invited visitors of the Hof-Fest to play a quiz.
The quiz and the presentation of our robot Bobbi has been a great success. Visitors liked to talk with Bobbi and to learn interesting facts from the security domain. The following photos give a visual impression from the event.

Announcing the 10th Intl. Workshop on News Recommendation and Analytics

Author: Andreas Lommatzsch

In our continuously changing world, news play a major role. In recent years we observed a rapidly changing news ecosystem leading to new challenges: Workshop on News Recommendation and Analytics (INRA) serves to exchange ideas and discuss recent trends, technological advancements, and open problems concerning news. The workshop provides a forum to discuss recent research, technical and interdisciplinary aspects as well as current trends related to news. Topics of interest include information access systems for news, advances in natural language processing, multi-modality, mis- and disinformation, trust and user experiences, and personalization

We welcome contributions in scientific articles, demonstrations, and ideas. We strive to bring together researchers, practitioners, and decision-makers to address crucial challenges. The 10th edition of the INRA workshop will be held co-located with SIGIR in Madrid, Spain in July 2022.
More details can be found on the INRA webpage:

INRA Workshop, colocated with SIGIR 2022, Madrid, Spain

The DAI Labor provides several datasets for research purpose

Author: Andreas Lommatzsch

At our lab we actively work on research projects in the domain of

  • Language-models and Chatbots
  • Semantic knowledge processing and knowledge graphs
  • Recommender Systems

In this post, we lists datasets we have been using in DAI-Lab’s publications. All datasets are publicly available by sending an email to corpora(at)dai-labor.de or to the author of this post.

Delicious dataset

This dataset contains all public bookmarks of about 950,000 users retrieved from http://delicious.com between December 2007 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments that were posted between September 2003 and December 2007. No spam filtering was done! Usernames have been anonymized to protect data privacy. The final dataset is around 7GB in size (compressed).
The full corpus is described and analyzed in:

Analyzing social bookmarking systems: A del.icio.us cookbook. Robert Wetzker, Carsten Zimmermann, and Christian Bauckhage. In Mining Social Data (MSoDa) Workshop Proceedings, pp. 26-30. ECAI 2008, (July 2008).

The Slashdot Zoo

This dataset represents the social network of the technology news web site http://slashdot.org. The network contains 78,000 users and 510,000 relationships of the types friend and foe. The dataset was extracted from Slashdot between May 2008 and February 2009 and contains only the giant connected component that includes the user CmdrTaco (Rob Malda, moderator and founder of Slashdot). The relationship type friend and foe correspond to positive and negative endorsements.
An analysis of the dataset was presented at WWW 2009:

The Slashdot Zoo: Mining a Social Network with Negative Edges. Jerome Kunegis, Andreas Lommatzsch, and Christian Bauckhage. In Proceedings of the International Conference on World Wide Web, pp. 741–750, 2009.

Corpus for Internet News Sentiment Analysis

Details of the dataset and the annotation scheme and process are described in the technical report:

Bütow, F., Lommatzsch, A., Ploch, D.: Creation of a German Corpus for Internet News Sentiment Analysis. Project report, Berlin Institute of Technology, AOT (2016) [details]

GerOM: Dataset with sentiment-annotated quotations in German

The dataset consists of sentiment-annotated quotations. It can be used exclusively for academic, non-commercial research.

Details of the dataset and the annotation scheme are described in:

Ploch, D., 2015. Intelligent News Aggregator for German with Sentiment Analysis, in: Hopfgartner, F. (Ed.), Smart Information Systems, Advances in Computer Vision and Pattern Recognition. Springer International Publishing, pp. 5-46.

Twitter Sentiment Dataset

The dataset contains tweets that have been human-annotated with sentiment labels by 3 Mechanical Turk workers.
There are 12597 tweets in 4 languages: English, German, French and Portugese.
The labels annotated are positive, neutral, negative and n/a.

Our new Research Project "SPURT" (Language Models for Generating Polls and the Identification of Topics) has started

Author: Andreas Lommatzsch

Learning about user interests and opinions of readers of news portals is interesting for both readers and authors. Online surveys and polls have been shown useful to engage readers and to learn about the readers options. The challenge in defining good questions (interesting for most users) is the broad spectrum of topic (requiring matching questions) and finding an adequate question formulation.

Researching what questions attract most users and how readers of online news portals interact with online polls is the objective in a joined project "What-The-Question" between the DAI-Lab at the TU Berlin and Opinary. A special focus lies on applying language models for understanding the context and linguistic aspects of the questions. In addition, methods for the automatic creation of context-aware questions are researched.

The project is supported by the German Federal Ministry for Economic Affairs and Climate Action within the funding program "Central Innovation Programme for small and medium-sized enterprises (SMEs)".

Opinary Logo DAI-Labor Logo

Announcing the 9th INRA workshop, held in conjunction with RecSys 2021.

Author: Andreas Lommatzsch

ACM RecSys is the leading conference in the research areas of recommender systems. In 2021 in RecSys conference will be held as a hybrid event. The physical part will take place in the Amsterdam Conference Center (the former Amsterdam Stock Exchange Building). The physical sessions are planned for the morning sessions; the digital sessions mostly in the afternoon in the "Golden Hours" in order to ensure an active participation.

An important domain for recommender systems are Online News. The INRA workshop provides a forum for discussing current both technical and societal aspects of news recommendations and personalization techniques. 9th International Workshop on News Recommendation and Analytics (INRA2021) collocated with RecSys is organized in cooperation of the Norwegian University of Science and Technology (Norway), the Technische Universität Berlin (Germany), and the Penn State University (USA). The workshop submission deadline this year is 6 August 2021. In addition to the workshop, a UMUAI special issue is planed; the best papers of the INRA workshop will be invited to submit extended version to this journal.

More Details can be found on the INRA workshop webpage.

INRA 2021 Workshop webpage

INRA 2021 Workshop webpage

Announcing MediaEval – News Images 2020

Author: Andreas Lommatzsch

Images play an important role in online news articles and news consumption patterns. The influence of images on the perceived relevance of news items as well as the factors making images interesting in news are not well researched yet. This has been the motivation for us to setup the "News Images" task in the Multi-MediaEval Benchmark.

The "News Images" task aims to achieve additional insight about the role of images in news articles. The task supplies a large set of articles (including text body, and headlines) and the accompanying images. The task requires participants to predict which image was used to accompany each article and also predict frequently clicked articles on the basis of accompanying images.

In the challenge training data describing the news and the interest in the news items (number of views, number of clicks, CTR) are provided. The performance of the developed algorithms is benchmarked based on the data of a subsequent month. The reassignment of images to news items is measured based on the accuracy (percentage of correctly assigned images); the performance of the relevance prediction is measured based on the Precision@N (the top N entries should be predicted correctly).

The registration is now open for participants. The MediaEval conference will be held in December 2020. Due to COVID-19 this year MediaEval goes fully virtually online. Thus, participants can save time and money usually needed for traveling. So there is more time for developing new algorithms, building interesting presentations and the preparing good discussions. More details about the MediaEval benchmark can be found on the official web page.

Bobbi supports Berlin’s Administration in Answering Questions related to COVID-19

Author: Andreas Lommatzsch

Chatbots are a popular technique for building a scalable and easy-to-use solution answering user questions. Bobbi is the chatbot of the City of Berlin. The bot is designed for answering questions related to the services provided by the city administration.

With the spread of the SARS-CoV-2 virus and the related COVID-19 disease, questions related to the virus have become a major issue. Citizens frequently demand virus-related information and ask about the most recent regulations.
The development of a component optimized for answering COVID-19 related questions raised several challenges:

  • The topic COVID-19 is as recent such that no large training data collections exist.
  • The virus affects a lot of different domains. Thus, several different departments and ministries provide information that is relevant for answering user questions.
  • The information related to the virus is continuously changing. Thus, the answers must be frequently updated to ensure that all answers are based on the most recent state of information.
  • Government agencies provide diverse data. Answers to questions may consist of only one word; other answers are very long and consist of more than 15 sentences.
  • The answers provided by the chatbot must be correct. The risk of giving the wrong answer must be minimized.

Our chatbot framework provides a component that fulfills these requirements. A web crawler collects FAQ data from a list of defined sources, such as Berlin’s  COVID-19 website, the relevant Berlin’s Senate Administration, and the RKI (Robert-Koch-Institute). The information is re-crawled several times a day ensuring that the information is always up to date.

When a citizen asks a question in the Bobbi chat, the chatbot first checks whether the question is related to COVID-19. If the question is related and the question is very similar to a question available in the set of FAQs, the bot directly provides the answer.

If the question is related but does not exactly match a question from the set of crawled FAQs, the bot shows the user a list of the closest matches. This ensures that even though the user question contains synonyms or a negation, the bot provides a correct question-answer pair. The matching uses a German-language model and a collection of domain-specific synonyms for ensuring a good answer quality without the need for extensive training data.

In addition to the FAQ matching, the bot also searches for relevant administrative services to ensure that the user has access to comprehensive information. The question answering is available in nine different languages.

The chatbot’s usage statistics emphasize the high demand for COVID-19 information. In May 2020, Bobbi conducted about nine times more dialogs compared to the number of dialogs in January 2020. About 80% of the dialogs only consist of questions related to COVID-19.
Due to the high acceptance of this functionality, we will extend this feature so that questions from other topics and domains can be also answered in a similar style.
You can find the chatbot Bobbi on Berlin’s Official Services Web Portal.

ACM Conference on Recommender Systems 2019 and International Workshop on News Recommendation and Analytics

Author: Benjamin Kille

The ACM International Conference on Recommender Systems was held in Copenhagen from 16th to 20th September 2019. The 13th edition of RecSys features three days of conference talks followed by two days for workshops and tutorials. The program included two keynote speeches. First, Mireille Hildebrandt explored how the EU’s GDPR affects recommender systems. Second, Eszter Hargittai discussed recommender systems from the perspective of social research. The conference emphasized the interdisciplinary character of recommender systems with the keynotes, a variety of contributions, and multiple tutorials and workshops. For instance, the program featured tutorials on multi-stakeholder considerations and fairness along with a workshop on multistakeholder environments. This trend signifies that the recommender systems research community increasingly attracts experts from different disciplines.

In collaboration with partners from NTNU, we organized the seventh edition of the International Workshop on News Recommendation and Analytics. The workshop seeks to present cutting-edge research as well as practical insights from the intersection of news and recommender systems. We had received sixteen submissions, ten of which we could accept given the three-hour time slot. The accepted contributions split evenly into five long and short papers. We awarded each presenter ten minutes for a short paper and eighteen minutes for a long paper. We were very happy to welcome the University of Amsterdam’s Natali Helberger as a keynote speaker. Her talk aligned perfectly with the conference theme. She emphasized the intricate and subtle ways in which recommender systems affect societies.

Subsequent to the keynote, attendees followed along with these talks (speakers put in italics):

  • Public Service Media, Diversity and Algorithmic Recommendation: Tensions between Editorial Principles and Algorithms in European PSM Organizations [Jannick Kirk Sørensen]
  • Semi-supervised sentiment analysis for under-resourced languages with a sentiment lexicon [Peng Liu, Cristina Marco and Jon Atle Gulla]
  • On the Importance of News Content Representation in Hybrid Neural Session-based Recommender Systems [Gabriel De Souza P. Moreira, Dietmar Jannach and Adilson Marques Da Cunha]
  • Defining a Meaningful Baseline for News Recommender Systems [Benjamin Kille and Andreas Lommatzsch]
  • On-the-Fly News Recommendation Using Sequential Patterns [Mozhgan Karimi, Boris Cule and Bart Goethals]
  • Giveme5W1H: A Universal System for Extracting Main Events from News Articles [Felix Hamborg, Corinna Breitinger, and Bela Gipp]
  • Recommendation systems for news articles at the BBC [Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher and Felix Mercer Moss]
  • Trend-responsive user segmentation enabling traceable publishing insights. A case study of a real-world large-scale news recommendation system [Joanna Misztal-Radecka, Dominik Rusiecki, Michał Żmuda and Artur Bujak]
  • Leveraging Emotion Features in News Recommendations [Nastaran Babanejad, Ameeta Agrawal, Heidar Davoudi, Aijun An and Manos Papagelis]

The collection of talks features a balanced mixture of research and insights into practice. Unfortunately, Janu Verma could not attend to present his work on “Enriched Network Embeddings for News Recommendation.”

Besides, the authors had the chance to put up posters aiding the discussions during the break. For everyone who could not attend the workshop, we have included some visual impressions.

Presenting our Chatbot Research at the LWDA Conference 2019 in Berlin

Author: Andreas Lommatzsch

The 2019 LWDA conference has been held in Berlin from September 30th to October 2nd, 2019. This year’s venues have been the Smart Data Forum (next to the TU Berlin) in and the Berlin School of Library and Information Science (next to the main building of the Humboldt-University Berlin). The conference is organized by the German Computer Science Society (GI). The core topics of the conference are Knowledge Discovery and Machine Learning; Databases, and Information Retrieval.

From the many interesting presentations I would like to highlight the keynote "Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs" by Stefan Dietze. The talk underlined the importance of datasets and the aggregation of datasets for research. Challenges are ambiguity and missing meta-data for the available datasets. Converting crawled data into knowledge graphs by applying semantic and ML methods (e.g. NER, NED, Sentiment Detection) provides the basis for new research fields, especially related to social science. I liked the talk due to the fact that we made similar observation in our research projects (e.g. [1] and [2]). Created datasets are provided on our dataset web page.

CC IRML presented current research in the domain of chatbot systems at the conference. Our contribution "An Information Retrieval-based Approach for Building Intuitive Chatbots for Large Knowledge Bases" reports the experiences running the Virtual Assistant "Bobbi". Bobbi is a chatbot providing information related to services and locations of the Berlin Administration. The paper discusses how to build chatbots without training data (cold-start problem) and explains how to efficiently handle the wide variety of observed user intentions. The research uses data which we have collected in the live system deployed on the official website of the city of Berlin (service.berlin.de). We presented the results in a 30 minutes talk. Besides, we participated in the poster session to discuss more directly with attendees.