NewsImages at MediaEval 2022 in Bergen, Norway

Author: Andreas Lommatzsch

The MediaEval conference 2022 has been held in Bergen, Norway. The conference has been co-located with the 29th International Conference on Multimedia Modeling to foster an exchange of ideas between the communities.
The NewsImages lab researching the use of multimedia in the news domain is part of MediaEval since 2018.

This year, NewsImages provided three news datasets from three different domains: (1) The RSS part provided news from different news portals (aggregated by the dgelt-Project). The second part TW provided news stories-related tweets. The third part RT provided news in German related to the war in Ukraine. The task in NewsImages consists in re-matching news text and images. Details about the analyzed setting, the dataset, and the evaluation metrics are explained in the Lab overview paper.

In this year’s challenge, 5 teams from Europe and Asia participated. The developed solutions provided very good results, The best team reached a Recall@5 =0.6 . The discussion of the results gave interesting insights in the relation between the news texts and the used images. The best results were obtained by annotating images using CLIP. For correlating text and image considering only a short text (e.g. the headline) has been most successful. This underlines that image and headline are often based on the same key element of the news message. The best results have been reported for tweets indicating that there is a strong overlap between the (relatively) short text and the accompanying image. The typically required translation step for the texts from the RT dataset reduced the reached recall by about 15%. The teams in the challenge mainly focused on optimizing the similarity models for connecting texts and images as well as on methods for bridging the semantic gap between news texts and images by enriching detected concepts with semantic information.

Details about the implemented methods can be found in the MediaEval proceedings.

Impressions from the LWDA Conference 2022 in Hildesheim

Author: Andreas Lommatzsch

The LWDA conference (acronym for ”Lernen, Wissen, Daten, Analysen” [”Learning, Knowledge, Data, Analytics”] is a conference organized by the German Computer Science Society. The conference covers recent research in areas such as databases, information systems, knowledge discovery, machine learning, data mining, and knowledge management.

In 2022, the conference has been held at the University of Hildesheim on the campus marienburg castle.

The presentations and poster sessions give a valuable overview on current research projects and results of the mostly German universities. The conference give a good platform for discussing future cooperation and for planning joined activities.

I presented on the conference a system for generating clarification questions in a chatbot tailored for the needs of the Berlin’s administration. The system combines language models, semantic annotations and clustering techniques for generating counter questions for ambiguous user inputs. This enables a virtual chatbot to guide users efficiently to the desired information. The details an be found in our

The following photos give an impression from the conference.

Impressions from the 10th INRA workshop at SIGIR 2022

Author: Andreas Lommatzsch

The 10th edition of the International Workshop on News Recommendation and Analytics (INRA 2022) has been held at July 15th, 2022, as a hybrid SIGIR workshop in Madrid, Spain.
The INRA workshop provides a platform for recent research and discussions in the domain of news and recommender algorithms.
Over last years the relevance of news has increased what resulted in a big interest in the analysis in the production, personalization, presentation, and consumption of news.
Due to popular internet portals and social media, news are ubiquitous.
The personalization and context-aware adaptation also resulted in new challenges such as extended user profiling, filter bubbles and echo chambers as well as fake news.

These challenges are reflected in this year’s workshop program consisting of research presentations, invited talks and a panel discussion.
The talks covered a broad spectrum of news related challenges. I would highlight here three presenations:
1. Lucas Möller presented his work researching different scoring functions for evaluating neural content-based filtering. The paper shows that the optimized non-linear combinations of user-based and item-based criteria improves the recommender performance.
2. An invited presentation explained the relevance of images for the perception of news articles. The talk gave an overview on the results of MediaEval NewsImages 2021 and presented the special challenges in the NewsImages task 2022.
3. Srivas Prasad presented the challenges of news article publishing and filtering from the industry perspective. He discussed the problem of detecting relevant information in social media and delivering reliable, valuable information in real time.

Looking back on the workshop day, it was a day full with interesting presentations and good discussions.
The workshop proceedings will we published on soon.
Extended papers could also be submitted to the Special Issue on News Personalization and Analytics of The Journal of Personalization Research (UMUAI).

Due to the positive feedback we got for the workshop, we plan to continue the workshop series also in 2023.

Announcing MediaEval NewsImages 2022

Author: Andreas Lommatzsch

The MediaEval NewsImages challenges has received growing interest in the last years. This is the movivation for us to continue the NewsImages challenge also in 2022. NewsImages provides a dataset and forum for researching the relation between texts and images in news. Images in news should attract the users attention, guide the perception of users, and highlight specific aspects of news texts. The specific challenge for news publishers is, that there are often no photos available for breaking news. So, images used in news article may come from a large spectrum of sources, including social media, screenshots from TV, stock photos, recent photos of a relevant person or a location.

NewsImages 2022 runs as a lab under the umbrella of MediaEval 2022. NewsImages provides a dataset tailored for learning and evaluating strategies for reassigning news texts and images. In 2022, the dataset consists of 3 parts: (1) News crawled from an RSS feed focusing on politic, (2) News crawled from Twitter, and (3) a News dataset crawled from a news portal in German.

The NewsImages 2022 workshop will be held in Bergen, Norway in January 2023.
More details about the MediaEval benchmark can be found on the official web page. A detailed description of the methods developed in 2021 can be found the conference proceedings from 2021.

The MediaEval 2022 website.

Presenting the Chatbot Bobbi on the “Hoffest” at the Red Townhall in Berlin

Author: Andreas Lommatzsch

The Hoffest (Courtyart Festival) at the Red Townhall in Berlin is traditional festival before the summer vacation. This year, Franziska Giffey, the Governing Mayor of Berlin, welcomed representatives from politics, business, science, diplomacy, culture, media and sports as well as citizens earned merits in the last months.

The DAI-Labor presented the robot Bobbi at the stand of the ITDZ Berlin. Bobbi is a cooperation project with the ITDZ Berlin. In this project we develop and evaluate a chatbot optimized for answering questions related to the services offered by the public administration.
For the Hof-Fest, Bobbi had been trained in the security domain. Based on the new knowledge, Bobbi invited visitors of the Hof-Fest to play a quiz.
The quiz and the presentation of our robot Bobbi has been a great success. Visitors liked to talk with Bobbi and to learn interesting facts from the security domain. The following photos give a visual impression from the event.

Announcing the 10 Intl. Workshop on News Recommendation and Analytics

Author: Andreas Lommatzsch

In our continuously changing world, news play a major role. In recent years we observed a rapidly changing news ecosystem leading to new challenges: Workshop on News Recommendation and Analytics (INRA) serves to exchange ideas and discuss recent trends, technological advancements, and open problems concerning news. The workshop provides a forum to discuss recent research, technical and interdisciplinary aspects as well as current trends related to news. Topics of interest include information access systems for news, advances in natural language processing, multi-modality, mis- and disinformation, trust and user experiences, and personalization

We welcome contributions in scientific articles, demonstrations, and ideas. We strive to bring together researchers, practitioners, and decision-makers to address crucial challenges. The 10th edition of the INRA workshop will be held co-located with SIGIR in Madrid, Spain in July 2022.
More details can be found on the INRA webpage:

INRA Workshop, colocated with SIGIR 2022, Madrid, Spain

The DAI Labor provides several datasets for research purpose

Author: Andreas Lommatzsch

At our lab we actively work on research projects in the domain of

  • Language-models and Chatbots
  • Semantic knowledge processing and knowledge graphs
  • Recommender Systems

In this post, we lists datasets we have been using in DAI-Lab’s publications. All datasets are publicly available by sending an email to corpora(at) or to the author of this post.

Delicious dataset

This dataset contains all public bookmarks of about 950,000 users retrieved from between December 2007 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments that were posted between September 2003 and December 2007. No spam filtering was done! Usernames have been anonymized to protect data privacy. The final dataset is around 7GB in size (compressed).
The full corpus is described and analyzed in:

Analyzing social bookmarking systems: A cookbook. Robert Wetzker, Carsten Zimmermann, and Christian Bauckhage. In Mining Social Data (MSoDa) Workshop Proceedings, pp. 26-30. ECAI 2008, (July 2008).

The Slashdot Zoo

This dataset represents the social network of the technology news web site The network contains 78,000 users and 510,000 relationships of the types friend and foe. The dataset was extracted from Slashdot between May 2008 and February 2009 and contains only the giant connected component that includes the user CmdrTaco (Rob Malda, moderator and founder of Slashdot). The relationship type friend and foe correspond to positive and negative endorsements.
An analysis of the dataset was presented at WWW 2009:

The Slashdot Zoo: Mining a Social Network with Negative Edges. Jerome Kunegis, Andreas Lommatzsch, and Christian Bauckhage. In Proceedings of the International Conference on World Wide Web, pp. 741–750, 2009.

Corpus for Internet News Sentiment Analysis

Details of the dataset and the annotation scheme and process are described in the technical report:

Bütow, F., Lommatzsch, A., Ploch, D.: Creation of a German Corpus for Internet News Sentiment Analysis. Project report, Berlin Institute of Technology, AOT (2016) [details]

GerOM: Dataset with sentiment-annotated quotations in German

The dataset consists of sentiment-annotated quotations. It can be used exclusively for academic, non-commercial research.

Details of the dataset and the annotation scheme are described in:

Ploch, D., 2015. Intelligent News Aggregator for German with Sentiment Analysis, in: Hopfgartner, F. (Ed.), Smart Information Systems, Advances in Computer Vision and Pattern Recognition. Springer International Publishing, pp. 5-46.

Twitter Sentiment Dataset

The dataset contains tweets that have been human-annotated with sentiment labels by 3 Mechanical Turk workers.
There are 12597 tweets in 4 languages: English, German, French and Portugese.
The labels annotated are positive, neutral, negative and n/a.

Our new Research Project "SPURT" (Language Models for Generating Polls and the Identification of Topics) has started

Author: Andreas Lommatzsch

Learning about user interests and opinions of readers of news portals is interesting for both readers and authors. Online surveys and polls have been shown useful to engage readers and to learn about the readers options. The challenge in defining good questions (interesting for most users) is the broad spectrum of topic (requiring matching questions) and finding an adequate question formulation.

Researching what questions attract most users and how readers of online news portals interact with online polls is the objective in a joined project "What-The-Question" between the DAI-Lab at the TU Berlin and Opinary. A special focus lies on applying language models for understanding the context and linguistic aspects of the questions. In addition, methods for the automatic creation of context-aware questions are researched.

The project is supported by the German Federal Ministry for Economic Affairs and Climate Action within the funding program "Central Innovation Programme for small and medium-sized enterprises (SMEs)".

Opinary Logo DAI-Labor Logo

Announcing the 9th INRA workshop, held in conjunction with RecSys 2021.

Author: Andreas Lommatzsch

ACM RecSys is the leading conference in the research areas of recommender systems. In 2021 in RecSys conference will be held as a hybrid event. The physical part will take place in the Amsterdam Conference Center (the former Amsterdam Stock Exchange Building). The physical sessions are planned for the morning sessions; the digital sessions mostly in the afternoon in the "Golden Hours" in order to ensure an active participation.

An important domain for recommender systems are Online News. The INRA workshop provides a forum for discussing current both technical and societal aspects of news recommendations and personalization techniques. 9th International Workshop on News Recommendation and Analytics (INRA2021) collocated with RecSys is organized in cooperation of the Norwegian University of Science and Technology (Norway), the Technische Universität Berlin (Germany), and the Penn State University (USA). The workshop submission deadline this year is 6 August 2021. In addition to the workshop, a UMUAI special issue is planed; the best papers of the INRA workshop will be invited to submit extended version to this journal.

More Details can be found on the INRA workshop webpage.

INRA 2021 Workshop webpage

INRA 2021 Workshop webpage

Announcing MediaEval – News Images 2020

Author: Andreas Lommatzsch

Images play an important role in online news articles and news consumption patterns. The influence of images on the perceived relevance of news items as well as the factors making images interesting in news are not well researched yet. This has been the motivation for us to setup the "News Images" task in the Multi-MediaEval Benchmark.

The "News Images" task aims to achieve additional insight about the role of images in news articles. The task supplies a large set of articles (including text body, and headlines) and the accompanying images. The task requires participants to predict which image was used to accompany each article and also predict frequently clicked articles on the basis of accompanying images.

In the challenge training data describing the news and the interest in the news items (number of views, number of clicks, CTR) are provided. The performance of the developed algorithms is benchmarked based on the data of a subsequent month. The reassignment of images to news items is measured based on the accuracy (percentage of correctly assigned images); the performance of the relevance prediction is measured based on the Precision@N (the top N entries should be predicted correctly).

The registration is now open for participants. The MediaEval conference will be held in December 2020. Due to COVID-19 this year MediaEval goes fully virtually online. Thus, participants can save time and money usually needed for traveling. So there is more time for developing new algorithms, building interesting presentations and the preparing good discussions. More details about the MediaEval benchmark can be found on the official web page.