direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Page Content

Bachelor Thesis: On the Impact of Keyword Extraction in Word Mover's Distance-based Similarity Comparison

Title: On the Impact of Keyword Extraction in Word Mover's Distance-based Similarity Comparison


This thesis is based on the already existing affinity system by Eichinger et al. It is used to compare similarities between users on the basis of texting data. Fundamentally, their proposed similarity comparison features the Word Mover’s Distance (WMD). Users can extract a set of words that bet describe their way of texting and convert them into real vectors by means of word embeddings. The WMD then calculates the distance between both sets of vectors. As the WMD scales supercubic in the number of words used for similarity comparison, it is pivotal to reduce the number of words to a user’s most descriptive words. We use keyword extraction to determine these most descriptive words. We analyze the impact of three distinct keywords extraction methods, tf-idf, RAKE, and YAKE! We evaluate their performance on the basis of a binary classification task on two data sets. First, a twitter data set in which we classify political affiliation to either the Republican or Democratic party of US politicians. Second, on a newsgroup data set comparing posts in the categories atheism versus space. We show that RAKE is by far the worst keyword extraction method. tf-idf and YAKE! are on a par with respect to median performance. We show that YAKE! is more robust than tf-idf. We conclude that in scenarios, in which a corpus of documents is not given, YAKE! is to be preferred, and else tf-idf.

Supervisor: Tobias Eichinger

Type:  Bachelor Thesis

Duration: 4 months

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe

TU Berlin - Service-centric Networking - TEL 19
Ernst-Reuter-Platz 7
10587 Berlin, Germany
Phone: +49 30 8353 58811
Fax: +49 30 8353 58409