TU Berlin

Service-centric NetworkingLawrenz, L. (2020). On the Impact of Keyword Extraction in Word Mover's Distance-based Similarity Comparison. Bachelor Thesis, Technische Universität Berlin

Inhalt

zur Navigation

Es gibt keine deutsche Übersetzung dieser Webseite.

Bachelor Thesis: On the Impact of Keyword Extraction in Word Mover's Distance-based Similarity Comparison

Title: On the Impact of Keyword Extraction in Word Mover's Distance-based Similarity Comparison

Description:

This thesis is based on the already existing affinity system by Eichinger et al. It is used to compare similarities between users on the basis of texting data. Fundamentally, their proposed similarity comparison features the Word Mover’s Distance (WMD). Users can extract a set of words that bet describe their way of texting and convert them into real vectors by means of word embeddings. The WMD then calculates the distance between both sets of vectors. As the WMD scales supercubic in the number of words used for similarity comparison, it is pivotal to reduce the number of words to a user’s most descriptive words. We use keyword extraction to determine these most descriptive words. We analyze the impact of three distinct keywords extraction methods, tf-idf, RAKE, and YAKE! We evaluate their performance on the basis of a binary classification task on two data sets. First, a twitter data set in which we classify political affiliation to either the Republican or Democratic party of US politicians. Second, on a newsgroup data set comparing posts in the categories atheism versus space. We show that RAKE is by far the worst keyword extraction method. tf-idf and YAKE! are on a par with respect to median performance. We show that YAKE! is more robust than tf-idf. We conclude that in scenarios, in which a corpus of documents is not given, YAKE! is to be preferred, and else tf-idf.

Supervisor: Tobias Eichinger

Type:  Bachelor Thesis

Duration: 4 months

Navigation

Direktzugang

Schnellnavigation zur Seite über Nummerneingabe