Master Thesis: A Machine Learning-based Mechanism for Feature Extraction for CDSA
A Machine Learning-based Mechanism for Feature Extraction for
Digitalization and the expansion of the web has caused a surge in the number of productsand services being offered online. This, in turn, has caused an exponential increase in theavailability of the related online reviews and recommendations, which gives companies andorganizations the advantage of being able to collect them fast and in a paperless manner - andin turn use them for their analytical purposes sooner than before.
The vast amount of this online material, however, together with the velocity at which newmaterial is produced, and the variety of the sources, domains and even language styles, pro-vides a challenge to the processing of the specific Natural Language Processing (NLP) task ofsentiment analysis, which aims to automatically classify a review into a positive or negativepolarity. Sentiment analysis and NLP tasks in general are considered to be among the most complextasks because of the contained heterogeneity. This heterogeneity is the challenge addressed inthe first place by cross-domain sentiment analysis (CDSA), which deals with sentiment analysison two or more different domains representing different data distributions. In a typical CDSAsetting involving two domains, a model that is trained on a source domain, which has a lot oflabeled samples, is used to classify samples from a target domain, for which only a few or nolabeled samples are available. Therefore, CDSA is opposed to the heterogeneity challenge evenmore than the traditional in-domain sentiment analysis is.
The different data distributions and the existence of different semantics behind the samewords in different domains stresses the importance of effective feature extraction in CDSA. It is also generally known, that in order for an algorithm or model to produce high quality output,high quality input has to be used or produced, according to the logic of the motto "Garbage in,garbage out".In response to this, in this work we propose a machine learning framework involving a noveldeep learning model that uses the combination of proven machine learning and deep learningconcepts adapted for the usage in our model in order to extract meaningful features for CDSA. Specifically, the proposed model incorporates a Siamese LSTM architecture adapted from itsoriginal usage for similarity classification, and stacked bidirectional LSTM autoencoders forfeature extraction. It uses the active learning (AL) algorithm for selecting a small amount ofsamples from the target domain to be included in the semi-supervised training process in orderto produce features that are adapted to the data distribution of the target domain and in turnare able to lead to better classification results of the remaining target domain samples.
Experimental results indicate, that using active learning in a non-traditional way for sampleselection with our current model settings leads better accuracies compared to the traditional us-age, and is able to lead to better accuracies compared to a strong baseline using the same modelwith randomly selected samples. In its current state, our proposed model enables fair to good polarity classification of target domain samples, achieving an accuracy of up to 74%, based ondomain invariant features. This result does not outperform the current state-of-the-art work inthis field, but shows potential for further improvement with more extensive experimentationusing different settings.
Supervisor: Katerina Katsarou
Type: Master Thesis
Duration: 6 months
10587 Berlin, Germany
Phone: +49 30 8353 58811
Fax: +49 30 8353 58409
e-mail query