Abstract:
|
The beginning of the new millennium was marked by huge development
of social networks, internet technologies in the cloud and applications of artificial
intelligence tools on the web. Extremely rapid growth in the number of articles on
the Internet (blogs, e-commerce websites, forums, discussion groups, and systems
for transmission of short messages, social networks and portals for publishing
news) has increased the need for developing methods of rapid, comprehensive and
accurate analysis of the text. Therefore, remarkable development of language
technologies has enabled their applying in processes of document classification,
document clustering, information retrieval, word sense disambiguation, text
extraction, machine translation, computer speech recognition, natural language
generation, sentiment analysis, etc. In computational linguistics, several different
names for the area concerning processing of emotions in text are in use: sentiment
classification, opinion mining, sentiment analysis, sentiment extraction. According
to the nature and the methods used, sentiment analysis in text belongs to the field
of computational linguistics that deals with the classification of text. In the process
of analysing of emotions we generally speak of three kinds of text classification:
• identification of subjectivity (opinion classification or subjectivity
identification) used to divide texts into those that carry emotional content and
those that only have factual content
• sentiment classification (polarity identification) of texts that carry emotional
content into those with positive and those with negative emotional content
• determining the strength or intensity of emotional polarity (strength of
orientation).
In terms of the level at which the analysis of feelings is carried out, there are three
methodologies: an analysis at the document level, at the sentence level and at the
level of attributes. Standardized methods of text classification usually use machine
learning methods or rule-based techniques. Sentiment analysis, as a specific type
of classification of documents, also uses these methods.
This doctoral thesis, whose main task is the analysis of emotions in text,
presents research related to the sentiment classification of texts in Serbian
language, using a probabilistic method of machine learning of multinomial logistic
regression i.e. maximum entropy method. The aim of this research is to create the
first comprehensive, flexible, modular system for sentiment analysis of Serbian
language texts, with the help of digital resources such as: semantic networks,
specialized lexicons and domain ontologies. This research is divided into two
phases. The first phase is related to the development of methods and tools for
detecting sentiment polarity of literal meaning of the text. In this part of the work,
a new method of reducing the feature vector space for sentiment classification is
proposed, implemented and evaluated. The proposed method for reduction is
applied in the classification model of maximum entropy, and relies on the use of
lexical-semantic network WordNet and a specialized sentiment lexicon. The
proposed method consists of two successive processes. The first process is related
to the expansion of feature vector space by the inflectional forms of features. The
study has shown that usage of stemming in sentiment analysis as a standard
method of reducing feature vector space in text classification, can lead to
incomplete or incorrect sentiment-polarity feature labelling, and with the
introduction of inflectional feature forms, this problem can be avoided. The paper
shows that a feature vector space, increased due to the introduction of inflectional
forms, can be successfully reduced using the other proposed procedure – semantic
mapping of all predictors with the same sentiment-polarity into a small number of
semantic classes. In this way, the feature vector space is reduced compared to the
initial one, and it also retains the semantic precision.
The second phase of the dissertation describes the design and
implementation of formal ontologies of Serbian language rhetorical figures – the
domain ontology and the task ontology. Usage of the task ontology in generating
features representing figurative speech is presented. The research aim of the
second phase is to recognize figurative speech to be used in improving of the
existing set of predictors generated in the first phase of the research. The research
results in this phase show that some classes of figures of speech can be recognized
automatically.
In the course of working on this dissertation, a software tool SAFOS
(Sentiment Analysis Framework for Serbian), as an integrated system for
sentiment classification of text in Serbian language, has been developed,
implemented and statistically evaluated. Results of the research within the scope
of this thesis are shown in papers (Mladenović & Mitrović, 2013; Mladenović &
Mitrović, 2014; Mladenović, Mitrović & Krstev, 2014; Mladenović, Mitrović, Krstev
& Vitas, 2015; Mladenović, Mitrović & Krstev, 2016).
The dissertation consists of seven chapters with the following structure.
Chapter 1 introduces and defines methods, resources and concepts used in the first
phase of research: text classification, sentiment classification, machine learning,
supervised machine learning, probabilistic supervised machine learning, and
language models. At the end of the introductory section, the tasks and objectives of
the research have been defined. Chapter 2 presents a mathematical model of text
classification methods and classification of sentiment methods. A mathematical
model of a probabilistic classification and an application of the probabilistic
classification in regression models are presented. At the end of the chapter it is
shown that the method using the mathematical model of maximum entropy, as one
of the regression models, has been successfully applied to natural language
processing tasks. Chapter 3 presents the lexical resources of the Serbian language
and the methods and tools of their processing. Chapter 4 deals with the
comprehensive research on the currently available types and methods of
sentiment classification. It shows the current work and research in sentiment
classification of texts. It also presents a comparative overview of research in
sentiment classification of texts using the method of maximum entropy. Chapter 5
discusses the contribution of this thesis to methods of feature space reduction for
maximum entropy classification. First, a feature space reduction method is
analysed. A new feature space reduction method which improves sentiment
classification is proposed. А mathematical model containing proposed method is
defined. Learning and testing sets and lexical-semantic resources that are used in
the proposed method are introduced. Chapter 5 also describes building and
evaluation of a system for sentiment classification – SAFOS, which applies and
evaluates the proposed method of a feature vector space reduction. The
parameters and the functions of SAFOS are defined. Also, measures for evaluation
of the system were discussed – precision, recall, F1-measure and accuracy. A
description of the method for assessing the statistical significance of a system is
given. Also, implementation of the statistical test in the system SAFOS is discussed.
The chapter provides an overview of the presented experiments, results and
evaluation of the system. Chapter 6 deals with methods of recognizing figurative
speech which can improve sentiment classification. The notion of domain ontology
is introduced, the role of rhetorical figures and domain ontology of rhetorical
figures. The importance of figurative speech in the sentiment classification has
been explored. The description of the construction and structure of the first
domain ontology of rhetorical figures in Serbian language, RetFig.owl, is given.
Also, the description of the construction and structure of the corresponding task
ontology that contains rules for identification of some classes of rhetorical figures
is given. At the end of this chapter, an overview of the performed experiments,
results and evaluation of the SAFOS system plugin that improved the recognition of
figurative speech is given.
The final chapter of this study deals with the achievemnts, problems and
disadvantages of the SAFOS system. The conclusion of this thesis points to the
great technological, social, educational and scientific importance of the sentiment
analysis and recognition of the figurative speech and gives some routes in further
development of the SAFOS system. |