Classifying short texts
The task is to build a classifier for recognizing the theme of a short text
(press note) in one of the following five groups: business, entertainment,
politics, sport, tech. Work with the provided dataset of texts, develop a
representation, feature extraction method, and a final classisifer to
correctly label any new text samples.
The work should be performed in two stages, which can be repeated iteratively:
(1) working out a data representation and a set of features to be used in the
classification, and writing a program to convert the set of texts to a set of
feature vectors, and (2) building the classifier, and its optimization.
Popular schemes for representing variable-length texts for the purpose
of automatic classification are: BoW (Bag-of-Words), TF-IDF (Term
Frequency-Inverse Document Frequency), Word2Vec, additionally using the
N-gram model (e.g. bigram).
Any machine learning algorithms can be used in the construction of the
classifier. It is worth doing a few experiments, starting with some
statistical analysis, and then performing an initial machine learning
experiment using the simplest data representation scheme and the Naive Bayes
classifier. The results from this experiment can then be used as the
reference to evaluate the correctness of applying more advanced techniques.
These can include other text representation methods as well as other machine
learning algorithms, such as decision trees, nearest neighbors, SVM, neural
networks, any ensemble learning approches, etc.
The results of each experiment should be evaluated using appropriate error
measures including (but not limited to) Accuracy computed on both the training
set and the cross-validation method, as the simplest measure to detect
overfitting.
Optimizing the results can focus on either or both of selecting the best
machine learning algorithm and tuning its parameters, as well as trying
ensemble learning approaches by building hybrid classifiers. It is also
possible to go back to the previous step - representation - and attempts to
modify it to achieve better classification.
Deliverables
Please work out the results obtained in the form of a report describing your
work (all important steps) and the results obtained. Additionally, please
prepare the development package, allowing to reproduce your classifier
operation.
The report should have the following general structure:
- NO title page, list of contents, figures, etc., only a compact header: project title, class name, author, date
- data representation, preliminary analysis (if any), preprocessing, etc.
- initial classification experiment: details, results (training set, cross-validation)
- optimization attempts, for each experiment:
* premises, execution details, results, conclusions
- final results summary and conclusions
- list of resources/literature
The main criteria for assessing the report are: brevity, clarity and
readability of the description, as well as precision and completeness.
The subsequent steps of the project should be justified briefly.
Report penalties:
- report unnecessarily long, separate title, table of contents pages
- results not clearly, precisely and completely provided for each experiment
- raw results from the program or screenshots pasted instead of a summary
- too much data with no clearly given summary
- too much precision with not attempt for proper rounding
- no clear summary of the results
The development package should correspond to the best classifier model found
during the project, and:
- be possible to run on Linux,
- contain a Readme.txt file describing how to run it, including
required software packages, their versions and how to install them,
- contain a pre-trained model which would allow to classify any given set
of testing texts, and, when their correct classes are available, compute the
most important statistical quality assessments of the classification, like
those presented in the report,
- ADDITIONALLY allow to regenerate the classifier based on the original
training data (set of texts), and repeat the classification of a given test
set,
- SHOULD NOT contain the original training data set.
Useful literature
Natural Language processing with python (Steven Bird, Ewan Klein, and Edward Loper)
https://cran.r-project.org/web/packages/tidytext/vignettes/tf_idf.html
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://scikit-learn.org/stable/index.html
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/
https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/
https://monkeylearn.com/text-classification-support-vector-machines-svm/
https://medium.com/analytics-vidhya
https://www.youtube.com/watch?v=Zt83JnjD8zg
https://www.youtube.com/watch?v=xvqsFTUsOmc
https://youtu.be/0kPRaYSgblM