Classifying short texts

The task is to build a classifier for recognizing the theme of a short text (press note) in one of the following five groups: business, entertainment, politics, sport, tech. Work with the provided dataset of texts, develop a representation, feature extraction method, and a final classisifer to correctly label any new text samples.

The work should be performed in two stages, which can be repeated iteratively: (1) working out a data representation and a set of features to be used in the classification, and writing a program to convert the set of texts to a set of feature vectors, and (2) building the classifier, and its optimization.

Popular schemes for representing variable-length texts for the purpose of automatic classification are: BoW (Bag-of-Words), TF-IDF (Term Frequency-Inverse Document Frequency), Word2Vec, additionally using the N-gram model (e.g. bigram).

Any machine learning algorithms can be used in the construction of the classifier. It is worth doing a few experiments, starting with some statistical analysis, and then performing an initial machine learning experiment using the simplest data representation scheme and the Naive Bayes classifier. The results from this experiment can then be used as the reference to evaluate the correctness of applying more advanced techniques. These can include other text representation methods as well as other machine learning algorithms, such as decision trees, nearest neighbors, SVM, neural networks, any ensemble learning approches, etc.

The results of each experiment should be evaluated using appropriate error measures including (but not limited to) Accuracy computed on both the training set and the cross-validation method, as the simplest measure to detect overfitting.

Optimizing the results can focus on either or both of selecting the best machine learning algorithm and tuning its parameters, as well as trying ensemble learning approaches by building hybrid classifiers. It is also possible to go back to the previous step - representation - and attempts to modify it to achieve better classification.

Deliverables

Please work out the results obtained in the form of a report describing your work (all important steps) and the results obtained. Additionally, please prepare the development package, allowing to reproduce your classifier operation.

The report should have the following general structure:

NO title page, list of contents, figures, etc., only a compact header: project title, class name, author, date
data representation, preliminary analysis (if any), preprocessing, etc.
initial classification experiment: details, results (training set, cross-validation)
optimization attempts, for each experiment:
- premises, execution details, results, conclusions
final results summary and conclusions
complete list of resources/literature

The main criteria for assessing the report are: brevity, clarity and readability of the description, as well as precision and completeness. The subsequent steps of the project should be justified briefly.

Report penalties:

report unnecessarily long, separate title, table of contents pages
results not clearly, precisely and completely provided for each experiment
raw results from the program or screenshots pasted instead of a summary
too much data with no clearly given summary
too much precision with not attempt for proper rounding
no clear summary of the results
important resources used in project not listed in report !!!

The development package should correspond to the best classifier model found during the project, and:

be possible to run on Linux,
contain a minimal Readme.txt file describing how to run it, including required software packages, their versions and how to install them,
the script/instructions in Readme.txt should demonstrate how to run the pre-trained classifier to classify samples, not how to run the full training from data
contain a pre-trained model which would allow to classify any given set of testing texts, and, when their correct classes are available, compute the most important statistical quality assessments of the classification, like those presented in the report,
ADDITIONALLY allow to regenerate the classifier based on the original training data (set of texts), and repeat the classification of a given test set,
SHOULD NOT contain the original training data set,
the requirements.txt file should list the required Python packages with no version numbers, unless a specific version number is required.

Useful literature

Natural Language processing with Python (Steven Bird, Ewan Klein, and Edward Loper)

https://cran.r-project.org/web/packages/tidytext/vignettes/tf_idf.html

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

https://scikit-learn.org/stable/index.html

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/

https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/

https://monkeylearn.com/text-classification-support-vector-machines-svm/

https://medium.com/analytics-vidhya

https://www.youtube.com/watch?v=Zt83JnjD8zg

https://www.youtube.com/watch?v=xvqsFTUsOmc

https://youtu.be/0kPRaYSgblM