Clustering image files

The task is to attempt clustering images from a provided dataset using selected algorithms. The 1000 images are collected in turn in 10 classes, and the goal is to reconstruct the classes as closely as possible. Image numbers in the groups of 100's can be used to compute the correctness of the results.

The clustering algorithm can be either one of those covered in class: k-means, EM (Expectation Maximization), hierarchical clustering, DBSCAN, or other algorithms found in literature or available from the machine learning library. The number of classes can be initially predetermined, to simplify the first approach, but experiments with automatic determination of the number of clusters should also be performed.

The most important part of the project is the initial processing of the images, which are of different sizes. The autoencoder can be used as initial feature extraction technique. It can be combined with image filters such as greyscale conversion, edge detection filter such as sobel or canny, to simplify images in the hope to extract more useful features. One very useful approach to feature extraction from images is the Histogram Oriented Gradients (HOG) method. Any such approaches can additionally be augmented by dimension reduction techniques, such as PCA.

Students are encouraged to take advantage of any scientific papers, tutorials, textbooks, provided they are properly endorsed and cited in the report. Any programming environment can be used. In the final report it would be good to compare at least two different approaches, for example the first successful clustering experiment, and the best approach, and compare the results. Please compute the results compared to the original 10 classes, given at least as precision and recall.

Deliverables

As before, the outcome of the project should be submitted in two parts: a report and a development package.

The report should include:

Report penalties:

The development package should:

Please write your scripts in a non-interactive way. The input files to cluster should be assumed to be in the Cluster_img subdirectory of the current location (or, optionally, in a directory specified as the command line argument to the clustering program).

Useful literature

https://scikit-image.org/

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering.html

https://scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html

https://docs.opencv.org/master/d9/df8/tutorial_root.html

https://www.learnopencv.com/histogram-of-oriented-gradients/

https://opencv-python-tutroals.readthedocs.io/en/latest/index.html

https://www.analyticsvidhya.com/blog/2019/08/3-techniques-extract-features-from-image-data-machine-learning-python/

https://www.analyticsvidhya.com/blog/2019/09/feature-engineering-images-introduction-hog-feature-descriptor/

https://towardsdatascience.com/image-clustering-using-k-means-4a78478d2b83

https://towardsdatascience.com/image-clustering-using-transfer-learning-df5862779571

https://towardsdatascience.com/introduction-to-image-segmentation-with-k-means-clustering-83fd0a9e2fc3

https://franky07724-57962.medium.com/using-keras-pre-trained-models-for-feature-extraction-in-image-clustering-a142c6cdf5b1

http://www.adeveloperdiary.com/data-science/computer-vision/how-to-implement-sobel-edge-detection-using-python-from-scratch/

https://machinelearningmastery.com/use-pre-trained-vgg-model-classify-objects-photographs/

https://en.wikipedia.org/wiki/Otsu%27s_method

https://kapernikov.com/tutorial-image-classification-with-scikit-learn/