Sportistician

The delightful thing about Data Science in the novelty of it all. Classical Statistical Methods are time-tested, but not very sexy. For me, the thrill of discovering something new is to share it with others. As such, I thought it useful to write up a discovery during a recent project. The Aarhus University Signal Processing group, in collaboration with the University of Southern Denmark, has provided the data containing images of unique plants belonging to 12 different species. I was tasked to build a Convolutional Neural Network model which would classify the plant seedlings into their respective 12 categories:

I did some research, and the literature suggested that VGG19 does well with plant classification models. Using the adam optimizer and categorical cross-entropy loss the resulting model resulted in 60% accuracy in the training set, but only 45% accuracy in the test data, even after employing gaussian blurring to regularize the provided plant images. This was clearly not good enough.

Looking for a quick and inexpensive improvement, I turned to a project involving a CNN for the MNIST dataset in a hand-writing classification problem that was extremely discerning. My thought process was that plants, like handwritten characters, originate from a root/starting point and "grow" from there. Each plant is as unique as the digits 0 to 9, with minor variations due to environmental factors, much like the handwriting style of different individuals. The resulting CNN achieved 90% acccuracy in the training data and 70% data in the test set, after 30 epoch, with the apparent overfitting beginning to occur around 15 epochs

Digging in a bit to the test data, it was evident that the model has a challenging time discerning between imamature (unsprouted) Black-grass and Loose-Silky Bent grass, which is clear from the professional illustrations below (not from dataset we're analyzing).

Digging in a bit to the images themselves, it became clear that the grasses themselves were nearly indistinguisable prior to flowering, which helps explain why the model has a tough time with discernment, just as a human would:

In a similar vein, it's problematic how often the Loose Silky-Bent grass is confused with common wheat (at an approximate 50-50 rate, well, 17-17 technically). This suggests that further analysis would be impractical: it's better to let the proverbial "tares grow up among the wheat", just as Jesus preached.

Tuesday, September 3, 2024

Computer Vision