Tuesday, September 3, 2024
Computer Vision
The delightful thing about Data Science in the novelty of it all. Classical Statistical Methods are time-tested, but not very sexy. For me, the thrill of discovering something new is to share it with others. As such, I thought it useful to write up a discovery during a recent project.
The Aarhus University Signal Processing group, in collaboration with the University of Southern Denmark, has provided the data containing images of unique plants belonging to 12 different species. I was tasked to build a Convolutional Neural Network model which would classify the plant seedlings into their respective 12 categories:
I did some research, and the literature suggested that VGG19 does well with plant classification models. Using the adam optimizer and categorical cross-entropy loss the resulting model resulted in 60% accuracy in the training set, but only 45% accuracy in the test data, even after employing gaussian blurring to regularize the provided plant images. This was clearly not good enough.
Looking for a quick and inexpensive improvement, I turned to a project involving a CNN for the MNIST dataset in a hand-writing classification problem that was extremely discerning. My thought process was that plants, like handwritten characters, originate from a root/starting point and "grow" from there. Each plant is as unique as the digits 0 to 9, with minor variations due to environmental factors, much like the handwriting style of different individuals.
The resulting CNN achieved 90% acccuracy in the training data and 70% data in the test set, after 30 epoch, with the apparent overfitting beginning to occur around 15 epochs
Digging in a bit to the test data, it was evident that the model has a challenging time discerning between imamature (unsprouted) Black-grass and Loose-Silky Bent grass, which is clear from the professional illustrations below (not from dataset we're analyzing).
Digging in a bit to the images themselves, it became clear that the grasses themselves were nearly indistinguisable prior to flowering, which helps explain why the model has a tough time with discernment, just as a human would:
In a similar vein, it's problematic how often the Loose Silky-Bent grass is confused with common wheat (at an approximate 50-50 rate, well, 17-17 technically). This suggests that further analysis would be impractical: it's better to let the proverbial "tares grow up among the wheat", just as Jesus preached.
Wednesday, August 14, 2024
Top 10 Posts
In updating my resume I had occasion to compile and organize some of my favorite sports analytics posts, both here and on external sites. Here's the top 10, in chronological order:
Revisiting
the Homecourt Advantage in College Basketball, the Sports Collection*,
January 2012
Valuation
of an NFL Pro-Bowl Tight End, Sportistician, June 2015
NFL
Tight Ends and the Combine, Football Outsiders, March 2016
Tight End
Prospecting, Rotoviz*, March 2017
2017 Tight
End class, Rotoviz*, April 2017
*Indicates introduction/abstract available at no cost
Subscribe to:
Posts (Atom)