When life gives you lemons create a dataset!

Maciej Adamiak
SoftwareMill Tech Blog
6 min readJun 10, 2020

--

Lemon quality control dataset—the making of

I really enjoy the experience of preparing a machine learning model. Especially when working on computer vision related assignments. Carefully exploring the problem, testing different approaches and finally getting satisfying results is something that makes me love my profession. From all tasks that make up a machine learning pipeline my favorite, and one which gives me the greatest satisfaction, is definitely preparing the initial dataset. Fortunately, that is how I spent last days of May.

SoftwareMill in cooperation with Amplus, a Polish producer and distributor of fresh fruits and vegetables, decided to tackle the important issue of food quality. We’ve started from a series of experiments regarding using computer vision in quality check of fruits. I’m sure you will be able to read more about that topic in the upcoming months. Now, let’s focus on describing how did we prepare our demo lemon dataset. This was quite a spontaneous action which took a week to accomplish.

Lemon tree

In SoftwareMill we all work remotely. This means that the transport of lemons was sent directly to my home in Łódź. I knew this will be a lot of fun when the delivery man started unpacking around 100 kg of lemons in my hall. You should have seen the look on his face. It was priceless! I’ve collected nine packages, each containing lemons divided by skilled quality controllers into different categories. The quality of a lemon depends on such features as: color, shape, texture, lack of greening, blemishes, scars, diseases and mold. Single package contained healthy lemons or those that have only one of the flaws indicated earlier. For the sake of simplicity we’ve decided that defective items should be homogeneous i.e. contain only a single flaw. This is going to help us in the future to precisely identify the defect against the background of healthy lemon tissue.

After carefully checking the package and verifying whether the classification is correct we started designing our DIY photo station. We knew that time is short and to avoid the risk of lemons rotting before capturing a photo in the relevant category we had to be ready as soon as possible with our setup. As a consequence we decided to use as simple components as possible. We ended up with a hexa base rotate kit, powered by Arduino UNO. We used a mobile phone camera exposed through http by IP Webcam application and an LCD monitor as a background. Features of the photo platform are simple. You can control its rotation by sending relevant commands via serial port from your PC to Arduino. We coded a simple routine that takes twelve photos of a single lemon from different angles and sends them to the cloud.

Lemon song

That’s when the real thing started. We processed each lemon and took multiple shots with different angles and in various lighting conditions. The set was divided in such a way we had a possibility to take samples in the morning, afternoon and late evening. Single fruit generated around 100 images.

After we acquired a decent sample from each category we proceeded with annotating our raw images. The task was to identify regions of fruits in the image. For the purpose of this task we’ve used VIA annotation tool.

We used this data to train an PSPNet (Inception-V3 backbone with ImageNet pretrained weights). PSPNet is a neural network architecture which is commonly used for semantic segmentation. Segmentation is a process of classifying pixels into categories. We wanted to classify the regions of the image that contain a lemon. This way we were able to extract the background information.

After background extraction and some manual verification we ended up with around 10 000 samples. This set was divided into three subsets: training, validation and test. Each subsets contains fruit pictures (1056px x 1056px) on black background. The ratio between categories and the origin were preserved and each lemon occurs only in one subset.

That’s how we prepared our demo dataset. We cannot wait to make good use of this data. Stay tuned for more news on our fruit datasets.

Of course we did not forget about post-processing.

Epilogue

Machine learning pipelines are very complex. I often catch myself thinking otherwise. All those awesome frameworks like Tensorflow, Keras, Scikit-learn, SciPy, OpenCV and much more accustomed us to fast prototyping, reliability and supportive community ❤. Having all this powerful tools, within reach, is very convenient and assuring. Moreover, the ML open source community is flourishing and we can expect that more interesting solutions will emerge in the near future. If you are observing machine learning from a newcomer perspective it’s easy to fall into the trap of thinking that work of engineers and scientists is only about creating a good model with ultrahigh accuracy. Let’s take a look at all the introductory tutorials to ML. You grab a dataset of numbers / flowers / cats / clothes, train, test and celebrate an accomplished task. Easy, right? Not so fast! Training your neural network is a pleasant experience only when you are working on well-prepared and curated dataset. Frequently, steps related data preparation, pre- and post-processing are the most time and resource consuming parts of the whole machine learning pipeline. So kudos to all data engineers :)

--

--

Software engineer with a passion to functional programming, data engineering, machine learning and research.