Cucumber Leaf Disease Detection using GLCM Features with Random Forest Algorithm

: Agriculture plays a vital role in India's economy, and the health of crops is critical for maximizing yield. In particular, cucumber, a key salad ingredient known for its health benefits, is susceptible to various diseases such as water mold, bacterial wilt, angular leaf spot, anthracnose, and powdery mildew. These diseases not only affect the quality of cucumbers but also significantly reduce their yield. Early detection of these diseases is crucial for successful cultivation, but traditional manual methods of disease identification by farmers or diagnosticians are time-consuming and prone to misidentification. To address these challenges, we explore advanced artificial intelligence techniques. We implement and compare various machine learning algorithms, including ResNet, AlexNet, and VGG-16, for disease classification in cucumbers. However, these methods often struggle with issues such as noise, irrelevant features, and the generation of pertinent characteristics. To overcome these limitations, we propose a novel approach using a GLCM (Gray Level Co-occurrence Matrix) feature extraction method combined with a Random Forest classifier. This new algorithm aims to improve the accuracy and efficiency of disease detection. Our dataset comprises four distinct categories: Healthy, Anthracnose, Aphids, and CYSDV. It is sourced from diverse platforms, including online repositories like kaggle and direct collection from cucumber farms. The initial phase of our methodology involves noise reduction by converting images into the LAB color space and isolating specific regions using the k-means clustering algorithm. Subsequently, we extract texture features from the diseased leaf images using the GLCM algorithm, and classification is performed using the Random Forest model. Comparative analysis shows that our proposed Random Forest algorithm outperforms previous models like LGBM (Light Gradient Boosting Machine) and QSVM (Quantum-Support Vector Machine) in predicting disease presence in cucumber plants with higher accuracy rate of 98.62%, Precision 98.77%, Recall 98.48% and also F1 Score 98.62%.


Introduction
Agriculture is fundamental for sustaining human life on Earth, serving as a crucial tool for providing food and economic growth.As the world population is projected to reach 9.7 billion by 2050, efficient and effective farming practices are becoming increasingly essential [1].However, the agricultural sector faces significant challenges, primarily due to plant diseases caused by fungi, viruses, and bacteria [2].The accurate identification and diagnosis of plant infections are critical, as incorrect diagnoses can lead to decreased resistance in plants and reduced crop yields.
Traditionally, disease identification in plants has relied on physical examination, but this method often lacks accuracy and fails to provide the basis for effective treatment.In response to these challenges, there is a growing trend towards adopting artificial intelligence techniques, such as machine learning [3] and deep learning [4], to improve disease detection and reduce productivity losses.These technologies are particularly useful in extracting textural features from diseased leaves, aiding in their recognition and treatment.India, for instance, exported 123,846 metric tons of cucumbers in the last financial year, highlighting the economic significance of this crop [5].Cucumbers, known for their high water content, are popular for hydration after workouts and in hot weather.They are also beneficial for health, providing vitamin K, which aids in blood clotting, and offering anti-fungal benefits that may protect against heart diseases, cancer, and diabetes.However, cucumber crops are susceptible to various diseases caused by the overuse of chemicals, bacteria, viruses, and fungi.
Early and accurate detection of these diseases is critical for reducing crop losses and bolstering the Int.Res.J. Multidiscip.Technovation, 6(1) (2024) 40-50 | 41 agricultural economy.Traditional manual detection methods are time-consuming and often inaccurate, leading to substantial economic losses.Thus, there is a need for more advanced, artificial methods for early disease detection to ensure effective treatment and to improve the quality and quantity of the yield.Current research in computer vision technology for disease detection in agriculture shows promise, but these methods come with their own set of advantages and disadvantages, exhibiting varying levels of accuracy in cucumber production.
To address these challenges, this research proposes a robust and efficient machine learning algorithm for the segmentation and classification of diseased cucumber leaves.Our dataset, gathered from a cucumber field in Regalapalli village, Proddatur, and kaggle, aims to capture real-life instances of disease for more accurate analysis.Our model demonstrates the capability to effectively remove noise, achieving a remarkable 98.62% accuracy using the Random Forest algorithm.
The primary contribution of this research paper can be summarized as follows:

Innovative Machine Learning Application for
Agriculture: The research introduces a robust machine learning algorithm that significantly enhances the accuracy of disease detection in cucumber plants.This model, utilizing the Random Forest algorithm, achieves an exceptional accuracy rate of 98.62%, demonstrating a notable advancement over existing agricultural disease identification methods.

Comprehensive and Realistic Data Collection for Improved Diagnosis:
The paper presents a unique dataset, meticulously gathered from both a specific cucumber field in Regalapalli village, Proddatur, and online repositories.This diverse dataset ensures a more realistic and effective analysis, paving the way for improved diagnosis and management of plant diseases in agricultural practices.This paper is structured as follows: Section 2 discusses various classification algorithms and their performance; Section 3 describes the methodology; Section 4 presents the results; and Section 5 concludes the study.

Literature Review
The diagnosis of plant diseases is a key area of focus in agricultural research.Numerous scientists and agronomists are dedicated to developing autonomous systems for identifying plant diseases without human intervention.In the context of salad vegetables, cucumbers are particularly important.They are valued for their health benefits and their ability to prevent dehydration, especially in summer.However, cucumbers are susceptible to diseases such as powdery mildew, downy mildew, angular leaf spot, anthracnose, and blight, which can adversely affect both yield and quality.
Mohammadali Khan [6] utilized the deep entropy-ELM method for early detection of diseased cucumber leaves.A study by Chunshan Wang et al. [7] employed DeepLab V3 and U-Net for cucumber leaf disease severity classification, achieving an accuracy of 93.27%.
To ensure optimal yield and quality, early detection of diseased leaves is crucial for farmers.Abdul Rehman and colleagues [8] developed a model using kmeans clustering to transform images into LAB color space for noise removal and to extract regions of interest (ROI).They fused Haralick features and local tridirectional patterns for classification using the Quadratic Support Vector Machine algorithm, achieving an accuracy of 96.80%.Kaiyu Li and Lingxian Zhang [9] proposed the DeepLab V3+ model to estimate the severity of leaf diseases by calculating the ratio of lesion pixels to leaf pixels, with an accuracy of 95.78%.While AlexNet and other image processing techniques have been used for plant disease identification, they often face challenges such as high parameter requirements and complexity.This issue was addressed by Shanwen Zhang and Subing Zhang [10] who implemented a global pooling dilated convolutional neural network, simplifying the model and successfully classifying six cucumber leaf diseases.
Chao Man, Zhaohui Jiang, Jingyao Zhang, Yuan Rao, and Shaowen Li [11] introduced a deep convolutional neural network model to identify diseases like downy mildew, anthracnose, and powdery mildew in cucumber leaves, achieving an average accuracy of 96.11%.P. Karthika and S. Veni [12] emphasized the importance of regular monitoring of environmental conditions for healthy cucumber growth.They implemented an automated system using a Multiclass Support Vector Machine and GLCM texture extraction algorithm, detecting diseases such as leaf miner, bacterial wilt, and leaf spot.
M. Yogeshwari and G. Thailambal [13] proposed a novel DCNN algorithm for plant leaf disease detection, preprocessing images with a 2D Adoptive Anisotropic Diffusion Filter and using GLCM for texture feature extraction.They achieved a classification accuracy of 97.43%.
Finally, Sandeep Kumar, KMVV Prasad, and A. Srelekha [14] developed an algorithm based on GLCM to extract texture features, using SVM for classification.

Int. Res. J. Multidiscip. Technovation, 6(1) (2024) 40-50 | 42
This method accurately identified leaf images affected by diseases like anthracnose, bacterial blight, and Cercospora leaf spot.In below Table 1 represents the performance of different classification algorithms and also our proposed algorithm with their accuracy along with predicted classes.

Proposed Work
Dataset 1: In this work first phase is dataset is gathered from kaggle and then RGB images are preprocessed and converted to LAB color space for better extraction of region of interest from leaf images, and ROI is extracted through k-means clustering.In second phase GLCM features [13] are extracted from segmented images and last Random Forest Algorithm is used for prediction of images with perfection of 98.62%.Figure 1

Image Acquisition
The Dataset is gathered from kaggle.The motivation behind gathering dataset is to develop best model to identify disease under immense noise.Gathered images of Anthracnose, Healthy, CYSDV and Aphids [15] shown in Figure 2.

Preprocessing
RGB image is converted into LAB color space during preprocessing step.In RGB images Red, Green, Blue are added together to produce different colors.Whereas LAB color space has a three-axis color system and Lab color space is more accurate color space, it helps to design images to approximate human vision is shown in Figure 3.

Segmentation-K-Means Clustering
As images are taken from real time scenarios and having some noise which makes segmentation difficult and challenging.K-means clustering algorithm used to partition the dataset into different clusters [16].Total n observations divide into k group and each group is belongs to one observation's.K-means clustering minimizes the squared Euclidean distances between the surrounding points and cluster center [17].
In Figure 4 proposed method we choose the k value is k=2.Our aim is to reduce squared Euclidean distances between the surrounding points and cluster center.

Feature Extraction
The texture features of the input images are extracted using the Gray Level Co-occurrence Matrix (GLCM), which provides crucial information for identifying the region of interest, specifically the diseased spots on the leaves.From the segmented images, a total of 13 features are extracted.These features include homogeneity, variance, contrast, energy, correlation, dissimilarity, standard deviation, mean, maximum, minimum, median values, range, and entropy.Each of these features plays a vital role in accurately pinpointing and characterizing the diseaseaffected areas in the images. Range (Equation 13): Range calculates the difference between the maximum and minimum values: Range = max , { , } − min , { , } These features play a crucial role in image analysis, helping to differentiate between healthy and diseased regions in plant leaves.

Classification
Random Forest Algorithm: The Random Forest (RF) algorithm is a versatile machine learning technique used for both classification and regression tasks.It constructs a multitude of decision trees during the training phase and integrates their outputs for the final prediction [18].This approach effectively overcomes the issue of over fitting, often leading to improved accuracy.The algorithm is well-suited for handling both categorical and continuous data type's.In Random forest no need to separate the data as train and test because decision tree always miss the 30% of data.

Implementation Steps: i)
Data Pre-processing: This involves preparing the data by cleaning, normalizing, and possibly transforming it to ensure it is suitable for input into the model.

ii)
Fitting the Random Forest Algorithm to the Training Set: In this step, the Random Forest model is trained using the preprocessed data.The algorithm learns to make predictions or decisions based on the features of the data.
iii) Prediction: After training, the model is used to make predictions on new, unseen data.

iv) Creation of a Confusion Matrix:
A confusion matrix is generated to evaluate the performance of the model by comparing the actual versus predicted values.

v)
Outputs: The final step involves interpreting the outputs of the model, which could be classifications in a classification task or predicted values in a regression task.
Figure 5 shows the work flow of Random Forest Algorithm: Step1: From the training set, it picks N data points randomly.
Step2: Next create the decision tree for each data point called as subset.
Step3: Each Decision tree produces the result.
Step4: Final result is based on the averaging of voting.
Step5: Based on voting it predicts the result.converted to LAB color space for noise removal.Kmeans clustering is used for extracting ROI.Then apply feature extraction techniques to extract the useful features from segmented images using GLCM.Here we extract total 13 features from segmented images.After that we use Random forest algorithm for classification of images.

Experiments
In Random forest algorithm hyper parameters are used to improve the model performance and also make the model works faster.So here we use two hyper parameters such as n_estimators=50, random_state=42.n_estimator is used to build the number of decision trees by the algorithm and random_state is used to control the randomness of the model.

Confusion Matrix: With help of Confusion
Matrix to represent summery of our machine learning model.This is shown in Figure 6.
Heat map: Heat map is a graphical representation and two dimensional representations of data in colors of co-occurrence matrix values are shown in below Figure 7.

Receiver Operating Characteristic Curve:
ROC is curve: Figure 8 shows overall performance of classification model with different threshold values.It represents rate of true positive and false positive.

Accuracy:
The accuracy of the model is quantified using the standard method "metrics.accuracy_score(test_labels,test_prediction)".It represents the proportion of true results (both true positives and true negatives) among the total number of cases examined.The formula is as follows: ACC =  +   +  +  +  (Equation 14)   Recall: Recall, or sensitivity, indicates the ability of the model to find all relevant cases within a dataset.It calculates the proportion of actual positives that were correctly identified.The formula for recall is: 16) F1Score: The F1 Score is the harmonic mean of precision and recall, providing a balance between the two metrics.It is particularly useful when the class distribution is imbalanced.The formula for the F1Score is: This revision provides a clearer understanding of each metric, including their formulas and significance in evaluating the model's performance shown in Figure 9 and predicted images using proposed algorithm shown in Figure 10.

Dataset 2:
The dataset is gathered from cucumber field located in regalapalli village in proddaturu, Kadapa (dist).For this used OPPO mobile for image capturing.The main aim of gathering our own dataset is to develop robust model that is capable to identify disease under immensenoise.Here gathered total 167 images of Angular leaf spot, Anthracnose, Healthy, Powdery mildew and Downy mildew.Figure 11 Shows the sample images [20].
After that all images are converted into LAB color space for noise removal.Then ROI is extracted through k-means clustering after that we extract GLCM features for segmented images then classification is performed using three models such as LGBM, Random Forest and Quadratic SVM, compare to three models our Proposed Model random Forest algorithm will give best results here outputs of all the models.Figure 12, Figure 13 and Figure 14 represents the performance of three models in the form of confusion matrices, ROC curve for dataset2 and also comparative analysis of three models.Finally, the predicted images of cucumber using Random Forest algorithm is shown in Figure 15.LGBM

Conclusion
The proposed methodology is efficient and best for the segmentation and classification cucumber leaf diseases i.e.Healthy, Anthracnose, Aphids and CYSDV.In this research dataset is gathered from Google repository.RGB images are converted to LAB color space for noise removal.Region of interest is extracted through K-means clustering.Then GLCM features are extracted from segmented images.Then perform classification using random forest algorithm.This algorithm classifies four cucumber classes with highest accuracy of 98.62%.An embedded system for plant disease identification is aimed to designed in the future and also consider large dataset with more number of classes.Our own dataset the accuracy of model is 78.74%.If you take more images then accuracy is also improved.

Figure 4 .
Figure 4. Represents the segmented images of each class

Table 1 .
. Represents the block diagram of proposed technique.Performance of different classification Algorithms of Cucumber leaf disease detection Anthracnose, Aphids, CYSDV, Healthy 98.62% Figure 1.Block diagram of Proposed Methodology