Classification of Large-Scale Datasets of Landsat-8 Satellite Image Based on LIBLINEAR Library

. ABSTRACT: One of the most common machine learning approaches is linear classification. Multithreading used in LIBLINEAR Library has become an important research topic for processing large datasets. Working in a multithreaded environment accelerates the training of large datasets and increases classification accuracy. This study examines multiple linear support vector machine implementations for binary classification of Landsat-8 satellite image datasets to determine the optimal model. Four datasets have been created for different scenes in Iraq. Each dataset contains millions of pixels to be classified optimally. This article selects an appropriate algorithm for such a large dataset of current types in the LIBLINEAR library optimal for organizing large datasets using the linear SVM algorithm. Each dataset containing about 4.5 million samples was used to test the performance. The seven linear SVM techniques have statically examined the most effective SVM implementation method. Accuracy, F-Score, and Kappa are used as model assessment measures to evaluate and rank the models' performances. Based on the results, the LibLINEAR library with type 4 (LINLINEAR4) classifier was the best classifier for satellite image classification in remote sensing when applied to large datasets. The accuracy, Kappa, and f-score of the LIBLINEAR 4 classifier are as follows: (92.89 %), (73.53 %), and (77.77%) with dataset1, (96.25 %), (59.97 %), and (61.88%) with dataset2, (96.65%), (68.33%), and (70.03%), and (94.72%), (53.67%), and (56.25%), respectively, with dataset4.


INTRODUCTION
In recent years, the availability of data collecting has increased continuously.There are further difficulties to consider, such as processing large amounts of data and its storage.Artificial intelligence algorithms make it feasible for humans to interpret the data in a certain way that is unable to manage [1,2].Support Vector Machine (SVM) is a machine learning approach in data mining that identifies an optimal separation that correctly divides and maximizes the margins between two or more classes [3].In recent years, support vector machine (SVM) models have been presented as one of the most the solutions to classification problems in mathematical programming.Since the traditional soft margin was first implemented, many different types of issues may be solved with the help of SVM algorithms.Face recognition technology, machine vision software, and bioinformatics [4,5].
LibLINEAR can classify data that can be linearly separated via a hyperplane while maintaining maximum distance from the support vectors.On the other hand, this is something that may be done by using a variety of SVM types, such as L2-regularized L2-loss support vector classification (dual) and L2-regularized L2-loss support vector classification (primal) [6].[dual] and [primal] refer to the same thing.It is critical to developing answers to issues of large-scale classification in a variety of applications, including text classification.The linear type, which has evolved over the years, is now considered to be one of the most effective learning techniques for vast sparse data with a large number of occurrences and variables [7,8].The LIBLINEAR algorithm is beneficial for training on large-scale issues.For instance, training only takes seconds to train a text classification issue using the Reuters Corpus Volume 1 (rcv1), which contains more than 600,000 samples.The same job would take a standard SVM solver like LIBSVM many hours to do.In addition, LIBLINEAR is on par with or even faster than other state-of-the-art linear classifiers like Pegasos [9,10].
ABSTRACT: One of the most common machine learning approaches is linear classification.Multithreading used in LIBLINEAR Library has become an important research topic for processing large datasets.Working in a multithreaded environment accelerates the training of large datasets and increases classification accuracy.This study examines multiple linear support vector machine implementations for binary classification of Landsat-8 satellite image datasets to determine the optimal model.Four datasets have been created for different scenes in Iraq.Each dataset contains millions of pixels to be classified optimally.This article selects an appropriate algorithm for such a large dataset of current types in the LIBLINEAR library optimal for organizing large datasets using the linear SVM algorithm.Each dataset containing about 4.5 million samples was used to test the performance.The seven linear SVM techniques have statically examined the most effective SVM implementation method.Accuracy, F-Score, and Kappa are used as model assessment measures to evaluate and rank the models' performances.Based on the results, the LibLINEAR library with type 4 (LINLINEAR4) classifier was the best classifier for satellite image classification in remote sensing when applied to large datasets.The accuracy, Kappa, and f-score of the LIBLINEAR 4 classifier are as follows: (92.89 %), (73.53 %), and (77.77%) with dataset1, (96.25 %), (59.97 %), and (61.88%) with dataset2, (96.65%), (68.33%), and (70.03%), and (94.72%), (53.67%), and (56.25%), respectively, with dataset4.
Keywords: Linear Classification, SVM, LibLINEAR, Large Dataset, Satellite Image.Kumar and Sree [6] SVM implementations resulting from 17 different combinations of SVM classifier libraries, SVM types, and Kernel techniques were statistically examined to find the most effective SVM implementation technique.Chiang, et al. [11] proposed parallelizing the dual coordinate descent algorithm using a new framework for large-scale linear classification.Lee, et al. [12] parallel sparse matrix-vector multiplications in a Newton technique for significant logistic regression.Various implementations, ranging from simple to complex, are compared and evaluated.Results show that great acceleration may be achieved under optimal conditions.Cervantes, et al. [13] the normal SVM is unsuitable for classifying big datasets since its training difficulty is significantly dependent on the size of the training set, proposed a novel SVM classification technique for large datasets.
The objectives of the current study are as the followings: 1.Apply a multithreaded library to choose the optimal model for classifying large datasets.
2.Calculating the training execution time for millions of recodes.
3.Creating new datasets for various desertification-affected Iraq areas to perform future studies on these areas.

FIGURE 1. Diagram of Proposed Methodology
Images were converted from digital numbers to reflectance values using the ENVI software.The raw Landsat 8 satellite images will be calibrated.The image acquired from the Landsat satellite-8 has dimensions of around (7881 * 7881) pixels, which means that it must be processed using approximately 62 million pixels.Thus, samples have been selected from each of the chosen scenes for Iraq.Class labels were created using the ArcGIS program, supervised learning methods, and direct observation of the studied phenomenon.Pixels that were part of the phenomenon were given a value of +1, while pixels with a value of -1 were not part of the phenomenon.Finally, all pixel values have been exported to a Txt file using the extract multi values to points tool in ArcGIS.

MATERIALS
Four datasets used in this study were created using the (ArcGIS 10.6) tool to collect Landsat 8 satellite images.They are selecting specific areas of Iraq and downloading images of them from the Landsat 8 satellite, as these images were used in other studies requiring the detection of certain phenomena using machine learning and remote sensing methods.Table 1 below contains the details of different datasets used for this study.All these datasets consist of 15 columns, and each column was generated using the calculation (Normalize Differences) between one band and the other band for Landsat-8 satellite images.An SVM is trained using 15 normalization differences (NDs) generated from six Landsat-8 bands (B2, B3, B4, B5, B6, and B7), comprising three visible bands (Visible, Near-infrared, and SWIR) and two SWIR bands (SWIR).Four places in Iraq with sand dunes were chosen to assess the threat of desertification.One of the most significant aspects of climate change is this phenomenon.Satellite images and remote sensing must be used to monitor and track the development of dunes over the next several years.There were four separate regions of Iraq where the data was collected.Sand dunes may be found at Baiji, Najaf, Samawah, and Amarah, among other locations.

METHODS
Support vector machines (SVMs) are a non-parametric supervised statistical learning approach.Because of this, there are no presumptions made on the distribution of the underlying data.At the beginning of the process, a collection of labeled data instances is presented to the procedure.The SVM training approach attempts to find a hyperplane that divides the dataset into a certain number of predefined classes in a manner that is consistent with the examples used for training [14].The decision boundary that reduces the number of incorrect classifications made throughout the training process is the optimal separation hyperplane.Learning is an iterative process that involves identifying a classifier with an adequate decision boundary to separate training patterns (in a potentially high-dimensional space) and then separating simulation data using the same parameters.This process is known as "classifier searching" (dimensions) [15].Models are constructed using LibLINEAR for each of the four datasets processed throughout the training process.LibLINEAR is now supports (1) L1-regularized L2-loss SVC, (2) L1-regularized L2-loss SVC, (3) L1-loss SVC, (4) L2-loss SVC, (5) L2-regularized LR, (6) L1-regularized LR (7) L2-regularized L2-loss SVR [17][18][19].

LIBLINEAR LIBRARY
LibLINEAR is a collection of linear SVM classifiers for large-scale linear classification that supports logistic regression and linear SVM.The LibLINEAR library helps a set of instance label pair instances is provided [Xi, Yi], i=1,….,l, xi ∈ Rn ,yi ∈[ -1,+1 ], in two groups.The response vector y ∈ Rl such that yi =[1,-1], a linear classifier function generates the weight vector was the model.The decision function is as sgn (wT x).The following points will show seven types used in this study [12,20].K.This is our assumption.A framework was employed based on classifiers of this kind in this approach.Where M is a k * n-dimensional matrix, and Mr is the rows of M. The inner-product of the row of M and the instance x is sometimes referred to as the trust similarity score for the r class.As an outcome of the above definition, the predicted label is the index of the row that has the highest score of similarity to x.It is also possible to use linear binary classifiers to predict whether an instance x is labelled 1 or 2. Whether w.x is greater than zero or less than zero (as seen in the notation above).Using a 2 * n matrix, M1 = w and M2 = -w, a classifier can be built [23].

ACCURACY ASSESSMENTS METRICS
Models are constructed using the R programming language, with 70 % of data used for training and 30 % for testing to evaluate the performance of the models.The first step is to determine the accuracy of all algorithms using R software (caret package).Accuracy is determined by dividing the number of accurately predicted pixels by the total number of pixels [25][26][27].Accuracy is given by the formula shown in Eq. ( 7).
In equations ( 7) and ( 8), the words "true positives," "true negatives," "false positives," and "false negatives," respectively, are denoted by the symbols "TP," "TN," "FP," and "FN," respectively [26].The F-Score is a measurement of accuracy that considers the statistics recall r and precision p [27,28].The term "precision" refers to the ratio of "true positives," or "TP," to "all predicted positives," or "TP + FP."The ratio of true positives to all real positives (true positives plus false positives) is a recall.The formula for the F score is stated below in Eq. ( 8) [28,29].
The Kappa statistic assesses the accuracy of a dataset's predicted classifications compared to its actual classifications while also considering the possibility that the two may match.On the other hand, it does not consider the costs much like the straightforward success rate.Better models will have a Kappa value that is closer to 1 [29,30].The Kappa metric may be calculated using the Eq. ( 9) shown below:

RESULTS AND DISCUSSION
The results will be presented in graphical and tabular forms to facilitate comparisons between the most effective classifiers for millions of rows of data.This part describes the models created during the experiments, the measurements obtained, and the various conclusions from analyzing the data taken during the experimentations.The parameters setting for all datasets are (type =1 to 7, cost = 1 and 100, bias = 1, verbose = TRUE).Tables 2,3,4, and 5 show the accuracies, Kappa, and F-scores of the models constructed and evaluated with dataset1, dataset2, dataset3, and dataset4 using R software (caret package).Furthermore, the training time for each technique with each dataset was computed to show the training time required with each dataset.The accuracy rate for the fourth type of library work (LibLINEAR) was the best, according to Tables 2, 3, and 5.According to these results, this classifier (LibLINEAR4) is suitable for datasets with millions of records.This type was invented by researchers (Crammer and Singer) to deal with big data.Figure 3 shows the overall accuracy, Kappa, and F-score for each classifier using the four datasets.Figure 4 shows the time taken to train 1 datasets1.The most significant drawbacks are that the majority of the studies that use remote sensing data, which is large data and requires advanced methods of processing and analysis, are working to classify it using the Support Vector Machine (SVM) algorithm in the library (LIBSVM), which is considered ineffective with large data.As a result, in this article we present results showing that the use of a library (LIBLINEAR) is better in terms of efficiency and speed when working with remote sensing data.

CONCLUSIONS
LIBLINEAR is an open-source library or package suitable for large dataset linear classification.Investigations and analysis indicate that solvers in LIBLINEAR have high theoretical and practical performance.This article used many linear SVM implementations on large datasets to select the best SVM classifier.The results indicated the LIBLINEAR4 classifier, a fast and effective classification approach for big datasets.The results of the experiments indicate that the level of accuracy achieved by using the LibLINEAR4 type is comparable to that achieved by other SVM implementations such as (e1071), despite the fact that the amount of time spent training is greatly reduced.In addition, the LIBLINEAR method is scalable, meaning that it may be used to huge datasets while still maintaining a high level of classification accuracy.

Figure 1
Figure1illustrates the stages that have been followed in this study through a flow diagram.This diagram includes the steps of data preparation, classification, accuracy assessments, and the results required to solve the research problem.The approach consists of three major stages: data preparation for training, constructing training classification models using seven models of the LIBLINEAR library and evaluating the classification accuracy of models using accuracy measures (Overall accuracy and F-score).The raw images from the Landsat-8 satellite were obtained from the United States Geological Survey (USGS) (https://earthexplorer.usgs.gov)and downloaded in the GeoTIFF file format.Satellite images of different scenes from various places in Iraq were do