Image mining technique using Hadoop map reduce over distributed multi-node computers connections

it easier and faster. Using a multi-node Hadoop cluster, map-reduce processing increases efficiency, resulting in time savings

JobTracker/MapReduce, and TaskTracker/MapReduce, while slave runs the DataNodes/HDFS and TaskTracker/MapReduce for data storage.When a client asks anything from NameNode, it is handled by NameNode.If you're working with data stored in HDFS, you'll find that the NameNode is responsible for keeping track of all storagerelated metadata [10].A "Hadoop cluster" is a "computational cluster" that is made up more than one machine linked together using the Hadoop framework.It is mostly used to store and examine massive data, like images, text files, videos, and other [1].So Hadoop architecture may be configured as either a single-node cluster or a multiple-node cluster

Single node Hadoop cluster
All daemons with a single-node cluster can be installed and configuring on a one virtual machine.This strategy is typically employed during the investigation and testing phase or in contexts with little data, but Hadoop might not be the best way to store data in this case.If you don't have enough data, most of Hadoop's main benefits won't be clear right away [10].

Multi node Hadoop cluster
Using several simulated machine Setup of Hadoop in a multi-node cluster requires virtualization.To put it another way, each data node is running on its own virtual machine.For BigData analysis, enterprises employ this type of infrastructure.For real-world use, you need to distribute petabytes of data among hundreds of machines in order for it to be processed instantly [10].

MapReduce Framework
MapReduce is a Java-based software framework that lets programmers run the same computation on multiple machines at the same time to process data faster, reliable, fault-tolerant manner and more efficiently.Apache Hadoop is widely used in science because it works well with the MapReduce programming language and is free to use [11,12].With the help of the Google File System (GFS), the information is broken up into smaller pieces and sent to many computers.Using MapReduce, which is a parallel programming API, calculations are distributed to the data (thus the name "Map") and combined at the end (called "Reduce") [13].These two functions are done at the same time; data is saved as pairs of "key, value."Map function starts by reading a value from the input file.This value is then applied to the function, yielding intermediate output values.These intermediate results are likewise stored in the cluster nodes as key-value pairs.Any key's records could go through many nodes.Sorting is done on the output of the map function and then sent to the reduce function [14].The MapReduce algorithm works well for mining petabyte-sized datasets that can't be stored physically [14] .
The rest of this study is broken down into two sections: the second portion discusses previous research, and the third section discusses proposed research.Section 4 discusses the experiment's findings and implications.

Related Work
Numerous researches have looked at image mining.Earlier image mining approaches are summarized in this portion of the text.Image mining is a fundamental method for gleaning information from visual data.An image processing application of data mining is all that's required [15].
Reference [16] Color is a differentiating feature which uses Block Truncation Coding (BTC) and Color Moment (CM) to get features of image dataset.The image collection is then divided into various groups using the K-Means clustering algorithm.
Handwritten signature feature extraction and handwritten signature recognition were recently introduced by (Biswas1et.al.2010) that developed a way to extract characteristics from Handwritten Signature Images.That calculated characteristic is utilized for verification.Here we employed a clustering approach for verification.As a result of the research presented here, an image clustering method based on a k-nearest neighbor's technique has been developed that is capable of handling clusters of various sizes and shapes.This approach is obviously capable of distinguishing forgeries from actual images, as demonstrated by the results of experiments [17].[18] Proposed that Clustering is one of the most widely used image mining methods, it may be used in a variety of fields, including image segmentation and bioinformatics.As a result of its ease of implementation, simplicity, efficiency, and empirical success, K-means has become the most common and simple clustering method.However, real-world applications generate enormous amounts of data, and thus, how to properly manage this data in an essential mining operation has been a substantial and demanding problem.Additionally, as a message-passing programming model, MPI (Message Passing Interface) offers great performance, scalability, and portability.MKmeans is an MPI-based parallel K-means clustering technique that was inspired by this.The approach provides successful use of the clustering algorithm in a parallel context.Experimentation reveals that MKmeans is very robust and portable, and that it operates with minimal time overhead on huge data sets.
In the field of RS (remote sensing) images, deforestation, climate change, ecosystem and land surface temperature are some of the main research areas, where features need to be classified or clustered to provide research basis.Clustering using the K-Means algorithm is a fundamental technique used in the analysis of real-time RS images.Processing a huge number of RS images becomes impossible for PCs because of their limited hardware resources and their tolerance for long processing times.Parallel and distributed computing approaches are unquestionably the best options.Different with traditional ways, in [19] this approach was parallelized using Hadoop, an open source system that uses the MapReduce programming model to store and analyze big datasets ranging in size from gigabytes to petabytes, instead of the more traditional methods.In their research, they have shown that the outcomes are acceptable and may inspire new techniques to solving similar issues in remote sensing applications.
In the Big Data era, there are a lot of images, which makes it hard to find satellite images.High processing speed is now essential for some unique applications, like responding quickly to disaster warnings.Using K-Means clustering on the Hadoop system, we demonstrate an efficient method for detecting satellite images in [20].They design the effective K-Means algorithm based on MapReduce programming model and Hadoop distributed file system.Two main operations in MapReduce: Map and Reduce, are realized to give an efficient implementation.The results show a fast detection speed and good scale up while keeping accuracy both in training and testing.For analyzing a vast number of photographs of fingerprints that could not otherwise be processed owing to a lack of physical memory, in [14] they turned to the MapReduce programming technique.Preprocessing and extraction of biometric features from images are done in an image data store before they are stored in a database.Multi-fingerprint images are preprocessed and extracted simultaneously by the algorithm in order to extract features (ridges and bifurcation).The results reveal a significant reduction in the amount of time it takes to generate a feature vector for each image processed.
Our work is combined between feature extraction method explained above by some researcher and image clustering this make image mining system in addition the proposed work applied on multi node Hadoop framework to obtain fast results

Proposed Work
In this paper, we talk about how image mining can be done in parallel on Hadoop using above mapreduce architecture.We use multi node Hadoop cluster as framework to apply the algorithm on it.Fig. 1 shows how we use the Hadoop multi node cluster to mine images.The images in the dataset are first turned into a file that is stored on HDFS.In the second step, extract color and texture feature from images by using the Map and Reduce function on a file that is stored on HDFS; The Hadoop MapReduce job divides the input file into chunks each of size 64MB.The Map tasks on different nodes (PC) then work in parallel to gather all of the pixels for each image from these separate chunks.After sorting, the outputs are sent to the reduce tasks, which compute the color and texture features as shown in equations below and write the feature vector to an HDFS file.
Image mining is the process of finding features, patterns, and knowledge in large groups of images that are related to a certain domain.On the other hand, color, texture, and shape are available [4].Feature extraction is the main core in diagnosis, classification, clustering, recognition, and detection [21].

feature extraction 3.1.1. color feature
Color feature is one of the most important features of an image.It is defined to a particular color space or model [22].This paper used color moment (CM) as method to extract color feature which it is one of the most straightforward yet powerful features.Mean, standard deviation, and skewness are the common moments, and the associated computation is as follows [21,22]

1-Mean
It is the average color value of the image and may be determined using the equation (1) below [21]

2-Standard deviation
It is a measurement of the variance of a set of numbers.SD may be calculated using the square root of distribution variance; equation (2) describes its format [21] 2 1 1 ( )

3-Skewness
Equation (3) describes deviation as a measurement of asymmetry of the distribution [2] as shown in equation where Qji is the value of pixel at location ji and m is the number of pixels in the image.Mean (E), Standard Deviation (SD), and Skewness (S) are used to make the feature vector of color representation: FColor = (M, SD, S).

Texture feature
Texture feature extraction is a method for calculating and defining the features and attributes of an image that quantitatively represents the texture image's features [23].Gabor filter is used in this approach to extract texture feature, it is one of the most well-known feature descriptor that Gabor invented in 1946 [23].The Gabor filter consists of a Gaussian kernel function that is modulated by a complex sinusoidal plane wave [24], as illustrated in (4).It works well in both the frequency and spatial domains [25] and many different sizes ( , ) ( , ) ( , ) Gaussian component gu(k,l), and the sinusoidal component s(k,l) is given in ( 5) and ( 6) respectively: ( ) k and l are the spatial variables, whereas m and n are the scale and orientation indices, respectively.This experiment used four scales (1,2,3,4) and eight orientations θ.Consequently, the orientations range from angle 0 o to 157.5 o with an "orientation bandwidth of B = 22.5 o ".U1 in Equation ( 7) is utilized to identify the filter bank with the greatest spatial frequency.The parameter σm can be written in (8): ( ) ( ) Apply Gabor filter fmn (k, l) to an image I(k, l) of size R x Q resulted discrete Gabor wavelet transform of an image (9): After applying Gabor filters to an image array at various scales and orientations, the energy content is computed in (10) If you want to cluster similar textures of images or regions, you may use the "mean μmn" shown in (11) and "standard deviation σmn" given in (12) to represent the region's homogeneous texture feature: FTexture = (μ00 , σ00, μ01, σ01,…………… μmn, σmn)

Parallel K-Means Algorithm with MapReduce on Hadoop
In the Hadoop multi node MapReduce framework, the extracted features Fcolor, Ftexture are clustered using a parallel K-Means algorithm with map reduce programming language.This method worked well, and it took less time than the normal sequential method and the Hadoop single-node method.The MapReduce jobs will be assigned to each iteration of the parallelized K-Means method: For a Map job, compute distance between each feature vector and each cluster center, then update cluster center in reduce task [13].Records of data of dataset features are saved in rows in order to initiate the Map job.As a result, every Map job has a record of completion, and MapReduce on the Hadoop system automatically completes the operation [13].

Experimental Work and Results
We have used K-Means clustering on the "multi-node Hadoop cluster" to set up a method for mining images.
-The experiments' program is written in the Eclipse Java programming language and executed it in sequential way and single node Hadoop and multi node Hadoop.-"Apache Hadoop 3.3" serves as a cluster-distributed computing framework in single node Hadoop.
-For the practical implementation of multi node, we used one master and two slaves in two Personal Computer with "Ubuntu Desktop 20.04.1 LTS" operating system -"Apache Hadoop 2.7.1" installed on these nodes -One of the PC acts as master another one acts as two slave nodes, "VMware Workstation 16 Pro" is used to create virtual machines.-The component packages of the programming framework that was used for the experiments provide access to MapReduce and HDFS.-Dataset used contains images whose pixels to a file stored on HDFS.The database has three classes of 256×256 pixel images: rose images, berry images, and dog images.-Large size of images file is 968 MB of 700 images while medium size of images file is 408 MB of 300 images and small size of images file is 132 MB of 100 images.-In the first part of the valuation process, "the color moment and Gabor measurements" for each of three color components (Red, Green, and Blue) are worked out.-The feature vector of each color layer of 67 elements, in which first three numbers is the mean, standard deviation, and skewness for a color feature, as well as thirty two means (μmn) and thirty two standard deviations (σmn) elements for Texture feature.-The average feature vector for the whole image was worked out based on the feature vectors of each color layer.-The second part of the plan is to use the "Parallel K-Means Algorithm" Based on the MapReduce Model on Hadoop System on the file that comes out of the map-reduce feature extraction program on HDFS.-Experiment and results allow us to make comparison of execution time between multi-node Hadoop system, single-node Hadoop system and the conventional codes.-Table 1 demonstrates that the feature extraction and parallel K-Means approach in multi-node Hadoop are quicker than the conventional clustering algorithm and single-node Hadoop.The findings are pretty satisfactory. [23].

FIGURE 1 . 6 . 2 . 3 : 6 . 2 . 4 : 6 . 2 . 5 : 6 . 3 :
FIGURE 1. Image mining in multi node Hadoop cluster Algorithm 3.1 Input: An RGB image dataset Output: clusters of images Method: Step 1:begin Step 2:put the file that contain pixels of all color image on HDFS Step 3: divides the file in to chunk of size 64 MB Step 4: distribute these chunks among Datanodes Step 5: in each Datanode the work in map reduce is: Step 6: for each image do: Step 6.1: begin Step 6.2: for each color component (R, G, B) in image do: Step 6.2.1: begin Step 6.2.2: calculate (mean, standard deviation, skewness) Step 6.2.3: apply Gabor filter on color space Step 6.2.4: construct the feature vector of that color Step 6.2.5: compute the average feature vector of three colors Step 6.2.6:End Step 6.3: Write the feature vector Fcolor,FTexture on output file on HDFS Step 6.4: End Step 7: apply K-mean clustering on this HDFS feature file in map reduce too Step 8:Write the output file on HDFS Step 9: end