Utilizing Synthetic Tabular Data Method to Improve Heart Attack Prediction Accuracy

accuracy of these methods is directly proportional to the size of the database that are training with. In other words, an inadequate training database is a drawback for their performance yields

scalable generation of data instead of the real data that has been gathered from actual natural events.As a consequence, we employ CTGAN to generate an expanded trained database from the results of the patient's laboratory tests.The gained database is fed to the RNN and LSTM to enhance their accuracy rate.
The following are the study's primary contributions: 1) Creating an automated process to predict the extent of cardiac arrest in patients with heart disease; 2) Improving the RNN and LSTM accuracy by employing the CTGAN method; 3) Implementing the CTGAN method to expand the training data set of heart patients; 4) Evaluating the effectiveness of the LSTM and RNN ways that make use of several performance measures.
The remainder of the project is structured as follows.The relevant literature is covered in Section 2. The primary approaches and procedures used in this paper are summarized in Section 3. The suggested system is described in Section 4 in great detail.While Section 6 provides a summary of this study, Section 5 describes our observations and results.

RELATED WORK
This section evaluates the literature that has been completed in fields that are similar to ours because part of our work involves determining the existence of heart issues and safeguarding patient information: The authors of this study improved the internal forgetting gate input to present a unique paradigm.The irregular time interval is smoothed to provide the time parameter vector, which is then used as the input to the forgetting gate to overcome the prediction barrier.The experiment's findings support the model's efficacy, which was developed using medical information obtained from a hospital's HIS.The paper's proposed dynamic prediction model outperforms the conventional LSTM model in classification by a large margin.An accuracy of 89% was produced by the upgraded model [11].
To detect probable heart sickness, this study uses convolutional neural networks (CNN) with a cutting-edge dataset obtained from the UCI library.This dataset includes certain heart test parameters as well as usual human activities.The results show that the suggested model works better than the existing techniques stated in this study.The overall accuracy of the suggested model is 97% [12].
Researchers created a non-intrusive robot in this study.Using the most common ECG, a system based on deep learning networks may conduct basic categorization of certain ECG data, such as whether it belongs to a normal or abnormal EKG (arrhythmia present).The MIT-BIH provides access to the arrhythmia database.For LSTM and 1D-CNN, they compared performance using a range of deep learning architectures.The results can be further enhanced by adopting a more complex deep learning architecture with an interest in computing cost.The researchers also talked about how deep learning methods like 1D-CNN can be used to classify arrhythmias.The fact that the proposed method does not require any noise filtering feature engineering mechanisms is its most significant feature.The collected results demonstrate that the researcher's performance in effectively classifying an ECG as normal or arrhythmic, where accuracy is 99%, is superior to other published data [13].
This study demonstrates the potential for additional investigation into the generation of synthetic EEG data using deep learning techniques like TGAN and CTGAN.The EEG data from CTGAN exhibits higher similarity than TGAN through visualization and similarity score.The researchers attempted to use the synthesized dataset as input data for several machine learning algorithms, unlike the related research.However, this study has a problem in that machine learning models do not perform better than the real data when the synthetic data from our trials is utilized as the input data.Using data from the website www.kaggle.com, the accuracy value for all methods employed in this study ranged from 49.1% to 49.8% [14].This study's main goal was to categorize cardiac disease using various models and a real-world dataset.To forecast the presence of the condition, A dataset of people with heart disease was subjected to the k-modes clustering algorithm.Bins of 10 intervals were created using the diastolic and systolic blood pressure data, while the age attribute of the dataset was translated to years and divided into bins of 5 years.To account for the differences in the progression and features of heart disease between men and women, the data were further divided based on gender.For both the male and female datasets, the elbow curve approach was used to estimate the right number of clusters.The figures showed an accuracy of 87.23% for the MLP model.These results show that k-modes clustering can reliably predict cardiac disease, and the method may aid in the development of focused diagnostic and therapeutic approaches for the disease.The 70,000-item Kaggle dataset on cardiovascular disease was used in the study, and Google Colab was used to build all of the algorithms.All algorithms have accuracy levels above 86%, with decision trees having the lowest accuracy (86.37%) and multilayer perceptrons having the maximum accuracy (as already indicated).Limitations Despite the positive outcomes, there are a few limits that should be taken into account.First of all, the study may not apply to other demographics or patient groups since just one dataset was employed.Additionally, the study neglected other potential risk factors for heart disease, such as genetic predispositions or features of lifestyle, and only considered a small number of clinical and demographic information.Additionally, the model's performance on a test dataset was not assessed, which may have provided insight into how well the model generalizes to fresh, untested data.Finally, the interpretability of the findings and their capacity to explain the clustering that the approach generated were not assessed.Further study is encouraged to overcome these problems and comprehend the potential of k-modes clustering for cardiac disease prediction in light of these restrictions [15].

MATERIALS AND METHODS
The primary methods and supplies used in this work are summarized in this section.

FEATURE STANDARDIZATION
Work was done on feature scaling, which entails altering values using one of two main techniques: normalization or standardization, before entering the data for the algorithms utilized.The input values are altered during normalization such that they fall between 0 and 1. Standardization techniques change the values to have a standard deviation of 1 and a zero-centered value.This work employed the normalization technique [16].

LONG SHORT-TERM MEMORY (LSTM)
A special case of RNNs is LSTM.It is made to deal with the issue of gradients that burst or disappear.The memory cell and several gates make up the LSTM's fundamental design.It was initially introduced in 1997 that these memory cells and gates were added to each neuron in the network [6].Google and Apple's Siri both use recurrent neural networks(RNNs), the most complex technique for processing sequential data.It is the first algorithm to recall its input thanks to its internal memory, which makes it perfect for machine learning problems needing sequential data.RNNs work on the principle that they can predict an output by maintaining and reinforcing the output of a previous layer.Whether or not the information from the prior timestamp should be recalled is decided in the first section.This cell attempts to learn new information using the input from the second part.The cell then sends the updated data from the current timestamp of the third segment to the following timestamp.These three LSTM cell constituents are referred to by Gates.The first, second, and third components are referred to as the forget gate, input gate, and output gate, respectively.The hidden state of an LSTM is identical to that of a traditional RNN, with H(t-1) denoting the hidden state of the prior timestamp and the present timestamp, respectively, to represent a cell state for the LSTM, and H(t) denoting the current timestamp's hidden state.The timestamps C(t-1) and C(t), which stand for the prior and current timestamps, respectively, are used to represent a cell state for the LSTM [17].LSTM includes two types of activation functions: sigmoid and its range are between (0 and 1), as it is used with the forgetting gate, because it tends more to forget unimportant words, while tanh has its range between (1 and -1), as it works to remember important words.
The following stages are used to implement this algorithm: Step 1: The forget gate is used to determine what should be forgotten based on knowledge from a prior time step; Step 2: New information is sought via input gate and tanh for updating cell state; Step 3: The information from the two gates above is used to update the cell state; Step 4: Information is usefully provided by the squashing operation and the output gate.

RECURRENT NEURAL NETWORK (RNN)
Recurrent neural networks(RNNs), Apple's Siri, and Google voice search are built on the most complex algorithm analyzing sequential data.Due to its internal memory, it is the first algorithm to recall its input, which makes it ideal for machine learning issues requiring sequential data.RNN works(as shown in Fig. 1) on the principle that it can predict a layer's output by preserving that layer's output and feeding it back into the input [18].To create a single layer of recurrent neural networks, the nodes from various layers of the neural network are condensed [19].RNN uses the sigmoid or tanh function for hidden layers.The tanh function has better performance.Only the identity activation function is considered linear.All other activation functions are non-linear.

FIGURE 1. -Structure of Recurrent Neural Networks[20]
The following stages enable this algorithm to function: Step 1: Initialize.We first determine the dimensions of the various parameters U, V, W, b, and c before implementing the fundamental RNN cell; Step 2: Forward pass; Step 3: Compute Loss; Step 4: Backward pass; Step 5: Update weights; Step 6: Repeat steps 2-5;

DATA AUGMENTATION
Data augmentation is a series of techniques used to add additional data to a machine learning model, either in the form of copies of current data that have been significantly adjusted or newly produced synthetic data from existing data.The machine learning model is smoothed out, and data overfitting is reduced.The approaches for data augmentation can be applied to a variety of data kinds, including tabular data, photos, audio, video, text, and other sorts of data [21] [22].

GENERATIVE ADVERSARIAL NETWORK (GAN)
Let's dissect "Generative Adversarial Networks" into their parts to better understand them.Because of this, the first word in the statement, generative, denotes the existence of a network that continuously generates new data, while the second term, adversarial, denotes the existence of two networks that conflict with one another.A network is nothing more than an arrangement of data that is always changing and producing new data.The first used was a GAN, a deep learning model [23].It consists of two neural networks that compete with one another: the discriminator, which is trained to distinguish between real and fake data, and the generator, which creates fictitious data.To trick the discriminator, the generator makes adjustments throughout training, which comprises learning to produce fake data that is indistinguishable from the real data [24].

PERFORMANCE METRICS
Four evaluation metrics_accuracy, precision, sensitivity, and F1-score-are employed in this study.The accuracy measure counts the number of times a classifier correctly classified data throughout the entire dataset.It is presented as the proportion of correctly classified instances to all cases that were correctly classified.Precision is the ratio of accurately labeled positive instances to all cases that were either correctly or mistakenly classified as positive.In other words, accuracy measures the frequency of instances that can be positively detected as affirmative.By dividing the total number of positive instances, including the incorrectly categorized negative cases, by the total number of positive cases, sensitivity calculates how well the system can classify positive cases.The F1-score combines sensitivity and precision measurements.which is used to assess a classifier's accuracy.The following equations produce the four metrics listed above: [25][26] [27] ) ( Where the acronyms mentioned above can be clarified as follows: True Positive (TP) is the positive states that are correctly labeled as positive states.
False Positive (FP) denotes the negative states that are incorrectly labeled as positive states. True To evaluate the method used to increase the data for the data set used in this paper, the equations shown below were used :

PROPOSED SYSTEM
This study demonstrates the creation of a secure and efficient healthcare system by leveraging existing knowledge of cardiac conditions.By utilizing a collection of clinical markers, the system aims to assist in the diagnosis and classification of cardiac patients.It is crucial to improve performance accuracy, especially when compared to previous similar studies.In this study, we compare the performance of long short-term memory (LSTM) algorithms and recurrent neural network (RNN) algorithms.
Since these algorithms perform better with larger data sets, we initially augmented the patient data set (training data only) using the CTGAN method.The results obtained for the newly generated data, measured using the New Row Synthesis metric, were promising.Subsequently, we compared the performance of LSTM and RNN algorithms in predicting the likelihood of cardiac patients experiencing a heart attack.
The system is composed of two phases, as illustrated in Fig. 1data augmentation of the original dataset and diagnosis of the patient's condition.Below is a detailed explanation of these phases.

PRE-PROCESSING PHASE
Likely, the data values in this collection were manually entered, obtained from many sources, and subsequently made public by numerous government entities because our system uses an available dataset.This means that these values must be preprocessed before classification.Although there are several ways to perform the pre-processing task, we use the Feature Scaling method in our suggested solution.It is a technique for spreading independently occurring features in the data uniformly throughout a predetermined range.It has a broad range of magnitudes or values under control.A machine learning algorithm would rely on values regardless of the units in the absence of feature scaling, therefore big values would be preferred over tiny values even if they were bigger.As part of the standardization procedure, a feature value is rescaled using Equation (7), resulting in a distribution with a mean of 0 and a variance of 1.

𝑉 𝑛𝑒𝑤 =
−    (7) Where   is a value for a feature from the dataset,   is a scaled value for a feature,   is the average of the feature values, and   is the feature value standard deviation.

SPLITTING DATA PHASE
After the necessary preprocessing, The data will be divided into two sets after the appropriate preprocessing: the training set, whose main goal is to allow machine learning algorithms to produce exact findings, and the testing set., which is used to evaluate the system's performance.In essence, a training set makes up 70% of the data, whereas a testing set makes up 30%.

DATA AUGMENTATION PHASE
Data augmentation, a collection of techniques, can be utilized to artificially increase the amount of data by creating new data points from existing data.Deep learning models are employed to either add new data points or make minor adjustments to the existing data.
Data augmentation enhances the performance and output of machine learning models by introducing more diverse instances to the training datasets.Machine learning models perform better and achieve higher accuracy when trained on extensive and diverse datasets.However, collecting and labeling such data can be time-consuming and costly.Implementing data augmentation strategies can help businesses reduce these operational expenses by modifying existing datasets.
In this system, the CTGAN technique is used.CTGAN utilizes a generative adversarial network (GAN) to model the distribution of tabular data and selects representative rows from that distribution.To address CTGAN's non-Gaussian and multimodal distribution, we employ mode-specific normalization.A GAN consists of two components: the generator, which produces synthetic data, and the discriminator, which learns to differentiate between real and generated data by utilizing the instances created by the generator.
After applying data augmentation using the CTGAN method on a dataset comprising 13 features and 300 rows, we obtain a new dataset with 13 features and 5000 rows.Several metrics such as Memory Requirements, Machine Learning Efficacy, Statistical Similarity, Distance to Closest Record, CategoricalCAP, and NumericalLR (NLR) can be used to evaluate the performance.In our case, we measured the performance of CTGAN using the New Row Synthesis metric, where the optimal ratio was 1.0.This metric determines if each row in the generated data is unique or if it duplicates an existing row in the real data.Consequently, the resulting GAN can emulate a synthetic dataset that is entirely fabricated but maintains the same structure as the actual data.

EVALUATION DATA AUGMENTATION PHASE
This metric aims to identify rows that are identical between the real and synthetic datasets.For a match to occur, all individual values in the real row must exactly match those in the synthetic row.The specific matching requirements depend on the type of data: Boolean/categorical data: The value in the real data and the value in the synthetic data must be identical.Numerical or date-time scale data: All values must match between the real and synthetic data, or if the synthetic value is within a certain percentage of the real value, it is considered a match.By default, this percentage parameter is set to 0.01 (1%).
The next step is to calculate the percentage of rows in the synthetic data that correspond to a row in the real data.The complement of this score ensures that 1 represents the best score (each row is unique) and 0 represents the poorest score (each row has a match).

PREDICTION PHASE
We analyze two well-known algorithms (LSTM and RNN) to identify which is superior based on the assessment metrics because the suggested system has the potential to directly impact human life and the diagnosis of health conditions.

EXPERIMENTAL RESULTS AND DISCUSSION
The proposed work consists of two phases: data augmentation for the medical dataset and estimation of the likelihood of cardiac patients experiencing a cardiac arrest.This section focuses on examining and evaluating the outcomes of these phases.It should be noted that the system is implemented in Python, and the dataset1 used in these stages is publicly available, containing various features such as age, gender, anemia, diabetes, high blood pressure, and other factors that can significantly impact a person's risk of developing heart disease.
The first phase addresses the main issue of the small size of the dataset used in our work.To address this, we propose increasing the size of the dataset using the CTGAN method.The performance of this method was evaluated using the New Row Synthesis measure, which is a well-known performance metric.We achieved a ratio of 1.0, indicating the highest performance rate for the chosen scale.
The second phase involves two steps to accomplish our objectives: In the first step, we used the LSTM and RNN algorithms to predict the extent of the possibility of heart patients having a heart attack by applying them to the data set of heart patients before augmentation as shown in the following summary: ( 1 )The data set can be found online at (https://www.kaggle.com/datasets/cherngs/heart-disease-cleveland-uci)

RNN after augmentation
Table 3 shows a comparison of our research's findings with those of various relevant studies.This table has five columns: the reference number, the year this reference was published, the dataset this reference utilized, the technique this reference used, and the accuracy this reference attained.

CONCLUSION
In this paper, the CTGAN method was used for augmentation of the patient's data set, because the LSTM and RNN algorithms used in our work need a large data set to work efficiently and with higher accuracy.After implementing the above-mentioned algorithms to predict the extent of cardiac patients' possibility of cardiac arrest, we obtained an accuracy of 99% and 100% for each LSTM and RNN, respectively.In future research, it is possible to use other methods to augment the data set and compare them to choose the best one in terms of accuracy.
Negative (TN) represents the right classification of negative diagnosis.False Negative (FN) indicates the positive cases that are incorrectly classified as negative represents all the values in the real data and (x) represents all the values in the synthetic data.

FIGURE 2 .
FIGURE 2. -The Architecture of the Proposed System Pre-processing phase Splitting data set phase Test set (30%)

FIGURE 7 .
FIGURE 7. -The performance comparison between LSTM before and after data augmentation

FIGURE 8 .
FIGURE 8. -The performance comparison between RNN before and after data augmentation

Table 1 . -The performance results of RNN and LSTM prediction techniques
In the second step, we applied the same algorithm above, but after data augmentation for the same set of data mentioned above as shown in the following summary:

-The accuracy of RNN FIGURE 4. -The accuracy of LSTM
B: Performance of RNN and LSTM algorithms after data augmentation :