Sentiment Analysis Using Naive Bayes Algorithm with Feature Selection Particle Swarm Optimization (PSO) and Genetic Algorithm

ABSTRACT


INTRODUCTION
Information.Extraction is the technique of gathering information from an unstructured collection of texts.It is necessary to define target information as structured information to be extracted [1].Furthermore, information can be obtained from the data regarding social media users' opinions towards certain entities.So that all opinion data can be helpful, then data processing can be done using sentiment analysis [2].It's also important to pay attention to sentence structure, use of non-formal language, emoticons, or image services.The main task in this analysis is to classify the sign of text at the sentences, document, sentence, feature, or page of level, in case documents, corrections, or positive entity segments.
Rifat et al.In his research, he tried to compare the Support Vector Machine and K-Nearest Neighbor to analyze sentiment.Naive Bayes multinomial algorithm offers the best performance compared to other traditional machine learning algorithms.[3] Classification is one of the data mining methods used to predict a value or a set of data.Classification methods are use to describe data or forecast data trends in the future.The biggest challenge in classification research in data mining is the class imbalance problem, and one of the solutions proposed by the researcher is feature selection [4].Feature selection is one way that can affect the level of classification accuracy.Which one, an optimization of features selection will minimize a process of large number for features.A relatively low subset of characteristics is essential to quickly and effectively improve classification accuracy.[5].In his research, using 3 correct selection features can make the classifier run well, more effectively and efficiently by minimizing the amount of data analyzed [6].
Sean & Dwi, In his research, he compared the K-Nearest Neighbor and SVM methods into sentiment analysis.Support Vector Machine (SVM) algorithm offers the best performance.[7].Meanwhile, in your research.Farkhund Iqbal et al., A Hybrid Framework for Sentiment Analysis Using Genetic Algorithm-Based Feature Reduction.With this hybrid approach, we've reduced the size of the feature set size by 42% without sacrificing accuracy.Comparison of our feature reduction technique with Principal Component Analysis (PCA) and the more widespread feature reduction technique based on latent semantic analysis (LSA) has increased precision of up to 15% compared to PCA and increased accuracy.up to 0.2% via LSA [8 ].
From some of the information above, the authors are motivated to employ the Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) methods to perform features selection to improve accuracy.The results of feature selection in this study are expected to increase the accuracy of Naïve Bayes.

RESEARCH METHODS 2.1 DATA COLLECTION
In this study, the author is using qualitative research because the data obtained will be words.These words are obtained from the Twitter Search API crawling process.These words will then be processed to find out the sentiments in them by extracting information to classify the polarity of the text on feature entities, either positive or negative.In this study, we focus on tweets related to vaccine keywords.Because this activity is a hot topic at the moment, this study uses three kinds of data, namely tweet data, stop words, and essential words.In addition, literature studies and references to national and international journals are also needed to gain additional knowledge related to theoretical foundations, analytical concepts, and methods in data classification.

RESEARCH
In this study, the process is carried out on text mining.Text mining is a text analysis process carried out automatically by a computer, the objective of which is to obtain quality information from the text summarized in a document.The main process in this method is to find words that can represent the content of the manuscript to further analyze the relationship between documents using certain statistical methods such as group analysis, classification, and association.[9].
The steps taken in text extraction are as follows: 1. Word preprocessing The actions taken at this stage include lowercase letters and tokenizing.2. Feature Selection The action taken at this stage is eliminating stopwords and stemming [10].The data collection is in tweet data taken from Twitter social media users' crawling with the word vaccine search.In addition, literature studies and references to national and international journals are also need to gain additional knowledge related to theoretical foundations, analytical concepts, and methods in data classification.

SENTIMENT ANALYSIS
Sentiment analysis or else opinion mining is a subject of study in which evaluation, opinion, even judgment, an attitude, and emotion towards an entity such as product, organization, service, individual, events, problem, and subject.[11] This analysis is used to obtain specific information from the existing dataset.Sentiment analysis focuses on elaborating opinions that contain polarity, which have a positive or negative sentiment value.Sentiment class labeling was performed with Lexicon-based functions.Then find the sentiment value in the sentence with the formula: So that in one sentence, the number of negative and positive values can be determined by the meaning each word.Next, a comparison will be conduct to see if the sentence has a positive or negative.
sentiment class :

NAIVE BAYES ALGORITHM
The Naive Bayes is one of the algorithms included in the classification technique.British scientist Thomas Bayes proposed probability and statistical methods that predict future possibilities based on previous experience.The theorem is combined with Naive, assuming that the conditions between attributes are not related.Naive Bayes assumes for presence or absence of a unique characteristic of one class has nothing to do with the features of another type.The equations of Bayes' theorem are the stages in the Naïve Bayes method, namely: Counting the amount of data a.Finding the probability value (P) Finding the mean (µ)

PARTICLE SWARM OPTIMIZATION
Particle Swarm Optimization is a branch of changing of algorithm that upon a population of particles that maintains a probability distribution to find the optimal solution.PSO is based on the deportment of a bird or fish community.If a pack doesn't have an alpha to lead them in search of food, they will wander around looking for food locations.This algorithm is based on the social behaviors of animals.Individual actions plus the influence of 4,444 others create social behavior A population is given rise to randomly from an initial value to the maximum value with the least significant value.Some of the particles represent the position and location of the problem in question.Each particle seeks the optimal solution by making adjustments to the work of the best particle.The company offers special particle effects worldwide, which is the best value for their dollar.Search a wide area to explore various paths in the search space.Every solution, it's represented by the particle position.It's evaluated performance by calculating the answers and entering them into the sufficiency function in each iteration.A point in a certain spatial dimension for a particle that is treated as a point in a certain measurement forgives the position of the particle in the search space, the position of X, The following are equations that describe position and velocity: Which X is the position for particle, V is the speedup of the particle.i and t are particles indices and t-iteration in N-dimensional space.Furtherore, to the explanation of the means for enchancing particle health with the following calculation model.
$ O =,, ...., represents the best locale of the i-th particle.While = ,, ..., represents the whole best herd in the World.And is a constant that has a positive value which is usually called the learning factor.And is the positive random numbers between 1 with 0. is the inertia parameter.Equation (2.20) is used to get the new particle velocity.It is substructure as was pace, the distances among as current position, the best local positions, and the current longitude of the best global job.Then the particles fly according to the equation.2.21.The PSO workflow can be see in Figure 2

GENETIC ALGORITHM
Genetic Algorithm(GA) is a genetic algorithm that belongs to the group of evolutionary algorithms.This algorithm was first introduced by Holland in 1975 and is a method commonly used for search methods and is inspired by population genetics in finding solutions to problems.This algorithm also follows the concept of Carles Darwin with his theory of evolution, where strong individuals will survive the population.The essential elements of natural genetics are natural selection (natural selection), crossover (crossover), and mutation (mutation).
Natural selection is an attempt to retain the best individuals by multiplying the best individuals.So the best is not lost in the next iteration.The interbreeding operator is used to create individuals.To create a new individual, it takes two parents.Two parents are required.The most commonly used parent selection technique is the roulette wheel.The mutation operator is used to replace the individual.
Worst with new individuals.The number of individuals replaced depends on the mutation rate parameter.

DATA SET
In this study, the authors used qualitative research with 1000 respondents.Then the writing is in the sentiment analysis to find out the word is positive or negative with the vaccine keyword.The data has been cleaned to anticipate the occurrence of missing values.

TESTING PROCESS
The test is carried out following a predetermined scenario, namely the Naive Bayes classification method with feature selection and Particle Swarm Optimization and Genetic Algorithm.The test scenario was carried out three times, each using cross-validation.The first sample was tested with the Naive Bayes algorithm without feature selection.The second test was carried out using the Naive Bayes algorithm with Particle Swarm Optimization feature selection.The last test is done by using a Naive Bayes algorithm with a Genetic Algorithm feature selection.The results of the three algorithm tests will be compared, and the algorithm with the highest accuracy will be used.Before conducting the test, it is necessary to determine the Population Size to increase the accuracy value used.Population size is the number of items in the population from the sample taken.Then the best Population size will be taken as an example to be applied.The accuracy results can be seen in the population size accuracy table below.Based on the above test, it was found that the highest Population Size accuracy value was found in population size 10 with an accuracy value of 77.50% and an AUC of 0.720.Then the population size value will be used to increase the accuracy of feature selection in testing the Naïve Bayes algorithm with a predetermined feature selection.Confusion Matrix values and ROC curves can be seen in the image below:   Based on the above test, it was found that the highest accuracy value was found in the tests carried out with the k-fold 10 experiment with an accuracy value of 77.50% and an AUC of 0.720.

NAIVE BAYES ALGORITHM TESTING PROCESS WITH GENETIC ALGORITHM SELECTION FEATURE (GA)
In this test, experiments were carried out using the Naive Bayes algorithm with feature selection Genetic Algorithm (GA).The test results can be seen in the table below: Based on the above test, it was found that the highest accuracy value was found in tests carried out with the k-fold 6 experiment.with an accuracy value of 71.78% and an AUC of 0.725.Based on the tests that have been carried out, the following is a comparison chart table of the tests that have been carried out with Naive Bayes, Naive Bayes combinations of PSO, and Naive Bayes combinations of GA.

CONCLUSION
Based on the testing and analysis, it's been concluded that using the selected feature in the Naïve Bayes algorithm for Twitter sentiment analysis can help improve the performance and accuracy of the tests carried out.The process shows that the Naïve Bayes algorithm model produces the highest level of accuracy of 60.26%.For comparison, the Naïve Bayes algorithm model, which has been combined with the Particle Swarm Optimization (PSO) feature selection, shows the highest accuracy of 77.50%.In contrast to the Naïve Bayes algorithm model, which has been combined with feature selection Genetic Algorithm (GA), which shows the highest accuracy of 71.78%.So the conclusion of the combination between Naive Bayes and PSO is a better result which one an increase in accuracy of 17.24%.

r
ISSN: 2721-3056 International Journal of Advances in Data and Information Systems, Vol. 2, No. 2, October 2021 : 95 -104 99 PSO is the search of solutions that are carried out through a population composed of several particles.

[ 1 ]
Vikas, BO & Mungara, J. "Enhanced Extraction and Summarization Techniques with User Review Data for Product Recommendations to Customers." International Journal of Scientific Research in Science, Engineering and Technology, vol 2, p. 25-30, 2016.

Table 1 .
Population Size Accuracy

Table 2 .
Accuracy of Naïve Bayes AlgorithmBased on the above test, it was found that the highest accuracy value was found in tests carried out with k-fold 7 experiments with an accuracy value of 60.26%% and an AUC of 0.519.

3.2.2 PROCESS OF TESTING NAIVE BAYES ALGORITHM WITH FEATURE SELECTION PARTICLE SWARM OPTIMISATION (PSO)
In this test, experiments were carried out using the Naive Bayes algorithm with Particle Swarm Optimization (PSO) feature selection.The test results can be seen in the table below:

Table 3 .
Accuracy of Naïve Bayes Algorithm with PSO Sentiment Analysis Using Naive Bayes Algorithm with Feature Selection Particle Swarm … (Abi Rafdi)102

Table 4 .
Accuracy of Naïve Bayes Algorithm with GA

Table 5 .
Graph of Test Comparison