Mutual Information for Feature Selection

Research Paper Instructions:

Hi, we are coming to the final version of this paper. I understand you did not have enough time for the last assignment, but I appreciate you did give me some high-quality work at the end. I can make the time limitation more flexible, but you need to let me know when can you finish all I need prior to the ddl I set so that I can better make the plan accordingly. For this final paper, I want at least 2000 more words (approx 5000 in total). Things to add on is about the algorithm. As you mentioned in the paper:" Data scientists use algorithms that maximize only relevant information while minimizing unnecessary or redundant information. " This is abbreviated to 'mRMR', which is one of the algorithm for MIFS. Try pick up 2 or 3 this kind of algorithms and expand them in detailed: What is the mathematical definition, pros and cons compared to other algorithms.

I attach some articles that you might want to look at.

https://www(dot)sciencedirect(dot)com/science/article/pii/S0957417415004674

http://home(dot)penglab(dot)com/papersall/docpdf/2005_TPAMI_FeaSel.pdf

https://towardsdatascience(dot)com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b

Research Paper Sample Content Preview:

Mutual Information for Feature Selection
Name
Institutional Affiliation
Mutual Information for Feature Selection
Introduction
While the model selection is critical in learning signals from the provided data, providing the right variables or data is of utmost importance. In machine learning, the model building requires the construction of relevant features or variables through feature engineering, and the resulting data set can then be employed as a statistical input to train a model. While these models are often assumed to be sophisticated and smart algorithms, they are easily fooled by unnecessary clutter and dependencies in the data. Data scientists often make the signals to be easily identifiable by performing feature selection, which is a necessary step in data pre-processing (Huijskens, 2017). According to Zhou, Wang, and Zhu (2022), feature selection is a fundamental pre-processing step in machine learning as it selects only the crucial features by eliminating redundant or irrelevant features from the primary dataset. Battiti (1994) recognizes this pre-processing stage as a critical step where the required number of appropriate features are selected from raw data to impact on learning phase complexity and the achievable generalization performance. While using mutual information (MI) for selecting features in supervised neural net learning, Battiti (1994) notes that although it is important that information in the input vector is sufficient in determining the output class, excess input results in the burdening of the training process and thus lead to the production of neural networks with excess connection weights compared to those needed by the problem at hand. Based on an application-oriented perspective, excessive features lengthen the duration for pre-processing and recognition, regardless of a satisfactory performance in learning and recognition (Battiti, 1994). Data scientists use algorithms that maximize only relevant information while minimizing unnecessary or redundant information.
One of these techniques that have been adopted by machine learning experts and data scientists is mutual information feature selection. In this algorithm, the in-filter feature selection approach is used in assessing the relevancy of a subset of features to predict the largest variable as well as the redundancy based on other variables. Nevertheless, Beraha et al. (2019) note that the existing algorithms are often heuristic and fail to provide any guarantee that they will resolve a proposed problem. This limitation has motivated the authors to propose a novel way of observing the theoretical results that indicate conditional mutual information may occur naturally when handling the ideal regression or classification errors that are achieved by various features or a subset of features. One thing to do before selecting is to remove words that appear only infrequently in one category because they are destined to have high mutual information with one category and low mutual information with the others. Ones show that low word frequency has a great influence on mutual information. If a word is not frequent enough but mainly appears in a certain category, it will have a high level of mutual information, which will bring noise to screening. Therefore, in order to avoid this situation, we can first sort the words according to the word frequency and eliminate those words only appear in one category and has low frequency. Then sort the words according to their mutual information. This paper describes mutual information for feature selection, the kind of feature selection suitable for mutual information, and measures for relevance and redundancy with their relative strengths and drawbacks.
Background
In machine learning, it is almost unlikely that all variables in a dataset will be useful in building real-life models, and the addition of any redundant or unnecessary variables will ultimately reduce the capacity of the model to generalize. Besides, redundant variables also reduce the accuracy of classifiers used in such models and may result in increasing the model complexity. The main purpose of feature selection is, therefore, to find the best set of features allowing data scientists to build useful models that are studied in a phenomenon. Implementing feature selection takes any of the two approaches: supervised and unsupervised techniques or learning. In supervised learning, labelled data is used in identifying the relevant features to increase the efficiency of supervised models like regression or classification, while unsupervised techniques use unlabelled data (Gupta, 2020). These approaches are broadly classified into four different methods, including filter, wrapper, embedded, and hybrid methods. In filter methods, the models select the intrinsic properties of the measured features using univariate statistics instead of performing cross-validation. Filter methods are known to be faster and require less computational power compared to the wrapper methods. Wrapper techniques, on the other hand, require the models to search the spaces for every possible feature subset while assessing their qualities through learning and evaluation of the classifiers with the feature subsets. In embedded methods, the benefits of the filter and wrapper method are combined through the inclusion of interactive features while maintaining reasonable computational demands. These methods are often iterative in terms of taking care of every iteration of the process of training models and will carefully extract the features that have an important contribution to the model training for each iteration. Mutual information (MI) is an example of filter methods under to select categorical input variables. There are various statistical measures used to select the most relevant variables based on the input and output variables.
MI refers to a measure of statistical independence with two main properties: measuring any kind of relationship between random variables, even non-linear relationships, and having invariance under transformations in terms of feature space which are differentiable and invertible such as rotations, translations, or any form of transformation. The method preserves the order of the original elements of the feature vectors. The pioneering work of Battiti introduced the definition of the feature selection problem as a step aimed at selecting the k most relevant variables amongst m variables from an original feature set, i.e., k < m. Battiti then proposed an alternative greedy selection of a single feature each at a time in evaluating the combinatorial explosion of all the features in the original feature set. He then arrived at four assumptions: features can only be classified in terms of either redundancy or relevancy; heuristic functional approaches are employed when selecting features and allow the controlling of the trade-off between redundancy and relevancy; greedy search strategies are applied in feature selection, and the selected feature subsets are assumed optimal (Vergara & Esteves, 2014). In MI, many feature selection problems are either quantized or discrete by nature. Vergara & Esteves, 2014 have taken an assumption of F as a feature set and C as the output vector to represent the classes of the real process, F is assumed as the realization of a random sampling of unknown distribution, and fi is the ith variable of F, fij is the jth sample of vector fi, while ci is the ith component of C where cij is the jth sample of vector ci. Vergara & Esteves (2014). The upper-case lettering convention in Vergara and Esteves (2014) assumptions represents the random sets of variables, while the lower-case lettering represents the individual sets of variables. MI aims to determine ways of building measurable relationships between features and targets, and this approach is used due to its speed and neutrality where solutions are applicable in different models. MI works in the same manner as Information Gain (IG) in Decision Tree Classifiers, but this methodology measures entropy drops under the target value condition (Wittten, Frank & Hall, 2011). This concept may be summarised using the equation below:
MI (feature; target) =Entropy(feature)-Entropy (feature | target）(1)
Based on the above equation, the MI score will always fall between 0 and 1, and the higher the value, the closer the association between features and targets. This is also the score that should be included in the dataset when training models. Nevertheless, if the score is small, say 0.01 or 0, then it means a weak association between the features and the targets and the variable should not be included in training models. MI is often computed between two variables/features then measuring the reduction in uncertainty for either of the variables based on a known value of the other variable (Wittten, Frank & Hall, 2011). The MI between two random variables, say X and Y may be represented as:
I (X; Y) =H(X)-H(X|Y)(2)
where:
I (X; Y) is the X and Y,
H(X) is the entropy for X
H(X|Y) is the conditional entropy for X based on Y
Since MI measures the mutual dependence between two random variables, this measure is often symmetrical, i.e.:
I (X; Y) =I (Y; X)(3)
Feature selection has historically been used in machine learning as well as in data mining to improve the efficiency of models by selecting the smallest feature subset in a certain error generalization or finding the best feature subsets with K features yielding the minimum generalization error (Vergara & Esteves, 2014). Vergara and Esteves (2014) define a feature as an “attribute” or a “variable” representing a property of a system, or a process different from that has been constructed or measured from the primary input variable. Other additional roles of feature selection include the improvement of generalization performance based on the models built from the whole set of features and the provision of more robust generalization and rapid response with unseen data. Besides, feature selection can achieve a simpler and better understanding of the processes generating data. In general, feature selection may be perceived as a pre-processing step effected in conjunction with models for regression or classification purposes. The wrapper methods used in feature selection described above employ induction learning methodologies to evaluate feature subsets, and their performance is often measured in terms of classification rates determined from the testing set. However, the embedded algorithms apply knowledge about specific structures of classes of functions used in models. The filter method is often robust against overfitting and may fail to select the best feature subsets for either regression or classification. A number of features are used in evaluating single features or multiple features of these algorithms, including inference correlation, inconsistency, distance measure, and fractal dimension, among others.
A fitting formalism to quantify MI is provided by Shannon’s theory and states that if the probabilities for the different classes are P(c); c=1…Nc, then the initial uncertainty in the output class can be measured by the entropy:
HC=-c=1NcPclogPc(4)
But the average uncertainty upon determining the feature vector f (with Nf components) is always the conditional entropy:
HC|F=-c=1NcPfc=1NcPc|flogPcf(5)
where Pcf denotes the conditional probability for class c with f as the input vector and when the feature vector is comprised of continuous variables, then the sum and probabilities will be replaced by an integral and the corresponding probability density respectively. For instance, one dimension will have:
(6)
In equation (6) above, the entropies of continuous systems will largely be depended on the coordinates and for linear transformations with f -> f’= αf, then the equation will become:
HF=-PflogPfPfdf
=-PflogPfαdf=HF+logα (7)
Generally, conditional entropy in the above derivations will always be less than or equal to the initial entropy and will only be equal if one independence between the features and output class is the product of individual densities, i.e.: P (c, f) = P(C)P(f). Here, the amount by which the uncertainty is reduced or the mutual information I (C; F) between the c and f variables:
I (C; F) = H(C) – H(C|F)(8)
The function in (8) above is symmetric in relation to C and F and a simple modification of the algebraic function reduces the function to the expression shown below in (9).
(9)
Based on the above equations, MI is the amount by which information about a feature vector decreases the class uncertainty and if one thinks of the combined events (c,f), that is, H(C;F), this is less than the sum of individual uncertainties (H(C) and H(F), which can be demonstrated as below, an equation similar to equation earlier mentioned in (2) above (Battiti, 1994):
H (C; F) = H(C) + H(F) – I (C; F) (10)
Filter methods
Input variables can either be numerical or categorical. Both Chi-squared and MI are used to select categorical output variables but ANOVA and Kendall’s are best suited for filtering numerical output variables. For input variables, Pearson’s and Spearman’s are used when both the input and output variables are numerical as shown. Fig 1 below shows the hierarchy of feature selection methods that are aimed at reducing the number of input variables to those only believed to be significant to a model in predicting target variables.
Fig 1 Hierarchy of feature selection methods (Brownlee, 2020)
Measures for relevancy and redundancy
Feature selection reduces the number of variables for developing and training models when raw data has numerous feature and existing methods of feature selection help to find only relevant features to enhance performance. Yu and Liu (2004) have shown that relevance along is inefficient to select features from large and high-dimensional datasets and instead propose the performance of redundancy analysis to select relevant features. Using this approach, the authors have developed a correlation-based method for measuring relevance and redundancy by decoupling the relevance and redundancy analyses and measuring their effectiveness and efficiency compar...

Updated on January 26, 2024

Get the Whole Paper!

Not exactly what you need?

Do you need a custom essay? Order right now:

Order

👀 Other Visitors are Viewing These APA Essay Samples:

Data Driven Business Thinking - MasterCard

2 pages/≈550 words | 1 Source | APA | IT & Computer Science | Research Paper |
Use of AI Modeling in Determination of Premium Prices for the Insurance Industry

9 pages/≈2475 words | No Sources | APA | IT & Computer Science | Research Paper |
Cybercrime and Its Effects on American Society

7 pages/≈1925 words | 10 Sources | APA | IT & Computer Science | Research Paper |