基于相关性分析及遗传算法的高维数据特征选择外文翻译资料

 2022-11-25 15:14:07

High dimensional data feature selection based on correlation analysis and genetic algorithm

Ren Jiangtao, Huang Huanyu, Sun Hao, Yin Jian

( Department of Computer Science, Sun Yat-sen University, Guangzhou 510275, China) (issrjt@maii.sysu.edu.cn)

Abstract: Feature selection is one of the important issues in the field of pattern recognition and data mining. For high-dimensional data objects, feature selectionOn the one hand, the accuracy and efficiency of classification can be improved, and on the other hand, feature-rich subsets can be found. Aiming at this problem, a feature selection method that integrates the fiiter model and the wrapper model is proposed. Firstly, feature screening is performed based on the correlation analysis between the feature and the category tag. Only features with strong correlation with the class tag are retained, and then the The filtered and streamlined feature subset uses a genetic algorithm for random search and uses the perceptron models classification error rate as an evaluation index. Experimental results show that the algorithm can effectively find feature subsets with good linear separability, thus achieving dimensionality reduction and improving the classification accuracy.

Keywords: feature selection; correlation; genetic algorithm

0 Preface

Feature selection is one of the important data processing methods in the field of pattern recognition and data mining. With the development of pattern recognition and data mining, the research objects become more and more complex, and the objects feature dimensions become higher and higher. The feature space of a large number of high-dimensional data objects contains many redundant features or even noise features. On the one hand, these features may reduce the accuracy of classification or clustering, and on the other hand, greatly increase the time and space complexity of learning and training. Therefore, in the face of high-dimensional data classification or clustering, it is usually necessary to use feature selection algorithms to find feature subspaces with good separability, thereby achieving dimensionality reduction and reducing the time and space complexity of machine learning [1, 2,8].

Depending on whether or not they rely on machine learning algorithms, feature selection algorithms can be divided into two major categories. One is wrapper-type algorithms, and the other is fiiter-type algorithms. Fiiter-type feature selection algorithm is independent of machine learning algorithms, and has the characteristics of low computational cost, high efficiency, and general dimensionality reduction. The wrapper-type feature selection algorithm needs to rely on one or more machine learning algorithms, which has high computational cost and efficiency. Low but good dimension reduction effect [1, 2].

From the optimization point of view, the feature selection problem is actually a combinatorial optimization problem. Usually solve this problem with traversal search, random search and heuristic search

Cable and other methods. Genetic algorithm also has a wide range of applications in combinatorial optimization problems. It belongs to a random search method. In recent years, with the deepening of the research on feature selection methods, the feature selection problem based on genetic algorithms has also obtained many researches and applications [4]. At present, the feature selection method based on genetic algorithm is usually based on the classifier to evaluate the feature subset, and the individuals evaluation index and fitness are given according to the classification accuracy.

However, the original feature set contains many features that are irrelevant or weakly related to the classification. If the genetic algorithm is used for feature selection directly against the original feature set, it may converge to the local minimum point with poor classification performance (ie, poor classification performance). Feature subsets) will also reduce search efficiency. Therefore, this study combines the fiiter model and wrapper model of feature selection algorithm, and proposes a two-stage feature selection method based on correlation analysis and genetic algorithm. Firstly, feature correlation is evaluated and screened based on information gain, and then a genetic algorithm is used to perform random search on the filtered and refined subset of features. The perceptron models classification error rate is used as the evaluation index. In addition, instead of adopting the traditional binary direct coding scheme in the coding of the genetic algorithm, an interval-based binary coding scheme is adopted, on the one hand, the coding length is reduced, the space-time efficiency is improved, and on the other hand, the number of selected features can be performed. Flexible control.

l Feature filtering based on correlation analysis

Feature filtering based on correlation analysis is one of the effective methods for feature selection and dimensionality reduction. Its main idea is to measure the correlation of individual features and category labels one by one based on a specific correlation definition, that is, their respective classification capabilities. Then, according to the classification ability of each feature, the features are sorted in descending order, and feature subsets with high classification ability are selected. Thus, features that are weakly related or even irrelevant to classification are eliminated to a certain extent, and dimensionality reduction is achieved. In this study, the information gain[6] widely used in some decision tree algorithms is used as the measure of the correlation between the features and the class labels (ie, the classific

剩余内容已隐藏,支付完成后下载完整资料


基于相关性分析及遗传算法的高维数据特征选择

任江涛,黄焕宇,孙婧

中山大学 计算机科学系

摘要:特征选择是模式识别及数据挖掘等领域的重要问题之一。针对高维数据对象,特征选择一方面可以提高分类精度和效率,另一方面可以找出富含信息的特征子集。针对此问题,提出了一种综合了 fiiter 模型及 wrapper 模型的特征选择方法,首先基于特征与类别标签的相关性分析进行特征筛选,只保留与类别标签具有较强相关性的特征,然后针对经过筛选而精简的特征子集采用遗传算法进行随机搜索,并采用感知器模型的分类错误率作为评价指标。实验结果表明,该算法可有效地找出具有较好的线性可分离性的特征子集,从而实现降维并提高分类精度。

关键词:特征选择;相关性;遗传算法

High-dimensional data feature selection based on relevance analysis and GA

REN Jiang-tao, HUANG Huan-yu, SUN Jing-hao, YIN Jian

( Department of Computer Science, Zhongshan University, Guangzhou Guangdong 510275, China)

Abstract: Feature seiection is one of the important probiems in the pattern recognition and data mining areas. For high- dimensionai data feature seiection not oniy can improve the accuracy and efficiency of ciassification, but aiso can discover informative feature subset. The new feature seiection method combining fiiter and wrapper modeis was proposed, which first fiiters featured by feature reievance anaiysis, and reaiized the near optimai feature subset search on the compact feature subset by genetic aigorithm; and the feature subset was evaiuated by the ciassification inaccuracy of the perceptron modei. The experiments show that the proposed aigorithm can find the feature subsets with good iinear separabiiity, which resuits in the iow-dimensionai data and the good ciassification accuracy.

Key words: feature seiection; reievance; Genetic Aigorithm( GA)

0 引言

特征选择是模式识别与数据挖掘领域的重要数据处理方法之一。随着模式识别与数据挖掘研究的深入,研究对象越来越复杂,对象的特征维数越来越高。大量高维数据对象的特征空间中含有许多冗余特征甚至噪声特征,这些特征一方面可能降低分类或聚类的精度,另一方面会大大增加学习及训练的时间及空间复杂度。因此,在面对高维数据进行分类或聚类时,通常需要运用特征选择算法找到具有较好可分性的特征子空间,从而实现降维,降低机器学习的时间及空间复杂度。

根据是否依赖机器学习算法,特征选择算法可以分为两大类,一类为 wrapper 型算法,另一类为 fiiter 型算法。Fiiter 型特征选择算法独立于机器学习算法,具有计算代价小,效率高但降维效果一般等特点;而 wrapper 型特征选择算法则需要依赖某种或多种机器学习算法,具有计算代价大,效率低但降维效果好等特点。

从优化的观点来看,特征选择问题实际上是一个组合优化问题。通常解决该问题有遍历搜索、随机搜索及启发式搜索等方法。遗传算法在组合优化问题中也有着广泛的应用, 属于一种随机搜索方法。近年来,随着对特征选择方法研究的深入,基于遗传算法的特征选择问题也得到了许多研究及应用。目前基于遗传算法的特征选择方法通常基于分类器进行特征子集的评估,依据分类精度给出个体的评价指标及适应度。

但是,原始特征集合中含有许多与分类不相关或弱相关的特征,若直接针对原始特征集合采用遗传算法进行特征选择,可能会收敛到分类性能较差的局部最小点( 即分类性能较差的特征子集),另外也会降低搜索的效率。因此,本研究融合了特征选择算法的 fiiter 模型及 wrapper 模型,提出了一种基于相关性分析及遗传算法的两阶段特征选择方法。首先基于信息增益进行特征相关性评价及筛选,然后针对经过筛 选而精简的特征子集采用遗传算法进行随机搜索,并采用感知器模型的分类错误率作为评价指标。另外,在遗传算法编码方面没有采用传统的二进制直接编码方案,而是采用基于区间的二进制编码方案,一方面减小了编码长度、提高了时空效率,另一方面可对选择的特征个数进行灵活控制。

1 基于相关性分析的特征过滤

基于相关性分析的特征过滤是进行特征选择及降维的有效方法之一,其主要思想是基于特定的相关性定义,逐个度量单个特征与类别标签的相关性,即单个特征各自的分类能力,然后根据各特征的分类能力对特征进行降序排序,选出分类能力高的特征子集,从而在一定程度上消除与分类弱相关甚至无关的特征,实现降维。在本研究中,采用在某些决策树算法中广泛采用的信息增益[6]作为特征与类别标签的相关性度量( 即分类能力度量),然后根据该度量的降序排序选出分类能力强的一组特征,实现特征子集的精简,下信息增益的定义。令 X 为随机变量,则 X 的信息熵定义为:

(1)

通过观测随机变量 Y 随机变量 X 的信息熵变为:

(2)

其中P(xi )代表随机变量X的先验概率,P(xiIy )代表观测到随机变量Y后随机变量后验概率。引入随机变量Y的信息后,随机变量X的信息熵H(XIY)<H(X),即引入Y后,X的不确定程度会变小或保持不变。若Y与X不相关,则H(XIY)=H(X);若Y与X相关,则H(XIY)lt;H(X),而差值H(X)-H(XIY)越大,Y与X的相关性越强。因此如公式(3)定义信息增益IG(XIY)为H(X)与H(XIY)的差值,反映了Y与X的相关程度,IG(XIY)越大,则变量Y与X的相关性越强。

IG(X|Y)=H(X) – H(X|Y) (3)

另外,为了对信息增益进行归一化,可采用公式(4):

(4)

在上述相关性度量定义的基础上,就可以基于特征fi与类别标签L之间的相关性进行特征排序及筛选,选出相关性最强的若干个特征形成精简后的特征子集,具体算法流程如

算法l所示。

算法 l FeatureSFiltering(D,F,m)

输入:带类别标签的数据集 D,原始特征集合F,保留的特征数m

输出:精简后的特征子集

步骤:

  1. 根据公式(l)~(4)计算每个特征fi与类别标签的信息增益SUi;

2) 根据 SUi对特征进行降序排序;

3) 选出 SU i 值最高的前 m 个特征形成精简特征集合。

2 基于遗传算法的特征选择

上文算法l 通过消除不相关特征实现了原始特征集合的精简后,可以采用基于遗传算法的 Wrapper 型特征选择方法。下面从编码方案、适应度函数及算法流程等方面对该算法进行描述。

2. l 编码方案

编码问题的关键在于能代表所给特征集合的所有可能子集的解空间。常用的方法是采用直接二进制编码,即每一个二进制位对应特征集合中的一个特征,该位为 l 则表示对应的特征入选特征子集,而该位为 0 则表示对应的特征不在选出的特征子集中。在特征维数 i 相对较低时,该表示方法可得到较小的二进制串,提高计算效率。但在特征维数 i 特别高的情况下,该表示方法反而可能导致较长的串,从而降低了计算效率。例如,基因表达数据集 CoIon Tumor 的维数为 2 000,采用直接二进制的编码方法就需要长度为 2 000 的二进制串。另外,直接的二进制表示方法不利于对选择出的特征个数进行限制。因此本研究采用基于区间的二进制编码方案。即用一个长度为 l 的二进制数表示所选择的特征在原特征集合中的序号。这样,如果指定要选择的特征个数 ,则这个二进制串长度为*l。当<i时,可得到较小的二进制串。例如,针对2000个特征,每个特征编号需要一个ll位的二进制数串来表示,即l=ll,假设每次搜索6个特征的组合(=6),那么整个编码二进制串的长度为*l=6*ll=66,远小于直接二进制编码的长度2000,提高了空间及时间效率。同时,该编码方案可保证每次选择的特征个数可指定,从而实现了对特征子集大小的灵活控制。

2.2适应度定义

在大多数基于遗传算法的Wrapper型特征选择方法中,采用某些分类器模型对所选择的特征集合进行评价,并利用得到的分类精度或分类错误率作为适应度函数。在本研究中,为搜索出线性可分性较好的特征子空间,采用感知器模型作为分类器模型,并采用分类错误率作为适应度,评价算法EvaIuation的流程由算法2给出。

算法2Eualuation(D,FS)

输入:带类别标签的数据集D,特征子集FS

输出:特征子集评价值

步骤:

l)根据特征子集FS从数据集D中选出一个降维后的数据集DF;

2)采用感知器算法对数据集DF进行分类,统计分类错误率err;

3)输出分类错误率err作为特征子集的评价指标,即适应度,算法结束。

3算法流程

基于标准遗传算法框架,得到一种新的基于相关性分析及遗传算法的特征选择方法(FeatureSeIectionbasedonReIevanceAnaIysisandGA,FSRAGA),算法具体描述如下。

算法3 FSRAGA(D,F,m,fn,MaxI)

输入:带类别标签的数据集D,原始特征集合F,保留的特征数m,选择的特征数fn,最大迭代次数MaxI

输出:优化的特征子集

步骤:

l)调用算法FeatureSFiltering(D,F,m),进行特征的过滤,形成精简特征子集Fc;

2)根据上文给出的编码方案,以及选择的特征数fn,随机产生一组初始个体构成初始种群;

3)根据编码方案,将个体的二进制表达转化为精简特征子集Fc中的特征编号,根据这些特征编号进行特征选择,形成特征子集FS;

4)根据适应度评价方法,调用函数Eualuation(D,FS),计算个体适应度;

5)判断是否达到最大迭代次数MaxI,若达到则输出当前的最优特征子集,否则执行以下步骤;

6)根据适应度执行选择操作;

7)执行交叉操作;

8)执行变异操作;

9)返回步骤3)。

4 实验研究

为了评估上述算法的有效性,采用基因表达数据集采用了基因表达数据ProstateCancer(前列腺癌)进行测试。ProstateCancer数据集有102个样本,每个样本均含有12600个基因的表达数据,其中52个样本被确诊患有前列腺癌,另外50个样本为未患前列腺癌的正常组织样本。

图 1 FSRAGA 算法针对 Prostate Cancer 数据集的迭代运行结果

图 2 Prostate Cancer 数据集在 FSRAGA 算法选出的散点图

实验首先运用算法对特征集合进行过滤,从原始的12600个特征中选出类相关性最强的前512个特征,然后针对这512个特征应用遗传算法进行选择。图1(a)给出采用FSRAGA算法对ProstateCancer数据集进行特征选择的遗传算法迭代运行结果,图中横坐标代表遗传算法的迭代次数,纵坐标代表每一代种群得到的最优结果(即最低的感知器分类错误率)。为了验证进行基于相关分析的特征过滤的效果,图1(b)给出了针对未经过滤的原始12600维的特征集合运用遗传算法进行特征选择的实验结果,这两个实验均选出2维特征。从图1(a)中可以看出,在遗传算法的迭代过程中,2维ProstateCancer数据集的感知器分类错误率在持续下降,迭代到第10次时就收敛到一个较低的错误率3.92%。而图1(b)中可以看出,在针对没有过滤的原始特征集合进行特征选择时,遗传算法的效率及效果变差,一方面在迭代近20次时才收敛,另一方面收敛时得到的分类错误率高达11.77%。图2给出了ProstateCancer数据集在FSRAGA算法所选出的优化的2维(图2(a))及3维(图2(b))特征子空间中的分布散点图,分别用“$”及“'”代表两类样本,从图中可看出样本集在对应的特征子空间中具有较好的线性可分性,其中3维特征子空间中的线性可分性要优于2维特

剩余内容已隐藏,支付完成后下载完整资料


资料编号:[22451],资料为PDF文档或Word文档,PDF文档可免费转换为Word

原文和译文剩余内容已隐藏,您需要先支付 30元 才能查看原文和译文全部内容!立即支付

以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。