开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

我是一条鱼 · 2021年04月15日

能不能简要说明下什么情况选另外两个呢,谢谢

NO.PZ2015120204000049

问题如下:

Steele then performs exploratory data analysis. To assist in feature selection, she wants to create a visualization that shows the most informative words in the dataset based on their term frequency (TF) values.

After creating and analyzing the visualization, Steele is concerned that some tokens are likely to be noise features for ML model training; therefore, she wants to remove them.

To address her concern in her exploratory data analysis, Steele should focus on those tokens that have:

选项:

A.

low chi-square statistics.

B.

low mutual information (ML) values.

C.

very low and very high term frequency (TF) values.

解释:

C is correct. Frequency measures can be used for vocabulary pruning to remove noise features by filtering the tokens with very high and low TF values across all the texts. Noise features are both the most frequent and most sparse (or rare) tokens in the dataset. On one end, noise features can be stop words that are typically present frequently in all the texts across the dataset. On the other end, noise features can be sparse terms that are present in only a few text files. Text classification involves dividing text documents into assigned classes. The frequent tokens strain the ML model to choose a decision boundary among the texts as the terms are present across all the texts (an example of underfitting). The rare tokens mislead the ML model into classifying texts containing the rare terms into a specific class (an example of overfitting). Thus, identifying and removing noise features are critical steps for text classification applications.

能不能简要说明下什么情况选另外两个呢,谢谢

1 个答案

星星_品职助教 · 2021年04月15日

同学你好,

目前几乎没有题目专门针对考察A,B选项。

对于A选项,建议掌握:

高卡方统计量代表这个单词在这个类别中出现的更频繁(这个单词和彼此之间并不相互独立),这个时候这个单词就是应该被选择的一个特征。

对于B选项,建议掌握:

越高的MI代表单词对这个分类的贡献越大。而MI为0代表单词在所有文本中出现的频率相同,即单词对于各个分类都没有特殊贡献

其余不需要掌握。

  • 1

    回答
  • 3

    关注
  • 878

    浏览
相关问题

NO.PZ2015120204000049 AB两个不是也说明这个数据是没有特征的吗?为什么不选呢?

2021-06-01 11:44 1 · 回答

low mutuinformation (ML) values. very low anvery high term frequen(TF) values. C is correct. Frequenmeasures cusefor vocabulary pruning to remove noise features filtering the tokens with very high anlow TF values across all the texts. Noise features are both the most frequent anmost sparse (or rare) tokens in the taset. On one en noise features cstop wor thare typically present frequently in all the texts across the taset. On the other en noise features csparse terms thare present in only a few text files. Text classification involves ving text cuments into assigneclasses. The frequent tokens strain the ML mol to choose a cision bounry among the texts the terms are present across all the texts (example of unrfitting). The rare tokens mislethe ML mol into classifying texts containing the rare terms into a specific class (example of overfitting). Thus, intifying anremoving noise features are criticsteps for text classification applications. 老师,我还是没明白,chisquare和MI不也是ta exploration这一步的吗?为什么不能选?

2020-10-26 21:23 1 · 回答

low mutuinformation (ML) values. very low anvery high term frequen(TF) values. C is correct. Frequenmeasures cusefor vocabulary pruning to remove noise features filtering the tokens with very high anlow TF values across all the texts. Noise features are both the most frequent anmost sparse (or rare) tokens in the taset. On one en noise features cstop wor thare typically present frequently in all the texts across the taset. On the other en noise features csparse terms thare present in only a few text files. Text classification involves ving text cuments into assigneclasses. The frequent tokens strain the ML mol to choose a cision bounry among the texts the terms are present across all the texts (example of unrfitting). The rare tokens mislethe ML mol into classifying texts containing the rare terms into a specific class (example of overfitting). Thus, intifying anremoving noise features are criticsteps for text classification applications. very low TF values 不也和另外两个一样 属于feature selection 的吗 为啥选它

2020-10-04 12:03 1 · 回答

想问以下,噪声词的chi-square value和MI value有什么特点吗? 或者说可以通过这两个值来判断是否是噪声词吗?谢谢

2019-12-21 13:16 1 · 回答