开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

锦鲤本鲤 · 2022年01月21日

这题不是太理解,能用中文讲一遍吗

* 问题详情,请 查看题干

NO.PZ202108310100000205

问题如下:

选项:

Based on Exhibit 2, Achler should exclude from further analysis words in:

A.

only Group 1

B.

only Group 2

C.

both Group 1 and Group 2

解释:

C is correct.

Achler should remove words that are in both Group 1 and Group 2. Term frequency values range between 0 and 1. Group 1 consists of the highest frequency values (e.g., “the” = 0.04935), and Group 2 consists of the lowest frequency values (e.g., “naval” = 1.0123e–05).

Frequency analysis on the processed text data helps in filtering unnecessary tokens (or features) by quantifying how important tokens are in a sentence and in the corpus as a whole.

The most frequent tokens (Group 1) strain the machine-learning model to choose a decision boundary among the texts as the terms are present across all the texts, which leads to model underfitting.

The least frequent tokens (Group 2) mislead the machine-learning model into classifying texts containing the rare terms into a specific class, which leads to model overfitting. Identifying and removing noise features is critical for text classification applications.

A is incorrect because words in both Group 1 and Group 2 should be removed.

The words with high term frequency value are mostly stop words, present in most sentences. Stop words do not carry a semantic meaning for the purpose of text analyses and ML training, so they do not contribute to differentiating sentiment.

B is incorrect because words in both Group 1 and Group 2 should be removed.

Terms with low term frequency value are mostly rare terms, ones appearing only once or twice in the data. They do not contribute to differentiating sentiment.

如题没看懂答案,为什么都要删除
1 个答案

星星_品职助教 · 2022年01月21日

同学你好,

频率很高的词汇说明在全文到处都有,没有区分度。从group 1的具体实例也可以看出:“the”,“to”,“and”这些词汇对于分析句意没有帮助。

频率很低的词汇说明在全文中只是偶尔出现。从group 2中的词汇可以看出,出现频率都是10的负5次方级别的。这类词汇由于出现的过少,对分析句意也没有帮助。

  • 1

    回答
  • 0

    关注
  • 454

    浏览
相关问题