开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

zhoou · 2022年04月07日

NO.PZ2021083101000014

NO.PZ2021083101000014

问题如下:

As an additional part of the text exploration step, Achler conducts a term frequency analysis to identify outliers. Achler summarizes the analysis in Exhibit 2.

Based on Exhibit 2, Achler should exclude from further analysis words in:

选项:

A.

only Group 1

B.

only Group 2

C.

both Group 1 and Group 2

解释:

C is correct.

Achler should remove words that are in both Group 1 and Group 2. Term frequency values range between 0 and 1. Group 1 consists of the highest frequency values (e.g., “the” = 0.04935), and Group 2 consists of the lowest frequency values (e.g., “naval” = 1.0123e–05).

Frequency analysis on the processed text data helps in filtering unnecessary tokens (or features) by quantifying how important tokens are in a sentence and in the corpus as a whole.

The most frequent tokens (Group 1) strain the machine-learning model to choose a decision boundary among the texts as the terms are present across all the texts, which leads to model underfitting.

The least frequent tokens (Group 2) mislead the machine-learning model into classifying texts containing the rare terms into a specific class, which leads to model overfitting. Identifying and removing noise features is critical for text classification applications.

A is incorrect because words in both Group 1 and Group 2 should be removed.

The words with high term frequency value are mostly stop words, present in most sentences. Stop words do not carry a semantic meaning for the purpose of text analyses and ML training, so they do not contribute to differentiating sentiment.

B is incorrect because words in both Group 1 and Group 2 should be removed.

Terms with low term frequency value are mostly rare terms, ones appearing only once or twice in the data. They do not contribute to differentiating sentiment.

考点:Unstructured Data Exploration

请问老师移除group1只是因为这些都是stop word?如果换成其他词,要怎么判断是否需要移除?

1 个答案

星星_品职助教 · 2022年04月07日

同学你好,

这里的考点就是要移除stop word。

其他普通词的考点参照group 2,如果出现频率极低,就需要移除。

  • 1

    回答
  • 0

    关注
  • 474

    浏览
相关问题

NO.PZ2021083101000014问题如下 aitionpart of the text exploration step, Achler concts a term frequenanalysis to intify outliers. Achler summarizes the analysis in Exhibit 2.Baseon Exhibit 2, Achler shoulexclu from further analysis wor in: A.only Group 1B.only Group 2C.both Group 1 anGroup 2 C is correct. Achler shoulremove wor thare in both Group 1 anGroup 2. Term frequenvalues range between 0 an1. Group 1 consists of the highest frequenvalues (e.g., “the” = 0.04935), anGroup 2 consists of the lowest frequenvalues (e.g., “naval” = 1.0123e–05). Frequenanalysis on the processetext ta helps in filtering unnecessary tokens (or features) quantifying how important tokens are in a sentenanin the corpus a whole. The most frequent tokens (Group 1) strain the machine-learning mol to choose a cision bounry among the texts the terms are present across all the texts, whilea to mol unrfitting. The least frequent tokens (Group 2) mislethe machine-learning mol into classifying texts containing the rare terms into a specific class, whilea to mol overfitting. Intifying anremoving noise features is criticfor text classification applications.A is incorrebecause wor in both Group 1 anGroup 2 shoulremove The wor with high term frequenvalue are mostly stop wor, present in most sentences. Stop wor not carry a semantic meaning for the purpose of text analyses anML training, so they not contribute to fferentiating sentiment.B is incorrebecause wor in both Group 1 anGroup 2 shoulremove Terms with low term frequenvalue are mostly rare terms, ones appearing only onor twiin the tThey not contribute to fferentiating sentiment.考点Unstructureta Exploration 什么频率算作intermeate

2024-08-21 23:36 1 · 回答

NO.PZ2021083101000014问题如下 aitionpart of the text exploration step, Achler concts a term frequenanalysis to intify outliers. Achler summarizes the analysis in Exhibit 2.Baseon Exhibit 2, Achler shoulexclu from further analysis wor in: A.only Group 1B.only Group 2C.both Group 1 anGroup 2 C is correct. Achler shoulremove wor thare in both Group 1 anGroup 2. Term frequenvalues range between 0 an1. Group 1 consists of the highest frequenvalues (e.g., “the” = 0.04935), anGroup 2 consists of the lowest frequenvalues (e.g., “naval” = 1.0123e–05). Frequenanalysis on the processetext ta helps in filtering unnecessary tokens (or features) quantifying how important tokens are in a sentenanin the corpus a whole. The most frequent tokens (Group 1) strain the machine-learning mol to choose a cision bounry among the texts the terms are present across all the texts, whilea to mol unrfitting. The least frequent tokens (Group 2) mislethe machine-learning mol into classifying texts containing the rare terms into a specific class, whilea to mol overfitting. Intifying anremoving noise features is criticfor text classification applications.A is incorrebecause wor in both Group 1 anGroup 2 shoulremove The wor with high term frequenvalue are mostly stop wor, present in most sentences. Stop wor not carry a semantic meaning for the purpose of text analyses anML training, so they not contribute to fferentiating sentiment.B is incorrebecause wor in both Group 1 anGroup 2 shoulremove Terms with low term frequenvalue are mostly rare terms, ones appearing only onor twiin the tThey not contribute to fferentiating sentiment.考点Unstructureta Exploration 老师,这道题的意思是需要移除Group1的全部stop wor及Group2里词频最低的naval吗?

2023-03-15 17:47 1 · 回答

NO.PZ2021083101000014问题如下 aitionpart of the text exploration step, Achler concts a term frequenanalysis to intify outliers. Achler summarizes the analysis in Exhibit 2.Baseon Exhibit 2, Achler shoulexclu from further analysis wor in: A.only Group 1B.only Group 2C.both Group 1 anGroup 2 C is correct. Achler shoulremove wor thare in both Group 1 anGroup 2. Term frequenvalues range between 0 an1. Group 1 consists of the highest frequenvalues (e.g., “the” = 0.04935), anGroup 2 consists of the lowest frequenvalues (e.g., “naval” = 1.0123e–05). Frequenanalysis on the processetext ta helps in filtering unnecessary tokens (or features) quantifying how important tokens are in a sentenanin the corpus a whole. The most frequent tokens (Group 1) strain the machine-learning mol to choose a cision bounry among the texts the terms are present across all the texts, whilea to mol unrfitting. The least frequent tokens (Group 2) mislethe machine-learning mol into classifying texts containing the rare terms into a specific class, whilea to mol overfitting. Intifying anremoving noise features is criticfor text classification applications.A is incorrebecause wor in both Group 1 anGroup 2 shoulremove The wor with high term frequenvalue are mostly stop wor, present in most sentences. Stop wor not carry a semantic meaning for the purpose of text analyses anML training, so they not contribute to fferentiating sentiment.B is incorrebecause wor in both Group 1 anGroup 2 shoulremove Terms with low term frequenvalue are mostly rare terms, ones appearing only onor twiin the tThey not contribute to fferentiating sentiment.考点Unstructureta Exploration frequency不是0—1吗?!Group 1这些词的频率是0.0几,不算高吧?

2022-07-26 10:42 1 · 回答

NO.PZ2021083101000014 group 2 被移除是因为频率太低吗?

2022-02-04 17:23 1 · 回答