开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

沐沐的方盒 · 2023年08月04日

step1不是之说要清理数据,也没说清理掉全部的数据啊,为什么是错的?

NO.PZ2015120204000045

问题如下:

Steele and Schultz then discuss how to preprocess the raw text data. Steele tells Schultz that the process can be completed in the following three steps:

Step 1 Cleanse the raw text data.

Step 2 Split the cleansed data into a collection of words for them to be normalized.

Step 3 Normalize the collection of words from Step 2 and create a distinct set of tokens from the normalized words.

With respect to Step 1, Steele tells Schultz: “I believe I should remove all html tags, punctuations, numbers, and extra white spaces from the data before normalizing them.

Is Steele’s statement regarding Step 1 of the preprocessing of raw text data correct?

选项:

A.

Yes

B.

No, because her suggested treatment of punctuation is incorrect.

C.

No, because her suggested treatment of extra white spaces is incorrect.

解释:

B is correct. Although most punctuations are not necessary for text analysis and should be removed, some punctuations (e.g., percentage signs, currency symbols, and question marks) may be useful for ML model training. Such punctuations should be substituted with annotations (e.g., /percentSign/, /dollarSign/, and /questionMark/) to preserve their grammatical meaning in the text. Such annotations preserve the semantic meaning of important characters in the text for further text processing and analysis stages.

如题

1 个答案
已采纳答案

星星_品职助教 · 2023年08月05日

同学你好,

题干中说明处理是方式是要移除所有的html tags,punctuations.....等。

其中选项提到的punctuations的处理是错误的,有的punctuation可能有实际的意义,不能不加分析就直接一刀切的全部移除,

对于有意义的punctuations,需要用其他的方式做替代。