开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

锦鲤本鲤 · 2022年01月20日

选B是因为问的是是属于unstructed data吗

* 问题详情,请 查看题干

NO.PZ202108310100000202

问题如下:

Based on the source of the data, as part of the data cleansing and wrangling process, Achler most likely needs to remove:

选项:

A.

html tags and perform scaling

B.

numbers and perform lemmatization

C.

white spaces and perform winsorization

解释:

B is correct.

Achler uses a web spidering program that extracts unstructured raw content from social media webpages. Raw text data are a sequence of characters and contain other non-useful elements including html tags, punctuation, and white spaces (including tabs, line breaks, and new lines).

Removing numbers is one of the basic operations in the text cleansing/preparation process for unstructured data. When numbers (or digits) are present in the text, they should be removed or substituted with the annotation “/number/. ”

Lemmatization, which takes places during the text wrangling/preprocessing process for unstructured data, is the process of converting inflected forms of a word into its morphological root (known as lemma).

Lemmatization reduces the repetition of words occurring in various forms while maintaining the semantic structure of the text data, thereby aiding in training less complex ML models.

A is incorrect because although html tag removal is part of text cleansing/ preparation for unstructured data, scaling is a data wrangling/preprocessing process applied to structured data.

Scaling adjusts the range of a feature by shifting and changing the scale of data; it is performed on numeric variables, not on text data.

C is incorrect because although raw text contains white spaces (including tabs, line breaks, and new lines) that need to be removed as part of the data cleansing/preparation process for unstructured data, winsorization is a data wrangling/preprocessing task performed on values of data points, not on text data.

Winsorization is used for structured numerical data and replaces extreme values and outliers with the maximum (for large-value outliers) and minimum (for small-value outliers) values of data points that are not outliers.

如何判断问的是unstructed data呢
1 个答案

星星_品职助教 · 2022年01月20日

同学你好,

题干中多处可以看出这是一个unstructured data(text)。由于三个选项的后半部分中,只有Lemmatization才是对于unstructured data所做的text wrangling process,所以只能选B。

  • 1

    回答
  • 0

    关注
  • 526

    浏览
相关问题