NO.PZ2021083101000012
问题如下:
Achler and Rivera discuss remaining text wrangling tasks—specifically, which tokens to include in the document term matrix (DTM). Achler divides unique tokens into three groups; a sample of each group is shown in Exhibit 1.
Based on Exhibit 1, which token group has most likely undergone the text preparation and wrangling process?
选项:
A.Token Group 1
Token Group 2
Token Group 3
解释:
A is correct.
Data preparation and wrangling involve cleansing and organizing raw data into a consolidated format.
Token Group 1 includes n-grams (“not_increas_market, ” “sale_decreas”) and the words that have been converted from their inflected forms into their base word (“increas, ” “decreas”), and the currency symbol has been replaced with a “currencysign” token.
N-gram tokens are helpful for keeping negations intact in the text, which is vital for sentiment prediction. The process of converting inflected forms of a word into its base word is called stemming and helps decrease data sparseness, thereby aiding in training less complex ML models.
B is incorrect because Token Group 2 includes inflected forms of words (“increased, ” “decreased”) before conversion into their base words (known as stems).
Stemming (along with lemmatization) decreases data sparseness by aggregating many sparsely occurring words in relatively less sparse stems or lemmas, thereby aiding in training less complex ML models.
C is incorrect because Token Group 3 includes inflected forms of words (“increased, ” “decreased”) before conversion into their base words (known as stems). In addition, the “EUR” currency symbol has not been replaced with the “currencysign” token and the word “Sales” has not been lowercased.
考点:Unstructured Data Wrangling (Preprocessing)
请问AB如何区分