NO.PZ2015120204000046
问题如下:
Steele and Schultz then discuss how to preprocess the raw text data. Steele tells Schultz that the process can be completed in the following three steps:
Step 1 Cleanse the raw text data.
Step 2 Split the cleansed data into a collection of words for them to be normalized.
Step 3 Normalize the collection of words from Step 2 and create a distinct set of tokens from the normalized words.
Steele’s Step 2 can be best described as:
选项:
A.tokenization.
lemmatization
standardization.
解释:
A is correct. Tokenization is the process of splitting a given text into separate tokens. This step takes place after cleansing the raw text data (removing html tags, numbers, extra white spaces, etc.). The tokens are then normalized to create the bag-of-words (BOW).
请问这三个概念分别是啥意思?具体是在哪个步骤里?