What Is Clean Data Each variable is a column Each observation is a row Each type of observational unit is a table A table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. text <- c( "Because I could not stop for Death -", "He kindly stopped for me -", "The Carriage held but just Ourselves -", "and Immortality" ) text ## [1] "Because I could not stop for Death -" ## [2] "He kindly stopped for me -" ## [3] "The Carriage held but just Ourselves -" ## [4] "and Immortality" library(dplyr) text_df <- data_frame(line = 1:4, text = text) text_df ## # A tibble: 4 x 2 ## line text ## <int> <chr> ## 1 1 Because I could not stop for Death - ## 2 2 He kindly stopped for me - ## 3 3 The Carriage held but just Ourselves - ## 4 4 and Immortality library(tidytext) text_df %>% unnest_tokens(word, text) ## # A tibble: 20 x 2 ## line word ## <int> <chr> ## 1 1 because ## 2 1 i ## 3 1 could ## 4 1 not ## 5 1 stop ## 6 1 for ## 7 1 death ## 8 2 he ## 9 2 kindly ## 10 2 stopped ## 11 2 for ## 12 2 me ## 13 3 the ## 14 3 carriage ## 15 3 held ## 16 3 but ## 17 3 just ## 18 3 ourselves ## 19 4 and ## 20 4 immortality For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph

Continue reading

Case Study: Sentiment Analysis Data Prep import pandas as pd import numpy as np # Read in the data df = pd.read_csv('/Users/zero/Desktop/NLP/raw-data/Amazon_Unlocked_Mobile.csv') # 对数据进行采样以加快计算速度 # Comment out this line to match with lecture df = df.sample(frac=0.1, random_state=10) df.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } <tr style="text-align: right;"> <th></th> <th>Product Name</th> <th>Brand Name</th> <th>Price</th> <th>Rating</th> <th>Reviews</th> <th>Review Votes</th> </tr> <tr> <th>394349</th> <td>Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat.

Continue reading

Author's picture

Jixing Liu

Reading And Writing

Data Scientist

China