Working With Text

text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

text2 = text1.split(' ')  # Return a list of the words in text2, separating by ' '.

len(text2)

text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

在列表中找到特定的单词：通过增加条件判断

[w 
 for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

Find unique words using `set()`.

text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

len(set(text4))

set(text4)

{'To', 'be', 'not', 'or', 'to'}

len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

Processing free-text

text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

Finding hastags:

[w for w in text6 
     if w.startswith('#')]

['#UNSG']

Finding callouts:

[w for w in text6 if w.startswith('@')]

['@']

text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

Use regular expressions to help us with more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that: * start with '@' and are followed by at least one: * capital letter ('A-Z') * lowercase letter ('a-z') * number ('0-9') * or underscore ('_')

import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

Regex with pandas

import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

	text
0	Monday: The doctor's appointment is at 2:45pm.
1	Tuesday: The dentist's appointment is at 11:30...
2	Wednesday: At 7:00pm, there is a basketball game!
3	Thursday: Be back home by 11:15 pm at the latest.
4	Friday: Take the train at 08:10 am, arrive at ...

# find the number of characters for each string in df['text']
df['text']

0       Monday: The doctor's appointment is at 2:45pm.
1    Tuesday: The dentist's appointment is at 11:30...
2    Wednesday: At 7:00pm, there is a basketball game!
3    Thursday: Be back home by 11:15 pm at the latest.
4    Friday: Take the train at 08:10 am, arrive at ...
Name: text, dtype: object

# find the number of characters for each string in df['text']
df['text'].str

<pandas.core.strings.StringMethods at 0x11a34eb38>

print(type(df['text']))

<class 'pandas.core.series.Series'>

df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

# find the number of tokens for each string in df['text']
df['text'].str.split()

0    [Monday:, The, doctor's, appointment, is, at, ...
1    [Tuesday:, The, dentist's, appointment, is, at...
2    [Wednesday:, At, 7:00pm,, there, is, a, basket...
3    [Thursday:, Be, back, home, by, 11:15, pm, at,...
4    [Friday:, Take, the, train, at, 08:10, am,, ar...
Name: text, dtype: object

# find the number of tokens for each string in df['text']
#str.split each to a list
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

contain

# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

count match occurs times

# find how many times a digit occurs in each string
df['text'].str.count(r'[0-9]')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

regex ?: matches at most 1 times, 防止贪婪 0次或者1次

# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

# replace weekdays with '???'
#\w is equal the [A-z0-9_]
# \b determine the edge of word
df['text'].str.replace(r'\w+day\b', '???')
df['text'].str.replace(r'[A-z0-9_]+day\b', '???')

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

/Users/zero/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:2: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  from ipykernel import kernelapp as app

	0	1
0	2	45
1	11	30
2	7	00
3	11	15
4	08	10

# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

		0	1	2	3
	match
0	0	2:45pm	2	45	pm
1	0	11:30 am	11	30	am
2	0	7:00pm	7	00	pm
3	0	11:15 pm	11	15	pm
4	0	08:10 am	08	10	am
4	1	09:00am	09	00	am

# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Jixing Liu

Reading And Writing

Data Scientist

China

		time	hour	minute	period
	match
0	0	2:45pm	2	45	pm
1	0	11:30 am	11	30	am
2	0	7:00pm	7	00	pm
3	0	11:15 pm	11	15	pm
4	0	08:10 am	08	10	am
4	1	09:00am	09	00	am

NLP use python part 01

Working With Text

Find unique words using set().

Processing free-text

Finding hastags:

Finding callouts:

Use regular expressions to help us with more complex parsing.

Regex with pandas

contain

count match occurs times

regex ?: matches at most 1 times, 防止贪婪 0次或者1次

Jixing Liu

使用 R 输出格式化的 Excel

如何拟合一条曲线

努力后的失败，才是诚实的失败

蝇王

如何阅读大量的学术论文, 而不发疯？

多标签分类问题

新药研发

Deep Work

The Hello World Of Neural Network

使用 R 分析可视化你的 iPhone 健康 APP 数据

Find unique words using `set()`.