Working With Text

text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1
76
text2 = text1.split(' ')  # Return a list of the words in text2, separating by ' '.

len(text2)
14
text2
['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']


在列表中找到特定的单词:通过增加条件判断

[w 
 for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2
['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']
[w for w in text2 if w.istitle()] # Capitalized words in text2
['Ethics', 'United', 'Nations']
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'
['Ethics', 'ideals', 'objectives', 'Nations']


Find unique words using set().

text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)
6
len(set(text4))
5
set(text4)
{'To', 'be', 'not', 'or', 'to'}
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.
4
set([w.lower() for w in text4])
{'be', 'not', 'or', 'to'}

Processing free-text

text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6
['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']


Finding hastags:

[w for w in text6 
     if w.startswith('#')]
['#UNSG']


Finding callouts:

[w for w in text6 if w.startswith('@')]
['@']
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')


Use regular expressions to help us with more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that: * start with '@' and are followed by at least one: * capital letter ('A-Z') * lowercase letter ('a-z') * number ('0-9') * or underscore ('_')

import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]
['@UN', '@UN_Women']

Regex with pandas

import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df
text
0 Monday: The doctor's appointment is at 2:45pm.
1 Tuesday: The dentist's appointment is at 11:30...
2 Wednesday: At 7:00pm, there is a basketball game!
3 Thursday: Be back home by 11:15 pm at the latest.
4 Friday: Take the train at 08:10 am, arrive at ...
# find the number of characters for each string in df['text']
df['text']
0       Monday: The doctor's appointment is at 2:45pm.
1    Tuesday: The dentist's appointment is at 11:30...
2    Wednesday: At 7:00pm, there is a basketball game!
3    Thursday: Be back home by 11:15 pm at the latest.
4    Friday: Take the train at 08:10 am, arrive at ...
Name: text, dtype: object
# find the number of characters for each string in df['text']
df['text'].str
<pandas.core.strings.StringMethods at 0x11a34eb38>
print(type(df['text'])) 
<class 'pandas.core.series.Series'>
df['text'].str.len()
0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64
# find the number of tokens for each string in df['text']
df['text'].str.split()
0    [Monday:, The, doctor's, appointment, is, at, ...
1    [Tuesday:, The, dentist's, appointment, is, at...
2    [Wednesday:, At, 7:00pm,, there, is, a, basket...
3    [Thursday:, Be, back, home, by, 11:15, pm, at,...
4    [Friday:, Take, the, train, at, 08:10, am,, ar...
Name: text, dtype: object
# find the number of tokens for each string in df['text']
#str.split each to a list
df['text'].str.split().str.len()
0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

contain

# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')
0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

count match occurs times

# find how many times a digit occurs in each string
df['text'].str.count(r'[0-9]')
0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

regex ?: matches at most 1 times, 防止贪婪 0次或者1次

# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')
0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object
# replace weekdays with '???'
#\w is equal the [A-z0-9_]
# \b determine the edge of word
df['text'].str.replace(r'\w+day\b', '???')
df['text'].str.replace(r'[A-z0-9_]+day\b', '???')
0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')
/Users/zero/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:2: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  from ipykernel import kernelapp as app
0 1
0 2 45
1 11 30
2 7 00
3 11 15
4 08 10
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')
0 1 2 3
match
0 0 2:45pm 2 45 pm
1 0 11:30 am 11 30 am
2 0 7:00pm 7 00 pm
3 0 11:15 pm 11 15 pm
4 0 08:10 am 08 10 am
1 09:00am 09 00 am
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')
© 2020 Jixing Liu. All Rights Reserved
Author's picture

Jixing Liu

Reading And Writing

Data Scientist

China
time hour minute period
match
0 0 2:45pm 2 45 pm
1 0 11:30 am 11 30 am
2 0 7:00pm 7 00 pm
3 0 11:15 pm 11 15 pm
4 0 08:10 am 08 10 am
1 09:00am 09 00 am