[w for w in text2 if w.istitle()] # Capitalized words in text2
['Ethics', 'United', 'Nations']
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'
['Ethics', 'ideals', 'objectives', 'Nations']
Find unique words using set().
text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4)
6
len(set(text4))
5
set(text4)
{'To', 'be', 'not', 'or', 'to'}
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.
4
set([w.lower() for w in text4])
{'be', 'not', 'or', 'to'}
Processing free-text
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')
Use regular expressions to help us with more complex parsing.
For example '@[A-Za-z0-9_]+' will return all words that:
* start with '@' and are followed by at least one:
* capital letter ('A-Z')
* lowercase letter ('a-z')
* number ('0-9')
* or underscore ('_')
import re # import re - a module that provides support for regular expressions
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]
['@UN', '@UN_Women']
Regex with pandas
import pandas as pd
time_sentences = ["Monday: The doctor's appointment is at 2:45pm.",
"Tuesday: The dentist's appointment is at 11:30 am.",
"Wednesday: At 7:00pm, there is a basketball game!",
"Thursday: Be back home by 11:15 pm at the latest.",
"Friday: Take the train at 08:10 am, arrive at 09:00am."]
df = pd.DataFrame(time_sentences, columns=['text'])
df
text
0
Monday: The doctor's appointment is at 2:45pm.
1
Tuesday: The dentist's appointment is at 11:30...
2
Wednesday: At 7:00pm, there is a basketball game!
3
Thursday: Be back home by 11:15 pm at the latest.
4
Friday: Take the train at 08:10 am, arrive at ...
# find the number of characters for each string in df['text']
df['text']
0 Monday: The doctor's appointment is at 2:45pm.
1 Tuesday: The dentist's appointment is at 11:30...
2 Wednesday: At 7:00pm, there is a basketball game!
3 Thursday: Be back home by 11:15 pm at the latest.
4 Friday: Take the train at 08:10 am, arrive at ...
Name: text, dtype: object
# find the number of characters for each string in df['text']
df['text'].str
<pandas.core.strings.StringMethods at 0x11a34eb38>
print(type(df['text']))
<class 'pandas.core.series.Series'>
df['text'].str.len()
0 46
1 50
2 49
3 49
4 54
Name: text, dtype: int64
# find the number of tokens for each string in df['text']
df['text'].str.split()
0 [Monday:, The, doctor's, appointment, is, at, ...
1 [Tuesday:, The, dentist's, appointment, is, at...
2 [Wednesday:, At, 7:00pm,, there, is, a, basket...
3 [Thursday:, Be, back, home, by, 11:15, pm, at,...
4 [Friday:, Take, the, train, at, 08:10, am,, ar...
Name: text, dtype: object
# find the number of tokens for each string in df['text']
#str.split each to a list
df['text'].str.split().str.len()
0 7
1 8
2 8
3 10
4 10
Name: text, dtype: int64
contain
# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')
# replace weekdays with '???'
#\w is equal the [A-z0-9_]
# \b determine the edge of word
df['text'].str.replace(r'\w+day\b', '???')
df['text'].str.replace(r'[A-z0-9_]+day\b', '???')
0 ???: The doctor's appointment is at 2:45pm.
1 ???: The dentist's appointment is at 11:30 am.
2 ???: At 7:00pm, there is a basketball game!
3 ???: Be back home by 11:15 pm at the latest.
4 ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')
/Users/zero/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:2: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
from ipykernel import kernelapp as app
0
1
0
2
45
1
11
30
2
7
00
3
11
15
4
08
10
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')
0
1
2
3
match
0
0
2:45pm
2
45
pm
1
0
11:30 am
11
30
am
2
0
7:00pm
7
00
pm
3
0
11:15 pm
11
15
pm
4
0
08:10 am
08
10
am
1
09:00am
09
00
am
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')
简单拟合一个线性模型 states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) #summary(fit) 线性模型假设的综合验证 使用gvlma包中的gvlma函数验证模型的线性假设。gvlma函数由Pena和Slate ( 2006 )编写,能对线性模型假设进行综合验证,同时还能做偏斜度、峰度和异方差性的评价。换句话说,它给模型假设提供了一个单独的综合检验(通过/不通过)。
# Listing 8.8 - Global test of linear model assumptions library(gvlma) gvmodel <- gvlma(fit) summary(gvmodel) ## ## Call: ## lm(formula = Murder ~ Population + Illiteracy + Income + Frost, ## data = states) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.
You Become A Monster, You Will Not Be Scare A Monster
一群孩子被放到隔绝外界环境的孤岛, (这样的设定有点像实验术语中的控制变量. 作者想要证明人性本恶, 与外界无关, 也就是和其他变量没有关系.) 天真的孩子是如何从文明走向罪恶的过程, 完美的推演了由善像恶的自发过程. 表明人性本恶, 恶就存在人的身体里, 天生就带着恶的属性.
总结起来需要三个自带的条件:
共同的敌人 迫切的基础需求 主流的裹挟(集体无意思)
Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.
multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.