多标签分类问题

June 19, 2019 in machine learning

Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related. multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.

The Hello World Of Neural Network

May 17, 2019 in machine learning

简单来说神经网络和我们一般的编程区别在于: 一个是输入数据和函数规则, 然后得到结果. 而神经网络是输入数据和答案, 通过迭代学习, 神经网络能学习出函数规则.如下图: 举个简单的例子, 这里有两组数据: X: -1, 0, 1, 2, 3, 4 Y: -3, -1, 1, 3, 5, 7 你可以把 x 看做是数据, y 看做是答案, 现在你要做的是找到其中的函数关系, 这个关系能够帮助我们, 用 x 去预测 Y 的值(假设你没有学过解方程组). 最常用的方法就是归纳法, 首先你根据第一对数据猜一个对应关系规则, 拿着这个规则计算答案值, 评估计算的答案和真实答案差多远, 然后在调整你的规则, 继续评估, 直到你的规则能够拟合所有的数据. 这就是神经网络的逻辑过程. 我们来看一个简单的神经网络的例子 1.Import: 加载所需模块 import tensorflow as tf ## /Users/zero/anaconda3/envs/tfdeeplearning/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: compiletime version 3.6 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.5 ## return f(*args, **kwds) import numpy as np from tensorflow import keras 2.

NLP part04 word count use R

September 20, 2018 in machine learning

What Is Clean Data Each variable is a column Each observation is a row Each type of observational unit is a table A table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. text <- c( "Because I could not stop for Death -", "He kindly stopped for me -", "The Carriage held but just Ourselves -", "and Immortality" ) text ## [1] "Because I could not stop for Death -" ## [2] "He kindly stopped for me -" ## [3] "The Carriage held but just Ourselves -" ## [4] "and Immortality" library(dplyr) text_df <- data_frame(line = 1:4, text = text) text_df ## # A tibble: 4 x 2 ## line text ## <int> <chr> ## 1 1 Because I could not stop for Death - ## 2 2 He kindly stopped for me - ## 3 3 The Carriage held but just Ourselves - ## 4 4 and Immortality library(tidytext) text_df %>% unnest_tokens(word, text) ## # A tibble: 20 x 2 ## line word ## <int> <chr> ## 1 1 because ## 2 1 i ## 3 1 could ## 4 1 not ## 5 1 stop ## 6 1 for ## 7 1 death ## 8 2 he ## 9 2 kindly ## 10 2 stopped ## 11 2 for ## 12 2 me ## 13 3 the ## 14 3 carriage ## 15 3 held ## 16 3 but ## 17 3 just ## 18 3 ourselves ## 19 4 and ## 20 4 immortality For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph

NLP part 03 sentiment analysis

September 17, 2018 in machine learning

Case Study: Sentiment Analysis Data Prep import pandas as pd import numpy as np # Read in the data df = pd.read_csv('/Users/zero/Desktop/NLP/raw-data/Amazon_Unlocked_Mobile.csv') # 对数据进行采样以加快计算速度 # Comment out this line to match with lecture df = df.sample(frac=0.1, random_state=10) df.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } <tr style="text-align: right;"> <th></th> <th>Product Name</th> <th>Brand Name</th> <th>Price</th> <th>Rating</th> <th>Reviews</th> <th>Review Votes</th> </tr> <tr> <th>394349</th> <td>Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat.

NLP part 03 sentiment analysis

September 17, 2018 in machine learning

数据科学家会不会被机器取?

September 10, 2018 in computers, machine learning

数据科学自动化曾今是一个热门话题, 大多数人都在讨论所谓的“自动化”工具, 人们声称他们的工具可以自动化数据科学过程。给人一种错觉, 只要将这些工具与大数据架构相结合就可以解决任何业务问题。但是其实在实际的数据分析工作中, 自动化建模部分仅仅占到总工作量的10%, 大多数的时间和精力花在了 feature engineering 和 feature selection。比起构建一个复杂的模型, 我们更应该关注的问题这些问题例如: 定义要解决的问题，获取数据，探索数据，部署项目，调试和监视, 而这些问题往往都无法完全自动化。这里 Berry 和 Linoff 从摄影的角度给了一个有趣的比喻: “The camera can relieve the photographer from having to set the shutter speed, aperture and other settings every time a picture is taken. This makes the process easier for expert photographers and makes better photography accessible to people who are not experts. But this is still automating only a small part of the process of producing a photograph.

t-SNE

August 7, 2018 in machine learning

note 许多真实世界的数据集具有较低的内在维度，但是它们嵌入在高维空间中，人类因为受限于三维可视化，而不易发现这些内在的结构 t分布式随机邻域嵌入（t-SNE）：t-distributed stochastic neighbor embedding (t-SNE)，是一种非常流行的非线性降为的方法，它和 PCA的不同之处在于： PCA的本质是在降维之后尽量保存矩阵的最大变异性，而实际上我们经常是想保存原本的结构，换句话说，就是降维前各个点相对距离结构在降维之后要继续保留下来过程我们这里用 3维代替高维， 2维代替低维那么3维的点经过映射之后在2维的空间要保存原有的相对距离结构这个过程本质神经网络深度学习问题，是找到最佳的映射矩阵（最好的 w1 w2 组合方式），使得映射后，数据结构得以保存那么怎么来衡量数据结构保存的效果？那就看比较原始数据下的距离矩阵和映射后的距离矩阵的差这样的话问题就转化为：求熵最小化的问题？那么这里有个问题，高维度的距离矩阵怎么和低维的距离矩阵进行比较呢？这就是 t-SNE 的特殊的地方：它把距离问题转换为概率问题，距离越近的i,j两点，pij越大，以i为中心的t分布，描述 i 与所有点的距离，之所以不用正态分布，是因为t分布是长尾的，这样的分布在尾部的数据点之间（距离较远的点）的惩罚不是区分的很严格，换句话说就是映射后近变远的惩罚系数 > 远变近的惩罚系数，这个过程可以想象成一个弹簧产生形变，映射由远变进就是压缩，会产生斥力，映射由近变远，就会拉伸，产生引力，形变就好像是熵，是映射前和映射后的差距，产生的单位力（力/形变长度）的绝对值就好像是惩罚系数，拉伸的形变产生的单位引力 > 压缩的形变产生的单位斥力，这就很好的解决了数据点映射后的拥挤问题，就好像点与点之间有一定的斥力（为了纠正远变近的情况），但是这个斥力系数又小于点与点之间的引力, 这样在纠正同样误差的时候我们会优先纠正近的变成远的, 体现在单位引力大于单位斥力（为了纠正近变远的情况，单位引力大于单位斥力是因为我们更在乎点与点之间相似的机构，可以想象最后每个点会在所有合力的情况下达到一个平衡，如下图： Links tsne

OLDER POSTS
page 1 of 4

多标签分类问题

The Hello World Of Neural Network

NLP part04 word count use R

NLP part 03 sentiment analysis

NLP part 03 sentiment analysis

数据科学家会不会被机器取?

t-SNE

Jixing Liu

使用 R 输出格式化的 Excel

如何拟合一条曲线

努力后的失败，才是诚实的失败

蝇王

如何阅读大量的学术论文, 而不发疯？

多标签分类问题

新药研发

Deep Work

The Hello World Of Neural Network

使用 R 分析可视化你的 iPhone 健康 APP 数据