What do you do when you want to use results from the literature to anchor your own analysis? we’ll go through a practical scenario on scraping an html table from a Nature Genetics article into R and wrangling the data into a useful format.
01. Scraping a html table from a webpage
#load packages
library("rvest")
library("knitr")
library(tidyverse)
#scraping web page
url <- "https://www.nature.com/articles/ng.2802/tables/2"
#====🔥find where is the table lives on this webpage====
table_path='//*[@id="content"]/div/div/figure/div[1]/div/div[1]/table'
#get the table
nature_genetics_table2 <- url %>%
read_html() %>%
html_nodes(xpath=table_path) %>%
html_table(fill=T) %>% .[[1]]
#the first few lines of table
kable(nature_genetics_table2[1:4,])
SNPa |
Chr. |
Positionb |
Closest genec |
Major/minor alleles |
MAFd |
OR (95% CI)e |
Meta P value |
OR (95% CI)e |
Meta P value |
OR (95% CI)e |
Meta P value |
I2 (%), P valuef |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
Known GWAS-defined associated genes |
rs6656401 |
1 |
207692049 |
CR1 |
G/A |
0.197 |
1.17 (1.12–1.22) |
7.7 × 10−15 |
1.21 (1.14–1.28) |
7.9 × 10−11 |
1.18 (1.14–1.22) |
5.7 × 10−24 |
0, 7.8 × 10−1 |
rs6733839 |
2 |
127892810 |
BIN1 |
C/T |
0.409 |
1.21 (1.17–1.25) |
1.7 × 10−26 |
1.24 (1.18–1.29) |
3.4 × 10−19 |
1.22 (1.18–1.25) |
6.9 × 10−44 |
28, 6.1 × 10−2 |
02 Making messy data useful
Cleaning up the rows
All The Elements Of These Rows Contain The Exact Same Text
v=which(apply(nature_genetics_table2,1, function(x) length(unique(unlist(x))) )==1)
v
## [1] 2 12 18
split table
nature_genetics_table2_list = split(nature_genetics_table2, cumsum(1:nrow(nature_genetics_table2) %in% v))
nature_genetics_table2_list = lapply(nature_genetics_table2_list[2:4], function(y) {
y$Description = unique(as.character(y[1, ]))
y[-1, ]
})
#rbind three table
nature_genetics_table2_clean = do.call("rbind", nature_genetics_table2_list)
kable(nature_genetics_table2_clean[1:3,])
1.3 |
rs6656401 |
1 |
207692049 |
CR1 |
G/A |
0.197 |
1.17 (1.12–1.22) |
7.7 × 10−15 |
1.21 (1.14–1.28) |
7.9 × 10−11 |
1.18 (1.14–1.22) |
5.7 × 10−24 |
0, 7.8 × 10−1 |
Known GWAS-defined associated genes |
1.4 |
rs6733839 |
2 |
127892810 |
BIN1 |
C/T |
0.409 |
1.21 (1.17–1.25) |
1.7 × 10−26 |
1.24 (1.18–1.29) |
3.4 × 10−19 |
1.22 (1.18–1.25) |
6.9 × 10−44 |
28, 6.1 × 10−2 |
Known GWAS-defined associated genes |
1.5 |
rs10948363 |
6 |
47487762 |
CD2AP |
A/G |
0.266 |
1.10 (1.07–1.14) |
3.1 × 10−8 |
1.09 (1.04–1.15) |
4.1 × 10−4 |
1.10 (1.07–1.13) |
5.2 × 10−11 |
0, 9 × 10−1 |
Known GWAS-defined associated genes |
03. Fixing column names
colnames(nature_genetics_table2_clean) <- c("SNP", "Chr", "Position", "Closest gene", "Major/minor alleles", "MAF", "Stage1_OR", "Stage1_MetaP", "Stage2_OR","Stage2_MetaP", "Overall_OR", "Overall_MetaP", "I2_Percent/P","Description")
colnames(nature_genetics_table2_clean)
## [1] "SNP" "Chr" "Position"
## [4] "Closest gene" "Major/minor alleles" "MAF"
## [7] "Stage1_OR" "Stage1_MetaP" "Stage2_OR"
## [10] "Stage2_MetaP" "Overall_OR" "Overall_MetaP"
## [13] "I2_Percent/P" "Description"
04. Making a character variable into a numeric variable
# " × 10-" -> "e-"
nature_genetics_table2_clean$Stage1_MetaP <-
str_replace(nature_genetics_table2_clean$Stage1_MetaP," × 10−","e-") %>% as.numeric()
kable(nature_genetics_table2_clean[1:3,])
1.3 |
rs6656401 |
1 |
207692049 |
CR1 |
G/A |
0.197 |
1.17 (1.12–1.22) |
0 |
1.21 (1.14–1.28) |
7.9 × 10−11 |
1.18 (1.14–1.22) |
5.7 × 10−24 |
0, 7.8 × 10−1 |
Known GWAS-defined associated genes |
1.4 |
rs6733839 |
2 |
127892810 |
BIN1 |
C/T |
0.409 |
1.21 (1.17–1.25) |
0 |
1.24 (1.18–1.29) |
3.4 × 10−19 |
1.22 (1.18–1.25) |
6.9 × 10−44 |
28, 6.1 × 10−2 |
Known GWAS-defined associated genes |
1.5 |
rs10948363 |
6 |
47487762 |
CD2AP |
A/G |
0.266 |
1.10 (1.07–1.14) |
0 |
1.09 (1.04–1.15) |
4.1 × 10−4 |
1.10 (1.07–1.13) |
5.2 × 10−11 |
0, 9 × 10−1 |
Known GWAS-defined associated genes |
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4
## [5] readr_1.1.1 tidyr_0.8.0 tibble_1.4.2 ggplot2_2.2.1
## [9] tidyverse_1.2.1 knitr_1.20 rvest_0.3.2 xml2_1.2.0
##
## loaded via a namespace (and not attached):
## [1] xfun_0.1 reshape2_1.4.3 haven_1.1.1
## [4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6
## [7] yaml_2.1.18 rlang_0.2.0.9000 pillar_1.2.1
## [10] foreign_0.8-69 glue_1.2.0 selectr_0.3-2
## [13] modelr_0.1.1 readxl_1.0.0 bindrcpp_0.2
## [16] bindr_0.1.1 plyr_1.8.4 munsell_0.4.3
## [19] blogdown_0.5 gtable_0.2.0 cellranger_1.1.0
## [22] psych_1.7.8 evaluate_0.10.1 parallel_3.4.3
## [25] curl_3.1 highr_0.6 broom_0.4.3
## [28] Rcpp_0.12.16 backports_1.1.2 scales_0.5.0.9000
## [31] jsonlite_1.5 mnormt_1.5-5 hms_0.4.2
## [34] digest_0.6.15 stringi_1.1.7 bookdown_0.7
## [37] grid_3.4.3 rprojroot_1.3-2 cli_1.0.0
## [40] tools_3.4.3 magrittr_1.5 lazyeval_0.2.1
## [43] crayon_1.3.4 pkgconfig_2.0.1 lubridate_1.7.3
## [46] assertthat_0.2.0 rmarkdown_1.9 httr_1.3.1
## [49] R6_2.2.2 nlme_3.1-131.1 compiler_3.4.3