What do you do when you want to use results from the literature to anchor your own analysis? we’ll go through a practical scenario on scraping an html table from a Nature Genetics article into R and wrangling the data into a useful format.

01. Scraping a html table from a webpage

#load packages
library("rvest")
library("knitr")
library(tidyverse)
#scraping web page
url <- "https://www.nature.com/articles/ng.2802/tables/2"

#====🔥find where is the table lives on this webpage====
table_path='//*[@id="content"]/div/div/figure/div[1]/div/div[1]/table'
#get the table
nature_genetics_table2 <- url %>%
  read_html() %>%
  html_nodes(xpath=table_path) %>%
  html_table(fill=T) %>% .[[1]]
#the first few lines of table
kable(nature_genetics_table2[1:4,])
SNPa Chr. Positionb Closest genec Major/minor alleles MAFd Stage 1 Stage 1 Stage 2 Stage 2 Overall Overall Overall
SNPa Chr. Positionb Closest genec Major/minor alleles MAFd OR (95% CI)e Meta P value OR (95% CI)e Meta P value OR (95% CI)e Meta P value I2 (%), P valuef
Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes
rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 7.7 × 10−15 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1
rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 1.7 × 10−26 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2

02 Making messy data useful

Cleaning up the rows

All The Elements Of These Rows Contain The Exact Same Text

v=which(apply(nature_genetics_table2,1, function(x) length(unique(unlist(x))) )==1)
v
## [1]  2 12 18

split table

nature_genetics_table2_list = split(nature_genetics_table2, cumsum(1:nrow(nature_genetics_table2) %in% v))
nature_genetics_table2_list = lapply(nature_genetics_table2_list[2:4], function(y) {
y$Description = unique(as.character(y[1, ]))
y[-1, ]
})

#rbind three table
nature_genetics_table2_clean = do.call("rbind", nature_genetics_table2_list)

kable(nature_genetics_table2_clean[1:3,])
SNPa Chr. Positionb Closest genec Major/minor alleles MAFd Stage 1 Stage 1 Stage 2 Stage 2 Overall Overall Overall Description
1.3 rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 7.7 × 10−15 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1 Known GWAS-defined associated genes
1.4 rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 1.7 × 10−26 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2 Known GWAS-defined associated genes
1.5 rs10948363 6 47487762 CD2AP A/G 0.266 1.10 (1.07–1.14) 3.1 × 10−8 1.09 (1.04–1.15) 4.1 × 10−4 1.10 (1.07–1.13) 5.2 × 10−11 0, 9 × 10−1 Known GWAS-defined associated genes

03. Fixing column names

colnames(nature_genetics_table2_clean) <- c("SNP", "Chr", "Position", "Closest gene", "Major/minor alleles", "MAF", "Stage1_OR", "Stage1_MetaP", "Stage2_OR","Stage2_MetaP",    "Overall_OR", "Overall_MetaP", "I2_Percent/P","Description")
colnames(nature_genetics_table2_clean)
##  [1] "SNP"                 "Chr"                 "Position"           
##  [4] "Closest gene"        "Major/minor alleles" "MAF"                
##  [7] "Stage1_OR"           "Stage1_MetaP"        "Stage2_OR"          
## [10] "Stage2_MetaP"        "Overall_OR"          "Overall_MetaP"      
## [13] "I2_Percent/P"        "Description"

04. Making a character variable into a numeric variable

# " × 10-" -> "e-"
nature_genetics_table2_clean$Stage1_MetaP <- 
str_replace(nature_genetics_table2_clean$Stage1_MetaP," × 10−","e-") %>% as.numeric()
kable(nature_genetics_table2_clean[1:3,])
SNP Chr Position Closest gene Major/minor alleles MAF Stage1_OR Stage1_MetaP Stage2_OR Stage2_MetaP Overall_OR Overall_MetaP I2_Percent/P Description
1.3 rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 0 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1 Known GWAS-defined associated genes
1.4 rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 0 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2 Known GWAS-defined associated genes
1.5 rs10948363 6 47487762 CD2AP A/G 0.266 1.10 (1.07–1.14) 0 1.09 (1.04–1.15) 4.1 × 10−4 1.10 (1.07–1.13) 5.2 × 10−11 0, 9 × 10−1 Known GWAS-defined associated genes
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
##  [1] forcats_0.3.0   stringr_1.3.0   dplyr_0.7.4     purrr_0.2.4    
##  [5] readr_1.1.1     tidyr_0.8.0     tibble_1.4.2    ggplot2_2.2.1  
##  [9] tidyverse_1.2.1 knitr_1.20      rvest_0.3.2     xml2_1.2.0     
## 
## loaded via a namespace (and not attached):
##  [1] xfun_0.1          reshape2_1.4.3    haven_1.1.1      
##  [4] lattice_0.20-35   colorspace_1.3-2  htmltools_0.3.6  
##  [7] yaml_2.1.18       rlang_0.2.0.9000  pillar_1.2.1     
## [10] foreign_0.8-69    glue_1.2.0        selectr_0.3-2    
## [13] modelr_0.1.1      readxl_1.0.0      bindrcpp_0.2     
## [16] bindr_0.1.1       plyr_1.8.4        munsell_0.4.3    
## [19] blogdown_0.5      gtable_0.2.0      cellranger_1.1.0 
## [22] psych_1.7.8       evaluate_0.10.1   parallel_3.4.3   
## [25] curl_3.1          highr_0.6         broom_0.4.3      
## [28] Rcpp_0.12.16      backports_1.1.2   scales_0.5.0.9000
## [31] jsonlite_1.5      mnormt_1.5-5      hms_0.4.2        
## [34] digest_0.6.15     stringi_1.1.7     bookdown_0.7     
## [37] grid_3.4.3        rprojroot_1.3-2   cli_1.0.0        
## [40] tools_3.4.3       magrittr_1.5      lazyeval_0.2.1   
## [43] crayon_1.3.4      pkgconfig_2.0.1   lubridate_1.7.3  
## [46] assertthat_0.2.0  rmarkdown_1.9     httr_1.3.1       
## [49] R6_2.2.2          nlme_3.1-131.1    compiler_3.4.3