What do you do when you want to use results from the literature to anchor your own analysis? we’ll go through a practical scenario on scraping an html table from a Nature Genetics article into R and wrangling the data into a useful format.

01. Scraping a html table from a webpage

#load packages
#scraping web page
url <- "https://www.nature.com/articles/ng.2802/tables/2"

#====🔥find where is the table lives on this webpage====
#get the table
nature_genetics_table2 <- url %>%
  read_html() %>%
  html_nodes(xpath=table_path) %>%
  html_table(fill=T) %>% .[[1]]
#the first few lines of table
SNPa Chr. Positionb Closest genec Major/minor alleles MAFd Stage 1 Stage 1 Stage 2 Stage 2 Overall Overall Overall
SNPa Chr. Positionb Closest genec Major/minor alleles MAFd OR (95% CI)e Meta P value OR (95% CI)e Meta P value OR (95% CI)e Meta P value I2 (%), P valuef
Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes
rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 7.7 × 10−15 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1
rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 1.7 × 10−26 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2

02 Making messy data useful

Cleaning up the rows

All The Elements Of These Rows Contain The Exact Same Text

v=which(apply(nature_genetics_table2,1, function(x) length(unique(unlist(x))) )==1)
## [1]  2 12 18

split table

nature_genetics_table2_list = split(nature_genetics_table2, cumsum(1:nrow(nature_genetics_table2) %in% v))
nature_genetics_table2_list = lapply(nature_genetics_table2_list[2:4], function(y) {
y$Description = unique(as.character(y[1, ]))
y[-1, ]

#rbind three table
nature_genetics_table2_clean = do.call("rbind", nature_genetics_table2_list)

SNPa Chr. Positionb Closest genec Major/minor alleles MAFd Stage 1 Stage 1 Stage 2 Stage 2 Overall Overall Overall Description
1.3 rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 7.7 × 10−15 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1 Known GWAS-defined associated genes
1.4 rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 1.7 × 10−26 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2 Known GWAS-defined associated genes
1.5 rs10948363 6 47487762 CD2AP A/G 0.266 1.10 (1.07–1.14) 3.1 × 10−8 1.09 (1.04–1.15) 4.1 × 10−4 1.10 (1.07–1.13) 5.2 × 10−11 0, 9 × 10−1 Known GWAS-defined associated genes

03. Fixing column names

colnames(nature_genetics_table2_clean) <- c("SNP", "Chr", "Position", "Closest gene", "Major/minor alleles", "MAF", "Stage1_OR", "Stage1_MetaP", "Stage2_OR","Stage2_MetaP",    "Overall_OR", "Overall_MetaP", "I2_Percent/P","Description")
##  [1] "SNP"                 "Chr"                 "Position"           
##  [4] "Closest gene"        "Major/minor alleles" "MAF"                
##  [7] "Stage1_OR"           "Stage1_MetaP"        "Stage2_OR"          
## [10] "Stage2_MetaP"        "Overall_OR"          "Overall_MetaP"      
## [13] "I2_Percent/P"        "Description"

04. Making a character variable into a numeric variable

# " × 10-" -> "e-"
nature_genetics_table2_clean$Stage1_MetaP <- 
str_replace(nature_genetics_table2_clean$Stage1_MetaP," × 10−","e-") %>% as.numeric()
SNP Chr Position Closest gene Major/minor alleles MAF Stage1_OR Stage1_MetaP Stage2_OR Stage2_MetaP Overall_OR Overall_MetaP I2_Percent/P Description
1.3 rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 0 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1 Known GWAS-defined associated genes
1.4 rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 0 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2 Known GWAS-defined associated genes
1.5 rs10948363 6 47487762 CD2AP A/G 0.266 1.10 (1.07–1.14) 0 1.09 (1.04–1.15) 4.1 × 10−4 1.10 (1.07–1.13) 5.2 × 10−11 0, 9 × 10−1 Known GWAS-defined associated genes
