load packages

#source("http://bioconductor.org/biocLite.R")
#biocLite("GenomicScores")
library(GenomicScores)

Retrieval of genomic scores through annotation packages

There are currently four different annotation packages that store genomic scores and can be accessed using the GenomicScores package

Annotation packages Description 1. phastCons100way.UCSC.hg19 phastCons scores derived from the alignment of the human genome (hg19) to other 99 vertebrate species. 2. phastCons100way.UCSC.hg38 phastCons scores derived from the alignment of the human genome (hg38) to other 99 vertebrate species. 3. phastCons7way.UCSC.hg38 phastCons scores derived from the alignment of the human genome (hg38) to other 6 mammal species. 4. fitCons.UCSC.hg19 fitCons scores: fitness consequences of functional annotation for the human genome (hg19).

To retrieve genomic scores for specific positions we should use the function scores()

library(phastCons100way.UCSC.hg19)
gsco <- phastCons100way.UCSC.hg19
class(gsco)
## [1] "GScores"
## attr(,"package")
## [1] "GenomicScores"
scores(gsco, GRanges(seqnames="chr22",
                     IRanges(start=50967020:50967021, width=1)))
## GRanges object with 2 ranges and 1 metadata column:
##       seqnames               ranges strand |    scores
##          <Rle>            <IRanges>  <Rle> | <numeric>
##   [1]    chr22 [50967020, 50967020]      * |         1
##   [2]    chr22 [50967021, 50967021]      * |         1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gsco
## GScores object 
## # organism: Homo sapiens (UCSC)
## # provider: UCSC
## # provider version: 09Feb2014
## # download date: Mar 17, 2017
## # loaded sequences: chr19_gl000208_random, chr22
## # maximum abs. error: 0.05
#citation(gsco) now cann't run
provider(gsco)
## [1] "UCSC"
providerVersion(gsco)
## [1] "09Feb2014"
organism(gsco)
## [1] "Homo sapiens"
seqlevelsStyle(gsco)
## [1] "UCSC"

Retrieval of genomic scores through AnnotationHub resources

Another way to retrieve genomic scores is by using the AnnotationHub, which is a web resource that provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard (e.g., UCSC, Ensembl) and distributed sites, can be found. A Bioconductor AnnotationHub web resource creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.

gsco <- getGScores("phastCons7way.UCSC.hg38")
scores(gsco, GRanges(seqnames="chr22",IRanges(start=20967020:20967025, width=1)))

Retrieval of multiple scores per genomic position

Summarization of genomic scores

library(dplyr)
gsco <- phastCons100way.UCSC.hg19
gr <- GRanges(seqnames="chr22", IRanges(start=50967020:50967025, width=1))

#mean
scores(gsco, gr) %>% 
  .$scores %>% 
  mean()
## [1] 0.8
#median
scores(gsco, gr) %>% 
  .$scores %>% 
  median()
## [1] 1
#min
scores(gsco, gr) %>% 
  .$scores %>% 
  min()
## [1] 0
scores(gsco, gr) %>% 
  .$scores %>% 
  hist()

Annotating variants with genomic scores

A typical use case of the GenomicScores package is in the context of annotating variants with genomic scores, such as phastCons conservation scores. For this purpose, we load the VariantAnnotaiton and TxDb.Hsapiens.UCSC.hg19.knownGene packages. The former will allow us to read a VCF file and annotate it, and the latter contains the gene annotations from UCSC that will be used in this process.

Let’s load one of the sample VCF files that form part of the VariantAnnotation package.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevelsStyle(vcf)
## [1] "NCBI"    "Ensembl"
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(txdb)
## [1] "UCSC"

Because the chromosome nomenclature from the VCF file (NCBI) is different from the one with the gene annotations (UCSC) we use the seqlevelsStyle() function to force our variants having the chromosome nomenclature of the gene annotations.

seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

We annotate the location of variants using the function locateVariants() from the VariantAnnotation package.

loc <- locateVariants(vcf, txdb, AllVariants())
loc[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
##    seqnames               ranges strand | LOCATION  LOCSTART    LOCEND
##       <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer>
##       chr22 [50300078, 50300078]      - |   intron     10763     10763
##       chr22 [50300086, 50300086]      - |   intron     10755     10755
##       chr22 [50300101, 50300101]      - |   intron     10740     10740
##      QUERYID        TXID         CDSID      GENEID       PRECEDEID
##    <integer> <character> <IntegerList> <character> <CharacterList>
##            1       75253                     79087                
##            2       75253                     79087                
##            3       75253                     79087                
##           FOLLOWID
##    <CharacterList>
##                   
##                   
##                   
##   -------
##   seqinfo: 1 sequence from hg19 genome; no seqlengths
table(loc$LOCATION)
## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic 
##         11      22572        309       1368       2822       2867 
##   promoter 
##       2864

Annotate phastCons conservation scores

on the variants and store those annotations as an additional metadata column of the GRanges object. For this specific purpose we should the argument scores.only=TRUE that makes the scores() method to return the genomic scores as a numeric vector instead as a metadata column in the input ranges object.

loc$PHASTCONS <- scores(gsco, loc, scores.only=TRUE)
loc[1:3]
## GRanges object with 3 ranges and 10 metadata columns:
##    seqnames               ranges strand | LOCATION  LOCSTART    LOCEND
##       <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer>
##       chr22 [50300078, 50300078]      - |   intron     10763     10763
##       chr22 [50300086, 50300086]      - |   intron     10755     10755
##       chr22 [50300101, 50300101]      - |   intron     10740     10740
##      QUERYID        TXID         CDSID      GENEID       PRECEDEID
##    <integer> <character> <IntegerList> <character> <CharacterList>
##            1       75253                     79087                
##            2       75253                     79087                
##            3       75253                     79087                
##           FOLLOWID PHASTCONS
##    <CharacterList> <numeric>
##                          0.0
##                          0.1
##                          0.0
##   -------
##   seqinfo: 1 sequence from hg19 genome; no seqlengths

Using the following code we can examine the distribution of phastCons conservation scores of variants across the different annotated regions

x <- split(loc$PHASTCONS, loc$LOCATION)
mask <- elementNROWS(x) > 0
boxplot(x[mask], ylab="phastCons score", las=1, cex.axis=1.2, cex.lab=1.5, col="gray")
points(1:length(x[mask])+0.25, 
       sapply(x[mask], mean, na.rm=TRUE),
       pch=23, bg="red")