How To Get Genomewide Position-Specific Scores
load packages
#source("http://bioconductor.org/biocLite.R")
#biocLite("GenomicScores")
library(GenomicScores)
Retrieval of genomic scores through annotation packages
There are currently four different annotation packages that store genomic scores and can be accessed using the GenomicScores package
Annotation packages Description 1. phastCons100way.UCSC.hg19 phastCons scores derived from the alignment of the human genome (hg19) to other 99 vertebrate species. 2. phastCons100way.UCSC.hg38 phastCons scores derived from the alignment of the human genome (hg38) to other 99 vertebrate species. 3. phastCons7way.UCSC.hg38 phastCons scores derived from the alignment of the human genome (hg38) to other 6 mammal species. 4. fitCons.UCSC.hg19 fitCons scores: fitness consequences of functional annotation for the human genome (hg19).
To retrieve genomic scores for specific positions we should use the function scores()
library(phastCons100way.UCSC.hg19)
gsco <- phastCons100way.UCSC.hg19
class(gsco)
## [1] "GScores"
## attr(,"package")
## [1] "GenomicScores"
scores(gsco, GRanges(seqnames="chr22",
IRanges(start=50967020:50967021, width=1)))
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | scores
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr22 [50967020, 50967020] * | 1
## [2] chr22 [50967021, 50967021] * | 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gsco
## GScores object
## # organism: Homo sapiens (UCSC)
## # provider: UCSC
## # provider version: 09Feb2014
## # download date: Mar 17, 2017
## # loaded sequences: chr19_gl000208_random, chr22
## # maximum abs. error: 0.05
#citation(gsco) now cann't run
provider(gsco)
## [1] "UCSC"
providerVersion(gsco)
## [1] "09Feb2014"
organism(gsco)
## [1] "Homo sapiens"
seqlevelsStyle(gsco)
## [1] "UCSC"
Retrieval of genomic scores through AnnotationHub resources
Another way to retrieve genomic scores is by using the AnnotationHub, which is a web resource that provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard (e.g., UCSC, Ensembl) and distributed sites, can be found. A Bioconductor AnnotationHub web resource creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.
gsco <- getGScores("phastCons7way.UCSC.hg38")
scores(gsco, GRanges(seqnames="chr22",IRanges(start=20967020:20967025, width=1)))
Retrieval of multiple scores per genomic position
Summarization of genomic scores
library(dplyr)
gsco <- phastCons100way.UCSC.hg19
gr <- GRanges(seqnames="chr22", IRanges(start=50967020:50967025, width=1))
#mean
scores(gsco, gr) %>%
.$scores %>%
mean()
## [1] 0.8
#median
scores(gsco, gr) %>%
.$scores %>%
median()
## [1] 1
#min
scores(gsco, gr) %>%
.$scores %>%
min()
## [1] 0
scores(gsco, gr) %>%
.$scores %>%
hist()
Annotating variants with genomic scores
A typical use case of the GenomicScores package is in the context of annotating variants with genomic scores, such as phastCons conservation scores. For this purpose, we load the VariantAnnotaiton and TxDb.Hsapiens.UCSC.hg19.knownGene packages. The former will allow us to read a VCF file and annotate it, and the latter contains the gene annotations from UCSC that will be used in this process.
Let’s load one of the sample VCF files that form part of the VariantAnnotation package.
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevelsStyle(vcf)
## [1] "NCBI" "Ensembl"
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(txdb)
## [1] "UCSC"
Because the chromosome nomenclature from the VCF file (NCBI) is different from the one with the gene annotations (UCSC) we use the seqlevelsStyle() function to force our variants having the chromosome nomenclature of the gene annotations.
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
We annotate the location of variants using the function locateVariants() from the VariantAnnotation package.
loc <- locateVariants(vcf, txdb, AllVariants())
loc[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## chr22 [50300078, 50300078] - | intron 10763 10763
## chr22 [50300086, 50300086] - | intron 10755 10755
## chr22 [50300101, 50300101] - | intron 10740 10740
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 75253 79087
## 2 75253 79087
## 3 75253 79087
## FOLLOWID
## <CharacterList>
##
##
##
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
table(loc$LOCATION)
##
## spliceSite intron fiveUTR threeUTR coding intergenic
## 11 22572 309 1368 2822 2867
## promoter
## 2864
Annotate phastCons conservation scores
on the variants and store those annotations as an additional metadata column of the GRanges object. For this specific purpose we should the argument scores.only=TRUE that makes the scores() method to return the genomic scores as a numeric vector instead as a metadata column in the input ranges object.
loc$PHASTCONS <- scores(gsco, loc, scores.only=TRUE)
loc[1:3]
## GRanges object with 3 ranges and 10 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## chr22 [50300078, 50300078] - | intron 10763 10763
## chr22 [50300086, 50300086] - | intron 10755 10755
## chr22 [50300101, 50300101] - | intron 10740 10740
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 75253 79087
## 2 75253 79087
## 3 75253 79087
## FOLLOWID PHASTCONS
## <CharacterList> <numeric>
## 0.0
## 0.1
## 0.0
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
Using the following code we can examine the distribution of phastCons conservation scores of variants across the different annotated regions
x <- split(loc$PHASTCONS, loc$LOCATION)
mask <- elementNROWS(x) > 0
boxplot(x[mask], ylab="phastCons score", las=1, cex.axis=1.2, cex.lab=1.5, col="gray")
points(1:length(x[mask])+0.25,
sapply(x[mask], mean, na.rm=TRUE),
pch=23, bg="red")