This tutorial provides an introduction to DNA methylation analysis
using the minfi
package in R. It covers the following key steps:
The following analysis is based on DNA methylation data obtained from the GEO Series GSE246337. The dataset includes genome-wide methylation profiles generated using the Illumina Infinium MethylationEPIC v2.0 array, derived from blood samples.
A total of 50 samples were selected for this analysis. Sample-level metadata was extracted and cleaned from the provided Series Matrix text file.
Dataset Summary
The aim of this analysis is to identify differentially methylated CpG
sites associated with Age, while adjusting for
biological covariates including Sex, Race, and estimated
cell-type proportions. This is achieved using a linear model
framework from the package limma
.
The following code installs and loads the required packages for this analysis.
# Install and load required packages
instalload <- function(cran=NULL, bioc=NULL) {
pkgs_cran <- cran[!(cran %in% installed.packages()[, "Package"])]
if (length(pkgs_cran) > 0) {
install.packages(pkgs_cran, dependencies = T)
}
if (!requireNamespace("BiocManager", quietly = T))
install.packages("BiocManager")
pkgs_bioc <- bioc[!(bioc %in% installed.packages()[, "Package"])]
if (length(pkgs_bioc) > 0) {
BiocManager::install(pkgs_bioc, ask = F)
}
pkgs <- c(cran, bioc)
invisible(lapply(pkgs, require, character.only = T))
cat("Packages are installed and loaded successfully.\n")
}
# CRAN & Bioconductor packages
cran <- c("ggplot2", "gridExtra", "RColorBrewer",
"stringr", "rstudioapi", "knitr", "qqman")
bioc <- c("minfi",
"IlluminaHumanMethylationEPICv2anno.20a1.hg38",
"IlluminaHumanMethylationEPICv2manifest",
"FlowSorted.Blood.EPIC", "limma", "Gviz", "ENmix")
instalload(cran, bioc)
## Packages are installed and loaded successfully.
The following code sets the working directory to the location of this R Markdown file.
## [1] "C:/Users/ruizpint/OneDrive - Queensland University of Technology/QUT/PipelineScripts/2025CGPHNeurogenomicsWorkshop"
## 'data.frame': 50 obs. of 9 variables:
## $ SID : chr "BoA1" "BoA2" "BoA3" "BoA4" ...
## $ Tissue : chr "blood" "blood" "blood" "blood" ...
## $ Sex : chr "M" "F" "M" "F" ...
## $ Race : chr "White" "Asian" "Asian" "White" ...
## $ Ethnicity : chr "Non Hispanic" "Non Hispanic" "Non Hispanic" "HISPANIC" ...
## $ Age : int 31 79 60 59 22 47 78 89 24 70 ...
## $ GID : chr "GSM7866964" "GSM7866965" "GSM7866966" "GSM7866967" ...
## $ Description: chr "207700470022_R08C01" "207686150104_R02C01" "207700460080_R08C01" "207700470049_R01C01" ...
## $ Basename : chr "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
The dataset contains 50 samples. From the above, we can see it includes phenotype and technical information relevant to DNA methylation analysis.
In our analysis, we focus on Age as the outcome and for now we can consider as covariates Sex and Race. Variables as GID, Description, and Tissue are used for tracking and data merging, but are not directly included in the statistical model. SID and Basename are identifiers for each sample, ensuring we can link the metadata to the raw data files. Tissue is constant across samples, so it does not need to be included in the model.
# Load the raw methylation data from the idat files
RGSet <- read.metharray.exp(base="data/idats",
targets=targets,
extended=F,
recursive=F,
verbose=F)
str(RGSet)
## Formal class 'RGChannelSet' [package "minfi"] with 6 slots
## ..@ annotation : Named chr [1:2] "IlluminaHumanMethylationEPICv2" "20a1.hg38"
## .. ..- attr(*, "names")= chr [1:2] "array" "annotation"
## ..@ colData :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## .. .. ..@ rownames : chr [1:50] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
## .. .. ..@ nrows : int 50
## .. .. ..@ elementType : chr "ANY"
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## .. .. ..@ listData :List of 10
## .. .. .. ..$ SID : chr [1:50] "BoA1" "BoA2" "BoA3" "BoA4" ...
## .. .. .. ..$ Tissue : chr [1:50] "blood" "blood" "blood" "blood" ...
## .. .. .. ..$ Sex : chr [1:50] "M" "F" "M" "F" ...
## .. .. .. ..$ Race : chr [1:50] "White" "Asian" "Asian" "White" ...
## .. .. .. ..$ Ethnicity : chr [1:50] "Non Hispanic" "Non Hispanic" "Non Hispanic" "HISPANIC" ...
## .. .. .. ..$ Age : int [1:50] 31 79 60 59 22 47 78 89 24 70 ...
## .. .. .. ..$ GID : chr [1:50] "GSM7866964" "GSM7866965" "GSM7866966" "GSM7866967" ...
## .. .. .. ..$ Description: chr [1:50] "207700470022_R08C01" "207686150104_R02C01" "207700460080_R08C01" "207700470049_R01C01" ...
## .. .. .. ..$ Basename : chr [1:50] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
## .. .. .. ..$ filenames : chr [1:50] "data/idats/GSM7866964_207700470022_R08C01" "data/idats/GSM7866965_207686150104_R02C01" "data/idats/GSM7866966_207700460080_R08C01" "data/idats/GSM7866967_207700470049_R01C01" ...
## ..@ assays :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
## .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
## .. .. .. .. ..@ listData :List of 2
## .. .. .. .. .. ..$ Green: int [1:1105209, 1:50] 14462 7133 14963 11324 1184 15186 1693 11046 866 1651 ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:1105209] "1600157" "1600179" "1600219" "1600225" ...
## .. .. .. .. .. .. .. ..$ : chr [1:50] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
## .. .. .. .. .. ..$ Red : int [1:1105209, 1:50] 3269 921 1760 13191 11202 2679 17863 10587 9740 516 ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:1105209] "1600157" "1600179" "1600219" "1600225" ...
## .. .. .. .. .. .. .. ..$ : chr [1:50] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
## .. .. .. .. ..@ elementType : chr "ANY"
## .. .. .. .. ..@ elementMetadata: NULL
## .. .. .. .. ..@ metadata : list()
## ..@ NAMES : chr [1:1105209] "1600157" "1600179" "1600219" "1600225" ...
## ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## .. .. ..@ rownames : NULL
## .. .. ..@ nrows : int 1105209
## .. .. ..@ elementType : chr "ANY"
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## .. .. ..@ listData : Named list()
## ..@ metadata : list()
As for now, the data is in the form of an RGChannelSet (unprocessed
data), which contains the original red and green intensity values
measured by the array. From the above, we can see that the
RGSet
object contains:
This object will be used as input for exploration, quality control and preprocessing steps.
This section is based on the control table used by Illumina to explore expected performance of the assay (Figure 1, 2).
We will generate control plots to assess the quality of the data and
identify any potential issues with the samples using the ENmix
package. The issues can be related to the sample preparation, bisulfite
conversion, hybridization, or other steps in the methylation assay
process.
## Plotting STAINING .jpg
## Plotting EXTENSION .jpg
## Plotting HYBRIDIZATION .jpg
## Plotting TARGET_REMOVAL .jpg
## Plotting BISULFITE_CONVERSION_I .jpg
## Plotting BISULFITE_CONVERSION_II .jpg
## Plotting SPECIFICITY_I .jpg
## Plotting SPECIFICITY_II .jpg
## Plotting NON-POLYMORPHIC .jpg
## Plotting NEGATIVE .jpg
## Plotting NORM_A .jpg
## Plotting NORM_C .jpg
## Plotting NORM_G .jpg
## Plotting NORM_T .jpg
## Plotting NORM_ACGT .jpg
# Create a directory for ENmix control plots
if (!dir.exists("figures/enMix")) {
dir.create("figures/enMix", recursive = TRUE)
}
# Move all .jpg control plots to the target folder
jpgFiles <- list.files(pattern = "\\.jpg$")
file.rename(from = jpgFiles,
to = file.path("figures/enMix", jpgFiles))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
From the inspection of all the controls, one sample (around sample 20) shows noticeably low red signal in the STAINING, BISULFITE CONVERSION II, SPECIFICITY I and SPECIFICITY II. This sample may require further inspection or exclusion.
When working with DNA methylation data in minfi, the data goes through several transformation steps, and each step is stored in a specific type of object (Figure 3). It is important to understand how these object types are connected.
The process begins by reading IDAT files, which are stored in an object called an RGChannelSet. This raw data is then processed using a preprocessing function, which converts it into a MethylSet.
From there, two additional functions —
ratioConvert()
and
mapToGenome()
are used to transform the
MethylSet into a GenomicRatioSet, which is ready for
downstream analysis.
## IlluminaMethylationManifest object
## Annotation
## array: IlluminaHumanMethylationEPICv2
## Number of type I probes: 128271
## Number of type II probes: 808719
## Number of control probes: 635
## Number of SNP type I probes: 24
## Number of SNP type II probes: 41
## DataFrame with 5 rows and 4 columns
## Name chr pos UCSC_RefGene_Name
## <character> <character> <integer> <character>
## cg25324105_BC11 cg25324105_BC11 chr19 37692358 ZNF781;ZNF781;ZNF781..
## cg25383568_TC11 cg25383568_TC11 chr19 38727081 ACTN4;ACTN4
## cg25455143_BC11 cg25455143_BC11 chr19 1591515
## cg25459778_BC11 cg25459778_BC11 chr16 1614581
## cg25487775_BC11 cg25487775_BC11 chr2 161237458
From the overview of the data, we can see that the Illumina EPIC v2 array includes over 900,000 probes designed to assess DNA methylation across the genome. These probes are divided into two types based on their chemistry: Type I and Type II. Each type has different probe designs, and both are used to maximize coverage and sensitivity.
In addition to methylation probes, the array also includes control probes for quality checks and SNP probes to assess sample identity and potential genetic variation.
Each probe is annotated with genomic coordinates (chromosome, position, others) and often linked to nearby genes. This information helps us understand the biological relevance of methylation changes detected during analysis.
Structure of the IlluminaMethylationManifest Object:
Number of type I probes
: 128,271
probes
that use two separate color channels (Red
and
Green
) for methylated and unmethylated signals.Number of type II probes
: 808,719
probes
that use a single color channel (Green
) for both methylated
and unmethylated signals.Number of control probes
: 635
probes
included for quality control and normalisation.Number of SNP type I probes
: 24
probes
designed to detect single nucleotide polymorphisms (SNPs) using the type
I probe structure.Number of SNP type II probes
: 41
probes
designed to detect SNPs using the type II probe structure.At this stage, our data is stored in an RGChannelSet, which contains the raw intensity signals from the red and green channels on the array.
To start analyzing DNA methylation, we need to convert this into a MethylSet, which holds the actual methylated (Meth) and unmethylated (Unmeth) signal values.
The simplest way to do this is with the
preprocessRaw()
function. This function
uses the array design to correctly match the probes and color channels,
allowing it to generate the Meth and Unmeth signals.
Keep in mind that preprocessRaw()
does
not apply any normalisation. It’s meant to be used for
initial quality control.
You will see that the original red and green signal assays are now replaced with Meth and Unmeth signals, preparing the data for the next steps in analysis.
## class: MethylSet
## dim: 936990 50
## metadata(0):
## assays(2): Meth Unmeth
## rownames(936990): cg25324105_BC11 cg25383568_TC11 ...
## ch.12.78471492F_BC21 ch.21.43742285F_BC21
## rowData names(0):
## colnames(50): GSM7866964_207700470022_R08C01
## GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
## GSM7867013_207705770049_R06C01
## colData names(10): SID Tissue ... Basename filenames
## Annotation
## array: IlluminaHumanMethylationEPICv2
## annotation: 20a1.hg38
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.52.1
## Manifest version: 1.0.0
# Check the structure of the MSet object
meth <- getMeth(MSet)
unmeth <- getUnmeth(MSet)
head(meth[1:5, 1:2])
## GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg25324105_BC11 1352 940
## cg25383568_TC11 7548 6119
## cg25455143_BC11 590 597
## cg25459778_BC11 247 510
## cg25487775_BC11 10935 5801
## GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg25324105_BC11 21984 20939
## cg25383568_TC11 1337 1236
## cg25455143_BC11 19072 17119
## cg25459778_BC11 13889 9875
## cg25487775_BC11 7700 9991
Here we can se a matrix where rows correspond to
CpG sites
, columns represent samples
, and
values contain fluorescence intensities
(indicate how much
DNA at each CpG site is methylated or unmethylated) from Illumina
methylation arrays. These matrices are used to compute Beta (methylation
proportions) and M (log-transformed methylation ratios) values.
# Visualize the distribution of methylated and unmethylated intensities
# Log-transform intensities, only for visualization purposes
logMeth <- log10(meth + 1)
logUnmeth <- log10(unmeth + 1)
# Plot density and Increase y axis limit
plot(density(as.vector(logMeth)), col = "darkgreen", lwd = 2,
main = "Methylated and Unmethylated Intensities",
xlab = "Log10(Intensity)", ylim = c(0, 1.5))
lines(density(as.vector(logUnmeth)), col = "red", lwd = 2)
legend("topright", legend = c("Methylated", "Unmethylated"),
col = c("darkgreen", "red"), lwd = 2)
Once we have the methylated (Meth) and unmethylated (Unmeth) signals in a MethylSet, we often convert them into values that are easier to work with in analysis Beta and M values. These are stored in a new type of object called a RatioSet.
Finally, we want to include genomic information such
as the chromosome and location of each probe using
mapToGenome()
function. This step turns
our RatioSet into a GenomicRatioSet, which is the
format we will use for the downstream analyses.
# Convert MethylSet to a GenomicRatioSet object
RatioSet <- ratioConvert(MSet,
what = "both",
keepCN = TRUE)
# Observe the change of the assays
RatioSet
## class: RatioSet
## dim: 936990 50
## metadata(0):
## assays(3): Beta M CN
## rownames(936990): cg25324105_BC11 cg25383568_TC11 ...
## ch.12.78471492F_BC21 ch.21.43742285F_BC21
## rowData names(0):
## colnames(50): GSM7866964_207700470022_R08C01
## GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
## GSM7867013_207705770049_R06C01
## colData names(10): SID Tissue ... Basename filenames
## Annotation
## array: IlluminaHumanMethylationEPICv2
## annotation: 20a1.hg38
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.52.1
## Manifest version: 1.0.0
## class: GenomicRatioSet
## dim: 930075 50
## metadata(0):
## assays(3): Beta M CN
## rownames(930075): cg00381604_BC11 cg21870274_BC21 ... cg27912920_TC21
## cg16855331_BC21
## rowData names(0):
## colnames(50): GSM7866964_207700470022_R08C01
## GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
## GSM7867013_207705770049_R06C01
## colData names(10): SID Tissue ... Basename filenames
## Annotation
## array: IlluminaHumanMethylationEPICv2
## annotation: 20a1.hg38
## Preprocessing
## Method: Raw (no normalization or bg correction)
## minfi version: 1.52.1
## Manifest version: 1.0.0
## GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg00381604_BC11 0.04027505 0.05374046
## cg21870274_BC21 0.77646606 0.69081510
## cg00002271_BC21 0.36874602 0.39778924
## cg08058514_BC21 0.71405361 0.42476489
## cg00004581_TC21 0.31495854 0.33383572
## cg09261435_BC21 0.14921587 0.16735113
## GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg00381604_BC11 -4.5746627 -4.1381554
## cg21870274_BC21 1.7964290 1.1598298
## cg00002271_BC21 -0.7755931 -0.5982642
## cg08058514_BC21 1.3202877 -0.4374872
## cg00004581_TC21 -1.1210294 -0.9967397
## cg09261435_BC21 -2.5113922 -2.3148299
## GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg00381604_BC11 13.79888 13.67728
## cg21870274_BC21 13.92796 13.70434
## cg00002271_BC21 14.31678 14.72813
## cg08058514_BC21 11.26562 10.31741
## cg00004581_TC21 14.87987 14.54388
## cg09261435_BC21 14.40408 14.32010
The key aspects of obtaing the beta
, m
and
cn
are as follows:
Beta:
M:
CN:
The minfi
package includes a simple and
helpful quality control (QC) plot. This plot shows the
median intensity of each sample in the
methylated (M) and unmethylated (U)
channels, using a log scale.
When you make this plot, you will usually see that high-quality samples group together in one area, while low-quality (failed) samples stand out with lower intensities and are more spread out, below or above the line.
# Extract and plot the quality control information from the MethylSet
qc <- getQC(MSet)
plotQC(qc,badSampleCutoff = 10.5) + # Default cutoff
title("Quality Control")
## integer(0)
It seems that we do not have any failed samples, as all samples are above the default cutoff of 10.5. However, we can still check the detection p-values to ensure that all samples have sufficient quality.
# Calculate the detection p-values
detP <- detectionP(RGSet, type = "m+u")
# Examine mean detection p-values across all samples to identify any failed samples
barplot(colMeans(detP),
las=2,
cex.names=0.2,
ylab="Mean detection p-values")
abline(h=0.05,col="red")
When deciding if a sample is of good or bad quality, it’s best not to rely on just one metric. Instead, we can also examine the detection p-values, which give us insight into how reliable the signal is for each CpG site in each sample.
The minfi
package calculates detection
p-values by comparing the total signal (methylated + unmethylated) for a
probe to the background noise, which is estimated using
negative control probes.
By plotting the average detection p-value for each sample, we can quickly see which samples have many failed probes these samples will show higher mean p-values, suggesting lower overall quality.
# Calculate the mean detection p-values across all samples
meanDetP <- colMeans(detP)
# Identify samples with mean detection p-value > 0.05
failedSamples <- names(meanDetP[meanDetP > 0.05])
failedSamples
## [1] "GSM7866981_207700470022_R01C01"
For this particular dataset, GSM7866981_207700470022_R01C01 is the only sample with a mean detection p-value greater than 0.05, indicating that it has failed in one or more probes. We can remove this sample from the analysis.
Another helpful way to check the quality of your samples is by looking at the distribution of Beta values. Beta values range from 0 (unmethylated) to 1 (fully methylated), so for a good-quality sample, you would usually expect the values to cluster near 0 and 1.
# Inspect the overall density distribution of Beta values
phenoData <- pData(RGSetF)
densityPlot(RGSetF,
sampGroups = phenoData$Sex,
pal = brewer.pal(8, "Dark2"),
main = "Density Plot of Beta Values",
add = TRUE,
legend = TRUE,)
After removing the failed sample, we can visualize the distribution is as expected.
In DNA methylation data analysis, normalisation is a crucial preprocessing step that addresses technical variation and probe design bias. The Illumina array includes two probe types (Type I and II) with different dynamic ranges, and if not corrected, this can cause Type I probes to dominate analysis results due to their broader range. This makes within-array normalisation essential when comparing or ranking probes. Additionally, batch effects systematic differences introduced by factors as reagent concentration or lab conditions can obscure true biological signals. It is recommended to use:
preprocessFunnorm
when your samples
come from biologically distinct groups or it is
expected large anticipated differences between samples
(Fortin,
JP. et al. 2014).preprocessQuantile
when samples
are from the same tissue type or expected to have
similar global methylation profiles(Touleimat,
N. & Tost, J. 2012).These options generally perform better than older methods as
preprocessRaw
, preprocessIllumina
, or
preprocessSWAN
, which are still available for
convenience.
As we are expecting large anticipated differences between samples, we will apply the preprocessFunnorm method to our data.
# Preprocess the data using functional normalisation
set.seed(123)
MSetF <- preprocessFunnorm(RGSetF)
MSetF
## class: GenomicRatioSet
## dim: 930075 49
## metadata(0):
## assays(2): Beta CN
## rownames(930075): cg00381604_BC11 cg21870274_BC21 ... cg27912920_TC21
## cg16855331_BC21
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
## GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
## GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
## array: IlluminaHumanMethylationEPICv2
## annotation: 20a1.hg38
## Preprocessing
## Method: NA
## minfi version: NA
## Manifest version: NA
# Visualise what the data looks as before and after normalisation
# based on Sex
pal <- brewer.pal(8,"Dark2")
par(mfrow=c(1,2))
densityPlot(RGSetF,
sampGroups=phenoData$Sex,
main="Raw",
legend=FALSE)
legend("top",
legend = levels(factor(phenoData$Sex)),
text.col=pal)
densityPlot(getBeta(MSetF),
sampGroups=phenoData$Sex,
main="Normalized",
legend=FALSE)
legend("top",
legend = levels(factor(phenoData$Sex)),
text.col=pal)
Statistically, normalisation should ideally minimize technical noise without compromising biological signals. The right panel indicates a successful normalisation, aligning distributions more closely while maintaining inherent biological distinctions.
To improve the reliability of methylation analysis, it is important to remove probes that do not perform well. These poor-quality probes can introduce noise and hide meaningful biological patterns. We will be filtering out probes that fail in one or more samples typically based on high detection p-values, probes located on the X and Y chromosomes, probes with single nucleotide polymorphisms (SNPs) at the CpG site, and cross-reactive probes. This will improve data quality and reduces the burden of correcting for multiple comparisons in statistical analyses.
# Ensure probes are in the same order in the mSetSq and detP objects
detPF <- detectionP(RGSetF)
detPF <- detPF[match(featureNames(MSetF),
rownames(detPF)),]
# Remove any probes that have failed in one or more samples; this next line
# checks for each row of detP whether the number of values < 0.05 is equal
# to the number of samples (TRUE) or not (FALSE)
keep <- rowSums(detPF < 0.05) == ncol(MSetF)
table(keep)
## keep
## FALSE TRUE
## 10063 920012
## class: GenomicRatioSet
## dim: 920012 49
## metadata(0):
## assays(2): Beta CN
## rownames(920012): cg00381604_BC11 cg21870274_BC21 ... cg27912919_TC21
## cg16855331_BC21
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
## GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
## GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
## array: IlluminaHumanMethylationEPICv2
## annotation: 20a1.hg38
## Preprocessing
## Method: NA
## minfi version: NA
## Manifest version: NA
Removing the sex chromosomes is a common practice in methylation analysis because these chromosomes often contain common SNPs at the CpG site (Maksimovic, J. et al. 2016 & Wang, Y. et al. 2021).
# Remove probes on the sex chromosomes
keep_chr <- !(featureNames(MSetF_Flt) %in%
getAnnotation(RGSet)$Name[getAnnotation(RGSet)$chr
%in% c("chrX", "chrY")])
MSetF_Flt_Rxy <- MSetF_Flt[keep_chr,]
MSetF_Flt_Rxy
## class: GenomicRatioSet
## dim: 896329 49
## metadata(0):
## assays(2): Beta CN
## rownames(896329): cg00381604_BC11 cg21870274_BC21 ... cg26570081_BC21
## cg26570107_TC11
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
## GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
## GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
## array: IlluminaHumanMethylationEPICv2
## annotation: 20a1.hg38
## Preprocessing
## Method: NA
## minfi version: NA
## Manifest version: NA
##
## chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20
## 86032 44510 51401 47680 22541 31183 30160 38733 47267 17086 39784 68340 25919
## chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
## 11015 19905 52258 38473 46518 55262 48187 40919 33156
# Remove probes where common SNPs may affect the CpG methylation signal
MSetF_Flt_Rxy_Ds <- dropLociWithSnps(MSetF_Flt_Rxy, snps = c("SBE", "CpG"), maf = 0.5)
MSetF_Flt_Rxy_Ds
## class: GenomicRatioSet
## dim: 896313 49
## metadata(0):
## assays(2): Beta CN
## rownames(896313): cg00381604_BC11 cg21870274_BC21 ... cg26570081_BC21
## cg26570107_TC11
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
## GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
## GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
## array: IlluminaHumanMethylationEPICv2
## annotation: 20a1.hg38
## Preprocessing
## Method: NA
## minfi version: NA
## Manifest version: NA
Probes that have been shown to be cross-reactive, meaning those that map to multiple locations in the genome, will also be filtered out. This list was publicly published and can be accessed through the authors’ publication Peters et al. (2024).
# Exclude cross reactive probes
xReactiveProbes <- read.csv("data/12864_2024_10027_MOESM8_ESM.csv",
stringsAsFactors=FALSE)
head(xReactiveProbes)
## keep__
## FALSE TRUE
## 29115 867198
## class: GenomicRatioSet
## dim: 867198 49
## metadata(0):
## assays(2): Beta CN
## rownames(867198): cg00002271_BC21 cg00004581_TC21 ... cg13801850_TC21
## cg08423507_BC11
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
## GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
## GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
## array: IlluminaHumanMethylationEPICv2
## annotation: 20a1.hg38
## Preprocessing
## Method: NA
## minfi version: NA
## Manifest version: NA
As reported by Du, P., Zhang, X., Huang, CC. et al. (2010), we can calculate the M, Beta, and CN values for the final version of the data.
# Calculate M and B values for statistical analysis of the final version of the data
m <- getM(MSetF_Flt_Rxy_Ds_Rc)
beta <- getBeta(MSetF_Flt_Rxy_Ds_Rc)
# Create directory for metrics
if (!dir.exists("rData/metrics")) {
dir.create("rData/metrics", recursive = TRUE)
}
# Visualise the distribution of beta and M values
par(mfrow=c(1,2))
densityPlot(beta, sampGroups=phenoData$Race, main="Beta values",
legend=FALSE, xlab="Beta")
legend("top", legend = levels(factor(phenoData$Race)),
text.col=brewer.pal(8,"Dark2"))
densityPlot(m, sampGroups=phenoData$Race, main="M values",
legend=FALSE, xlab="M")
legend("topleft", legend = levels(factor(phenoData$Race)),
text.col=brewer.pal(8,"Dark2"))
DNA methylation levels measured from whole blood samples are influenced not only by biological and environmental exposures, but also by the composition of different white blood cell (WBC) types (Qi, L. et al., 2022 & Terry, M. et al., 2011). Since each immune cell type has its own unique methylation profile, failing to account for variation in cell type proportions can lead to confounding in differential methylation analysis especially in epigenome-wide association studies (EWAS) involving age, sex, or disease states (Koestler, D. et al., 2016).
To address this, we estimated cell-type proportions
using reference based deconvolution methods based on the FlowSorted.Blood.EPIC
reference panel. This panel includes optimized CpGs for identifying
immune cell types in Illumina EPIC array data.
# Using ENmix to remove the suffix from the beta matrix
betaENMIX<- rm.cgsuffix(beta)
library(FlowSorted.Blood.EPIC)
IDOLOptimizedCpGsBloodv2<- IDOLOptimizedCpGs[which(IDOLOptimizedCpGs%in%rownames(betaENMIX))]
identical(rownames(IDOLOptimizedCpGs.compTable[IDOLOptimizedCpGsBloodv2,]), IDOLOptimizedCpGsBloodv2)
## [1] TRUE
# Project cell type proportions using the IDOLOptimizedCpGsBloodv2
propEPIC <- projectCellType_CP(
betaENMIX[IDOLOptimizedCpGsBloodv2, ],
IDOLOptimizedCpGs.compTable[IDOLOptimizedCpGsBloodv2,],
contrastWBC = NULL, nonnegative = TRUE,
lessThanOne = FALSE
)
## CD8T CD4T NK Bcell Mono Neu
## GSM7866964_207700470022_R08C01 0.0355 0.1628 0.0397 0.0515 0.0512 0.6700
## GSM7866965_207686150104_R02C01 0.0161 0.0212 0.0463 0.0240 0.0309 0.8653
## GSM7866966_207700460080_R08C01 0.0711 0.1319 0.1614 0.0268 0.0687 0.5360
## GSM7866967_207700470049_R01C01 0.0849 0.2160 0.0523 0.1199 0.0531 0.4932
## GSM7866968_207805820146_R07C01 0.1798 0.1424 0.0501 0.0478 0.0742 0.5206
## GSM7866969_207700470041_R01C01 0.0900 0.1397 0.0306 0.0676 0.0879 0.6084
# Add the cell type proportions to the phenoData
# Align propEPIC to phenoData
propEPIC_ <- propEPIC[rownames(phenoData), ]
# Combine using cbind()
phenoData <- cbind(phenoData, propEPIC_)
pData(RGSetF) <- phenoData
# Check the updated phenoData
str(phenoData)
## Formal class 'DFrame' [package "S4Vectors"] with 6 slots
## ..@ rownames : chr [1:49] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
## ..@ nrows : int 49
## ..@ elementType : chr "ANY"
## ..@ elementMetadata: NULL
## ..@ metadata : list()
## ..@ listData :List of 16
## .. ..$ SID : chr [1:49] "BoA1" "BoA2" "BoA3" "BoA4" ...
## .. ..$ Tissue : chr [1:49] "blood" "blood" "blood" "blood" ...
## .. ..$ Sex : chr [1:49] "M" "F" "M" "F" ...
## .. ..$ Race : chr [1:49] "White" "Asian" "Asian" "White" ...
## .. ..$ Ethnicity : chr [1:49] "Non Hispanic" "Non Hispanic" "Non Hispanic" "HISPANIC" ...
## .. ..$ Age : int [1:49] 31 79 60 59 22 47 78 89 24 70 ...
## .. ..$ GID : chr [1:49] "GSM7866964" "GSM7866965" "GSM7866966" "GSM7866967" ...
## .. ..$ Description: chr [1:49] "207700470022_R08C01" "207686150104_R02C01" "207700460080_R08C01" "207700470049_R01C01" ...
## .. ..$ Basename : chr [1:49] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
## .. ..$ filenames : chr [1:49] "data/idats/GSM7866964_207700470022_R08C01" "data/idats/GSM7866965_207686150104_R02C01" "data/idats/GSM7866966_207700460080_R08C01" "data/idats/GSM7866967_207700470049_R01C01" ...
## .. ..$ CD8T : num [1:49] 0.0355 0.0161 0.0711 0.0849 0.1798 ...
## .. ..$ CD4T : num [1:49] 0.1628 0.0212 0.1319 0.216 0.1424 ...
## .. ..$ NK : num [1:49] 0.0397 0.0463 0.1614 0.0523 0.0501 ...
## .. ..$ Bcell : num [1:49] 0.0515 0.024 0.0268 0.1199 0.0478 ...
## .. ..$ Mono : num [1:49] 0.0512 0.0309 0.0687 0.0531 0.0742 0.0879 0.0944 0.0481 0.0772 0.0766 ...
## .. ..$ Neu : num [1:49] 0.67 0.865 0.536 0.493 0.521 ...
To identify differentially methylated CpG sites associated
with Age, we applied a linear modeling
approach using the limma
package, which was
originally developed for gene expression analysis and is widely adopted
for high-throughput methylation data. This model adjusts for important
biological covariates and estimates statistical
significance using empirical Bayes moderation,
improving the detection of true biological signals, especially in modest
sample sizes.
The phenotype metadata (phenoData
) was converted into a
clean data.frame
(phenoFV
) for model
building.
## [1] "SID" "Tissue" "Sex" "Race" "Ethnicity"
## [6] "Age" "GID" "Description" "Basename" "filenames"
## [11] "CD8T" "CD4T" "NK" "Bcell" "Mono"
## [16] "Neu"
## 'data.frame': 49 obs. of 16 variables:
## $ SID : chr "BoA1" "BoA2" "BoA3" "BoA4" ...
## $ Tissue : chr "blood" "blood" "blood" "blood" ...
## $ Sex : chr "M" "F" "M" "F" ...
## $ Race : chr "White" "Asian" "Asian" "White" ...
## $ Ethnicity : chr "Non Hispanic" "Non Hispanic" "Non Hispanic" "HISPANIC" ...
## $ Age : int 31 79 60 59 22 47 78 89 24 70 ...
## $ GID : chr "GSM7866964" "GSM7866965" "GSM7866966" "GSM7866967" ...
## $ Description: chr "207700470022_R08C01" "207686150104_R02C01" "207700460080_R08C01" "207700470049_R01C01" ...
## $ Basename : chr "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
## $ filenames : chr "data/idats/GSM7866964_207700470022_R08C01" "data/idats/GSM7866965_207686150104_R02C01" "data/idats/GSM7866966_207700460080_R08C01" "data/idats/GSM7866967_207700470049_R01C01" ...
## $ CD8T : num 0.0355 0.0161 0.0711 0.0849 0.1798 ...
## $ CD4T : num 0.1628 0.0212 0.1319 0.216 0.1424 ...
## $ NK : num 0.0397 0.0463 0.1614 0.0523 0.0501 ...
## $ Bcell : num 0.0515 0.024 0.0268 0.1199 0.0478 ...
## $ Mono : num 0.0512 0.0309 0.0687 0.0531 0.0742 0.0879 0.0944 0.0481 0.0772 0.0766 ...
## $ Neu : num 0.67 0.865 0.536 0.493 0.521 ...
Before modeling, we inspected the correlation matrix of numeric covariates using Spearman’s method to identify potential multicollinearity or confounding.
# Check Correlation matrix of numeric variables
numVars <- phenoFV[sapply(phenoFV, is.numeric)]
corMat <- cor(numVars, use = "pairwise.complete.obs", method = "spearman")
corrplot::corrplot(corMat, method = "color", type = "upper", tl.cex = 0.7,
tl.col = "black", tl.srt = 45)
# Convert Sex and Race to factor
phenoFV$Sex <- factor(phenoFV$Sex)
phenoFV$Race <- factor(phenoFV$Race)
levels(phenoFV$Sex)
## [1] "F" "M"
## [1] "Asian" "Black" "Other" "White"
From the above, we can include the following covariates in our model:
A design matrix was constructed to model methylation
Beta values as a function of Age, adjusting for the below covariates.
The analysis was performed using lmFit()
and
eBayes()
from limma
, which fits a linear model
per CpG site:
\[ \text{Methylation}_i \sim \text{Age} + \text{Sex} + \text{Race} + \text{CD4T} + \text{Mono} \]
The top differentially methylated positions (DMPs) were extracted
using topTable()
with Age as the primary
coefficient of interest (coef = 2
).
# Create a design matrix for the linear model:
# Columns:
# # [1] "SID" "Tissue" "Sex" "Race" "Ethnicity"
# [6] "Age" "GID" "Description" "Basename" "filenames"
# [11] "CD8T" "CD4T" "NK" "Bcell" "Mono"
# [16] "Neu" "CD8T.1" "CD4T.1" "NK.1" "Bcell.1"
# [21] "Mono.1" "Neu.1"
design <- model.matrix(~ Age + Sex + Race + CD4T + Mono,
data = phenoFV)
fit <- lmFit(beta, design)
fit <- eBayes(fit, trend= TRUE)
tt <- topTable(fit, coef = 2, number = Inf)
tt
A Q-Q plot of the moderated t-statistics for Age was generated to assess whether the distribution of test statistics followed the expected null distribution. Below we can see a deviation from the diagonal line at the tail supporting the presence of true differential methylation signals (Voorman, A. et al., 2011 & Gordon, K. et al., 2025.
To map significant CpGs to their genomic coordinates:
tt
) with probe annotation
from the normalized object (MSetF_Flt_Rxy_Ds_Rc
).Genome-wide and suggestive significance thresholds were annotated:
# Get Annotation
tt$IID <- rownames(tt)
ann <- getAnnotation(MSetF_Flt_Rxy_Ds_Rc)
manhattanData <- merge(tt, ann[, c("Name", "chr", "pos")],
by.x = "IID", by.y = "Name")
# Create dataframe to plot MH plot
manhattanData <- data.frame(
SNP = manhattanData$IID,
CHR = as.numeric(gsub("chr", "", manhattanData$chr)),
BP = manhattanData$pos,
P = manhattanData$P.Value
)
manhattan(manhattanData, annotatePval = 5e-08, chrlabs = c(1:22),
main = "Manhattan Plot", ylim = c(0, 20), cex.axis = 0.4,
suggestiveline = -log10(5e-05), genomewideline = -log10(5e-08))
This tutorial is intended solely for research and educational purposes and does not constitute clinical or diagnostic advice. The statistical models presented here are exploratory and may be affected by batch effects, technical artifacts, and uncorrected variables such as predicted sex and array-specific biases, which must be accounted for in formal analyses.
We developed a comprehensive and automated pipeline, publicly available on GitHub:
The pipeline is implemented in R and is structured to be compatible with local and cluster environments.