QUT Logo

Overview

This tutorial provides an introduction to DNA methylation analysis using the minfi package in R. It covers the following key steps:

  1. Data Import: Loading raw methylation data from IDAT files.
  2. Exploration: Generating control plots to assess data quality.
  3. Quality Control: Assessing the quality of the data using control plots and metrics.
  4. Normalisation: Applying functional normalisation to adjust for technical biases.
  5. Filtering: removing chromosome X and Y probes, and probes with low detection p-values, cross-reactive probes and SNP probes.

Dataset

The following analysis is based on DNA methylation data obtained from the GEO Series GSE246337. The dataset includes genome-wide methylation profiles generated using the Illumina Infinium MethylationEPIC v2.0 array, derived from blood samples.

A total of 50 samples were selected for this analysis. Sample-level metadata was extracted and cleaned from the provided Series Matrix text file.

Dataset Summary

Aim

The aim of this analysis is to identify differentially methylated CpG sites associated with Age, while adjusting for biological covariates including Sex, Race, and estimated cell-type proportions. This is achieved using a linear model framework from the package limma.

Libraries

The following code installs and loads the required packages for this analysis.

# Install and load required packages

instalload <- function(cran=NULL, bioc=NULL) {
  pkgs_cran <- cran[!(cran %in% installed.packages()[, "Package"])]
  if (length(pkgs_cran) > 0) {
    install.packages(pkgs_cran, dependencies = T)
  }

  if (!requireNamespace("BiocManager", quietly = T))
          install.packages("BiocManager")
  pkgs_bioc <- bioc[!(bioc %in% installed.packages()[, "Package"])]
  if (length(pkgs_bioc) > 0) {
    BiocManager::install(pkgs_bioc, ask = F)
  }

  pkgs <- c(cran, bioc)
  invisible(lapply(pkgs, require, character.only = T))

  cat("Packages are installed and loaded successfully.\n")
}

# CRAN & Bioconductor packages
cran <- c("ggplot2", "gridExtra", "RColorBrewer",
          "stringr", "rstudioapi", "knitr", "qqman")

bioc <- c("minfi",
          "IlluminaHumanMethylationEPICv2anno.20a1.hg38",
          "IlluminaHumanMethylationEPICv2manifest",
          "FlowSorted.Blood.EPIC", "limma", "Gviz", "ENmix")

instalload(cran, bioc)
## Packages are installed and loaded successfully.

Working Directory

The following code sets the working directory to the location of this R Markdown file.

setwd(dirname(getSourceEditorContext()$path))
print(getwd())
## [1] "C:/Users/ruizpint/OneDrive - Queensland University of Technology/QUT/PipelineScripts/2025CGPHNeurogenomicsWorkshop"

Load Data

Load sample data

# Sample data
targets <- read.csv("data/pheno.csv")
str(targets)
## 'data.frame':    50 obs. of  9 variables:
##  $ SID        : chr  "BoA1" "BoA2" "BoA3" "BoA4" ...
##  $ Tissue     : chr  "blood" "blood" "blood" "blood" ...
##  $ Sex        : chr  "M" "F" "M" "F" ...
##  $ Race       : chr  "White" "Asian" "Asian" "White" ...
##  $ Ethnicity  : chr  "Non Hispanic" "Non Hispanic" "Non Hispanic" "HISPANIC" ...
##  $ Age        : int  31 79 60 59 22 47 78 89 24 70 ...
##  $ GID        : chr  "GSM7866964" "GSM7866965" "GSM7866966" "GSM7866967" ...
##  $ Description: chr  "207700470022_R08C01" "207686150104_R02C01" "207700460080_R08C01" "207700470049_R01C01" ...
##  $ Basename   : chr  "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...

The dataset contains 50 samples. From the above, we can see it includes phenotype and technical information relevant to DNA methylation analysis.

In our analysis, we focus on Age as the outcome and for now we can consider as covariates Sex and Race. Variables as GID, Description, and Tissue are used for tracking and data merging, but are not directly included in the statistical model. SID and Basename are identifiers for each sample, ensuring we can link the metadata to the raw data files. Tissue is constant across samples, so it does not need to be included in the model.

Load methylation data

# Load the raw methylation data from the idat files
RGSet <- read.metharray.exp(base="data/idats",
                            targets=targets,
                            extended=F,
                            recursive=F,
                            verbose=F)

str(RGSet)
## Formal class 'RGChannelSet' [package "minfi"] with 6 slots
##   ..@ annotation     : Named chr [1:2] "IlluminaHumanMethylationEPICv2" "20a1.hg38"
##   .. ..- attr(*, "names")= chr [1:2] "array" "annotation"
##   ..@ colData        :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
##   .. .. ..@ rownames       : chr [1:50] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
##   .. .. ..@ nrows          : int 50
##   .. .. ..@ elementType    : chr "ANY"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   .. .. ..@ listData       :List of 10
##   .. .. .. ..$ SID        : chr [1:50] "BoA1" "BoA2" "BoA3" "BoA4" ...
##   .. .. .. ..$ Tissue     : chr [1:50] "blood" "blood" "blood" "blood" ...
##   .. .. .. ..$ Sex        : chr [1:50] "M" "F" "M" "F" ...
##   .. .. .. ..$ Race       : chr [1:50] "White" "Asian" "Asian" "White" ...
##   .. .. .. ..$ Ethnicity  : chr [1:50] "Non Hispanic" "Non Hispanic" "Non Hispanic" "HISPANIC" ...
##   .. .. .. ..$ Age        : int [1:50] 31 79 60 59 22 47 78 89 24 70 ...
##   .. .. .. ..$ GID        : chr [1:50] "GSM7866964" "GSM7866965" "GSM7866966" "GSM7866967" ...
##   .. .. .. ..$ Description: chr [1:50] "207700470022_R08C01" "207686150104_R02C01" "207700460080_R08C01" "207700470049_R01C01" ...
##   .. .. .. ..$ Basename   : chr [1:50] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
##   .. .. .. ..$ filenames  : chr [1:50] "data/idats/GSM7866964_207700470022_R08C01" "data/idats/GSM7866965_207686150104_R02C01" "data/idats/GSM7866966_207700460080_R08C01" "data/idats/GSM7866967_207700470049_R01C01" ...
##   ..@ assays         :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
##   .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
##   .. .. .. .. ..@ listData       :List of 2
##   .. .. .. .. .. ..$ Green: int [1:1105209, 1:50] 14462 7133 14963 11324 1184 15186 1693 11046 866 1651 ...
##   .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. .. .. .. ..$ : chr [1:1105209] "1600157" "1600179" "1600219" "1600225" ...
##   .. .. .. .. .. .. .. ..$ : chr [1:50] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
##   .. .. .. .. .. ..$ Red  : int [1:1105209, 1:50] 3269 921 1760 13191 11202 2679 17863 10587 9740 516 ...
##   .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. .. .. .. ..$ : chr [1:1105209] "1600157" "1600179" "1600219" "1600225" ...
##   .. .. .. .. .. .. .. ..$ : chr [1:50] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
##   .. .. .. .. ..@ elementType    : chr "ANY"
##   .. .. .. .. ..@ elementMetadata: NULL
##   .. .. .. .. ..@ metadata       : list()
##   ..@ NAMES          : chr [1:1105209] "1600157" "1600179" "1600219" "1600225" ...
##   ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
##   .. .. ..@ rownames       : NULL
##   .. .. ..@ nrows          : int 1105209
##   .. .. ..@ elementType    : chr "ANY"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   .. .. ..@ listData       : Named list()
##   ..@ metadata       : list()

As for now, the data is in the form of an RGChannelSet (unprocessed data), which contains the original red and green intensity values measured by the array. From the above, we can see that the RGSet object contains:

  • 50 samples (columns) with associated phenotypic information (sex, age, race, others)
  • Over 1.1 million probes (rows) representing individual CpG sites

This object will be used as input for exploration, quality control and preprocessing steps.

Exploration

This section is based on the control table used by Illumina to explore expected performance of the assay (Figure 1, 2).

We will generate control plots to assess the quality of the data and identify any potential issues with the samples using the ENmix package. The issues can be related to the sample preparation, bisulfite conversion, hybridization, or other steps in the methylation assay process.

Figure 1. Independent Controls. Source: Illumina. “Control Bead Type Table.” Illumina Support Documentation, https://support-docs.illumina.com/ARR/Inf-MSA/Content/ARR/MSA/ControlBeadTypeTable.htm. Accessed 17 July 2025. Assay
Figure 1. Independent Controls. Source: Illumina. “Control Bead Type Table.” Illumina Support Documentation, https://support-docs.illumina.com/ARR/Inf-MSA/Content/ARR/MSA/ControlBeadTypeTable.htm. Accessed 17 July 2025. Assay
Figure 2. Dependant Controls. Source: Illumina. “Control Bead Type Table.” Illumina Support Documentation, https://support-docs.illumina.com/ARR/Inf-MSA/Content/ARR/MSA/ControlBeadTypeTable.htm. Accessed 17 July 2025. Assay
Figure 2. Dependant Controls. Source: Illumina. “Control Bead Type Table.” Illumina Support Documentation, https://support-docs.illumina.com/ARR/Inf-MSA/Content/ARR/MSA/ControlBeadTypeTable.htm. Accessed 17 July 2025. Assay

Control probes plot using ENmix

# Generate ENmix control plots
plotCtrl(RGSet)
## Plotting  STAINING .jpg
## Plotting  EXTENSION .jpg
## Plotting  HYBRIDIZATION .jpg
## Plotting  TARGET_REMOVAL .jpg
## Plotting  BISULFITE_CONVERSION_I .jpg
## Plotting  BISULFITE_CONVERSION_II .jpg
## Plotting  SPECIFICITY_I .jpg
## Plotting  SPECIFICITY_II .jpg
## Plotting  NON-POLYMORPHIC .jpg
## Plotting  NEGATIVE .jpg
## Plotting  NORM_A .jpg
## Plotting  NORM_C .jpg
## Plotting  NORM_G .jpg
## Plotting  NORM_T .jpg
## Plotting  NORM_ACGT .jpg
# Create a directory for ENmix control plots
if (!dir.exists("figures/enMix")) {
  dir.create("figures/enMix", recursive = TRUE)
}

# Move all .jpg control plots to the target folder
jpgFiles <- list.files(pattern = "\\.jpg$")
file.rename(from = jpgFiles,
            to   = file.path("figures/enMix", jpgFiles))
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

From the inspection of all the controls, one sample (around sample 20) shows noticeably low red signal in the STAINING, BISULFITE CONVERSION II, SPECIFICITY I and SPECIFICITY II. This sample may require further inspection or exclusion.

Minfi

When working with DNA methylation data in minfi, the data goes through several transformation steps, and each step is stored in a specific type of object (Figure 3). It is important to understand how these object types are connected.

The process begins by reading IDAT files, which are stored in an object called an RGChannelSet. This raw data is then processed using a preprocessing function, which converts it into a MethylSet.

From there, two additional functions — ratioConvert() and mapToGenome() are used to transform the MethylSet into a GenomicRatioSet, which is ready for downstream analysis.

Figure 3. Flowchart of the different *minfi class objects. Source: NBIS (National Bioinformatics Infrastructure Sweden). (n.d.). Methylation array tutorial. NBIS Workshop on Epigenomics. Retrieved July 17, 2025, from https://nbis-workshop-epigenomics.readthedocs.io/en/latest/content/tutorials/methylationArray/Array_Tutorial.html
Figure 3. Flowchart of the different *minfi class objects. Source: NBIS (National Bioinformatics Infrastructure Sweden). (n.d.). Methylation array tutorial. NBIS Workshop on Epigenomics. Retrieved July 17, 2025, from https://nbis-workshop-epigenomics.readthedocs.io/en/latest/content/tutorials/methylationArray/Array_Tutorial.html
# Get an overview of the data and annotation
getManifest(RGSet)
## IlluminaMethylationManifest object
## Annotation
##   array: IlluminaHumanMethylationEPICv2
## Number of type I probes: 128271 
## Number of type II probes: 808719 
## Number of control probes: 635 
## Number of SNP type I probes: 24 
## Number of SNP type II probes: 41
head(getAnnotation(RGSet)[, c("Name", "chr", "pos", "UCSC_RefGene_Name")][1:5, ])
## DataFrame with 5 rows and 4 columns
##                            Name         chr       pos      UCSC_RefGene_Name
##                     <character> <character> <integer>            <character>
## cg25324105_BC11 cg25324105_BC11       chr19  37692358 ZNF781;ZNF781;ZNF781..
## cg25383568_TC11 cg25383568_TC11       chr19  38727081            ACTN4;ACTN4
## cg25455143_BC11 cg25455143_BC11       chr19   1591515                       
## cg25459778_BC11 cg25459778_BC11       chr16   1614581                       
## cg25487775_BC11 cg25487775_BC11        chr2 161237458

From the overview of the data, we can see that the Illumina EPIC v2 array includes over 900,000 probes designed to assess DNA methylation across the genome. These probes are divided into two types based on their chemistry: Type I and Type II. Each type has different probe designs, and both are used to maximize coverage and sensitivity.

In addition to methylation probes, the array also includes control probes for quality checks and SNP probes to assess sample identity and potential genetic variation.

Each probe is annotated with genomic coordinates (chromosome, position, others) and often linked to nearby genes. This information helps us understand the biological relevance of methylation changes detected during analysis.

Structure of the IlluminaMethylationManifest Object:

  • Number of type I probes: 128,271 probes that use two separate color channels (Red and Green) for methylated and unmethylated signals.
  • Number of type II probes: 808,719 probes that use a single color channel (Green) for both methylated and unmethylated signals.
  • Number of control probes: 635 probes included for quality control and normalisation.
  • Number of SNP type I probes: 24 probes designed to detect single nucleotide polymorphisms (SNPs) using the type I probe structure.
  • Number of SNP type II probes: 41 probes designed to detect SNPs using the type II probe structure.

Note on the data structure

At this stage, our data is stored in an RGChannelSet, which contains the raw intensity signals from the red and green channels on the array.

To start analyzing DNA methylation, we need to convert this into a MethylSet, which holds the actual methylated (Meth) and unmethylated (Unmeth) signal values.

The simplest way to do this is with the preprocessRaw() function. This function uses the array design to correctly match the probes and color channels, allowing it to generate the Meth and Unmeth signals.

Keep in mind that preprocessRaw() does not apply any normalisation. It’s meant to be used for initial quality control.

You will see that the original red and green signal assays are now replaced with Meth and Unmeth signals, preparing the data for the next steps in analysis.

# Preprocess the raw data to create a MethylSet object
MSet <- preprocessRaw(RGSet)

MSet
## class: MethylSet 
## dim: 936990 50 
## metadata(0):
## assays(2): Meth Unmeth
## rownames(936990): cg25324105_BC11 cg25383568_TC11 ...
##   ch.12.78471492F_BC21 ch.21.43742285F_BC21
## rowData names(0):
## colnames(50): GSM7866964_207700470022_R08C01
##   GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
##   GSM7867013_207705770049_R06C01
## colData names(10): SID Tissue ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylationEPICv2
##   annotation: 20a1.hg38
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.52.1
##   Manifest version: 1.0.0
# Check the structure of the MSet object
meth <- getMeth(MSet)
unmeth <- getUnmeth(MSet)

head(meth[1:5, 1:2])
##                 GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg25324105_BC11                           1352                            940
## cg25383568_TC11                           7548                           6119
## cg25455143_BC11                            590                            597
## cg25459778_BC11                            247                            510
## cg25487775_BC11                          10935                           5801
head(unmeth[1:5, 1:2])
##                 GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg25324105_BC11                          21984                          20939
## cg25383568_TC11                           1337                           1236
## cg25455143_BC11                          19072                          17119
## cg25459778_BC11                          13889                           9875
## cg25487775_BC11                           7700                           9991

Here we can se a matrix where rows correspond to CpG sites, columns represent samples, and values contain fluorescence intensities (indicate how much DNA at each CpG site is methylated or unmethylated) from Illumina methylation arrays. These matrices are used to compute Beta (methylation proportions) and M (log-transformed methylation ratios) values.

# Visualize the distribution of methylated and unmethylated intensities
# Log-transform intensities, only for visualization purposes
logMeth <- log10(meth + 1)
logUnmeth <- log10(unmeth + 1)

# Plot density and Increase y axis limit
plot(density(as.vector(logMeth)), col = "darkgreen", lwd = 2,
     main = "Methylated and Unmethylated Intensities",
     xlab = "Log10(Intensity)", ylim = c(0, 1.5))
lines(density(as.vector(logUnmeth)), col = "red", lwd = 2)
legend("topright", legend = c("Methylated", "Unmethylated"),
       col = c("darkgreen", "red"), lwd = 2)

Once we have the methylated (Meth) and unmethylated (Unmeth) signals in a MethylSet, we often convert them into values that are easier to work with in analysis Beta and M values. These are stored in a new type of object called a RatioSet.

Finally, we want to include genomic information such as the chromosome and location of each probe using mapToGenome() function. This step turns our RatioSet into a GenomicRatioSet, which is the format we will use for the downstream analyses.

# Convert MethylSet to a GenomicRatioSet object
RatioSet <- ratioConvert(MSet,
                         what = "both",
                         keepCN = TRUE)
# Observe the change of the assays
RatioSet
## class: RatioSet 
## dim: 936990 50 
## metadata(0):
## assays(3): Beta M CN
## rownames(936990): cg25324105_BC11 cg25383568_TC11 ...
##   ch.12.78471492F_BC21 ch.21.43742285F_BC21
## rowData names(0):
## colnames(50): GSM7866964_207700470022_R08C01
##   GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
##   GSM7867013_207705770049_R06C01
## colData names(10): SID Tissue ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylationEPICv2
##   annotation: 20a1.hg38
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.52.1
##   Manifest version: 1.0.0
# Map the RatioSet to the genome
GSet <- mapToGenome(RatioSet)
GSet
## class: GenomicRatioSet 
## dim: 930075 50 
## metadata(0):
## assays(3): Beta M CN
## rownames(930075): cg00381604_BC11 cg21870274_BC21 ... cg27912920_TC21
##   cg16855331_BC21
## rowData names(0):
## colnames(50): GSM7866964_207700470022_R08C01
##   GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
##   GSM7867013_207705770049_R06C01
## colData names(10): SID Tissue ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylationEPICv2
##   annotation: 20a1.hg38
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.52.1
##   Manifest version: 1.0.0
# Obtain metrics to measure methylation levels
beta <- getBeta(GSet)
head(beta[, 1:2])
##                 GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg00381604_BC11                     0.04027505                     0.05374046
## cg21870274_BC21                     0.77646606                     0.69081510
## cg00002271_BC21                     0.36874602                     0.39778924
## cg08058514_BC21                     0.71405361                     0.42476489
## cg00004581_TC21                     0.31495854                     0.33383572
## cg09261435_BC21                     0.14921587                     0.16735113
m <- getM(GSet)
head(m[, 1:2])
##                 GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg00381604_BC11                     -4.5746627                     -4.1381554
## cg21870274_BC21                      1.7964290                      1.1598298
## cg00002271_BC21                     -0.7755931                     -0.5982642
## cg08058514_BC21                      1.3202877                     -0.4374872
## cg00004581_TC21                     -1.1210294                     -0.9967397
## cg09261435_BC21                     -2.5113922                     -2.3148299
cn <- getCN(GSet)
head(cn[, 1:2])
##                 GSM7866964_207700470022_R08C01 GSM7866965_207686150104_R02C01
## cg00381604_BC11                       13.79888                       13.67728
## cg21870274_BC21                       13.92796                       13.70434
## cg00002271_BC21                       14.31678                       14.72813
## cg08058514_BC21                       11.26562                       10.31741
## cg00004581_TC21                       14.87987                       14.54388
## cg09261435_BC21                       14.40408                       14.32010

The key aspects of obtaing the beta, m and cn are as follows:

Beta:

  • Proportion of DNA methylation
  • Range: 0 (unmethylated) to 1 (fully methylated)
  • Computed as M/(M+U+alpha)

M:

  • Log-transformed methylation ratio
  • Range: -∞ to ∞
  • Computed as log2(M/U)

CN:

  • Copy number values
  • sum of the methylation and unmethylation channel.
Figure 4. The relationship curve between Beta and M values. Source: Du, P., Zhang, X., Huang, CC. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010). https://doi.org/10.1186/1471-2105-11-587
Figure 4. The relationship curve between Beta and M values. Source: Du, P., Zhang, X., Huang, CC. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010). https://doi.org/10.1186/1471-2105-11-587

Quality Control

The minfi package includes a simple and helpful quality control (QC) plot. This plot shows the median intensity of each sample in the methylated (M) and unmethylated (U) channels, using a log scale.

When you make this plot, you will usually see that high-quality samples group together in one area, while low-quality (failed) samples stand out with lower intensities and are more spread out, below or above the line.

# Extract and plot the quality control information from the MethylSet
qc <- getQC(MSet)
plotQC(qc,badSampleCutoff = 10.5) + # Default cutoff
title("Quality Control")

## integer(0)

It seems that we do not have any failed samples, as all samples are above the default cutoff of 10.5. However, we can still check the detection p-values to ensure that all samples have sufficient quality.

# Calculate the detection p-values
detP <- detectionP(RGSet, type = "m+u")

# Examine mean detection p-values across all samples to identify any failed samples
barplot(colMeans(detP),
        las=2,
        cex.names=0.2,
        ylab="Mean detection p-values")
abline(h=0.05,col="red")

When deciding if a sample is of good or bad quality, it’s best not to rely on just one metric. Instead, we can also examine the detection p-values, which give us insight into how reliable the signal is for each CpG site in each sample.

The minfi package calculates detection p-values by comparing the total signal (methylated + unmethylated) for a probe to the background noise, which is estimated using negative control probes.

By plotting the average detection p-value for each sample, we can quickly see which samples have many failed probes these samples will show higher mean p-values, suggesting lower overall quality.

# Calculate the mean detection p-values across all samples
meanDetP <- colMeans(detP)

# Identify samples with mean detection p-value > 0.05
failedSamples <- names(meanDetP[meanDetP > 0.05])
failedSamples
## [1] "GSM7866981_207700470022_R01C01"
RGSetF <- RGSet[, !(colnames(RGSet) %in% failedSamples)]

For this particular dataset, GSM7866981_207700470022_R01C01 is the only sample with a mean detection p-value greater than 0.05, indicating that it has failed in one or more probes. We can remove this sample from the analysis.

Another helpful way to check the quality of your samples is by looking at the distribution of Beta values. Beta values range from 0 (unmethylated) to 1 (fully methylated), so for a good-quality sample, you would usually expect the values to cluster near 0 and 1.

# Inspect the overall density distribution of Beta values
phenoData <- pData(RGSetF)
densityPlot(RGSetF,
            sampGroups = phenoData$Sex,
            pal = brewer.pal(8, "Dark2"),
            main = "Density Plot of Beta Values",
            add = TRUE,
            legend = TRUE,)

After removing the failed sample, we can visualize the distribution is as expected.

Normalisation

In DNA methylation data analysis, normalisation is a crucial preprocessing step that addresses technical variation and probe design bias. The Illumina array includes two probe types (Type I and II) with different dynamic ranges, and if not corrected, this can cause Type I probes to dominate analysis results due to their broader range. This makes within-array normalisation essential when comparing or ranking probes. Additionally, batch effects systematic differences introduced by factors as reagent concentration or lab conditions can obscure true biological signals. It is recommended to use:

  • preprocessFunnorm when your samples come from biologically distinct groups or it is expected large anticipated differences between samples (Fortin, JP. et al. 2014).
  • Use preprocessQuantile when samples are from the same tissue type or expected to have similar global methylation profiles(Touleimat, N. & Tost, J. 2012).

These options generally perform better than older methods as preprocessRaw, preprocessIllumina, or preprocessSWAN, which are still available for convenience.

As we are expecting large anticipated differences between samples, we will apply the preprocessFunnorm method to our data.

# Preprocess the data using functional normalisation
set.seed(123)
MSetF <- preprocessFunnorm(RGSetF)
MSetF
## class: GenomicRatioSet 
## dim: 930075 49 
## metadata(0):
## assays(2): Beta CN
## rownames(930075): cg00381604_BC11 cg21870274_BC21 ... cg27912920_TC21
##   cg16855331_BC21
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
##   GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
##   GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
##   array: IlluminaHumanMethylationEPICv2
##   annotation: 20a1.hg38
## Preprocessing
##   Method: NA
##   minfi version: NA
##   Manifest version: NA
# Visualise what the data looks as before and after normalisation
# based on Sex
pal <- brewer.pal(8,"Dark2")

par(mfrow=c(1,2))
densityPlot(RGSetF,
            sampGroups=phenoData$Sex,
            main="Raw",
            legend=FALSE)
legend("top",
       legend = levels(factor(phenoData$Sex)),
       text.col=pal)

densityPlot(getBeta(MSetF),
            sampGroups=phenoData$Sex,
            main="Normalized",
            legend=FALSE)
legend("top",
       legend = levels(factor(phenoData$Sex)),
       text.col=pal)

Statistically, normalisation should ideally minimize technical noise without compromising biological signals. The right panel indicates a successful normalisation, aligning distributions more closely while maintaining inherent biological distinctions.

Filtering

To improve the reliability of methylation analysis, it is important to remove probes that do not perform well. These poor-quality probes can introduce noise and hide meaningful biological patterns. We will be filtering out probes that fail in one or more samples typically based on high detection p-values, probes located on the X and Y chromosomes, probes with single nucleotide polymorphisms (SNPs) at the CpG site, and cross-reactive probes. This will improve data quality and reduces the burden of correcting for multiple comparisons in statistical analyses.

Remove probes with low detection p-values

# Ensure probes are in the same order in the mSetSq and detP objects
detPF <- detectionP(RGSetF)
detPF <- detPF[match(featureNames(MSetF),
                   rownames(detPF)),]

# Remove any probes that have failed in one or more samples; this next line
# checks for each row of detP whether the number of values < 0.05 is equal
# to the number of samples (TRUE) or not (FALSE)
keep <- rowSums(detPF < 0.05) == ncol(MSetF)
table(keep)
## keep
##  FALSE   TRUE 
##  10063 920012
# Subset the GenomicRatioSet
MSetF_Flt <- MSetF[keep,]
MSetF_Flt
## class: GenomicRatioSet 
## dim: 920012 49 
## metadata(0):
## assays(2): Beta CN
## rownames(920012): cg00381604_BC11 cg21870274_BC21 ... cg27912919_TC21
##   cg16855331_BC21
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
##   GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
##   GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
##   array: IlluminaHumanMethylationEPICv2
##   annotation: 20a1.hg38
## Preprocessing
##   Method: NA
##   minfi version: NA
##   Manifest version: NA

Remove probes from the X and Y chromosomes

Removing the sex chromosomes is a common practice in methylation analysis because these chromosomes often contain common SNPs at the CpG site (Maksimovic, J. et al. 2016 & Wang, Y. et al. 2021).

# Remove probes on the sex chromosomes
keep_chr <- !(featureNames(MSetF_Flt) %in%
                   getAnnotation(RGSet)$Name[getAnnotation(RGSet)$chr
                                             %in% c("chrX", "chrY")])

MSetF_Flt_Rxy <- MSetF_Flt[keep_chr,]

MSetF_Flt_Rxy
## class: GenomicRatioSet 
## dim: 896329 49 
## metadata(0):
## assays(2): Beta CN
## rownames(896329): cg00381604_BC11 cg21870274_BC21 ... cg26570081_BC21
##   cg26570107_TC11
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
##   GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
##   GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
##   array: IlluminaHumanMethylationEPICv2
##   annotation: 20a1.hg38
## Preprocessing
##   Method: NA
##   minfi version: NA
##   Manifest version: NA
#Verify chromosome X and Y were removed
table(getAnnotation(MSetF_Flt_Rxy)$chr)
## 
##  chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2 chr20 
## 86032 44510 51401 47680 22541 31183 30160 38733 47267 17086 39784 68340 25919 
## chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9 
## 11015 19905 52258 38473 46518 55262 48187 40919 33156

Remove probes with SNPs at CpG site

# Remove probes where common SNPs may affect the CpG methylation signal
MSetF_Flt_Rxy_Ds <- dropLociWithSnps(MSetF_Flt_Rxy, snps = c("SBE", "CpG"), maf = 0.5)
MSetF_Flt_Rxy_Ds
## class: GenomicRatioSet 
## dim: 896313 49 
## metadata(0):
## assays(2): Beta CN
## rownames(896313): cg00381604_BC11 cg21870274_BC21 ... cg26570081_BC21
##   cg26570107_TC11
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
##   GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
##   GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
##   array: IlluminaHumanMethylationEPICv2
##   annotation: 20a1.hg38
## Preprocessing
##   Method: NA
##   minfi version: NA
##   Manifest version: NA

Remove cross-reactive probes

Probes that have been shown to be cross-reactive, meaning those that map to multiple locations in the genome, will also be filtered out. This list was publicly published and can be accessed through the authors’ publication Peters et al. (2024).

# Exclude cross reactive probes
xReactiveProbes <- read.csv("data/12864_2024_10027_MOESM8_ESM.csv",
                            stringsAsFactors=FALSE)
head(xReactiveProbes)
keep__ <- !(featureNames(MSetF_Flt_Rxy_Ds) %in% xReactiveProbes$ProbeID)

table(keep__)
## keep__
##  FALSE   TRUE 
##  29115 867198
# Remove cross-reactive probes
MSetF_Flt_Rxy_Ds_Rc <- MSetF_Flt_Rxy_Ds[keep__,]
MSetF_Flt_Rxy_Ds_Rc
## class: GenomicRatioSet 
## dim: 867198 49 
## metadata(0):
## assays(2): Beta CN
## rownames(867198): cg00002271_BC21 cg00004581_TC21 ... cg13801850_TC21
##   cg08423507_BC11
## rowData names(0):
## colnames(49): GSM7866964_207700470022_R08C01
##   GSM7866965_207686150104_R02C01 ... GSM7867012_207700460010_R01C01
##   GSM7867013_207705770049_R06C01
## colData names(13): SID Tissue ... yMed predictedSex
## Annotation
##   array: IlluminaHumanMethylationEPICv2
##   annotation: 20a1.hg38
## Preprocessing
##   Method: NA
##   minfi version: NA
##   Manifest version: NA

Metrics

As reported by Du, P., Zhang, X., Huang, CC. et al. (2010), we can calculate the M, Beta, and CN values for the final version of the data.

  • M: Better statistical properties.
  • Beta: More easy to interpret.
# Calculate M and B values for statistical analysis of the final version of the data
m <- getM(MSetF_Flt_Rxy_Ds_Rc)

beta <- getBeta(MSetF_Flt_Rxy_Ds_Rc)

# Create directory for metrics
if (!dir.exists("rData/metrics")) {
  dir.create("rData/metrics", recursive = TRUE)
}
# Visualise the distribution of beta and M values
par(mfrow=c(1,2))
densityPlot(beta, sampGroups=phenoData$Race, main="Beta values",
            legend=FALSE, xlab="Beta")
legend("top", legend = levels(factor(phenoData$Race)),
       text.col=brewer.pal(8,"Dark2"))
densityPlot(m, sampGroups=phenoData$Race, main="M values",
            legend=FALSE, xlab="M")
legend("topleft", legend = levels(factor(phenoData$Race)),
       text.col=brewer.pal(8,"Dark2"))

Cell Type Estimation

DNA methylation levels measured from whole blood samples are influenced not only by biological and environmental exposures, but also by the composition of different white blood cell (WBC) types (Qi, L. et al., 2022 & Terry, M. et al., 2011). Since each immune cell type has its own unique methylation profile, failing to account for variation in cell type proportions can lead to confounding in differential methylation analysis especially in epigenome-wide association studies (EWAS) involving age, sex, or disease states (Koestler, D. et al., 2016).

To address this, we estimated cell-type proportions using reference based deconvolution methods based on the FlowSorted.Blood.EPIC reference panel. This panel includes optimized CpGs for identifying immune cell types in Illumina EPIC array data.

# Using ENmix to remove the suffix from the beta matrix
betaENMIX<- rm.cgsuffix(beta)

library(FlowSorted.Blood.EPIC)
IDOLOptimizedCpGsBloodv2<- IDOLOptimizedCpGs[which(IDOLOptimizedCpGs%in%rownames(betaENMIX))]
identical(rownames(IDOLOptimizedCpGs.compTable[IDOLOptimizedCpGsBloodv2,]), IDOLOptimizedCpGsBloodv2)
## [1] TRUE
# Project cell type proportions using the IDOLOptimizedCpGsBloodv2
propEPIC <- projectCellType_CP(
    betaENMIX[IDOLOptimizedCpGsBloodv2, ],
    IDOLOptimizedCpGs.compTable[IDOLOptimizedCpGsBloodv2,],
    contrastWBC = NULL, nonnegative = TRUE,
    lessThanOne = FALSE
)
# Display the cell type proportions
head(propEPIC)
##                                  CD8T   CD4T     NK  Bcell   Mono    Neu
## GSM7866964_207700470022_R08C01 0.0355 0.1628 0.0397 0.0515 0.0512 0.6700
## GSM7866965_207686150104_R02C01 0.0161 0.0212 0.0463 0.0240 0.0309 0.8653
## GSM7866966_207700460080_R08C01 0.0711 0.1319 0.1614 0.0268 0.0687 0.5360
## GSM7866967_207700470049_R01C01 0.0849 0.2160 0.0523 0.1199 0.0531 0.4932
## GSM7866968_207805820146_R07C01 0.1798 0.1424 0.0501 0.0478 0.0742 0.5206
## GSM7866969_207700470041_R01C01 0.0900 0.1397 0.0306 0.0676 0.0879 0.6084
# Add the cell type proportions to the phenoData
# Align propEPIC to phenoData
propEPIC_ <- propEPIC[rownames(phenoData), ]

# Combine using cbind()
phenoData <- cbind(phenoData, propEPIC_)

pData(RGSetF) <- phenoData

# Check the updated phenoData
str(phenoData)
## Formal class 'DFrame' [package "S4Vectors"] with 6 slots
##   ..@ rownames       : chr [1:49] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
##   ..@ nrows          : int 49
##   ..@ elementType    : chr "ANY"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()
##   ..@ listData       :List of 16
##   .. ..$ SID        : chr [1:49] "BoA1" "BoA2" "BoA3" "BoA4" ...
##   .. ..$ Tissue     : chr [1:49] "blood" "blood" "blood" "blood" ...
##   .. ..$ Sex        : chr [1:49] "M" "F" "M" "F" ...
##   .. ..$ Race       : chr [1:49] "White" "Asian" "Asian" "White" ...
##   .. ..$ Ethnicity  : chr [1:49] "Non Hispanic" "Non Hispanic" "Non Hispanic" "HISPANIC" ...
##   .. ..$ Age        : int [1:49] 31 79 60 59 22 47 78 89 24 70 ...
##   .. ..$ GID        : chr [1:49] "GSM7866964" "GSM7866965" "GSM7866966" "GSM7866967" ...
##   .. ..$ Description: chr [1:49] "207700470022_R08C01" "207686150104_R02C01" "207700460080_R08C01" "207700470049_R01C01" ...
##   .. ..$ Basename   : chr [1:49] "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
##   .. ..$ filenames  : chr [1:49] "data/idats/GSM7866964_207700470022_R08C01" "data/idats/GSM7866965_207686150104_R02C01" "data/idats/GSM7866966_207700460080_R08C01" "data/idats/GSM7866967_207700470049_R01C01" ...
##   .. ..$ CD8T       : num [1:49] 0.0355 0.0161 0.0711 0.0849 0.1798 ...
##   .. ..$ CD4T       : num [1:49] 0.1628 0.0212 0.1319 0.216 0.1424 ...
##   .. ..$ NK         : num [1:49] 0.0397 0.0463 0.1614 0.0523 0.0501 ...
##   .. ..$ Bcell      : num [1:49] 0.0515 0.024 0.0268 0.1199 0.0478 ...
##   .. ..$ Mono       : num [1:49] 0.0512 0.0309 0.0687 0.0531 0.0742 0.0879 0.0944 0.0481 0.0772 0.0766 ...
##   .. ..$ Neu        : num [1:49] 0.67 0.865 0.536 0.493 0.521 ...

Limma

To identify differentially methylated CpG sites associated with Age, we applied a linear modeling approach using the limma package, which was originally developed for gene expression analysis and is widely adopted for high-throughput methylation data. This model adjusts for important biological covariates and estimates statistical significance using empirical Bayes moderation, improving the detection of true biological signals, especially in modest sample sizes.

The phenotype metadata (phenoData) was converted into a clean data.frame (phenoFV) for model building.

# Convert phenoData in dataframe
phenoFV <- as.data.frame(phenoData)
colnames(phenoFV)
##  [1] "SID"         "Tissue"      "Sex"         "Race"        "Ethnicity"  
##  [6] "Age"         "GID"         "Description" "Basename"    "filenames"  
## [11] "CD8T"        "CD4T"        "NK"          "Bcell"       "Mono"       
## [16] "Neu"
str(phenoFV)
## 'data.frame':    49 obs. of  16 variables:
##  $ SID        : chr  "BoA1" "BoA2" "BoA3" "BoA4" ...
##  $ Tissue     : chr  "blood" "blood" "blood" "blood" ...
##  $ Sex        : chr  "M" "F" "M" "F" ...
##  $ Race       : chr  "White" "Asian" "Asian" "White" ...
##  $ Ethnicity  : chr  "Non Hispanic" "Non Hispanic" "Non Hispanic" "HISPANIC" ...
##  $ Age        : int  31 79 60 59 22 47 78 89 24 70 ...
##  $ GID        : chr  "GSM7866964" "GSM7866965" "GSM7866966" "GSM7866967" ...
##  $ Description: chr  "207700470022_R08C01" "207686150104_R02C01" "207700460080_R08C01" "207700470049_R01C01" ...
##  $ Basename   : chr  "GSM7866964_207700470022_R08C01" "GSM7866965_207686150104_R02C01" "GSM7866966_207700460080_R08C01" "GSM7866967_207700470049_R01C01" ...
##  $ filenames  : chr  "data/idats/GSM7866964_207700470022_R08C01" "data/idats/GSM7866965_207686150104_R02C01" "data/idats/GSM7866966_207700460080_R08C01" "data/idats/GSM7866967_207700470049_R01C01" ...
##  $ CD8T       : num  0.0355 0.0161 0.0711 0.0849 0.1798 ...
##  $ CD4T       : num  0.1628 0.0212 0.1319 0.216 0.1424 ...
##  $ NK         : num  0.0397 0.0463 0.1614 0.0523 0.0501 ...
##  $ Bcell      : num  0.0515 0.024 0.0268 0.1199 0.0478 ...
##  $ Mono       : num  0.0512 0.0309 0.0687 0.0531 0.0742 0.0879 0.0944 0.0481 0.0772 0.0766 ...
##  $ Neu        : num  0.67 0.865 0.536 0.493 0.521 ...

Before modeling, we inspected the correlation matrix of numeric covariates using Spearman’s method to identify potential multicollinearity or confounding.

# Check Correlation matrix of numeric variables
numVars <- phenoFV[sapply(phenoFV, is.numeric)]
corMat <- cor(numVars, use = "pairwise.complete.obs", method = "spearman")
corrplot::corrplot(corMat, method = "color", type = "upper", tl.cex = 0.7,
                   tl.col = "black", tl.srt = 45)

# Convert Sex and Race to factor
phenoFV$Sex <- factor(phenoFV$Sex)
phenoFV$Race <- factor(phenoFV$Race)
levels(phenoFV$Sex)
## [1] "F" "M"
levels(phenoFV$Race)
## [1] "Asian" "Black" "Other" "White"

From the above, we can include the following covariates in our model:

  • Age (continuous)
  • Sex (converted to factor)
  • Race (converted to factor)
  • Cell type proportions: CD4T and Monocytes

Model Specification

A design matrix was constructed to model methylation Beta values as a function of Age, adjusting for the below covariates. The analysis was performed using lmFit() and eBayes() from limma, which fits a linear model per CpG site:

\[ \text{Methylation}_i \sim \text{Age} + \text{Sex} + \text{Race} + \text{CD4T} + \text{Mono} \]

The top differentially methylated positions (DMPs) were extracted using topTable() with Age as the primary coefficient of interest (coef = 2).

# Create a design matrix for the linear model:
# Columns:
# # [1] "SID"         "Tissue"      "Sex"         "Race"        "Ethnicity"
#  [6] "Age"         "GID"         "Description" "Basename"    "filenames"
# [11] "CD8T"        "CD4T"        "NK"          "Bcell"       "Mono"
# [16] "Neu"         "CD8T.1"      "CD4T.1"      "NK.1"        "Bcell.1"
# [21] "Mono.1"      "Neu.1"
design <- model.matrix(~ Age + Sex + Race + CD4T + Mono,
                        data = phenoFV)

fit <- lmFit(beta, design)
fit <- eBayes(fit, trend= TRUE)

tt <- topTable(fit, coef = 2, number = Inf)
tt

Diagnostics

A Q-Q plot of the moderated t-statistics for Age was generated to assess whether the distribution of test statistics followed the expected null distribution. Below we can see a deviation from the diagonal line at the tail supporting the presence of true differential methylation signals (Voorman, A. et al., 2011 & Gordon, K. et al., 2025.

# Q-Q plot of moderated t-statistics
qqt(fit$t[,2],
    df=fit$df.residual+fit$df.prior)
abline(0,1)

Genomic Visualization

To map significant CpGs to their genomic coordinates:

  • We merged the model output (tt) with probe annotation from the normalized object (MSetF_Flt_Rxy_Ds_Rc).
  • We then created a Manhattan plot, which visualizes the genome-wide distribution of association p-values, highlighting regions of strong differential methylation.

Genome-wide and suggestive significance thresholds were annotated:

  • Genome-wide: p < 5 × 10⁻⁸ (red line)
  • Suggestive: p < 5 × 10⁻⁵ (blue line)
# Get Annotation
tt$IID <- rownames(tt)
ann <- getAnnotation(MSetF_Flt_Rxy_Ds_Rc)
manhattanData <- merge(tt, ann[, c("Name", "chr", "pos")],
                       by.x = "IID", by.y = "Name")
# Create dataframe to plot MH plot
manhattanData <- data.frame(
  SNP = manhattanData$IID,
  CHR = as.numeric(gsub("chr", "", manhattanData$chr)),
  BP = manhattanData$pos,
  P = manhattanData$P.Value
)

manhattan(manhattanData, annotatePval = 5e-08, chrlabs = c(1:22),
          main = "Manhattan Plot", ylim = c(0, 20), cex.axis = 0.4,
          suggestiveline = -log10(5e-05), genomewideline = -log10(5e-08))

References

  1. Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Usinglme4. Journal of Statistical Software, 67(1). https://doi.org/10.18637/jss.v067.i01
  2. Du, P., Zhang, X., Huang, CC. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010). https://doi.org/10.1186/1471-2105-11-587
  3. Fortin, JP., Labbe, A., Lemire, M. et al. Functional normalisation of 450k methylation array data improves replication in large cancer studies. Genome Biol 15, 503 (2014). https://doi.org/10.1186/s13059-014-0503-2
  4. Hansen, K. D., & Fortin, J.-P. (2025). The minfi User’s Guide. In https://bioconductor.org/packages/devel/bioc/vignettes/minfi/inst/doc/minfi.html
  5. Heiss, J. (2013). Recommended Work Flow. In https://hhhh5.github.io/ewastools/articles/exemplary_ewas.html Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biol, 14(10), R115. https://doi.org/10.1186/gb-2013-14-10-r115
  6. Koestler, D.C., Jones, M.J., Usset, J. et al. Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL). BMC Bioinformatics 17, 120 (2016). https://doi.org/10.1186/s12859-016-0943-7
  7. Maksimovic, J., Phipson, B., & Oshlack, A. (2017). A cross-package Bioconductor workflow for analysing methylation array data. F1000Research, 5. https://f1000research.com/articles/5-1281
  8. Marschner, I. C. (2011). glm2: Fitting Generalized Linear Models with Convergence Problems. R Journal, 3(2), 12-15. https://journal.r-project.org/archive/2011/RJ-2011-012/RJ-2011-012.pdf
  9. Pelegri , D., & Gonzalez, J. R. (2015). Chronological and gestational DNAm age estimation using different methylation-based clocks. In https://bioconductor.org/packages/release/bioc/vignettes/methylclock/inst/doc/methylclock.html
  10. Peters TJ, Meyer B, Ryan L, Achinger-Kawecka J, Song J, Campbell EM, Qu W, Nair S, Loi-Luu P, Stricker P, Lim E, Stirzaker C, Clark SJ, Pidsley R. Characterisation and reproducibility of the HumanMethylationEPIC v2.0 BeadChip for DNA methylation profiling. BMC Genomics. 2024 Mar 6;25(1):251. doi: 10.1186/s12864-024-10027-5. PMID: 38448820; PMCID: PMC10916044.
  11. Touleimat, N., & Tost, J. (2012). Complete Pipeline for Infinium® Human Methylation 450K BeadChip Data Processing Using Subset Quantile Normalisation for Accurate DNA Methylation Estimation. Epigenomics, 4(3), 325–341. https://doi.org/10.2217/epi.12.21
  12. Wang Y, Hannon E, Grant OA, Gorrie-Stone TJ, Kumari M, Mill J, Zhai X, McDonald-Maier KD, Schalkwyk LC. DNA methylation-based sex classifier to predict sex and identify sex chromosome aneuploidy. BMC Genomics. 2021 Jun 28;22(1):484. doi: 10.1186/s12864-021-07675-2. PMID: 34182928; PMCID: PMC8240370.

Manuals

  1. Hansen, D., et al. (2025). minfi: Analyze Illumina Infinium DNA methylation arrays. Bioconductor. https://www.bioconductor.org/packages/devel/bioc/manuals/minfi/man/minfi.pdf
  2. Smyth, G., et al. (2025). limma: Linear Models for Microarray and RNA-Seq Data. Bioconductor. https://bioconductor.org/packages/release/bioc/manuals/limma/man/limma.pdf
  3. Xu, Z., et al. (2025). ENmix: Quality control and analysis tools for Illumina DNA methylation array. Bioconductor. https://www.bioconductor.org/packages/devel/bioc/manuals/ENmix/man/ENmix.pdf
  4. Salas, L, et al. (2025). FlowSorted.Blood.EPIC: Sorted blood cell reference dataset for deconvolution of Illumina EPIC methylation data. Bioconductor. https://bioconductor.org/packages/devel/data/experiment/manuals/FlowSorted.Blood.EPIC/man/FlowSorted.Blood.EPIC.pdf
  5. Illumina (2025). Control Bead Type Table. Illumina Support Documentation. https://support-docs.illumina.com/ARR/Inf-MSA/Content/ARR/MSA/ControlBeadTypeTable.htm
  6. National Bioinformatics Infrastructure Sweden (NBIS) (2025). DNA methylation array analysis. https://nbis-workshop-epigenomics.readthedocs.io/en/latest/content/tutorials/methylationArray/Array_Tutorial.html

Disclaimer

This tutorial is intended solely for research and educational purposes and does not constitute clinical or diagnostic advice. The statistical models presented here are exploratory and may be affected by batch effects, technical artifacts, and uncorrected variables such as predicted sex and array-specific biases, which must be accounted for in formal analyses.

Full Methylation Analysis Pipeline

We developed a comprehensive and automated pipeline, publicly available on GitHub:

🔗 DNAm Pipeline

The pipeline is implemented in R and is structured to be compatible with local and cluster environments.