MSigDF: Molecular Signature Database (MSigDB) in a Data Frame
Stephen D. Turner
vustephen@gmail.comEnrique M. Toledo
enriquetoledo@gmail.com Source:vignettes/msigdf.Rmd
msigdf.RmdAbstract
This data package contains the Molecular Signature Database
(MSigDB) for both human and predicted mouse orthologs in separate
data frames (tibbles). Each data frame (msigdf.human
and msigdf.mouse) contain three columns: the
collection (Hallmark, or c1-c8), the gene set, and Entrez IDs for
genes in that set. The msigdf.urls tibble contains
links to descriptions on the Broad Institute’s website of each
gene set. Source code available on
GitHub.
Data sources
Original data from the Broad Institute’s Molecular Signature Database (MSigDB)1, redistributed as separate gmt data files from the MSigDB.
The gene sets contained in the MSigDB are from a wide variety of sources, and relate to a variety of species, mostly human. To facilitate use of the MSigDB in our work, we have created a pure mouse version of the MSigDB by mapping all sets to mouse orthologs. A pure human version is also provided.
Procedure:
1. The current MSigDB v2025.1 gmt files were downloaded from
Broad ftp.
2. This was done with the human and mouse gene sets
3. Each collection was converted to a list in R, and written to a
RData file using save().
See the script in data-raw/ to see how the data frames
(tibbles) were created.
Example usage
There are three data frames (tibbles) this package. The
msigdf.human data frame has columns for each MSigDB
collection divided by sub-collection (like cc, bp and mf for C5). The
format of the data is tidy, so each row is a single gene set collection,
sub-collection and gene symbol. The msigdf.mouse data frame
has the same structure for mouse orthologs. The msigdf.urls
data frame links the name of the gene set to the URL on the Broad’s
website.
New C5 ontology information was added to the category subcode for easy filtering and consistency.
- HPO: Human Phenotype Ontology
- MF: GO Molecular Function ontology
- BP: GO Biological Process ontology
- CC: GO Cellular Component ontology
The data sets in this package have several million rows. The package imports the tibble package so they’re displayed nicely.
Take a look:
## # A tibble: 6 × 4
## category_code category_subcode geneset symbol
## <chr> <chr> <chr> <chr>
## 1 c1 all MT MT-ATP6
## 2 c1 all MT MT-ATP8
## 3 c1 all MT MT-CO1
## 4 c1 all MT MT-CO2
## 5 c1 all MT MT-CO3
## 6 c1 all MT MT-CYB
## # A tibble: 6 × 4
## category_code category_subcode geneset symbol
## <chr> <chr> <chr> <chr>
## 1 m1 all MT mt-Atp6
## 2 m1 all MT mt-Atp8
## 3 m1 all MT mt-Co1
## 4 m1 all MT mt-Co2
## 5 m1 all MT mt-Co3
## 6 m1 all MT mt-Cytb
msigdf.urls %>% as.data.frame() %>% head()## category_code category_subcode geneset
## 1 c1 all MT
## 2 c1 all chr10p11
## 3 c1 all chr10p12
## 4 c1 all chr10p13
## 5 c1 all chr10p14
## 6 c1 all chr10p15
## url
## 1 http://software.broadinstitute.org/gsea/msigdb/cards/MT
## 2 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p11
## 3 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p12
## 4 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p13
## 5 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p14
## 6 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p15
Just get the entries for the KEGG non-homologous end joining pathway:
## # A tibble: 13 × 4
## category_code category_subcode geneset symbol
## <chr> <chr> <chr> <chr>
## 1 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING DCLRE1C
## 2 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING DNTT
## 3 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING FEN1
## 4 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING LIG4
## 5 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING MRE11
## 6 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING NHEJ1
## 7 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING POLL
## 8 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING POLM
## 9 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING PRKDC
## 10 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING RAD50
## 11 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING XRCC4
## 12 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING XRCC5
## 13 c2 all KEGG_NON_HOMOLOGOUS_END_JOINING XRCC6
Some software, e.g., fGSEA might require gene sets
to be a named list of genes identifiers, where the name of each element
in the list is the name of the pathway. This is how the data was
originally structured, and we can return to it with
plyr::dlply(). Here, let’s use only the hallmark sets, and
after we dlply the data into this named list format, get
just the first few pathways, and in each of those, just display the
first few gene symbols.
msigdf.human %>%
filter(category_code=="c2") %>%
select(geneset, symbol) %>%
group_by(geneset) %>%
summarize(symbol=list(symbol)) %>%
deframe() %>%
head() %>%
map(head)## $ABBUD_LIF_SIGNALING_1_DN
## [1] "AHNAK" "ALCAM" "ANKRD40" "ARID1A" "BCKDHB" "C16orf89"
##
## $ABBUD_LIF_SIGNALING_1_UP
## [1] "ACAA2" "ALDOC" "ANXA8L1" "BCL3" "CEBPB" "CXCL14"
##
## $ABBUD_LIF_SIGNALING_2_DN
## [1] "CGA" "CITED2" "NALCN" "PITX2" "PTHLH" "SCN1A"
##
## $ABBUD_LIF_SIGNALING_2_UP
## [1] "ATP1B1" "COL11A1" "DAB2" "DCN" "DIO2" "EZR"
##
## $ABDELMOHSEN_ELAVL4_TARGETS
## [1] "BCL2" "CAB39" "CASP3" "CDC42" "CDH2" "DLG4"
##
## $ABDULRAHMAN_KIDNEY_CANCER_VHL_DN
## [1] "ACTA2" "ALDH1A1" "ALDH3B1" "ITGB3BP" "MPPE1" "MTMR3"
Further exploration
The number of gene sets in each collection for each organism is dependent of the construction at MSigDB.
Human Collection of gene sets https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp
## # A tibble: 16 × 3
## # Groups: category_code [16]
## category_code category_subcode n
## <chr> <chr> <int>
## 1 c1 all 43429
## 2 c2 all 581467
## 3 c3 all 818674
## 4 c4 all 98548
## 5 c5 all 1361645
## 6 c6 all 30586
## 7 c7 all 990349
## 8 c8 all 157386
## 9 h all 7322
## 10 m1 all 41375
## 11 m2 all 197902
## 12 m3 all 396685
## 13 m5 all 885544
## 14 m7 all 70547
## 15 m8 all 47967
## 16 mh all 7191
Mouse Collection of gene sets https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp
## # A tibble: 7 × 3
## # Groups: category_code [7]
## category_code category_subcode n
## <chr> <chr> <int>
## 1 m1 all 41375
## 2 m2 all 197902
## 3 m3 all 396685
## 4 m5 all 885544
## 5 m7 all 70547
## 6 m8 all 47967
## 7 mh all 7191
Get the URL for the hallmark set with the fewest number of genes
(Notch signaling). Optionally, %>% this to
browseURL to open it up in your browser.
msigdf.human %>%
filter(category_code=="h") %>%
count(geneset) %>%
arrange(n) %>%
head(1) %>%
inner_join(msigdf.urls, by="geneset") %>%
pull(url)## [1] "http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING"
## [2] "http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING"
Just look at the number of genes in each KEGG pathway (sorted descending by the number of genes in that pathway):
msigdf.human %>%
filter(category_code=="c2" & grepl("^KEGG_", geneset)) %>%
count(geneset) %>%
arrange(desc(n))## # A tibble: 844 × 2
## geneset n
## <chr> <int>
## 1 KEGG_OLFACTORY_TRANSDUCTION 389
## 2 KEGG_PATHWAYS_IN_CANCER 325
## 3 KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION 272
## 4 KEGG_MAPK_SIGNALING_PATHWAY 267
## 5 KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION 264
## 6 KEGG_REGULATION_OF_ACTIN_CYTOSKELETON 213
## 7 KEGG_FOCAL_ADHESION 199
## 8 KEGG_CHEMOKINE_SIGNALING_PATHWAY 188
## 9 KEGG_HUNTINGTONS_DISEASE 183
## 10 KEGG_ENDOCYTOSIS 181
## # ℹ 834 more rows
Session info
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] msigdf_2025.1 lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0
## [5] dplyr_1.1.4 purrr_1.2.0 readr_2.1.5 tidyr_1.3.1
## [9] tibble_3.3.0 ggplot2_4.0.0 tidyverse_2.0.0 knitr_1.50
## [13] BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.6 sass_0.4.10 generics_0.1.4
## [4] stringi_1.8.7 hms_1.1.4 digest_0.6.37
## [7] magrittr_2.0.4 evaluate_1.0.5 grid_4.5.1
## [10] timechange_0.3.0 RColorBrewer_1.1-3 bookdown_0.45
## [13] fastmap_1.2.0 jsonlite_2.0.0 BiocManager_1.30.26
## [16] scales_1.4.0 textshaping_1.0.4 jquerylib_0.1.4
## [19] cli_3.6.5 rlang_1.1.6 withr_3.0.2
## [22] cachem_1.1.0 yaml_2.3.10 tools_4.5.1
## [25] tzdb_0.5.0 vctrs_0.6.5 R6_2.6.1
## [28] lifecycle_1.0.4 fs_1.6.6 htmlwidgets_1.6.4
## [31] ragg_1.5.0 pkgconfig_2.0.3 desc_1.4.3
## [34] pkgdown_2.2.0 pillar_1.11.1 bslib_0.9.0
## [37] gtable_0.3.6 glue_1.8.0 systemfonts_1.3.1
## [40] xfun_0.54 tidyselect_1.2.1 farver_2.1.2
## [43] htmltools_0.5.8.1 rmarkdown_2.30 compiler_4.5.1
## [46] S7_0.2.0