PubMatrixPython — full reference notebook¶
Complete walkthrough of every parameter and feature. Mirrors the PubMatrixR documentation.
Reference: Becker et al. (2003) BMC Bioinformatics 4:61. https://doi.org/10.1186/1471-2105-4-61
Setup¶
import sys
sys.path.insert(0, '..')
import pandas as pd
import matplotlib.pyplot as plt
from pubmatrix import (
pubmatrix,
pubmatrix_from_file,
plot_pubmatrix_heatmap,
pubmatrix_heatmap,
)
NCBI API key¶
Without a key NCBI allows 3 requests/second; with a key, 10/second. Get one at https://account.ncbi.nlm.nih.gov/
API_KEY = "YOUR_KEY_HERE"
Leave as None to run without one.
API_KEY = None # replace with your key to increase rate limit
pubmatrix() — core query function¶
pubmatrix(
A, # list of str — column terms
B, # list of str — row terms
api_key = None, # NCBI API key
database = "pubmed",# "pubmed" or "pmc"
daterange = None, # [start_year, end_year]
outfile = None, # base filename for export
export_format = None,# None | "csv" | "ods"
n_tries = 2, # retries on network failure
)
Returns a pandas.DataFrame — rows =
B terms, columns = A terms, values = publication
counts.
Basic usage¶
A = ["WNT1", "WNT2", "CTNNB1"]
B = ["obesity", "diabetes", "cancer"]
result = pubmatrix(A=A, B=B, api_key=API_KEY)
result
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:04<00:00, 1.85query/s]
| WNT1 | WNT2 | CTNNB1 | |
|---|---|---|---|
| obesity | 63 | 6 | 90 |
| diabetes | 118 | 18 | 267 |
| cancer | 1278 | 297 | 8363 |
Larger matrix — 7 × 7 WNT × obesity genes¶
wnt_genes = ["WNT1", "WNT2", "WNT3A", "WNT5A", "WNT7B", "CTNNB1", "DVL1"]
obesity_genes = ["LEPR", "ADIPOQ", "PPARG", "TNF", "IL6", "ADRB2", "INSR"]
result_wnt = pubmatrix(A=wnt_genes, B=obesity_genes, api_key=API_KEY)
result_wnt
Querying NCBI: 100%|█████████████████████████████████████| 49/49 [00:19<00:00, 2.55query/s]
| WNT1 | WNT2 | WNT3A | WNT5A | WNT7B | CTNNB1 | DVL1 | |
|---|---|---|---|---|---|---|---|
| LEPR | 6 | 0 | 0 | 2 | 0 | 4 | 0 |
| ADIPOQ | 2 | 0 | 0 | 6 | 0 | 9 | 0 |
| PPARG | 2 | 3 | 7 | 5 | 1 | 26 | 0 |
| TNF | 83 | 4 | 110 | 123 | 6 | 216 | 3 |
| IL6 | 75 | 7 | 87 | 143 | 9 | 151 | 3 |
| ADRB2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| INSR | 1 | 1 | 1 | 1 | 0 | 4 | 0 |
database parameter¶
"pubmed" (default) searches MEDLINE
abstracts. "pmc" searches full-text
articles in PubMed Central — counts are
typically higher.
result_pmc = pubmatrix(A=A, B=B, database="pmc", api_key=API_KEY)
result_pmc
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00, 2.58query/s]
| WNT1 | WNT2 | CTNNB1 | |
|---|---|---|---|
| obesity | 2765 | 1013 | 4904 |
| diabetes | 4396 | 1610 | 7073 |
| cancer | 12695 | 5089 | 29362 |
# Side-by-side comparison
print("PubMed:")
print(result)
print("\nPMC:")
print(result_pmc)
PubMed:
WNT1 WNT2 CTNNB1
obesity 63 6 90
diabetes 118 18 267
cancer 1278 297 8363
PMC:
WNT1 WNT2 CTNNB1
obesity 2765 1013 4904
diabetes 4396 1610 7073
cancer 12695 5089 29362
daterange parameter¶
Filter results to a publication year range. Useful for tracking how co-occurrence changes over time.
result_2000_2010 = pubmatrix(A=A, B=B, daterange=[2000, 2010], api_key=API_KEY)
result_2011_2024 = pubmatrix(A=A, B=B, daterange=[2011, 2024], api_key=API_KEY)
print("2000–2010:")
print(result_2000_2010)
print("\n2011–2024:")
print(result_2011_2024)
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00, 2.59query/s] Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00, 2.45query/s]
2000–2010:
WNT1 WNT2 CTNNB1
obesity 1 0 5
diabetes 11 4 41
cancer 361 82 2100
2011–2024:
WNT1 WNT2 CTNNB1
obesity 60 6 79
diabetes 103 12 201
cancer 768 169 5351
Export to CSV¶
Saves a .csv where each cell is an
Excel HYPERLINK formula linking
directly to the PubMed search for that term
pair.
pubmatrix(A=A, B=B, outfile="output", export_format="csv", api_key=API_KEY)
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00, 2.58query/s]
Saved CSV to output.csv
| WNT1 | WNT2 | CTNNB1 | |
|---|---|---|---|
| obesity | 63 | 6 | 90 |
| diabetes | 118 | 18 | 267 |
| cancer | 1278 | 297 | 8363 |
Export to ODS¶
Same as CSV but in OpenDocument Spreadsheet format, with clickable hyperlinks in LibreOffice / OpenOffice.
pubmatrix(A=A, B=B, outfile="output", export_format="ods", api_key=API_KEY)
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00, 2.56query/s]
Saved ODS to output.ods
| WNT1 | WNT2 | CTNNB1 | |
|---|---|---|---|
| obesity | 63 | 6 | 90 |
| diabetes | 118 | 18 | 267 |
| cancer | 1278 | 297 | 8363 |
n_tries — retry on network
failure¶
Default is 2. Increase for unstable connections.
result_retry = pubmatrix(A=A, B=B, n_tries=5, api_key=API_KEY)
result_retry
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00, 2.57query/s]
| WNT1 | WNT2 | CTNNB1 | |
|---|---|---|---|
| obesity | 63 | 6 | 90 |
| diabetes | 118 | 18 | 267 |
| cancer | 1278 | 297 | 8363 |
pubmatrix_from_file() — load terms
from a text file¶
File format — A terms first,
# separator, then B terms:
WNT1
WNT2
CTNNB1
#
obesity
diabetes
cancer
All keyword arguments are passed through to
pubmatrix().
sample_terms = "WNT1\nWNT2\nCTNNB1\n#\nobesity\ndiabetes\ncancer\n"
with open("sample_terms.txt", "w") as f:
f.write(sample_terms)
result_file = pubmatrix_from_file("sample_terms.txt", api_key=API_KEY)
result_file
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00, 2.55query/s]
| WNT1 | WNT2 | CTNNB1 | |
|---|---|---|---|
| obesity | 63 | 6 | 90 |
| diabetes | 118 | 18 | 267 |
| cancer | 1278 | 297 | 8363 |
# With optional arguments
result_file_dated = pubmatrix_from_file(
"sample_terms.txt",
daterange=[2015, 2024],
api_key=API_KEY,
)
result_file_dated
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00, 2.41query/s]
| WNT1 | WNT2 | CTNNB1 | |
|---|---|---|---|
| obesity | 42 | 4 | 68 |
| diabetes | 89 | 11 | 179 |
| cancer | 579 | 136 | 4406 |
import os
os.remove("sample_terms.txt")
Heatmap visualisation¶
Cell values show overlap percentage:
overlap = (intersection / union) × 100
union = row_total + col_total - intersection
This is a Jaccard-style normalisation — it accounts for terms that appear frequently on their own, so a pair like (CTNNB1, cancer) is not inflated just because both terms are common.
pubmatrix_heatmap() — quick plot
with defaults¶
pubmatrix_heatmap(result)
<Axes: title={'center': 'PubMatrix Results'}>
plot_pubmatrix_heatmap() — full
control¶
plot_pubmatrix_heatmap(
matrix,
title = "PubMatrix Co-occurrence Heatmap",
cluster_rows = True,
cluster_cols = True,
show_numbers = True,
color_palette = None, # list of hex colours; defaults to red gradient
filename = None, # save to PNG if set
width = 10,
height = 8,
scale_font = True,
)
plot_pubmatrix_heatmap(
result,
title="WNT Genes × Disease Co-occurrence",
cluster_rows=True,
cluster_cols=True,
show_numbers=True,
width=8,
height=5,
)
<Axes: title={'center': 'WNT Genes × Disease Co-occurrence'}>
Clustering disabled¶
plot_pubmatrix_heatmap(
result,
title="No clustering",
cluster_rows=False,
cluster_cols=False,
)
<Axes: title={'center': 'No clustering'}>
Numbers hidden¶
plot_pubmatrix_heatmap(
result,
title="No cell annotations",
show_numbers=False,
)
<Axes: title={'center': 'No cell annotations'}>
Custom colour palette¶
Pass any list of hex colours — gradient is interpolated between them.
plot_pubmatrix_heatmap(
result,
title="Blue gradient",
color_palette=["#deebf7", "#9ecae1", "#3182bd"],
)
<Axes: title={'center': 'Blue gradient'}>
plot_pubmatrix_heatmap(
result,
title="Green gradient",
color_palette=["#e5f5e0", "#a1d99b", "#31a354"],
)
<Axes: title={'center': 'Green gradient'}>
Save to PNG¶
plot_pubmatrix_heatmap(
result,
title="Saved heatmap",
filename="heatmap_full.png",
width=8,
height=5,
)
Saved heatmap to heatmap_full.png
<Axes: title={'center': 'Saved heatmap'}>
Working with the result DataFrame¶
The return value is a plain
pandas.DataFrame — all standard
pandas operations apply.
Summary statistics¶
print("Column sums (total co-occurrences per A term):")
print(result.sum(axis=0))
print()
print("Row sums (total co-occurrences per B term):")
print(result.sum(axis=1))
Column sums (total co-occurrences per A term): WNT1 1459 WNT2 321 CTNNB1 8720 dtype: int64 Row sums (total co-occurrences per B term): obesity 159 diabetes 403 cancer 9938 dtype: int64
Bar charts — co-occurrences per term¶
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
col_totals = result.sum(axis=0).sort_values(ascending=False)
axes[0].bar(col_totals.index, col_totals.values, color="#de2d26")
axes[0].set_title("Co-occurrences per column term (A)")
axes[0].set_ylabel("Total publication count")
axes[0].tick_params(axis="x", rotation=45)
row_totals = result.sum(axis=1).sort_values(ascending=False)
axes[1].bar(row_totals.index, row_totals.values, color="#3182bd")
axes[1].set_title("Co-occurrences per row term (B)")
axes[1].set_ylabel("Total publication count")
axes[1].tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.show()
Temporal trend — comparing two date windows¶
# Reuse results computed above
diff = result_2011_2024 - result_2000_2010
print("Absolute change in co-occurrence counts (2011–2024 vs 2000–2010):")
diff
Absolute change in co-occurrence counts (2011–2024 vs 2000–2010):
| WNT1 | WNT2 | CTNNB1 | |
|---|---|---|---|
| obesity | 59 | 6 | 74 |
| diabetes | 92 | 8 | 160 |
| cancer | 407 | 87 | 3251 |
plot_pubmatrix_heatmap(
diff,
title="Change in co-occurrences: 2011–2024 vs 2000–2010",
color_palette=["#f7f7f7", "#fc8d59", "#d73027"],
cluster_rows=False,
cluster_cols=False,
width=7,
height=4,
)
<Axes: title={'center': 'Change in co-occurrences: 2011–2024 vs 2000–2010'}>
Save results to CSV manually¶
result.to_csv("my_results.csv")
print("Saved.")
Saved.