Pipeline Inputs
Table of contents
The PxN pipeline needs two main components: (1) a gene expression background, and (2) a gene set table. PxN comes with a series of gene expression datasets and gene set files ready to use. It is also possible for the user to provide a custom gene set and/or a custom background dataset. Changing the background dataset requires a several pre-processing steps, and its expalined in the Advanced section. The sections below describe the built-in PxN datasets and outline the steps needed to incorporate a custom gene set into the pipeline.
Overview
PxN comes with two main background datasets and several gene sets. The pre-processed gene sets come from three major sources:
- MSigDB version 7 C2 collection (internally tagged as
MSigDBv7) - MSigDB version 6 C2 collection (internally tagged as
MSigDBv6) - Genedex: a manually curated database of Alzheimer's Disease facets (internaly tagged as
genedex)
The background datasets are:
GTex Toil
This dataset is part of the UCSC Toil RNAseq Recompute Compendium. Raw files from the TOIL_RSEM_norm_count were downloaded from the Xena browser. The internal tag for this dataset is gtextoil, it includes 19,561 genes across 7,847 samples from 54 different tissues. A subset of this dataset containing only brain and immune tissues is available with the tag gtextoil_iBrain.
Microarray Barcode
This dataset was obtained from the set of legacy files of the orginal PCxN version. It is provided for reference and comparison with the previous version of the pipeline. The internal tag for this dataset is HGU133plus2, it contains 20,590 genes across 3,207 samples from 72 tissues. This dataset was obtained from the input files of the orginal PCxN.
Data Download
PxN built-in data folder be downloaded as a .tgz file from Zenodo. To make sure that it integrates into the pipeline ecosystem, the file needs to be expanded inside the input folder under pipeline.
For example, if you cloned the repo into ~/pxn your directory strcucture should look like this:
- ~/pxn/pipeline
- input/
- scripts/
- output/
Steps
- Download the data from Zenodo
- Move the dowloaded file into the
-/pxn/pipeline/inputfolder - While inside the
-/pxn/pipeline/inputfolder, expand the file:
cd ~/pxn/pipeline/input # Enter the input directory
tar -xvzf input.tgz # Expand the tar file
rm input.tgz # Delete the tar file after expanding
Contents
Each subfolder comes with a README.md file providing extensive documentation about its contents. In summary, the input folder contains the following subdirectories:
| Directory | Contents |
|---|---|
| augment_sets | Described in the Advanced section |
| gene_expression | Processed gene expression data |
| gene_sets | Processed gene set objects |
| std_gene_tables | Reference gene set tables |
The gene_expression folder
This folder includes post-processed gene expression datasets structured in the format required for PxN. The processing steps undertaken to generate these datasets are:
- Keeping only genes expressed with at least 3 counts in at least 1 sample (they constitute the gene universe)
- Discarding tissues with less than 10 samples.
List of datasets:
gtextoil- This dataset uses as base the GTex toil dataset.- Gene universe size: 19561
- Number of samples: 7847
- Number of tissues: 54
gtextoil_iBrain- This is a subset of thegtextoildataset that includes only brain and immune tissues.- Gene universe size: 18994
- Number of samples: 2746
- Number of tissues: 22
HGU133plus- This dataset uses as base the HGU133plus microarray barcode dataset.- Gene universe size: 20590
- Number of samples: 3207
- Number of tissues: 72
The gene_sets folder
Each reference gene set gets processed for a particular gene expression background to ensure that only genes expressed in the background dataset are included in the gene set used for analysis. This results in unique background-gene set combinations that are labelled using the gene set and background dataset internal tags.
List of gene sets:
MSigDBv7\__gtextoil- This geneset contains the pathways from MSigDB version 7 filtered for the gene universe of the GTex toil (gfilter) dataset.- Pathways: 1186
- Pathway pairs: 702689
MSigDBv6\__gtextoil- This geneset contains the pathways from MSigDB version 6 filtered for the gene universe of the GTex toil (gfilter) dataset.- Pathways: 851
- Pathway pairs: 361628
genedex\__gtextoil- This gene set contains the gene lists of AD facets available in Genedex (accessed Nov 2024), processed for the GTex toil (gfilter) dataset.- Pathways: 242
- Pathway pairs: 16834
MSigDBv7\__HGU133plus2- This geneset contains the pathways from MSigDB version 7 filtered for the gene universe of the HGU133plus2 barcode (gfilter) dataset.- Pathways: 1199
- Pathway pairs: 718189
MSigDBv6\__gtextoil- This geneset contains the pathways from MSigDB version 6 filtered for the gene universe of the GTex toil (gfilter) dataset.- Pathways: 851
- Pathway pairs: 361628
The std_gene_tables folder
This directory contains a series of standard tables prepared for different reference gene sets. These files are used as the source to prepare the gene set files in the gene_sets folder. There are three tables included by default in the pipeline, one for each source database. The naming convention is \[GENESETNAME\]\_pathway\_table.csv This is where custom user-specific standard tables (see below) should be place to ensure consistency.
Creating custom gene sets
To run the pipeline with your own gene set you will need to generate a standard table, which is basically a tab-separated file with two columns: set_name and genes. This file should be place inside the std_gene_tables folder following the file naming convention of standard tables (described above). The set_name column indicates the name of the pathway or gene list, and the genes column lists the Entrez IDs of the genes in that pathway. If a pathway has 10 genes, this table would have 10 rows for that pathway, one for every gene, where the pathway name is repeated. You can find several examples inside the folder scripts/example_notebooks on how to generate this table from unprocessed inputs from public databases such as MSigSB or custom gene sets like Genedex.
Regardless of the starting point, the standard table should look like this:
| set_name | genes |
|---|---|
| Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS | 55902 |
| Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS | 2645 |
| Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS | 5232 |
| Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS | 5230 |
| Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS | 5162 |