Output
pyIPSA stores its output in several folders which store results of consecutive steps of the pipeline. The structure of the output is the following:
├── aggregated_junction_stats.tsv
├── aggregated_library_stats.tsv
├── J1
| ├── <sample>.J1.gz
| └── <sample>.library_stats.txt
├── J2
| └── <sample>.J2.gz
├── J3
| └── <sample>.J3.gz
├── J4
| ├── <sample>.J4.gz
| └── <sample>.junction_stats.txt
├── J6
| └── <sample>.J6.gz
├── R
| └── <sample>.R.gz
├── S1
| └── <sample>.S1.gz
├── S2
| └── <sample>.S2.gz
└── S6
| └── <sample>.S6.gz
All files are plain-text and tab-separated if they present table data. Files with junctions and sites are also gzipped.
J1 - Junction and Counts
pyIPSA’s first step output is 2 files per each sample in J1 folder:
junction counts file and library statistics file. Junction counts file is a
plain-text, tab-separated file without a header. It is named <sample>.J1.gz.
Example of its content:
junction id |
offset |
F1 |
R1 |
F2 |
R2 |
|---|---|---|---|---|---|
chr1_500_700 |
9 |
0 |
5 |
2 |
1 |
chr1_500_700 |
12 |
0 |
0 |
0 |
2 |
chr1_850_950 |
24 |
2 |
0 |
1 |
0 |
Each row describes junction and counts from all reads supporting the junction in given offset. The columns have the following interpretation:
junction id — reference sequence (chromosome), 1-based start position and 1-based end position joined by _
offset — distance from start of the read to the start of the junction
F1 — count for all read 1 from the forward strand
R1 — count for all read 1 from the reverse strand
F2 — count for all read 2 from the forward strand
R2 — count for all read 2 from the reverse strand
The second file contains various data about RNASeq library and alignment extracted from BAM.
It is plain-text and named <sample>.library_stats.txt.
J2 - Junctions and Aggregated Counts
The second step’s output is one junctions with aggregated counts file
for each sample in J2 folder. It is named <sample>.J2.gz and has the following content:
junction id |
total count |
staggered count |
entropy |
|---|---|---|---|
chr1_500_700_+ |
10 |
2 |
1.52 |
chr1_850_950_+ |
3 |
1 |
1.22 |
Each row describes junction and counts from all reads supporting the junction aggregated by all possible offsets. The columns have the following interpretation:
junction id — reference sequence (chromosome), 1-based start position, 1-based end position and now also strand joined by _
total count — total count of this junction from all reads in all offsets
staggered count — number of offsets that have read with count for this junction
entropy — entropy for offset distribution
J3 - Annotated Junctions
The third step’s output is annotated junctions file for each sample in J3 folder.
It is named <sample>.J3.gz and has the following content:
junction id |
total count |
staggered count |
entropy |
annotation status |
splice site |
|---|---|---|---|---|---|
chr1_500_700_+ |
10 |
2 |
1.52 |
3 |
GTAG |
chr1_850_950_+ |
3 |
1 |
1.22 |
0 |
GGAG |
First 4 columns are the same as in J2 file, additional ones are:
annotation status — possible values are:
0 - both junction ends are absent in annotation
1 - one junction end is present in annotation
2 - both junction ends are present in annotation
3 - both junction ends are present in annotation and correspond to existing intron
splice site — donor and acceptor splice site sequences. Usually they are GT and AG but can vary.
J4 - Choose Strand
The fourth step’s output is 2 files for each sample:
annotated junctions file with correct strand chosen for each junctions and
junctions stats file. The first file is named <sample>.J4.gz and has
the same format as file from J3 step:
junction id |
total count |
staggered count |
entropy |
annotation status |
splice site |
|---|---|---|---|---|---|
chr1_500_700_+ |
10 |
2 |
1.52 |
3 |
GTAG |
chr1_850_950_+ |
3 |
1 |
1.22 |
0 |
GGAG |
But some records from J3 will be missed due to strand choice.
The second file is <sample>.junctions_stats.txt, it just reports
how many read counts support junctions with GTAG or non-GTAG splice sites,
annotated and non-annotated junctions and etc.
J6 - Filter Junctions
This step’s output is <sample>.J6.gz. It has the same format as files from
J3 and J4 steps:
junction id |
total count |
staggered count |
entropy |
annotation status |
splice site |
|---|---|---|---|---|---|
chr1_500_700_+ |
10 |
2 |
1.52 |
3 |
GTAG |
chr1_850_950_+ |
3 |
1 |
1.22 |
0 |
GGAG |
The purpose of step is to filter out junctions not passing some of criteria:
total count — must be not less than
total_countvalue in config file (default is1)entropy — must be not less than
entropyvalue in config file (default is1.5)splice site — must be GTAG only if config file’s parameter
gtagset toTrue(default isFalse)
Gather Junctions
Some additional files are generated along with J1-J6:
aggregated_library_stats.tsv— library parameters (from J1) of all samples present in one tableaggregated_junction_stats.tsv— junction stats (from J4) of all samples present in one tableJ4/merged_junctions.J4.gz— a union of all junctions files from J4 step. Contains all unique junctions found in alignments. Computed only if parameterpooledin config set toTrue.
S1 (PS1) - Sites and Counts
This step is similar to J1, but now it works with sites, not junctions.
The output file name is <sample>.S1.gz.
<sample>.PS1.gz if junctions were pooled. The format is:
site id |
offset |
count |
|---|---|---|
chr1_500_+ |
9 |
2 |
chr1_700_+ |
12 |
9 |
chr1_900_+ |
5 |
4 |
site id — reference sequence (chromosome), site’s 1-based position and strand joined by _
offset — distance from start of the read to the position of site
count — total count of read supporting the site
S2 (PS2) - Aggregate Sites
This step is similar to J2, it aggregates offsets for each junction.
The output file name is <sample>.S2.gz or
<sample>.PS2.gz if junctions were pooled. The format is:
site id |
total count |
staggered count |
entropy |
|---|---|---|---|
chr1_500_+ |
9 |
1 |
0.96 |
chr1_700_+ |
12 |
1 |
1.55 |
New columns are:
total count — total count of this site from all reads in all offsets
staggered count — number of offsets that have read with count for this site
entropy — entropy for offset distribution
S6 (PS6) - Filter Sites
The output file name is <sample>.S6.gz or
<sample>.PS6.gz if junctions were pooled.
The format is the same as S2:
site id |
total count |
staggered count |
entropy |
|---|---|---|---|
chr1_500_+ |
9 |
1 |
0.96 |
chr1_700_+ |
12 |
1 |
1.55 |
The purpose of this step is to filter out sites not passing some of criteria:
total count — must be not less than
total_countvalue in config file (default is1)entropy — must be not less than
entropyvalue in config file (default is1.5)
R (PR) - Rates
The output file name is <sample>.R.gz or <sample>.PR.gz
if junctions were pooled.
The format:
site id |
inclusion |
exclusion |
retention |
|---|---|---|---|
chr1_500_+_D |
5 |
0 |
6 |
chr1_700_+_A |
11 |
0 |
13 |
site id — reference sequence (chromosome), site’s 1-based, and type (D - donor, A - acceptor) and strand joined by _
inclusion — number of reads supporting inclusion of this site
exclusion — number of reads supporting exclusion of this site
retention — number of reads support retention of this site