Output

pyIPSA stores its output in several folders which store results of consecutive steps of the pipeline. The structure of the output is the following:

├── aggregated_junction_stats.tsv
├── aggregated_library_stats.tsv
├── J1
|   ├── <sample>.J1.gz
|   └── <sample>.library_stats.txt
├── J2
|   └── <sample>.J2.gz
├── J3
|   └── <sample>.J3.gz
├── J4
|   ├── <sample>.J4.gz
|   └── <sample>.junction_stats.txt
├── J6
|   └── <sample>.J6.gz
├── R
|   └── <sample>.R.gz
├── S1
|   └── <sample>.S1.gz
├── S2
|   └── <sample>.S2.gz
└── S6
|   └── <sample>.S6.gz

All files are plain-text and tab-separated if they present table data. Files with junctions and sites are also gzipped.

J1 - Junction and Counts

pyIPSA’s first step output is 2 files per each sample in J1 folder: junction counts file and library statistics file. Junction counts file is a plain-text, tab-separated file without a header. It is named <sample>.J1.gz. Example of its content:

junction id	offset	F1	R1	F2	R2
chr1_500_700	9	0	5	2	1
chr1_500_700	12	0	0	0	2
chr1_850_950	24	2	0	1	0

Each row describes junction and counts from all reads supporting the junction in given offset. The columns have the following interpretation:

junction id — reference sequence (chromosome), 1-based start position and 1-based end position joined by _
offset — distance from start of the read to the start of the junction
F1 — count for all read 1 from the forward strand
R1 — count for all read 1 from the reverse strand
F2 — count for all read 2 from the forward strand
R2 — count for all read 2 from the reverse strand

The second file contains various data about RNASeq library and alignment extracted from BAM. It is plain-text and named <sample>.library_stats.txt.

J2 - Junctions and Aggregated Counts

The second step’s output is one junctions with aggregated counts file for each sample in J2 folder. It is named <sample>.J2.gz and has the following content:

junction id	total count	staggered count	entropy
chr1_500_700_+	10	2	1.52
chr1_850_950_+	3	1	1.22

Each row describes junction and counts from all reads supporting the junction aggregated by all possible offsets. The columns have the following interpretation:

junction id — reference sequence (chromosome), 1-based start position, 1-based end position and now also strand joined by _
total count — total count of this junction from all reads in all offsets
staggered count — number of offsets that have read with count for this junction
entropy — entropy for offset distribution

J3 - Annotated Junctions

The third step’s output is annotated junctions file for each sample in J3 folder. It is named <sample>.J3.gz and has the following content:

junction id	total count	staggered count	entropy	annotation status	splice site
chr1_500_700_+	10	2	1.52	3	GTAG
chr1_850_950_+	3	1	1.22	0	GGAG

First 4 columns are the same as in J2 file, additional ones are:

annotation status — possible values are:
- 0 - both junction ends are absent in annotation
- 1 - one junction end is present in annotation
- 2 - both junction ends are present in annotation
- 3 - both junction ends are present in annotation and correspond to existing intron
splice site — donor and acceptor splice site sequences. Usually they are GT and AG but can vary.

J4 - Choose Strand

The fourth step’s output is 2 files for each sample: annotated junctions file with correct strand chosen for each junctions and junctions stats file. The first file is named <sample>.J4.gz and has the same format as file from J3 step:

junction id	total count	staggered count	entropy	annotation status	splice site
chr1_500_700_+	10	2	1.52	3	GTAG
chr1_850_950_+	3	1	1.22	0	GGAG

But some records from J3 will be missed due to strand choice. The second file is <sample>.junctions_stats.txt, it just reports how many read counts support junctions with GTAG or non-GTAG splice sites, annotated and non-annotated junctions and etc.

J6 - Filter Junctions

This step’s output is <sample>.J6.gz. It has the same format as files from J3 and J4 steps:

junction id	total count	staggered count	entropy	annotation status	splice site
chr1_500_700_+	10	2	1.52	3	GTAG
chr1_850_950_+	3	1	1.22	0	GGAG

The purpose of step is to filter out junctions not passing some of criteria:

total count — must be not less than total_count value in config file (default is 1)
entropy — must be not less than entropy value in config file (default is 1.5)
splice site — must be GTAG only if config file’s parameter gtag set to True (default is False)

Gather Junctions

Some additional files are generated along with J1-J6:

aggregated_library_stats.tsv — library parameters (from J1) of all samples present in one table
aggregated_junction_stats.tsv — junction stats (from J4) of all samples present in one table
J4/merged_junctions.J4.gz — a union of all junctions files from J4 step. Contains all unique junctions found in alignments. Computed only if parameter pooled in config set to True.

S1 (PS1) - Sites and Counts

This step is similar to J1, but now it works with sites, not junctions. The output file name is <sample>.S1.gz. <sample>.PS1.gz if junctions were pooled. The format is:

site id	offset	count
chr1_500_+	9	2
chr1_700_+	12	9
chr1_900_+	5	4

site id — reference sequence (chromosome), site’s 1-based position and strand joined by _
offset — distance from start of the read to the position of site
count — total count of read supporting the site

S2 (PS2) - Aggregate Sites

This step is similar to J2, it aggregates offsets for each junction. The output file name is <sample>.S2.gz or <sample>.PS2.gz if junctions were pooled. The format is:

site id	total count	staggered count	entropy
chr1_500_+	9	1	0.96
chr1_700_+	12	1	1.55

New columns are:

total count — total count of this site from all reads in all offsets
staggered count — number of offsets that have read with count for this site
entropy — entropy for offset distribution

S6 (PS6) - Filter Sites

The output file name is <sample>.S6.gz or <sample>.PS6.gz if junctions were pooled. The format is the same as S2:

site id	total count	staggered count	entropy
chr1_500_+	9	1	0.96
chr1_700_+	12	1	1.55

The purpose of this step is to filter out sites not passing some of criteria:

total count — must be not less than total_count value in config file (default is 1)
entropy — must be not less than entropy value in config file (default is 1.5)

R (PR) - Rates

The output file name is <sample>.R.gz or <sample>.PR.gz if junctions were pooled. The format:

site id	inclusion	exclusion	retention
chr1_500_+_D	5	0	6
chr1_700_+_A	11	0	13

site id — reference sequence (chromosome), site’s 1-based, and type (D - donor, A - acceptor) and strand joined by _
inclusion — number of reads supporting inclusion of this site
exclusion — number of reads supporting exclusion of this site
retention — number of reads support retention of this site