Analysis flowchart

A total of 2485 entries of Sequence Read Archive (SRA) from Human small RNAseq samples containing trillion of bases of total sum size 1.23 TB were retrieved and processed using SRA Toolkit. Adapter trimming and sequence quality filtering was done using ea-utils. The resulting filtered fastq files are indexed using in-house developed algorithm. The index files were compiled using using a filter. Fasta files for the downstream analyis was generated from the compiled file for further downstream process.

Mapping was done using whole genome sequence of Homo sapiens, GRCh38 using bowtie2 and or bwa. Samtools was used to process mapping data to generate .vcf files containing small RNA variants file.

The expression analysis is done by normalizing the frequency of sequence encountered in each of the sample. Each normalized value for a sequence is plotted as a gradient colored pixel. Clustering was done and heat map generation from limited amount of data was done using R.
HOCOMOCO database was used to search for known motifs. Dreme was used to detect novel motifs.
Network visualization was done combining all the the result of motif analysis with weighed value. The weighed value is derived from the frequency of a sequence that occurred in sample. Network visualization was done using data from motif analysis using Gephi.

Fig. 1: Schematic diagram of workflow, showing conversion of SRA files to fastq (FQ), which are then indexed. The indexed file are used for statistical analysis (Stat.) and for viewing global expression profile (Exp.). The filtered file is used for viewing global expression profile (Exp.), Mapping, small RNA variants (Var.), Motif analysis and network visualization (Net.).