PD-APCn: Pattern-directed aligned pattern clustering of bio-sequences

Background

Identifying functional segments or regions from bio-sequences is a major challenge in bioinformatics. Functional segments of a bio-sequence could reveal folded structure, physio-chemical functionality and mutation hotspots for better understanding of biological mechanisms; directing the design of new drugs and discovering new knowledge about the cure of genetic diseases. With explosive data streaming in, effective, accurate and scalable methods are still lacking. Existing methods such as: MEME, GLAM2 are incompetent to capture frameshift and rare mutations.

Description of the invention

University of Waterloo (UW) researchers have developed a novel software that uses a systematic process to align pattern clusters of bio-sequence families and thereby, to identify functional regions. The software also adaptively determines the width and mutation spots without relying on exhaustive search and without relying on explicit prior knowledge or clue. While the software discovers new patterns with strong statistical support, it also spots mutational rare patterns with minor substitution and frameshift (insertion and deletion). This is of ample importance for personalized medicine, gene therapy/marker and drug research.

Advantages

Allows variable pattern length
Capable of identifying mutations and rare mutations (Fig. 2)
Fast (400X compared to MEME method), accurate, and precise (location-wise)
Does not need parameter pruning (compared to MEME, GLAM2 method)
No explicit prior knowledge needed
Compatible with hardware acceleration/multitasking

From the APCs discovered, the software can disentangle patterns within APCs to further reveal deeper knowledge on subgroup characteristics in different specific statistical/functional spaces with/without class labels given.