Genetic Algorithms for Sequence Alignment

Ahmed I.
10 min readMar 17, 2021

--

Introduction

Multiple sequence alignment (MSA) has numerous applications, including protein analysis, comparative genomics, functional prediction of sequences, and generation of phylogenetic trees [1]. Several strategies have been developed to conduct MSA. These computational strategies generally have three approaches: exact, progressive, and iterative. The exact approach is, in essence, a generalization of pairwise sequence alignment (PSA) dynamic programming methods that can provide optimal solutions [2]–[4]. However, MSA becomes increasingly complex with a higher number, length, and diversity of sequences. Hence, exact strategies are limited to solving MSA with few sequences [5]. Progressive MSA (P-MSA) are greedy-heuristic algorithms built gradually according to a pre-set order sequence [6]. P-MSA is used in a collection of alignment tools, for instance: CLUSTAL [7], [8], MULTALIGN [9], and PCMA [10]. The most used of the latter is CLUSTAL. The first step is global PSA, and secondly, a guide tree is constructed based on a computed distance matrix. Neighbor-joining [11], UPGMA [12], or other clustering methods are used to build the initial guide tree.

Thus, pairs of sequences sharing the shortest branch length in the guide tree, being the most similar, are of utmost importance and undergo first PSA. A consensus alignment is gradually obtained after progressively performing PSA following the guide tree. This process remains more efficient than exact approaches. It offers solutions faster, but the major drawback is the possibility that P-MSA converges to a local minimum which often results in poorly accurate alignment [13]. The iterative approaches can overcome the shortcomings of P-MSA by performing realignment on sequences based on prescribed criteria. Moreover, the guide tree is reconstructed after every round of P-MSA. This methodology is utilized in tools like MUSCLE [14] and PRIME [15].

The deterministic approaches for MSA are complex procedures to solve, consequently, non-deterministic strategies have become readily more popular in the field of bioinformatics [16]. Among these approaches are evolutionary algorithms (EAs), which consist of a collection of randomized search and optimization methodologies guided by the mechanisms of natural selection, evolution, and genetics [17]. Ant colony algorithm [18], bee colony optimization [19], Tabu search [20] are types of EAs utilized for MSA and other applications. However, the most reported remains genetic algorithms (GAs) [21].

Fundamental Concepts of Genetic Algorithms

GAs are exploration algorithms based on the mechanisms of natural selection and genetics. The dataset consists of individuals grouped as a population. Phenotypes and genotypes generally represent the herein individuals. Based on the initial problem’s constraints, phenotypes are randomly generated candidate solutions that are ready to be optimized. In contrast, the genotypes give an encoded depiction of a candidate solution via chromosomes. A chromosome consists of arranged linear genes, and every gene can have many values called alleles. For instance, a chromosome can include a series of 0 and 1 (i.e., binary strings), where the value at a position is either on (i.e., an allele = 1) or off (i.e., an allele = 0). In other cases, various symbols and permutations of the alphabets are utilized to model chromosomes for more complex optimization/search problems (i.e., protein and DNA sequences). Every candidate solution at any generation has an associated “fitness” function that quantifies its adaptability to its local environment. Thus, this objective function plays the role of an adaptation criterion [22].

A conventional GA has the following components: initialization/encoding, selection, reproduction (crossover, mutation, interchanging, and reversing), replacement, and termination. Often, the initial pool of solutions is generated randomly to cover a broad range of viable solutions. However, the encoding rules must be set to initialize a population. The chromosomes can be encoding via value encoding (i.e., every chromosome is a string of values) or tree encoding (i.e., every chromosome represented by guide trees). Tree encoding is rarely used for GA, whereas value encoding is conventionally used for GAs. The encoded candidate solutions are then processed via a “fitness” function to rank the chromosomes against their adaptability degree. The “fittest” individuals are stochastically selected from the batch of solutions using a selection operator. Based on the selection operator, individuals will be chosen to undergo the reproduction phase. The convergence rate of GAs is highly dependent on the selection pressure criteria. In contrast, a higher selection pressure will lead to a faster convergence rate, though this may reduce population diversity [23].

Selection methods include random selection, rank selection, tournament selection, and more traditionally, the weighted roulette technique. Every individual of a population occupies a selected area with a probability proportional to their level of fitness. For each selection of an individual, a simple rotation of the wheel gives the chosen candidate. However, this selection operator selection is not flawless since it presents the risk of favoring an individual or a small group of individuals, which may impoverish the population’s diversity.

In the reproduction phase, “child” solutions are to be generated. A commonly used operator in this phase is the crossover operator. For starters, individuals of a population are randomly coupled in pairs representing the parents. Each pair of individuals undergoes crossover described as the following: crossover operates on two individuals’ genotype, being “parents” in this instance. It produces new individuals (usually two) called “children” whose genes are inherited from either parent. Thus, the crossover can be done by splitting each of the two chromosomes into fragments and recombining them to form new chromosomes. Other crossover strategies are available such as uniform (i.e., a crossover based on whether a gene is expressed or masked) and arithmetic crossovers (i.e., the chromosomes are subject to crossover based on a custom governing equation).

As for the mutator operator, the mutation operates on the genotype of a single individual. It corresponds, in nature, to an “error” that occurs when the chromosome is copied and reproduced. For example, in a numerical approach, with a binary string, to make an exchange between the “0” and the “1” for an allele. If exact copies are always guaranteed, then the rate of mutation is equal to zero. However, in real life, copying errors can occur in various circumstances under the influence of noise. The mutation changes the values of some genes with a low probability. It does not generally improve solutions, but it avoids an irreparable loss of diversity. The interchanging operator selects two given positions on a chromosome and interchanges their values. As for the reversing operator, values at random are chosen and swapped places with a neighboring value.

The replacement phase is a crucial step in GAs since it selects the individuals representing the next-generation among the “parent” and “child” solutions, where it is possible to conduct random replacement (i.e., candidate solution is randomly selected based on their “fitness” output among a pool containing both “parent” and “child” solutions), weak replacement (i.e., “parent” solutions with a lower “fitness” value are replaced by “child” solutions with higher “fitness” value) and direct parent replacement (“parent” solutions are replaced by “child solutions). The cycle will be looped until the termination criteria are met; for instance: the maximum number of generations, elapsed computational time, the candidate solutions converge to a local minimum, or stall generations are obtained [24].

Genetic Algorithms for Multiple Sequence Alignment

Several researchers have applied GAs to solve MSA problems, such as the SAGA tool [25], the global criterion for sequence alignment (GLOCSA) [26], the vertical decomposition with genetic algorithm (VDGA) [27], the evolving consensus sequences approach with a genetic algorithm [28] and more. These MSA-GAs strategies utilize a variety of operators but remain single optimization strategies. Instead, recent development in the niche of MSA-GAs research treats MSA as a multi-objective optimization strategy. Hence, “fitness” is interpreted for candidate solutions via multiple criteria (i.e., maximum alignment score, minimum gap length, alignment length constraints), not solely maximum alignment score, for instance, and these criteria are optimized simultaneously. The most used multi-objective MSA-GAs is parallel niche Pareto AlineaGA. This hybrid multi-objective optimization workflow converges to a set of nondominated solutions, referred to as a Pareto set of solutions where no solutions can be improved without worsening a criterion. [29].

Other hybrid pipelines of GAs have been reported utilizing quantum computing applied to MSA problems. The research group observed an efficient reduction of population size and iterations required to reach an optimal individual than the strategies above [30]. However, their strategy was not a multi-objective optimization which could have improved their workflow. On the other hand, a quantum-computing evolutionary algorithm utilizing multi-objective optimization has been reported for decision-making applications, but a similar has yet to be reported for MSA problems [31]. In retrospect, similar hybridization and parallelization approaches will continue to emerge as the complexity/size of biological datasets increases and the technological tools increase in efficiency, accessibility, and higher throughput.

References

[1] J. D. Thompson, F. Plewniak, and O. Poch, “A comprehensive comparison of multiple sequence alignment programs.,” Nucleic Acids Res., vol. 27, no. 13, pp. 2682–2690, 1999.

[2] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J. Mol. Biol., vol. 48, no. 3, pp. 443– 453, 1970.

[3] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Mol. Biol., vol. 147, pp. 195–197, 1981.

[4] M. A. Liebert and A. A. Schäffer, “Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment,” vol. 2, no. 3, pp. 459–472, 1995.

[5] D. J. Lipman, S. F. Altschul, and J. D. Kececioglu, “A tool for multiple sequence alignment.,” Proc. Natl. Acad. Sci. U. S. A., vol. 86, no. 12, pp. 4412–5, 1989.

[6] P. Hogeweg and B. Hesper, “The alignment of sets of sequences and the construction of phyletic trees: An integrated method,” J. Mol. Evol., vol. 20, no. 2, pp. 175–186, 1984.

[7] D. G. Higgins and P. M. Sharp, “CLUSTAL : a package for performing multiple sequence alignment on a microcomputer,” J. Mol. Evol., vol. 73, pp. 237–244, 1988.

[8] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice,” Nucleic Acids Res. Acids Res, vol. 22, no. 22, pp. 4673– 4680, 1994.

[9] R. Wiaderldewicz and A. Ruiz-Carrillo, “Nucleic Acids Research,” Nucleic Acids Res. vol. 15, no. 19, pp. 7831–7848, 1987.

[10] J. Pei, R. Sadreyev, and N. V. Grishin, “PCMA: Fast and accurate multiple sequence alignment based on profile consistency,” Bioinformatics, vol. 19, no. 3, pp. 427–428, 2003.

[11] N. Saitou and M. Nei, “The neighbor-joining method : a new method for sequences,” J. Mol. Biol., vol. 16, no. 4, pp. 111–120, 1987.

[12] C. Classic and H. E. Bee, “Numerical taxonomy: the principles and practice of numerical classification,” I. Gen. Microbiol, LS, vol. 30, pp. 15,1988.

[13] J. D. Thompson, F. Plewniak, and O. Poch, “BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs,” Bioinformatics, vol. 15, no. 1, pp. 87–88, 1999.

[14] R. C. Edgar, “MUSCLE: Multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., vol. 32, no. 5, pp. 1792–1797, 2004.

[15] S. Yamada, O. Gotoh, and H. Yamana, “Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost,” BMC Bioinformatics, vol. 7, pp. 1–17, 2006.

[16] B. a Shapiro, J. C. Wu, D. Bengali, and M. J. Potts, “The massively parallel genetic algorithm for RNA folding: MIMD implementation and population variation.,” Bioinformatics, vol. 17, no. 2, pp. 137–148, 2001.

[17] P. A. Vikhar, “Evolutionary Algorithms : A Critical Review and its Future Prospects,” 2016 International Conference on Global Trends in Signal Processing, Information Computing and Communication Evolutionary, pp. 261–265, 2016.

[18] L. Ca, M. Pbzr, and T. O. Si, “THE STUDY ON THE ELECTRIC FIELD EFFECT IN THE,” Surface Review and Letters, vol. 14, no. 4, pp. 617–621, 2007.

[19] Xiujuan Lei, Jingjing Sun, Xiaojun Xu, and Ling Guo, “Artificial bee colony algorithm for solving multiple sequence alignment,” 2010 IEEE Fifth Int. Conf. Bio-Inspired Comput. Theor. Appl., pp. 337–342, 2010.

[20] T. Riaz, Y. Wang, and K. Li, “Multiple Sequence Alignment Using Tabu Search,” 2nd AsiaPacific Bioinforma. Conf., vol. 29, pp. 1–10, 2004.

[21] K. Engineering, C. Lai, C. Wu, and C. Ho, “Using genetic algorithm to solve multiple sequence alignment problem,” International Journal of Software Engineering and Knowledge Engineering, vol. 19, no. 6, pp. 871–888, 2009.

[22] Z. NARIMANI, H. BEIGY, and H. ABOLHASSANI, “A New Genetic Algorithm for Multiple Sequence Alignment,” Int. J. Comput. Intell. Appl., vol. 11, no. 04, p. 1250023, 2012.

[23] A. J. Radenbaugh, “Applications of genetic algorithms in bioinformatics,” SJSU ScholarWorks, pp. 30–38, 2008.

[24] B. Aguié, P. Bertrand, D. Moustapha, S. Etienne, O. Souleymane, and A. B. Matthieu, “A Genetic Algorithm for Overall Designing and Planning of a Long Term Evolution Advanced Network,” American Journal of Operations Research, vol. 6, July, pp. 355–370, 2016.

[25] C. Notredame and D. G. Higgins, “SAGA: Sequence alignment by genetic algorithm,” Nucleic Acids Res., vol. 24, no. 8, pp. 1515–1524, 1996.

[26] E. D. Arenas-Díaz, H. Ochoterena, and K. Rodríguez-Vázquez, “Multiple Sequence Alignment Using a Genetic Algorithm and GLOCSA,” J. Artif. Evol. Appl., vol. 2009, pp. 1– 10, 2009.

[27] F. Naznin, R. Sarker, and D. Essam, “DGA: Decomposition with genetic algorithm for multiple sequence alignment,” Comput. Intell. Bioinforma. Comput. Biol. (CIBCB), 2010 IEEE Symp., pp. 1–8, 2010.

[28] C. Shyu and J. A. Foster, “Evolving Consensus Sequence for Multiple Sequence Alignment with a Genetic Algorithm,” Initiative for Bioinformatics and Evolutionary Studies, pp. 2313– 2324, 2003.

[29] J. A. G, M. Silva, and J. M. S, “Parallel Niche Pareto AlineaGA — an Evolutionary

Multi-objective approach on Multiple Sequence Alignment 1 Introduction,” vol. 8, no. 3, pp. 1–16, 2011.

[30] L. Abdesslem, M. Soham, B. Mohamed, and P. Group, “Multiple Sequence Alignment by Quantum Genetic Algorithm,” Quantum, 2006.

[31] J. Balicki, “Quantum-inspired Multi-objective Evolutionary Algorithms for Decision Making : Analyzing the State-Of-The-Art,” Advances in Applied and Pure Mathematics, pp. 383–389, 2011.

--

--