Pairwise sequence alignment

Introduction

Sequence alignment, which involves aligning two sequences and measuring matches and mismatches of characters, is a way to quantify the similarity between two sequences. The alignment is the procedure of finding two extended sequences with the same length from the two sequences, where an extended sequence is generated by inserting null characters ‘-‘ into the original sequences. A pair of two extended sequences with the same length is an alignment of the two sequences. An alignment score is calculated for an alignment by assigning scores for matches, mismatches, and gaps. Sequence alignment is the procedure of finding the optimal alignment with the highest alignment score.

Method

Pseudo code

  • Input: two sequences
  • Output: alignment score matrix
  1. Retrieve length of the sequence (m, n)
  2. Create an empty alignment score matrix (m x n)
  3. Calculate the alignment score at the first row
  4. Calculate the alignment score at the first column
  5. Calculate alignment score row by row
  6. Print the result

Implementation with Python code

The path graph score table can be calculated in numerous ways. For educational purpose, a simple and intuitive implementation code is is introduced here.

1. Definition of parameters and substitution score matrix

Substitution score is defined to calculate the alignment score of matches and mismatches. Here, an identity matrix is used for the substitution matrix.

# substitution matrix
submat = {('A', 'A'): 1, ('A', 'T'): 0, ('A', 'G'): 0, ('A', 'C'): 0,
          ('T', 'A'): 0, ('T', 'T'): 1, ('T', 'G'): 0, ('T', 'C'): 0,
          ('G', 'A'): 0, ('G', 'T'): 0, ('G', 'G'): 1, ('G', 'C'): 0,
          ('C', 'A'): 0, ('C', 'T'): 0, ('C', 'G'): 0, ('C', 'C'): 1
         }
gap_score = -1

2. Set input sequences & retrieve length of the sequences

# Set input sequences 
s = "ACGC"
t = "GACTAC"

# Retrieve the lengths of the sequences 
m = len(s)
n = len(t)

3. Create an empty aligment score matrix

mat = []   # alginment score matrix
for i in range(m+1):
    mat.append([0] * (n+1))

4. Fill in the first row and the first column

The first row and the first column represent extended sequences started with gaps in the first and the sequence sequences. The alignment score of the other nodes depends on these scores, so they were calculated first.

# first row
for i in range(n):
    mat[0][i+1] = mat[0][i] + gap_score

# first column
for i in range(m):
    mat[i+1][0] = mat[i][0] + gap_score

5. Fill in the alignment score matrix

for i in range(m):
    for j in range(n):
        mx = max(mat[i][j] + submat[(s[i], t[j])], 
                 mat[i][j+1] + gap_score, 
                 mat[i+1][j] + gap_score)
        mat[i+1][j+1] = mx

6. Print the results

# Print the result
print(mat)

# Print the result with better representation 
import pandas as pd
mat = pd.DataFrame(mat, index= ['-'] + list(s), columns = ['-'] + list(t))
print(mat)

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top