Activity for Phylogenetic Tree - Biomedical Data Science Laboratory

UPGMA는 종, 유전자 서열 등의 관계를 계통수로 표현하는 방법이다. 이 activity에서는 이 방법을 Python code로 전환하는 방법에 대해 생각해본다. 그리고, BioPython library의 UPGMA code를 이용하여 UPGMA, NJ tree를 그리는 과정을 실습한다.

1. UPGMA

Unweighted Pair Group Method with Arithmetic Means (UPGMA)는 서열 간 distance를 이용해 phylogenetic tree를 구성하는 방법이다.

Algorithm

서열 group 쌍의 거리(distance)를 계산한다.
가장 가까운 서열 group 쌍을 찾는다.
가장 가까운 두 서열 group을 합친다.
남은 서열 group이 2개 이상이면 (2) 부터 다시 시작한다.

(합쳐지는 대상이 2개 이상의 서열을 가질 수 있어 서열이라 하지 않고 서열 group이라 언급함)

Pseudo-code

UPGMA 알고리즘은 (1) 가까운 pair를 찾는 과정과 (2) 둘을 합쳐 하나로 만드는 과정으로 생각할 수 있다. 이 과정에서 index를 관리하는 방법이 복잡하기 때문에 전체 code를 구성하기 보다는 문제를 작게 나눠서 해결하는 것이 바람직할 것이다.

서열 쌍의 distance matrix를 계산하고, 그 값을 dmat에 입력한다.

UPGMA(dmat)
    각 서열을 지닌 group을 만든다. (index_group)
    WHILE index_group의 길이가 2 이상일 경우
        가장 가까운 두 group을 찾는다. (find_closest_pair)
        두 group을 합친다. (merge_pair)

find_closest_pair(dmat, index_group):
    모든 pair 중 가장 가까운 pair를 찾는다.

merge_pair(index_group, closest_pair):
    closest_pair를 합친다.

Python Code

UPGMA 알고리즘 자체는 간단하지만, program 자체는 약간 복잡하다. 아래 code를 JupyterNotebook에서 실행해보고, code가 어떻게 작동하는지 분석해본다.

실습을 위해 JupyterLab에서 terminal 창을 열고, 아래 명령어를 이용해 작업 디렉토리를 만들고, 해당 directory에 JupyterNotebook file (phylogenetic_tree.ipynb)을 생성하고 실습을 진행한다.

# 홈 디렉토리로 이동
cd ~

# 작업 directory 생성
mkdir PhylogeneticTree

Main code (전체 실행 code)

UPGMA 메인 코드는 아래와 같다.

from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio import AlignIO
import Bio.Phylo as Phylo
from copy import deepcopy
import numpy as np

def UPGMA(dmat):
    n = len(dmat)
    index_group = []
    for i in range(n):
        index_group.append([i])    
    while len(index_group) > 1:
        closest_pair = find_closest_pair(dmat, index_group)
        index_group = merge_pair(index_group, closest_pair)
        print(len(index_group), index_group)

이 코드에서는 가장 가까운 서열 (혹은 서열 group)을 찾는 과정은 find_closest_pair, 두 서열 group을 합치는 과정은 merge_pair 함수로 분리해서 진행한다.

가장 가까운 pair 찾기 & 두 pair 합치기

두 함수는 아래와 같이 구현할 수 있다. 그런데, 서열 group의 거리 계산 과정은 두 서열 group에 있는 원소들 간의 모든 거리를 계산해야 한다. 이 과정은 calc_dist_group이라는 함수로 분리해서 진행한다.

def find_closest_pair(dmat, index_group):
    m = len(index_group)
    closest_pair = []
    closest_dist = 1.0
    for i in range(m):
        for j in range(i):
            d = calc_dist_group(dmat, index_group[i], index_group[j])
            if d < closest_dist:
                closest_pair = [i, j]
                closest_dist = d
    return closest_pair

def merge_pair(index_group, closest_pair):
    new_index_group = []
    m = len(index_group)
    for i in range(m):
        if i not in closest_pair:
            new_index_group.append(index_group[i])
    new_index_group.append([index_group[closest_pair[0]], index_group[closest_pair[1]]])
    return new_index_group

거리 계산 함수

calc_dist_group 함수는 서열 group간 거리 계산을 위한 함수로 아래와 같이 구성할 수 있다.

def flatten_list(nested_list):
    flat_list = []
    for e in nested_list:
        if isinstance(e, list):
            flat_list.extend(flatten_list(e))
        else:
            flat_list.append(e)
    return flat_list

def calc_dist_group(dmat, group1, group2):
    distances = []
    for e1 in flatten_list(group1):
        for e2 in flatten_list(group2):
            if e1 > e2: 
                distances.append(dmat[e1][e2])
            else:
                distances.append(dmat[e2][e1])
    return np.average(distances)

UPGMA tree 예제

이제 UPGMA tree 구성을 위한 과정이 준비되었다. 예제 파일을 이용해 UPGMA tree를 만들어 보자. 먼저 아래 파일을 받고, 실습하는 JupyterNotebook이 있는 directory에 복사한다.

amylase_aln Download

예제 파일은 MSA 결과로 여러 종류의 amylase의 서열을 포함하고 있다. 이 서열들의 관계를 phylogenetic tree를 이용해 표현한다.

aln = AlignIO.read(open('amylase_aln.txt'), 'fasta')
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
dmat = dm.matrix

UPGMA(dmat)

BioPython을 활용한 UPGMA tree 생성

UPGMA tree 만들기

BioPython을 이용해 UPGMA tree를 생성하기 위해서는 먼저 tree를 생성하는 constructor를 만들어 주고, 이를 이용해 upgma 함수를 실행시킨다.

constructor = DistanceTreeConstructor()
upgma_tree = constructor.upgma(dm)

UPGMA tree 출력하기

UPGMA tree의 구조는 아래와 같이 출력할 수 있다.

print(upgma_tree)

UPGMA dendrogram 생성

또한, draw 명령어를 이용해 tree를 그릴 수 있다.

Phylo.draw(upgma_tree, do_show=False)

2. Neighbor-Joining

Neighbor-Joining algorithm의 구현을 위해서는 Q-matrix를 계산하는 중간 과정이 필요하다. 도전하고 싶으신 분들만 coding에 도전해보자. 여기서는 BioPython을 이용해 tree를 생성하고, 그 결과를UPGMA 방법의 결과와 비교한다.

BioPython을 활용한 UPGMA tree 생성

UPGMA와 같이 constructor를 이용해 tree를 생성한다. 다만, nj 함수를 이용한다.

nj_tree = constructor.nj(dm)
Phylo.draw(nj_tree, do_show=False)

NJ tree는 기본적으로 root가 없는 tree이다. 따라서 dendrogram에서 root에 3개의 branch가 있는 것처럼 표현되는데, 이는 root는 아니고, 가장 마지막에 합쳐진 node들이라고 할 수 있다. UPGMA와 또 다른 특징은 조상 node를 공유하는 두 branch의 길이가 서로 다르다는 점이다.

Rooted tree 만들기

참고로, unrooted tree를 rooted tree로 전환할 수도 있다. 아래는 root_at_midpoint 방법을 이용한 예시이다.

nj_tree_rooted = deepcopy(nj_tree)
nj_tree_rooted.root_at_midpoint()
Phylo.draw(nj_tree_rooted, do_show=False)

1. UPGMA

Algorithm

Pseudo-code

Python Code

Main code (전체 실행 code)

가장 가까운 pair 찾기 & 두 pair 합치기

거리 계산 함수

UPGMA tree 예제

BioPython을 활용한 UPGMA tree 생성

UPGMA tree 만들기

UPGMA tree 출력하기

UPGMA dendrogram 생성

2. Neighbor-Joining

BioPython을 활용한 UPGMA tree 생성

Rooted tree 만들기

Leave a Comment Cancel Reply