02-LiNGAM algorithm

지난 글에서 LiNGAM 알고리즘을 적용하기 위해 필요한 기본적인 가정들에 대해 알아보았다. 이제 구체적인 알고리즘을 알아보도록 하자.

LiNGAM Discovery Algorithm

Given an $m \times n (m<n)$ data matrix $\mathbf{X}$, where each column contains one sample vector $\mathbf{x}$, first subtract the mean from each row of $\mathbf{X}$, then apply an ICA algorithm to obtain a decomposition $\mathbf{X}=\mathbf{A}\mathbf{S}$ where $\mathbf{S}$ has the same size as $\mathbf{X}$ and contains in its rows the independent components. From here on, we will exclusively work with $\mathbf{W}=\mathbf{A}^{-1}$.

$\mathbf{X}$를 centering한 뒤에 ICA를 적용한다고 되어 있다. 그렇다면 $\mathbf{S}$ 행렬은 independent components로 구성되어 있을 것이다. 이 논문에서는 FastICA 알고리즘을 사용한다고 했다. 이 알고리즘은 disturbance variable의 분포에 구애받지 않아 non-Gaussian distribution의 구체적인 함수꼴을 알 필요가 없다는 점이 장점이라고 한다.
Find the one and only permutation of rows of $\mathbf{W}$ which yields a matrix $\widetilde{\mathbf{W}}$ without any zeros on the main diagonal. In practice, small estimation errors will cause all elements of $\mathbf{W}$ to be non-zero, and hence the permutation is sought which minimizes $\sum_i{1/\vert\widetilde{\mathbf{W}}_{ii}\vert}$.

diagonal element가 non-zero인 $\widetilde{\mathbf{W}}$를 찾는 딱 하나의 row permutation을 찾는다. ICA가 정확하다는 가정 하에서는 딱 한번의 row permutation만으로 diagonal elements들이 non-zero인 행렬을 얻을 수 있기 때문이라고 한다. 이 부분에 대한 수학적 증명은 Appendix에서 다루고 있다. 나중에 한번 또 구체적으로 들여다 봐야곘다.

하지만 실제로 ICA는 정확하지 않기 때문에 0이어야 할 부분이 0이 아닐 수도 있다. 그래서 자칫 잘못 permutation이 될 수가 있어 실제로 찾을 때에는 $\sum_i{1/\vert\widetilde{\mathbf{W}}_{ii}\vert}$ 을 최소화하는 방향으로 찾는다고 한다.

이걸 최소화한다는 것은 결국 $\sum_i{\vert \widetilde{\mathbf{W}}_{ii} \vert}$ 을 최대화한다는 것과 같은 말이고, 그니까 가장 non-zero일 것 같은 애로 바꿔준다는 뜻이다. 이 부분에 대한 알고리즘은 Section 5에서 다룬다고 한다.
Divide each row of $\widetilde{\mathbf{W}}$ by its corresponding diagonal element, to yield a new matrix $\widetilde{\mathbf{W}’}$ with ones on the diagonal.

이렇게 되면
\[\widetilde{\mathbf{W'}}= \begin{bmatrix} 1 & w_{12}/w_{11} & \cdots & w_{1m}/w_{11} \\ w_{21}/w_{22} & 1 & \cdots & w_{2m}/w_{22} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m1}/w_{mm} & w_{m2}/w_{mm} & \cdots & 1 \end{bmatrix}\]
이 될 것이다.
Compute an estimate $\widehat{\mathbf{B}}$ of $\mathbf{B}$ using $\widehat{\mathbf{B}}=\mathbf{I}-\widetilde{\mathbf{W’}}$.

얻은 결과들을 통해 $\widehat{\mathbf{B}}$를 추정한다.
Finally, to find a causal order, find the permutation matrix $\mathbf{P}$ (applied equally to both rows and columns) of $\widehat{\mathbf{B}}$ which yields a matrix $\widetilde{\mathbf{B}}=\mathbf{P}\widehat{\mathbf{B}}\mathbf{P}^T$ which is as close as possible to strictly lower triangular. This can be measured for instance using $\sum_{i \leq j}{\widetilde{\mathbf{B}}_{ij}^2}$.

$\widehat{\mathbf{B}}$를 lower triangular matrix $\widetilde{\mathbf{B}}$로 만들어주는 permutation matrix $\mathbf{P}$를 찾는데, 이 때 식으로는 $\sum_{i \leq j}{\widetilde{\mathbf{B}}_{ij}^2}$를 활용한다. 나는 인과순서를 정하는 게 정말 중요한 부분일 거라고 생각해서 나름 기대했는데 lower triangular matrix에 최대한 가깝게 만드는 것이 곧 순서라고 보는 로직이다. 다먄 왜 여기서 row와 column에 똑같은 permutation을 해줘야 하는지 이해가 잘 가지 않는다. 이 부분도 Appendix에서 다루어지는 것 같다.

다만, ICA는 최적화하는 과정에서 local minima에 도달할 위험이 있다. 초기값에 따라 $\mathbf{W}$의 추정치가 달라질 수도 있다는 얘기다. 논문에서 직접 실험을 해본 결과, model 가정이 잘 맞을 때에는 초기값과 상관없이 안정적으로 해가 나오는 반면 model 가정이 성립하지 않을 때에는 불안정하게 값이 나온다고 되어 있다.

Permutation Algorithms for Large Dimensions

Permuting the Rows of $\mathbf{W}$

위에서 말한 Section 5가 여기다. 먼약 $\mathbf{W}$의 차원이 그렇게 크지 않다면 전역탐색으로 충분히 가능하지만, 고차원일 경우에는 이러한 방식이 시간이 상당히 오래 걸린다는 문제점이 있다. 이 섹션에서는 $\mathbf{W}$가 고차원일 때 어떠한 알고리즘으로 permutation을 진행하는지 설명하고 있다.

논문에서는 이 문제가 linear ssignment problem과 같다는 점을 설명하고, 이를 해결하기 위한 알고리즘은 수도 없이 많이 존재하다고 말하고 있다. 가장 대표적인 알고리즘으로는 헝가리안 알고리즘이 있다.

Permuting $\mathbf{B}$ to Get a Causal Order

먼저 $\widehat{\mathbf{B}}$가 아주 잘 추정되었을 때의 알고리즘은 아래와 같다.

Initialize the permutation $p$ to be an empty set.
Repeat until $\widehat{\mathbf{B}}$ contains no more elements.

a. Find a row $i$ of $\widehat{\mathbf{B}}$ containing all zeros, if not possible return False.

b. Append $i$ to the end of the list $p$.

c. Remove the $i$-th row and the $i$-th column of $\widehat{\mathbf{B}}$.
Return True and the found permutation $p$.

쉽게 예시를 들어 생각해보자.

\[\widehat{\mathbf{B}} = \left[ \begin{matrix} 0 & 1 & 0 \newline 0 & 0 & 0 \newline 2 & 3 & 0 \end{matrix} \right]\]

먼저 2-a에 따라 2번째 row가 $p$의 첫번째 원소로 들어갈 것이다.

\[p = \left[\left[0,1,0\right]\right]\]

그리고 2번째 row와 column이 지워진다.

\[\widehat{\mathbf{B}} = \left[ \begin{matrix} 0 & 0 \newline 2 & 0 \end{matrix} \right]\]

그러면 이번에는 1번째 row가 전부 0이게 된다.

\[p = \left[\left[0,1,0\right],\left[1,0,0\right]\right]\]

그리고 자연히 마지막으로 남은 row가 $p$의 마지막 원소로 들어간다.

\[p = \left[\left[0,1,0\right],\left[1,0,0\right],\left[0,0,1\right]\right]\]

즉,

\[\mathbf{P} = \left[ \begin{matrix} 0 & 1 & 0 \newline 1 & 0 & 0 \newline 0 & 0 & 1 \end{matrix} \right]\] \[\widetilde{\mathbf{B}}=\mathbf{P}\widehat{\mathbf{B}}\mathbf{P}^T=\left[ \begin{matrix} 0 & 0 & 0 \newline 1 & 0 & 0 \newline 3 & 2 & 0 \end{matrix} \right]\]

으로 strictly lower triangular matrix가 된다. 이거 맞나?

하지만 실제로 $\widehat{\mathbf{B}}$가 이렇게 예쁘게 추정될 일은 많이 없을 것이다. 그래서 대안적 방법으로 논문에서는 아래와 같은 알고리즘을 제안한다.

Set the $m(m+1)/2$ smallest (in absolute value) elements of $\widehat{\mathbf{B}}$ to zero.
Repeat

a. Test if $\widehat{\mathbf{B}}$ can be permuted to strict lower triangularity (using Algorithm B above). If the answer is yes, stop and return the permuted $\widehat{\mathbf{B}}$, that is $\widetilde{\mathbf{B}}$.

b. Additionally set the next smallest (in absolute value) element of $\widehat{\mathbf{B}}$ to zero.

논문에서는 이를 가자치기 (pruning)으로 표현하고 있다. ideal한 방법은 아니지만 꽤 괜찮다고 논문에서는 말하고 있다.

다음 장부터는 통계적인 검정방법에 대해서 소개하고 있다. Wald Test, Chi-Square Test 등에 대해서 다루고 있는데 이건 다음 글에서 다루도록 하겠다.

Reference

PREVIOUS01-LiNGAM algorithm

NEXT02-Hydejack에서 TeXt로 넘어오기까지