Parallelize mex file in Matlab

By default mex file cannot be executed in parallel in Matlab, which will be a major bottleneck for Matlab program. Fortunately, there is a way to execute loop operations in parallel -- parfor. Matlab provides a good tutorial on parfor: Getting Started with parfor.

One tricky problem with parfor is dealing with reduction assignments, which means the value of some variables are updated by each iteration, such as assignment to an array/matrix indexed by loop index.  My suggestion is to separate the loop into two parts: one part deal with non-reduction assignment with parfor, and the other part do the reduction assignment with traditional for loop.

If you cannot save all the parfor-assigned variables in memory, then you must save them into mat file and load them in the next for-loop. In the parfor-loop, command "save" cannot  be used directly because it violates the transparency(God knows what does it mean!). As an alternative, you can create a wrapper function of save and call that function instead. FOr example:

Save the following as "parsave.m":
function parsave(fname, x,y)
save(fname, 'x', 'y')

Then run it with:
parfor ii = 1:4
x = rand(10,10);
y = ones(1,3);
parsave(sprintf('output%d.mat', ii), x, y);


  1. parfor manual:
  2. Getting Started with parfor:
  3. How do I use SAVE with a PARFOR loop:

Notes of Advanced Machine Learning(I)

  1. Two different ways represent a distribution over several random variables: (1) product of conditional probabilities: \(p(x_1,x_2,x_3,x_4)=p(x_4)p(x_3|x_4)p(x_2|x_3,x_4)p(x_1|x_2,x_3,x_4)\) and (2) global energy function: \(p(x_1,x_2,x_3,x_4)=\frac{1}{Z}e^{-E(x_1,x_2,x_3,x_4)}\), where \(Z\) is the partition function.
  2. Directed graphical models use conditional probabilities, which undirected graphical models use energy functions that are a sum of several terms. Deep belief net(DBN) is a hybrid model.
    1. Probabilistic Model

    Two different ways represent a distribution over several random variables:

    • product of conditional probabilities: p(x1,x2,x3,x4)=p(x4)p(x3|x4)p(x2|x3,x4)p(x1|x2,x3,x4)

    • global energy function:


    where Zis the partition function.

    Directed graphical models use conditional probabilities(Bayesian networks), while undirected graphical models(Markov random fields, Boltzmann machines) use energy functions that are a sum of several terms. Deep belief net(DBN) is a hybrid model.

    Directed Graphs

    Directed graphs are useful for expressing causal relationships between random variables.

    • The joint distribution defined by the graph is given by the product of a conditional distribution for each node conditioned on its parents.

    • For example, the joint distribution over x1,,x7 factorizes:


    Markov Random Fields


    • Each potential function is a mapping from joint configurations of random variables in a clique to non-negative real numbers.

    • The choice of potential functions is not restricted to having specific probabilistic interpretations.

    • Potential functions are often represented as exponentials:

    p(x)=1Zcc(xc)=1Z(-cE(xc))=1Z(-E(x)) (Boltzmann distribution)

    • Computing Z is very hard, which represents a major limitation of undirected models.


    1. Singular Value Decomposition

    Singular Value Decomposition(SVD) is a factorization of a real or complex matrix. Formally, the singular value decomposition of an mn matrix M is a factorization of the form


    where U is a mm unitary matrix,  is an mn rectangular diagonal matrix with nonnegative real numbers on the diagonal, and V*(the conjugate transpose of V: (V*)ij=Vji, for real matrix, it equals the transpose) is an nn unitary matrix.

    A complex square matrix U is unitary if U*U=UU*=I.

    The diagonal entries ij of  are known as the singular values of M, which means they are the square roots of the eigenvalues of matrix MM*. The m columns of U and n columns of V are called the left-singular vectors and right-singular vectors of M, respectively.

    The SVD and the eigendecomposition are closely related:

    • The left-singular vectors of M(columns of U) are eigenvectors of MM*.

    • The right-singular vectors of M(columns of V) are eigenvectors of M*M.

    • The non-zero singular values of M(diagonal entries of ) are the square roots of the non-zero eigenvalues of both M*M and MM*.


  1. U Toronto CSC2535:

"undefined reference" error during installing OpenCV

Many people(including me) met the infamous "undefined reference" error during installing OpenCV, such as

../../lib/ undefined reference to `avpriv_vorbis_parse_extradata’
../../lib/ undefined reference to `av_des_init’
../../lib/ undefined reference to `av_rc4_crypt’
../../lib/ undefined reference to `av_aes_crypt’
../../lib/ undefined reference to `av_des_mac’
../../lib/ undefined reference to `av_tree_destroy’
../../lib/ undefined reference to `av_sha_update’
collect2: ld returned 1 exit status
make[2]: *** [bin/opencv_perf_imgproc] Error 1
make[1]: *** [modules/imgproc/CMakeFiles/opencv_perf_imgproc.dir/all] Error 2
make: *** [all] Error 2

This error was caused by lacking of or improper installation of ffmpeg. If you are sure the ffmpeg is successfully installed, and you can find "– FFMPEG: YES" after cmake opencv, then you'd better think about adding option "--enable-shared " when configure ffmpeg. After re-installing ffmpeg, then this "undefined reference" error should disappear.

PS: A very good guideline of installing OpenCV on Ubuntu: 

A Comprehensive Guide to Installing and Configuring OpenCV 2.4.2 on Ubuntu

PPS: Damn Ubuntu!!!

Build Dense Trajectory Codes in Ubuntu

Even when the OpenCV and ffmpeg have been successfully installed, you still may meet the error of "undefined reference to main" when building the Dense Trajectory code. Since the author didn't provide a solution, I share my workaround here -- just use command g++ directly:

Matlab Startup Script

You can create a script named startup.m under userpath(in Windows and Mac OSX, default path is Documents/MATLAB). You can add whatever Matlab commands you want into startup.m. For example, to add /usr/local/bin into PATH, you can add command

setenv('PATH', [getenv('PATH') ':/usr/local/bin']);

into startup.m.

BTW, every time Matlab starts, configure file $MATLABROOT/toolbox/local/matlabrc.m will be executed.

Counting Unique k-mers -- My First Go Program

Previously, I wrote a Perl program to count the number of unique k-mers. It is very convenient to implement it in Perl, because Perl supports hash-of-hashes(which could dynamically count distinct k-mers) and sorting hashes by value(which can easily list the occurrences of unique k-mers in descending order).

Unfortunately, comparing to C++, Perl program usually costs more memory and longer time. This is not a big issue when the FASTA files are not large(smaller than 400MB). However, when handling larger FASTA files, my Perl program would require more than 8GB memory, which exceeds the limits of my computer. Therefore, I decided to rewrite the program by Go, a system language that supports concurrent features and can has C/C++-comparable speed.

The syntax of Go is very different from C/C++, which takes me a whole day familiarize. Some operations that can be implemented easily in Perl can only be done in complicated way, such as calling system command, reading file line-by-line, writing file and sorting map by value. Since this small tool doesn't involve network and concurrency, I cannot compare the difficulty of writing Go program to implement such work.

Although the experience of the first Go program is not happy, I still think it is necessary to continue learning Go language, because Go program really runs very fast and saves memory(comparing to Perl script). And you don't need to worry about the tricky part of C++.

Linear Space Sequence Alignment

For explanation to linear space sequence alignment, please refer to

The algorithm and equation I used was from the textbook Algorithm Design by Jon Kleinberg and Éva Tardos(2005):

Implementation of Strassen's Algorithm for Matrix Multiplication

Strassen's algorithm is not the most efficient algorithm for matrix multiplication, but it was the first algorithm that was theoretically faster than the naive algorithm. There is very good explanation and implementation of Strassen's algorithm on Wikipedia.

However, the implementation of Strassen's algorithm cannot be used directly, because it  just sets the base case of the divide-and-conquer to be 1x1 matrix, which would consume huge time cost for iteration. If set the base case to 2x2 matrix, which means 2x2 matrix and 1x1 matrix will be multiplied by naive algorithm, then the Strassen's algorithm will be more efficient for matrices larger than 512x512.

When set base case to 2x2 matrix, then the Strassen's algorithm will surpass naive algorithm for matrices larger than 512x512.

When set base case to 2x2 matrix, then the Strassen's algorithm will surpass naive algorithm for matrices larger than 512x512.

When set base case to 6x6 matrix, then the Strassen's algorithm will surpass naive algorithm for matrices larger than 128x128.

When set base case to 6x6 matrix, then the Strassen's algorithm will surpass naive algorithm for matrices larger than 128x128.

// -- Implementation of matrix multiplication with
// Strassen's algorithm.
// Compile this file with gcc command:
// g++ -Wall -o matrix_mult 

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <ctype.h>
#include <unistd.h>
#include <iostream>
#include <fstream>
#include <cmath>
#include <cstring>

using namespace std;

// This function allocates the matrix
inline double** allocate_matrix(int n)
double** mat=new double*[n];
for(int i=0;i<n;++i)
mat[i]=new double[n];

return (mat); // returns the pointer to the vector.

// This function unallocates the matrix (frees memory)
inline void free_matrix(double **M, int n)
for (int i = 0; i < n; i++)
delete [] M[i];

delete [] M; // frees the pointer /

// function to sum two matrices
inline void sum(double **a, double **b, double **result, int tam) {

int i, j;

for (i = 0; i < tam; i++) {
for (j = 0; j < tam; j++) {
result[i][j] = a[i][j] + b[i][j];

// function to subtract two matrices
inline void subtract(double **a, double **b, double **result, int tam) {

int i, j;

for (i = 0; i < tam; i++) {
for (j = 0; j < tam; j++) {
result[i][j] = a[i][j] - b[i][j];

// naive method
void naive(double** A, double** B,double** C, int n)
for (int i=0;i<n;i++)
for (int j=0;j<n;j++)
for(int k=0;k<n;k++)
C[i][j] += A[i][k]*B[k][j];

// Strassen's method
void strassen(double **a, double **b, double **c, int tam)
// Key observation: call naive method for matrices smaller than 2 x 2
if(tam <= 4)

// other cases are treated here:
int newTam = tam/2;
double **a11, **a12, **a21, **a22;
double **b11, **b12, **b21, **b22;
double **c11, **c12, **c21, **c22;
double **p1, **p2, **p3, **p4, **p5, **p6, **p7;

// memory allocation:
a11 = allocate_matrix(newTam);
a12 = allocate_matrix(newTam);
a21 = allocate_matrix(newTam);
a22 = allocate_matrix(newTam);

b11 = allocate_matrix(newTam);
b12 = allocate_matrix(newTam);
b21 = allocate_matrix(newTam);
b22 = allocate_matrix(newTam);

c11 = allocate_matrix(newTam);
c12 = allocate_matrix(newTam);
c21 = allocate_matrix(newTam);
c22 = allocate_matrix(newTam);

p1 = allocate_matrix(newTam);
p2 = allocate_matrix(newTam);
p3 = allocate_matrix(newTam);
p4 = allocate_matrix(newTam);
p5 = allocate_matrix(newTam);
p6 = allocate_matrix(newTam);
p7 = allocate_matrix(newTam);

double **aResult = allocate_matrix(newTam);
double **bResult = allocate_matrix(newTam);

//dividing the matrices in 4 sub-matrices:
for (int i = 0; i < newTam; i++) {
for (int j = 0; j < newTam; j++) {
a11[i][j] = a[i][j];
a12[i][j] = a[i][j + newTam];
a21[i][j] = a[i + newTam][j];
a22[i][j] = a[i + newTam][j + newTam];

b11[i][j] = b[i][j];
b12[i][j] = b[i][j + newTam];
b21[i][j] = b[i + newTam][j];
b22[i][j] = b[i + newTam][j + newTam];

// Calculating p1 to p7:

sum(a11, a22, aResult, newTam); // a11 + a22
sum(b11, b22, bResult, newTam); // b11 + b22
strassen(aResult, bResult, p1, newTam); // p1 = (a11+a22) * (b11+b22)

sum(a21, a22, aResult, newTam); // a21 + a22
strassen(aResult, b11, p2, newTam); // p2 = (a21+a22) * (b11)

subtract(b12, b22, bResult, newTam); // b12 - b22
strassen(a11, bResult, p3, newTam); // p3 = (a11) * (b12 - b22)

subtract(b21, b11, bResult, newTam); // b21 - b11
strassen(a22, bResult, p4, newTam); // p4 = (a22) * (b21 - b11)

sum(a11, a12, aResult, newTam); // a11 + a12
strassen(aResult, b22, p5, newTam); // p5 = (a11+a12) * (b22)

subtract(a21, a11, aResult, newTam); // a21 - a11
sum(b11, b12, bResult, newTam); // b11 + b12
strassen(aResult, bResult, p6, newTam); // p6 = (a21-a11) * (b11+b12)

subtract(a12, a22, aResult, newTam); // a12 - a22
sum(b21, b22, bResult, newTam); // b21 + b22
strassen(aResult, bResult, p7, newTam); // p7 = (a12-a22) * (b21+b22)

// calculating c21, c21, c11 e c22:

sum(p3, p5, c12, newTam); // c12 = p3 + p5
sum(p2, p4, c21, newTam); // c21 = p2 + p4

sum(p1, p4, aResult, newTam); // p1 + p4
sum(aResult, p7, bResult, newTam); // p1 + p4 + p7
subtract(bResult, p5, c11, newTam); // c11 = p1 + p4 - p5 + p7

sum(p1, p3, aResult, newTam); // p1 + p3
sum(aResult, p6, bResult, newTam); // p1 + p3 + p6
subtract(bResult, p2, c22, newTam); // c22 = p1 + p3 - p2 + p6

// Grouping the results obtained in a single matrix:
for (int i = 0; i < newTam ; i++) {
for (int j = 0 ; j < newTam ; j++) {
c[i][j] = c11[i][j];
c[i][j + newTam] = c12[i][j];
c[i + newTam][j] = c21[i][j];
c[i + newTam][j + newTam] = c22[i][j];

// deallocating memory (free):
free_matrix(a11, newTam);
free_matrix(a12, newTam);
free_matrix(a21, newTam);
free_matrix(a22, newTam);

free_matrix(b11, newTam);
free_matrix(b12, newTam);
free_matrix(b21, newTam);
free_matrix(b22, newTam);

free_matrix(c11, newTam);
free_matrix(c12, newTam);
free_matrix(c21, newTam);
free_matrix(c22, newTam);

free_matrix(p1, newTam);
free_matrix(p2, newTam);
free_matrix(p3, newTam);
free_matrix(p4, newTam);
free_matrix(p5, newTam);
free_matrix(p6, newTam);
free_matrix(p7, newTam);
free_matrix(aResult, newTam);
free_matrix(bResult, newTam);

} // end of Strassen function

// Generate random matrices
void gen_matrix(double** M,int n)
for(int i=0;i<n;++i)
for(int j=0;j<n;++j)

// print matrix M using specied fstream
void print_matrix(fstream& fs, double** M, int n)
for(int i=0;i<n;++i)
for(int j=0;j<n;++j)
fs<<M[i][j]<<" ";

// record the generated matrix and the final product
void mat_mult_log(double** A, double** B,double** C,int n,char* file)
fstream fs;,fstream::out);

fs<<"Random Matrix A:"<<endl;
fs<<"Random Matrix B:"<<endl;
fs<<"C=A * B"<<endl;



int main(int argc, char** argv)

int mdim=2; // matrix dimension
char* output=NULL;
bool is_strassen=false;
int c;

while ((c = getopt (argc, argv, "sn:o:")) != -1)
switch (c)
case 's':
case 'n':
mdim = pow((int)2,atoi(optarg)); // 2^n dimensions
case 'o':
output = optarg; // 2^n dimensions
case '?':
if (optopt == 'n')
fprintf (stderr, "Option -%c requires an argument.\n", optopt);
else if (isprint (optopt))
fprintf (stderr, "Unknown option `-%c'.\n", optopt);
fprintf (stderr,
"Unknown option character `\x%x'.\n",
return 1;
abort ();

// create new matrices
double** A=allocate_matrix(mdim);
double** B=allocate_matrix(mdim);
double** C=allocate_matrix(mdim);

// matrices multiplication



return 0;

Basic Linux Multi-Process and Multi-Thread Programming

Today I have to make my algorithm running in parallel in order to make it faster. At first I used following way to implement multi-process:

	unsigned int proc_num = 5;
	pid_t* pids=new pid_t[proc_num];
	double incr=(double)N/(double)proc_num;
	/* Start children. */
	for (unsigned int i = 0; i < proc_num; ++i) {
		if ((pids[i] = fork()) < 0) {
	 	} else if (pids[i] == 0) {
	    		// DoWorkInChild();

	/* Wait for children to exit. */
	int status;
	pid_t pid;
	while (proc_num > 0) {
	  	pid = wait(&status);
		if(status != 0)
	  		printf("Child Process with PID %ld aborted with status 0x%x.\n", (long)pid, status);
	  	--proc_num;  // TODO(pts): Remove pid from the pids array.

Above way worked well, however, there's no way to change the "shared" variables in child processes. Because each child process has an independent copy of all variables.

In order to change the same array in parallel, I implemented multi threads.

struct thread_info {    /* Used as argument to thread_start() */
	pthread_t thread_id;        /* ID returned by pthread_create() */
	int	start;       /* Application-defined thread # */
	int     end;      /* From command-line argument */
	int*	gpdarr;	/* Array to store GPD number */
	long int G;	/* genome length */
	unsigned int L;	/* read length */
	double l1;	// GPD parameter lambda1
	double l2;	// GPD parameter lambda2
	double M;	// maximum of GPD density

void printPt(pthread_t pt) {
  unsigned char *ptc = (unsigned char*)(void*)(&pt);
  for (size_t i=0; i<sizeof(pt); i++) {
    printf("%02x", (unsigned)(ptc[i]));

void* gen_gpd_num(void* arg)
	struct thread_info *tinfo = (struct thread_info *) arg;
	// do something here


int main()
        unsigned int tnum = 20;
	double incr=(double)N/(double)tnum;
	thread_info* tinfo=new thread_info[tnum];

	int err;
	void* res;
	/* Start children. */
	for(unsigned int idx=0;idx<tnum;++idx) 

		err = pthread_create(&tinfo[idx].thread_id, NULL, &gen_gpd_num, &tinfo[idx]);

			printf("can't create thread :[%s]", strerror(err));
			pthread_join(tinfo[idx].thread_id, &res);
        delete [] tinfo;

The potential problem of using multi-threading in Linux is -- it is hard to figure out if there really are many threads are running simultaneously. 

When compiling multi-threading program using GCC, pthread library must be specified:

g++ -o foo -Wall -L/usr/lib -lpthread


  1. How to Create Threads in Linux (With a C Example Program)


Simple Matlab/Octave Commands to Process Data

To load data in following format stored in a text file "read_pos' into Matlab/Octave, use commands

f=fopen('read_pos','r'); % open file
a=textscan(f,'%s%d'); % read file
y=a{2}; % a is a cell: a{1} stores read id, a{2} stores numbers.

Then data of the second column are stored in variable y.

r12353.1 2407054.5
r12361.1 5328858.5
r12363.1 2360272
r12368.1 4726440.5
r12372.1 2224001
r12373.1 5165613.5
r12381.1 501776
r12385.1 3475398
r12394.1 3376364
r12401.1 2142875.5
r12411.1 2191090.5
r12419.1 1240590
r12420.1 4903572
r12422.1 767011.5
r12426.1 3575915.5
r12429.1 554956
r12433.1 4786335
r12435.1 3373955.5
r12442.1 5363611.5
r12452.1 1903660.5
r12454.1 2784165
r12466.1 5137479
r12470.1 592191

If there are only numbers stored in file "read_pos", following commands should be used:

f=fopen('read_pos','r'); % open file
a=textscan(f,'%s%d'); % read file
y=cell2mat{a}; % convert cell a to vector
size(y) % check the size

To plot the distribution of all the numbers, you can:


If another data set z is also loaded and you want to plot y and z in the same graph, you can:

hold on

To annotate and scale the graph, use commands;

xlabel('AMD Data k-mer Occurrances');
title('GPD Classification of AMD k-mers')
legend('Bin 1','Bin 2')
axis([-5 5 0 10000]) % scale axes

A matlab function to plot multiple datasets:

function result=plot_distr(varargin)


marker={'bo' 'r+' 'k*' 'gs' 'y^' 'md' 'cp' 'bx' 'r.' 'k>' 'gh' 'y<'};
for k=1:nVarargs
hold on
ylim([0-5*nVarargs 6*nVarargs]); % scale y-axis