While I have published research papers in several areas such as data mining, bioinformatics and algorithmics, most of my research efforts focus on fast clustering as well as its foundation and applications.

Clustering is an important data mining problem and has many applications in areas such as bioinformatics and image processing; however, clustering has poor performance on large data-sets, especially when the dissimilarity measurement (also called distance) between data objects is time consuming to compute. To speedup clustering on large data-sets, we have developed two approaches. The first approach is data summarization. Essentially our method performs a smart sampling to collect more information from the data than a naive random sampling can provide, so that the clustering result on the smart samples is closer to that of the original data-set than clustering on random samples. This result was published in [Zhou and Sander, VLDB 2003]. In the second approach, our method predicts the distance between data objects to avoid unnecessary distance computations. While many hierarchical and density based clustering algorithms in the worst case compute a quadratic number of distances, only a small portion of them are needed to construct the clustering results. Using a property on distance functions, our method is able to predict which distance is small with reasonable accuracy. The result was published in [Zhou and Sander, ICDM 2006].

A more fundamental approach to speedup clustering is to speedup similarity search because similarity search accounts for most of the runtime of clustering on large data-sets. In metric space, a typical method to speedup similarity search is to apply triangle inequalities to avoid distance computations. The challenge is that in high dimensional space and general metric space, distances between objects are dominantly long distances, rendering triangle inequality pruning ineffective. In our recent work [Zhou et al., TCBB 2010], we proposed a special type of data objects called partial pivots for which our method predicts and computes short distances associated with them, and use them to effectively perform triangle inequality pruning.

References

Zhou, J. and Sander, J. (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces, Proceeding of the 29th International Conference on Very Large Data Bases (VLDB-03), pp 452-463.

Zhou, J. and Sander, J. (2006) Speedup clustering with hierarchical ranking, Proceedings of the 6th International Conference on Data Mining (ICDM-06), pp 1205-1210.

Zhou, J. et al. (2010) Finding the nearest neighbors in biological databases using less distance computations, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(4): 669-680.