To get the submodules:
git pull
git submodule update --initEach folder contains an implementation that we tested.
par_tmfg our parallel TMFG and DBHT implementation
hac the hierarchical agglomerative clustering (HAC) algorithm by Yu et al.
Aste Aste's MATLAB TMFG+DBHT implementation. This is modified from DBHT
and PMFG. The modifications include adding timers for benchmarking and substitute some subroutines for better performance. Speficically, we changed Aste's TMFG+DBHT implementation (DBHTs.m file) to use boost library's all pair's shortest path and breadth-first search implementation, because this gives significant speedup. Aste's MATLAB PMFG+DBHT implementation also uses boost's implementation.
mpi-scalablekmeanspp the C++ implmentation of k-means++.
- g++ = 7.5.0
- make
- C++ boost library
- MATLAB
- MATLAB BGL
After boost is installed, set the BOOST_ROOT variable in par_tmfg/Makefile to the address of boost folder
The input to both implementations is a symmetric matrix.
The format of the file is a binary file with dataset folder.
The UCR data sets can be downloaded from here. The stock data can be obtained using the Yahoo Finance API. Our data is obtained in Nov. 2021.
You can also download our data here. There is a readme.md in the data repository linked above that explains how to use the datasets.
For running time tests, we use numactl. It can be installed using apt install numactl.
run make in hac/general_hac
PARLAY_NUM_THREADS=wk numactl -i all ./linkage dataset n outpout method round
wkis the number of workers to usenumactl -i allis optionaldatasetis the file name of the input distance matrix (in binary format)nis the number of data pointsoutputis the file name of the output file for the resulting dendrogrammethodcan be "comp" or "avg" for complete linkage and average linkage respectivelyroundis the number of times to run the program
cd hac/general_hac
make
PARLAY_NUM_THREADS=${wk} numactl -i all ./linkage ../../datasets/CBF.dat 930 outputs/CBF_comp_dendro comp 1run make in par_tmfg
PARLAY_NUM_THREADS=wk numactl -i all ./tmfg S output n D method prefix round
wkis the number of workers to usenumactl -i allis optionalSis the file name of the input similarity matrix (in binary format)outputis the file name prefix of the output file for the resulting dendrogram (-Z) and the resulting TMFG (-P). The outputs are going to be saved folders "par_tmfg/outputs/Ps/" and "par_tmfg/outputs/Zs/", so these two folders should be created in advance.nis the number of data pointsDis the file name of the input dissimilarity matrix. If D=0, will use D = sqrt(2(1-s))methodcan be "exact" or "prefix".prefixis the prefix size to insert in each round. it is ignored when method is exactroundis the number of times to run the program
cd par_tmfg
make
PARLAY_NUM_THREADS=${wk} numactl -i all ./tmfg ../datasets/CBF.dat outputs/CBF 930 0 prefix 2 1
PARLAY_NUM_THREADS=1 ./tmfg ../datasets/CBF.dat outputs/CBF 930 0 exact 0 1UCR_PMFG(dataset, inputdir, outputdir)
UCR_TMFG(dataset, inputdir, outputdir)
datasetis the name of the dataset.inputdiris the directory of the input datasetoutputdiris the output directory
cd Aste
matlab -nojvm -nosplash -nodesktop -nodisplay -r 'UCR_PMFG("iris", "../datasets/", "outputs"); exit' -logfile outputs/iris_pmfg_timing.txt
matlab -nojvm -nosplash -nodesktop -nodisplay -r 'UCR_TMFG("iris", "../datasets/", "outputs"); exit' -logfile outputs/iris_tmfg_timing.txtThe C++ k-means++ code is in mpi-scalablekmeanspp/ folder.
from sklearn.cluster import SpectralClustering
SpectralClustering(n_clusters=k, affinity="nearest_neighbors",
n_neighbors=n_neighbor,
assign_labels='discretize',
random_state=1, n_jobs=worker).fit(X)