Blog - Mustafa Baydogan

Note that this is an old post from 2012-07-19 moved to new site.

Performance improvements for TSBF

While running TSBF on the new data from UCR database for our revisions to the paper, I realized that current R implementation is not efficient. Overall approach is still not implemented in a good way since feature extraction is done separately (C code) where the connection with R is through text files. This affects the time to run TSBF significantly since reading files into matrices in R is taking substantial time (especially for large datasets). To shorten the time for reading feature matrices and handle the memory efficiently, I did the following revisions:

1) Removing matrices that are not used (memory management) illustrated below for some of the matrices.

rm(subtr)

rm(subtst)

gc(verbose=TRUE)

2) Reading subsequence features to a matrix using scan (improves the memory usage and computation time) .

Before (read.table reads to a data.frame which is not efficient memorywise if the data is numeric, use of matrix instead improves the memory usage and time to read):

#read generated features

subtr<- read.table(“RFsub_train”)

subtst<- read.table(“RFsub_test”)

After (added two lines of code to c implementation so that we know the number of subsequences per time series and number of columns of the feature matrix

#read subsequence data information and generated features

stats<-scan(“stats”,n=2,quiet=TRUE) #[1] number of subsequences [2] number of features

nsub=stats[1]*noftrain

nfeat=stats[2]

nsubtest=stats[1]*noftest

subtr<-matrix(scan(“RFsub_train”,what=numeric(0),n=nsub*nfeat,quiet=TRUE),nsub,nfeat,byrow=TRUE)

subtst<-matrix(scan(“RFsub_test”,what=numeric(0),n=nsubtest*nfeat,quiet=TRUE),nsubtest,nfeat,byrow=TRUE)

Performance with and without scan on a Windows 7 system with i5 2.13 Ghz processor (feature matrix for subsequence features of CinC_ECG_torso dataset, matrix size: 407100 X 102):

system.time(subtst<- read.table(“RFsub_test”))

user system elapsed

1141.50 6.89 1169.18

system.time(matrix(scan(“RFsub_test”,what=numeric(0),n=nsubtest*nfeat,quiet=TRUE),nsubtest,nfeat,byrow=TRUE))

user system elapsed

116.93 2.48 121.59

Please let me know if you have any questions! The direct link to the folder for the updated files is

here

Note that this is an old post from 2012-05-24 moved to new site.

Comparing multivariate tree algorithms

This entry explains how to run the multivariate tree algorithms CRUISE, QUEST and LOTUS (two versions) through command line using MATLAB.

For our paper, “SMT: Sparse Multivariate Tree”, we had to run CRUISE, QUEST, LOTUS(S) and LOTUS(M) for comparison purposes (based on v-fold crossvalidation on the training data). All three methods are also multivariate trees as SMT. Please find the details of the algorithms on:

CRUISE: http://www.stat.wisc.edu/~loh/cruise.html
QUEST: http://www.stat.wisc.edu/~loh/quest.html
LOTUS(S) and LOTUS(M) : http://www.stat.wisc.edu/~loh/lotus/lotus.html
SMT: coming soon @ http://enpub.fulton.asu.edu/hdeng3/

To run CRUISE, QUEST and LOTUS on the datasets in SMT paper, please follow the steps below (for windows environement):

Download the zipped folder that contains all required files by clicking here (this contains “Windows executable for Intel and compatibles” from the websites above)
Extract the zip file wherever you want.
The folder contains 4 batch files (*.bat) for running CRUISE, QUEST, LOTUS(S) and LOTUS(M) using the command line. In each batch file, there are three commands: (example is for “quest_batchfile.bat”)
cd C:\mustafa\ogrencilik\ASU\PHD\sparseMultivariateTrees\batch —> change the address based on the location you extract the zipped file
cruise < cruise_batch —> run cruise algorithm with the settings in the cruise_batch file
exit —–> exit command line

Make the changes for each batch file (There are 4 of them)
If required, make the changes to parameter settings in the file named “methodname”_batch (i.e. cruise_batch). Current settings are the default settings.
There is a subfolder named “data” in which you provide your training data. Your data should be formatted in a certain way. Class information (dependent) should be provided in the first column, rest will be the features (predictors).
In the subfolder named “data”, there is a MATLAB script file named “Main.m”. Open it in MATLAB editor. Change the location information for the batch files (where you unpack the zip file) accordingly, which is in 9th line. Change other parameters such as number of folds, number of replications if required.
Run “Main.m” in the subfolder “data”, MATLAB may ask you if there is a need to change the folder. Change it if prompted.