Home About Research Academic Projects R Packages Time Series Data Mining Multiple Instance Learning Publications Teaching Files Blog Links Contact

## LPS package is updated

Last Updated on Thursday, 03 April 2014 23:24 Monday, 23 December 2013 14:26

I recently updated the Learned Pattern Similarity (LPS) package. The implementation for computing LPS is now faster. Time to train is almost the half of the time of earlier implementation. Testing time is decreased significantly. The new version will be uploaded in the Files section soon. I am working on a comparison of the current implementation with the new one. Once I am done with the experiments, this page will be updated.

Stay tuned!

## Updated TSPD codes are online

Last Updated on Sunday, 29 December 2013 01:02 Sunday, 22 December 2013 09:50

After submission of our paper named "Supervised Time Series Pattern Discovery through Local Importance" (TSPD) (supporting page), we made the codes available online. The functions used in TSPD are implemented as part of my recent R package called LPS. Source code for LPS is available here.

We illustrated the performance of TSPD on classification problems. R code for classification is provided in Files  section. This example uses GunPoint dataset from UCR Time Series Database. Here, I will go over the steps and explain how to run TSPD for classification.

1. Calling the package and setting the parameters: Assuming that you have installed packages LPS and randomForest, we call them using require function. We set the parameters as described in the paper. We use the same parameter setting for all datasets as mentioned. Comments (after #) clearly describe the correspondence between the variables and the parameters of TSPD.

```require(LPS)
require(randomForest)

#Parameters: Corresponding parameter notation in the paper is provided
nrep=10		# number of replications in TSPD
treePerIter=50	# J=J_I=J_P number of trees
intmaxfrac=0.05	# I(max) maximum interval length as a fraction of TS length
maxshapefrac=0.25  # used to set L based on a fraction of TS length
kfrac=c(2,1,0.5,0.25)  # K (number of patterns) is set based on certain levels of number of training data (N)
ksteps=c(0,100,500,1000,10000)	# N levels for setting K (i.e. if N<100 K=kfrac[1]*N which is K=2N)
```

2. Organization of the training and test files, setting parameters that are based on training dataset characteristics: This consists of three main tasks. The files are read and we get the class information. The time series are standardized to zero mean and deviation of one to make the approaches comparable to DTW results provided by UCR Time Series Database.  Then we set the number of patterns and maximum possible interval length based on the number of training instances and time series length.

```#read training data and characteristics
trainclass=traindata[,1]
noftrain=nrow(traindata)
traindata=t(apply(traindata[,2:ncol(traindata)], 1, function(x) (x-mean(x))/sd(x)))
nofclass=length(unique(trainclass))
lenseries=ncol(traindata)

noftest=nrow(testdata)
testclass=testdata[,1]
testdata=t(apply(testdata[,2:ncol(testdata)], 1, function(x) (x-mean(x))/sd(x)))
nofpattern=floor(kfrac[findInterval(noftrain,ksteps)]*noftrain) # setting K based on N
intmaxL=floor(lenseries*intmaxfrac) # maximum interval length for feature generation```

3. Training: This consists of training RFint and RFpattern for nrep replications. Codes for one replication are given below. We select a random interval length between 5 and I(max) time units and train RFint on the interval representation. We sample patterns based on the local importance from RFint and compute best matching distances of time series to patterns. We then train RFpattern on this representation.

Initialize the matrices for storing predictions and data structure to store pattern information over replications
```allvotes=matrix(0,noftest, nofclass)
shapeletInfo=list(select=matrix(0,nofpattern,nrep),level=matrix(0,nofpattern,nrep)) ```
Single replication of training TSPD
```intlen=max(5,floor(runif(1)*intmaxL)+1) # select random interval length (w) between 5 and I(max)-> intmaxL
slidelen=floor(intlen/2) # set w=d/2 as described in the paper
maxInt=floor((lenseries*maxshapefrac)/(intlen))+1 # set K level

#train RFint
train=intervalFeatures(traindata,intlen,slidelen)
RFint <- randomForest(train\$features,factor(trainclass),ntree=treePerIter,localImp=TRUE)
localimp=RFint\$localImp

#train RFpattern
shapelet=shapeletSimilarity(traindata,localimp,train,maxInt,nshapelet=nofpattern)
RFpattern=randomForest(shapelet\$similarity,factor(trainclass),ntree=treePerIter)

shapeletInfo\$select[,n]=shapelet\$sel
shapeletInfo\$level[,n]=shapelet\$lev```

4. Testing: Testing requires computation of best matching distances of test time series to patterns and classification by RFpattern. The voting results are aggregated using allvotes matrix (of dimensions noftest x nofclass). The largest vote determines the class for each time series. Codes for one replication of TSPD for testing is provided below.

Single replication of testing TSPD
```test=shapeletSimilarityTest(testdata, traindata, shapelet\$importanceOrder, shapeletInfo, train, n)
prediction=predictShapelet(RFpattern, test\$similarity, whichTrees=c(1,treePerIter))
```

A SAMPLE RUN RESULT

Screenshot of a sample run of TSPD on GunPoint dataset for 10 replications is provided below: (Ubuntu 12.10 system with 8 GB RAM, dual core CPU i7-3620M 2.7 GHz):

Sample patterns found to be important by RFpattern
This output can be compared to the results from other shapelet studies. Simply a Google search on 'Gun-point shapelet' should return some relevant links. There is a good summary of the data sets and descriptions in the jmotif Google Code Homepage. The patterns discovered match with class descriptions.

## Extending the Time Series Bag-of-Features (TSBF) for multivariate time series classification

Last Updated on Tuesday, 29 October 2013 10:24 Tuesday, 29 October 2013 01:00

During our revision for our SMTS paper, we have extended TSBF to multivariate time series classification (MTSBF) for comparison purposes. Codes are available here.

MTSBF performs better than SMTS for some datasets where SMTS outperforms MTSBF significantly for the others. This is due to the problem characteristics. When the relationships between the attributes are important in the definition of a class, SMTS performs better in general with a representation that is quite simple conceptually and operationally.

This archive (*.zip file) stores the required R files, compiled files (*.so file for linux based systems) and source codes (in C language).  If you want to run this on Windows, you need to compile c files using command "R CMD SHLIB /pathname" to generate a *.dll (dynamic library). You will need to modify the file named  'multivariateTSBF_functions.r' accordingly. This file does not check the operating system and looks for "*.so" file in its current form.

We also have a sample dataset (ECG dataset from http://www.cs.cmu.edu/~bobski/). MTSBF uses the same file structure as SMTS.

Let me know if you have any questions.

## My seminar on Learned Pattern Similarity at Arizona State University

Last Updated on Saturday, 12 October 2013 19:49 Saturday, 12 October 2013 19:44

On October 11th, 2013, I gave a talk on our recent work "Learned Pattern Similarity (LPS)" in Computing, Informatics and Decision Systems Engineering (CIDSE) at Arizona State University (ASU). The announcement is here. The presentation is available in files section.

This version is slightly different (probably better) than the version presented at INFORMS conference in Minneapolis (2013). The presentation had to be too short for INFORMS because of the time limitations where I had enough time for this seminar.

Best viewed in slideshow view as I have animations. Let me know if you have any questions.

## The presentation of Learned Pattern Similarity (LPS) is uploaded

Last Updated on Thursday, 03 April 2014 23:25 Wednesday, 09 October 2013 06:49

The presentation of Learned Pattern Similarity (LPS) in the INFORMS'13 conference in Minneaplois is uploaded. You can find it on http://www.mustafabaydogan.com/files/viewcategory/8-presentations.html

Please let me know if you have any questions!

## LPS package is online

Last Updated on Friday, 10 October 2014 23:50 Monday, 23 September 2013 08:23

This blog entry is outdated. Please check the R package on CRAN. Here is the link to package page. The manual provides all the necessary information about running LPS for univariate time series.

After submission of our paper named "Time series similarity based on a pattern-based representation" (supporting page), we made R package (LPS) online. It still requires significant amount of work in terms of documentation and that is actually why I cannot submit it to CRAN as in its current form. Hopefully, I will finish it soon.

We illustrated the performance of the similarity measure on classification problems. R code for classification is provided in Files  section. This example uses GunPoint dataset from UCR Time Series Database. Here, I will go over the steps and explain how to run LPS for classification.

1. Calling the package and setting the parameters: The default settings of the parameters (parameters used in the paper) are set for functions implemented in the R package. The package functions are loaded with require() function. The segment length factors to be evaluated by the cross-validation is defined as an array named seglenfactor. Number of trees for learning patterns (denoted as J in the paper) is set to 150 (as in the paper). The path of the files should be provided. Here, the files are in my working directory.

```require(LPS)

# parameters (L and J)
seglenfactor=c(0.25,0.5,0.75)
noftree=150

trainfile='GunPoint_TRAIN'
testfile='GunPoint_TEST'
```

2. Organization of the training and test files, setting replication parameters: This consists of three main tasks. The files are read and we get the class information. The time series are standardized to zero mean and deviation of one to make the approaches comparable to DTW results provided by UCR Time Series Database

```#read train data
class_train=traindata_labeled[,1]
noftrain_labeled=nrow(traindata_labeled)

#standardize (if needed)
traindata_labeled=t(apply(traindata_labeled[,2:ncol(traindata_labeled)], 1, function(x) (x-mean(x))/sd(x)))

nof_test=nrow(testdata)
class_test=testdata[,1]

#standardize (if needed)
testdata=t(apply(testdata[,2:ncol(testdata)], 1, function(x) (x-mean(x))/sd(x)))
```

3. Training: This consists of tuning the parameters and the learning of the patterns with the tree-based ensemble.  tuneLearnPattern() function is implemented for pattern learning. After tuning, the best segment length factor and the depth level is used for learning patterns with learnPattern(). Arguments of learnPattern() are almost the same as tuneLearnPattern() and they are described below:

tunelearnPattern <- function(x, y, unlabeledx=NULL, nfolds=5, segmentlevels=c(0.25,0.5,0.75),
mindepth=4, maxdepth=8, depthstep=2, ntreeTry=25, diff.target=TRUE, diff.segment=TRUE,  ...)
x: is the training data
y: is the labels of the training data
unlabeledx: LPS may benefit from unlabeled data. This argument is created for future purposes.
nfolds: number of folds for cross-validation (default setting=5)
segmentlevels: segment levels to be evaluated (default setting c(0.25,0.5,0.75))
(mindepth, maxdepth, depthstep): determines the depth levels to be evaluated. (default setting evaluates 4,6,8)
ntreeTry: number of trees to be used by pattern learning for each fold (default setting=25)
diff.target: true if target can be a difference series, false otherwise (default setting=true)
diff.segment: true if predictor segment can be a difference series, false otherwise (default setting=true)

Tuning and pattern learning scripts are given below:
```tune=tunelearnPattern(traindata_labeled, class_train, segmentlevels=seglenfactor, mindepth=4, maxdepth=8, depthstep=2, ntreeTry=25, target.diff=T,segment.diff=T)

# learn patterns
ensemble=learnPattern(traindata_labeled, segment_length_factor=tune\$best.seg, target.diff=T, segment.diff=T, ntree=noftree, maxdepth=tune\$best.depth, replace=FALSE)
```

The parameter "replace" is not clear in the description. If "replace" is set to true, the patterns are learned using the idea of bagging in random forests. Each tree selects a subset of the time series in this case. On the other hand, our approach requires maximum possible number of training instances to find better representations. Hence, this parameter is set to false. We kept this parameter to control the number of training instances which can significantly reduce the training time.

4. Testing: At this stage, time series are represented by learned patterns from each tree and the similarity is aggregated over the trees.

```sim=matrix(0,nof_test,noftrain_labeled)
noftree=ensemble\$forest\$ntree
for(t in 1:noftree){
representations=representTS(ensemble, traindata_labeled, testdata, which.tree=t, max_depth=tune\$best.depth)
sim=sim+computeSimilarity(representations\$test, representations\$train)
}

id=apply(sim,1,which.min)
predicted=class_train[id]
error_rate=1-sum(class_test==predicted)/nof_test
```

A SAMPLE RUN RESULT

Screenshot of a sample run of LPS on GunPoint dataset for 10 replications is provided below: (Ubuntu 12.10 system with 8 GB RAM, dual core CPU i7-3620M 2.7 GHz):

Page 2 of 3