Computational Neurolinguistics ToolBox. Created by Hiroyuki AKAMA (Tokyo Institute of Technology)
I. Preparation:
Matlab later than 2012a is required. This toolbox allows you to duplicate the implementation of the algorithm that Mitchell et al introduced in their Science paper (2008), which consists of predicting the neural activation by any words unknown in the sense of lacking fMRI record but becoming manipulable through corpus evidence regarding co-occurrence or associative strength.
In addition, You need to prepare for some regularization tools.
'regtools' by Per Christian Hansen must be downloaded from Mathworks File Exchange and installed as a directory named 'regu' in this package.
http://jp.mathworks.com/matlabcentral/fileexchange/52-regtools The manual and more details can be found at
http://www2.compute.dtu.dk/~pcha/Regutools/
A better one is 'rbf' toolbox.
But it has become obsolete and is not found as it is, so you might want to restore it referring to the information below.
cf. http://read.pudn.com/downloads115/sourcecode/math/483393/rbf/Utilities/ReadMe__.htm cf. http://www.codeforge.com/read/227797/globalRidge.m__html
cf. http://www.inf.ed.ac.uk/teaching/courses/rl/manual.ps
You must prepare for the following MATLAB m files in the directory of 'rbf'.
colSum.m is now found as
http://read.pudn.com/downloads26/sourcecode/others/83803/rbf/Utilities/colSum.m__.htm http://www.codeforge.com/read/227797/colSum.m__html
diagProduct.m is now found as
http://read.pudn.com/downloads26/sourcecode/others/83803/rbf/Utilities/diagProduct.m__.htm http://www.codeforge.com/read/227797/diagProduct.m__html
globalRidge.m is now found as
http://read.pudn.com/downloads46/sourcecode/math/154448/352414rbf1/rbf/RidgeRegression/g lobalRidge.m__.htm
http://www.codeforge.com/read/227797/globalRidge.m__html
getNextArg.m is now found as
http://read.pudn.com/downloads115/sourcecode/math/483393/rbf/Utilities/getNextArg.m__.htm
http://www.codeforge.com/read/227797/getNextArg.m__html
traceProduct.m is now found as
http://www.codeforge.com/read/227797/traceProduct.m__html
II. Way of Use
To run this toolbox for the datasets of the Science paper, one must prepare for
'runByVoxByNouns-P1.mat' (assigned to the first argument of the function ‘comNeuroLing()’) by applying 'getParticipantData_Science.m' to 'data-science-P1.mat', for example, which is
downloadable from
http://www.cs.cmu.edu/afs/cs/project/theo-73/www/science2008/data.html
Note that 'getParticipantData_Science.m' is not to define a function, so one must copy and paste it to the command window of Matlab to create an fMRI dataset for our toolbox.
Example of use:
[accuracy, cvresultlist,twoLeftOutNumlist]=compNeuroLing('runByVoxByNouns-TokyoTech- P?.mat', '*.csv', 500, 1770, 2, 3, 2);
Arguments:
1st argument: a Matlab mat file for a 3-dimmensional array corresponding to x:runs (repeated presentations) by y:voxels by z:words (fMRI nouns),
2nd argument: a csv file for a matrix representing an association strength (co-occurrence probability, etc.) between the words (fMRI nouns) and the semantic features (basic verbs for Mitchell et al),
3rd argument: number of selected features (voxels),
4th argument: repetition times for the leave-two-out cross-validation, if you use the value of 2 for the 7th argument making exhaustive pairings of two left-out nouns without duplication, you should enter the value of number_of_fMRI_nouns C 2, aka, nchoosek(number_of_fMRI_nouns, 2) here.
5th argument: method of feature (voxel) selection,
1:f-statistic of ANOVA (conventional, Statistics toolbox is not required; (equivalent to the function of mldivide. refer to
http://www.mathworks.co.jp/jp/help/matlab/ref/mldivide.html?lang=en)),
2:stability score (as in the case of Mitchell et al's Science study; Statistics toolbox is not required), slightly better than the former.
6th argument:method of regression,
0:ordinary least square method using the function of regress() (OLS, classical, Statistics toolbox is required),
1:ordinary least square method using the function of mldivide() (OLS, classical, Statistics toolbox is not required), sometimes falling into underdetermination (extremely unbalanced weights)
2:Ridge regression (using the function of ridge() of Statistics toolbox and that of gcv() of the Regularization Tool (http://www.mathworks.co.jp/matlabcentral/fileexchange/52-regtools) for optimizing the lambda value based on the generalized cross validation).
I will abolish this option in the next version, since I have found the lambda optimization is not as accurate as expected.
3:Ridge regression (using the function of globalRidge() of Matlab Routines for Subset Selection and Ridge Regression in Linear Neural Networks (RBF--acronym of radial basis function), by Mark J.L.Orr, Center for Cognitive Science, Edinburg University, UK,
http://www.inf.ed.ac.uk/teaching/courses/rl/manual.ps (Ridge Regression, Statistics toolbox is not required)
The merit of using this package might be that we can choose as a model selection criterion, unbiased estimate of variance (UEV), final prediction error (FPE), generalised cross-validation (GCV), or Bayesian information criterion (BIC). We selected as a default setting GCV.
And this option is slightly faster than the former regtool (option 2). It seems that the otimization of lambda by the rbf package is much more accurate than the regtool. The package of 'rbf' must be put in the base directory of our toolbox as './rbf'
With this option, there is no need for the function of ridge() in Statistics Toolbox, since the betas will be directly computed by
inv(X' * X +lambda*eye(m))*X'*y;
after getting the best lambda value through lambda=globalRidge(H,y,0.1,'GCV ')
This coding will be implemented in the forthcoming version of our toolbox. (Already done, 2013/08/01)
7th argument:sampling method
1:the instances of two-left nouns' pairings are results of random sampling repeated 'repetitionTime' times,
2:an exhaustive list enumerating all the instances of two-left nouns' pairings, the overall combination pairs taken from 1:number_of_fMRI_nouns' is made,
that is, returned value of nchoosek(1 : number_of_fMRI_nouns, 2), only if 'repetitionTime' is set to number_of_fMRI_nouns C 2, aka,
nchoosek(number_of_fMRI_nouns, 2) (1770, in the case of Science study).
%The reason why we need regularization. Basic tutorial (tentative). H1=rand(60,5);
H2=rand(60,5); H3=rand(60,5); H4=rand(60,5);
H=[H1 H2 H3 H4 0.1*(H1+2*H2+7*H3)];
%There are some columns which are linear weighted sum of other ones.
%Linear dependent! y=randn(60,1);
w_bad1=inv(H' * H)*H'*y
%w_bad1 is bad because the coefficients are extremely unbalanced (their variance is too big). w_bad2=H\y
%w_bad2 shows rank deficit.
%The reason why the function mldivide(\) didn't work. Sometimes some betas become all 0.
警告: 行列は、特異値に近い、または悪いスケーリングです。
結果は不正確になる可能性があります。RCOND = 3.502933e-019.
w_bad1 =
0.2068 0.8766 -1.0696 0.9188 0.5150 -0.1810 1.2237 -0.9894 0.8878 -0.0038 -0.3661 2.0162 -3.3460 2.2191 0.5785 0.4607 -0.3295 0.2543 0.1093 0.6535 -0.2391 -2.6885 5.0039 -5.4653 -3.2007
警告: ランクが欠落しています, rank = 20, tol = 6.7192e-014 w_bad2 =
0.6958 0.4356 -0.4872 0.4288 0.2116 -0.2399 0.7201 -0.1327 0.1336 -0.6998 -0.0451 -0.3263 -0.2814 -0.3987 -1.5792 0.4607 -0.3295 0.2543 0.1093 0.6535 0 0 0 0 0
Now I have found a good package for computing a Ridge regression with the optimized lamda value as a penalty for avoiding overfitting.
III.Result
Titech_OLD with OLS
Titech_ridge_gcv Science Levy and Bullinaria
P1 0.82 0.8068 0.83 0.83
P2 0.78 0.7232 0.76 0.85
P3 0.78 0.7559 0.78 0.76
P4 0.78 0.7921 0.72 0.78
P5 0.75 0.8124 0.78 0.82
P6 0.64
P7 0.71
P8 0.6
P9 0.65
If you discard the subject nu Result after sorting values i subjects. We see the regula Using Computational Neurol -stability score as a feature -Ridge regression (impleme with the rbf package by Mar -1770 combinations of two-l for P9, that is, running [accuracy, cvresultlist,twoL 'nounsByverbs.csv', 500, 17 we could get the accuracy of 7 A considerable improvemen the conditions of using stabili some rank deficit).
0.7017 0.85 0.7
0.765 0.73 0.7
0.6593 0.68 0.7
0.7226 0.82 0.6
numbering, a comparison would be facile. in each report without taking into account larization gurantees the robustness of mod olinguistics ToolBox with the options of
selection,
ented by myself) and global Ridge optimiz ark J.L.Orr,
left-out nouns,
LeftOutNumlist]=compNeuroLing('runByVo 1770, 2, 3, 2);
72.26%.
ent has been guaranteed, since P9 had brou ility score and OLS (ordinary least squares
.73 .78 .72 .68
nt the identity of deling.
ization based on GCV
VoxByNouns-P9.mat',
ought only 65% under s, presumably with