Thursday, March 10, 2011

Learn how to start SVM

Reference: A Practical Guide to Support Vector Classification by Chih-Wei Hsu

Proposed procedures for beginners:
1. Convert data into the LibSVM format
2. Scale the data
3. Try RBF kernel first
4. Use cross-validation to find the best parameter C and   gama
5. Use the best parameter C and gama to train the whole training set.
6. Test the model on the testing set.

1. The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...
.
.

Each line contains an instance and is ended by a '\n' character.  For classification, <label> is an integer indicating the class label. For regression, <label> is the target value which can be any real number. i.e. our training data is :

0.985749058346 1:24451.96 2:198.0345 3:0.00155077697416 4:2.40490922357e-06
<index>:<value> gives a feature (attribute) value. <index> is an integer starting from 1 and <value> is a real number. Indices must be in ASCENDING order. Labels in the testing file are only used to calculate accuracy or errors. If they are
unknown, fill the first column with any numbers.

2. Check data type using the command:
checkdata.py libSVMformat.data
    no error was reported.

3. Separate data(3026 in total) into training set(2726) and testing set(300) using random selection. Command:
subset.py -s 1 libSVMformat.data 300 test.data train.data

4. Scale the data
> svm-scale -l 0 -u 1 -s range train.data > train.scale
> svm-scale -r range test.data > test.scale
Scale each feature of the training data to be in [0,1]. Scaling
factors are stored in the file range and then used for scaling the
test data.

5. First try to train model
svm-train -s 3 -p 0.1 -t 0 train.scale
Solve SVM regression with linear kernel u'v and epsilon=0.1
in the loss function.
6. test
svm-predict test.scale train.scale.model test.predict
results:
Mean squared error = 3.29278 (regression)
Squared correlation coefficient = 0.0463435 (regression)

7. Second try: use RBF kernel to train and test
>svm-train -s 3 -p 0.1 -t 2 train.scale
>svm-predict test.scale train.scale.model test.predict
results:
Mean squared error = 3.28346 (regression)
Squared correlation coefficient = 0.0498395 (regression)

Wednesday, March 9, 2011

construct project directory

-project directory
     -data
          -Celegans_Jupter_C12_120min_grad.ms2
          -Celegans_Jupter_C12_120min_grad.nort.out
          -libSVMformat.data
     -article
          -FT-ICR, SVM, LibSVM tutorial...
     -script
          -PSM2libSVM.py (written by Victor)

The libSVMformat.data was the output file by using following command.
./PSM2libSVM.py Celegans_Jupter_C12_120min_grad.nort.out Celegans_Jupter_C12_120min_grad.ms2 > libSVMformat.data

0.985749058346 1:24451.96 2:198.0345 3:0.00155077697416 4:2.40490922357e-06
0.108725510543 1:18342.92 2:163.1712 3:0.00166732956911 4:2.77998789204e-06
1.35301110367 1:2792.54 2:500.0 3:0.00143809945378 4:2.06813003897e-06
2.00113352112 1:6217.64 2:500.0 3:0.00151054777749 4:2.28175458809e-06
0.530969709975 1:8401.57 2:451.8174 3:0.00176742084871 4:3.12377645645e-06

Take a look at the output file( only print the first five lines of the result). Five columns on each line represent ppm, total ion current, ion inject time, 1/observed mass to charge, 1/(observed mass to charge)^2, respectively.

So far, extracting features(column2-5) for each PSM has been finished.

The following work is to design SVR from the data (column 2-5) to predict the relative mass deviation, ppm.

 

Sunday, March 6, 2011

LIBSVM -- A Library for Support Vector Machines

SVMs(Support Vector Machines) are commonly used to do data classification. Compared to Neural Network, it is easier to use. When we are doing classification, we need to separate our data into training and testing sets. Each individual in the training set contains one target value (i.e. the class labels) and several attributes (i.e. the features or observed variables). The goal of SVM is to produce a model based on the training set and use this model to predict the target values of the testing set given only the attributes of test data.


1.download and install gnuplot  http://sourceforge.net/projects/gnuplot/files/
   this is required to use the parameter selection tool grid.py in LIBSVM

2. download LIBSVM
Libsvm is a simple, easy-to-use, and efficient software for SVM classification and regression. It solves C-SVM classification, nu-SVM classification, one-class-SVM, epsilon-SVM regression, and nu-SVM regression. It also provides an automatic model selection tool for C-SVM classification.

3. Materials
Spectral data: A set of fragmentation spectra from C. elegans set run in an FT-ICR, who’s monoisotopic masses been determined by Hardklör/Bulseye
PSM identifications: the fragmentation data has been searched and postprocessed using crux/percolator.

4. What to do
• Extract a set of fragmentation spectra which have PSMs with a q-value less than 1%.
• Determine the relative mass deviation (<observed mass> -<calculated mass>)/<calculated mass> for each of the PSMs in the set, and investigate the realtionship to I/<observed_mass_to_charge>.
• For each PSM, extract the features: (1) Total ion current of the MS/MS scan, (2) Ion injection time of the MS/MS scan, (3) I/<observed_mass_to_charge>,  (4) I/<observed_mass_to_charge>^2   (5) the relative mass deviation
• Design an SVR that from (1)-(4) predicts (5)
• Use cross validation to determine the performance of the system

Saturday, March 5, 2011

Space charge effects

The resolution of FT-ICR MS system is reduced by frequency shifts due to space-charge effects.

Space charge is treated as a continuum of charge distributed over  a space region. When charge has been emitted from some region of solid surface, a cloud of charge carriers can form space charge if they are sufficiently spread out. Because of Coulomb repulsion between ions, the cyclotron frequency changes and does not reflect accurate mass spectra. Frequency shift can be minimized by using small ions population, low ions densities, a short, high amplitute of excitation waveform.

In order to get highest mass measurement accuracy, internal or external calibration strategies are used to reduce the space charge effects. Our project is to compensate these effects using support vector regression (SVR).

Friday, March 4, 2011

FT-ICR mass spectrometer(1)

FT-ICR mass spectrometer is famous for its high resolution of +/- 2 parts per million. FT-MS keep the ions confined in the high magnetic field by inducing an alternating current on the metal plates. The ions circle with frequencies that are inversely proportional to the mass charge ratio. The frequency spectrum of ions is converted into mas spectrum by  Fourier transformation. Since frequency can be measured very accurately, FT-MS has very high resolution.
FT-ICR
1. important figures of MS
1.1 resolution
      m/z value divided by the peak width at half height
MS spectra













1.2 mass deviations
One important performance figure is an instruments mass resolution, typically Δm/z
eg. Δm/z=<observed m/z>-<exact m/z>=+/- 3 (3Da/charge)

High resolution instruments often report relative mass errors: ppm(Parts-per-million)
ppm=(observed - exact)/exact * 1000000

In general a |ppm| difference smaller than 5 is quite good.

1.3 monoisotopic mass and average mass
For a given compound, the monoisotopic mass is the mass of the isotopic peak whose elemental composition is composed of the most abundant isotopes of those elements. The monoisotopic mass can be calculated using the atomic masses of the isotopes.The average mass is the weighted average of the isotopic masses weighted by the isotopic abundances.  The average mass can be calculated using the atomic weights of the elements.
The exact mass/charge of an ion is calculated from the mono-isotopic mass,not the average, for each element in its elemental composition. An electron mass (or several) must be added or taken away to get the exact ion mass/charge (and divide by the charge if doubly, triply, etc. charged).