ALGORITHMS and PARAMETERS

Scaling

Scaling before applying ML algorithms is very important. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges and to avoid numerical difficulties during the calculation.

we perform linear scaling that is scale range [-1,1].

Tune Param

Proper choice of C and gamma is critical to the SVM's performance. The user is advised to select Tune Param .

On selecting Tune Param - you are performing cross-validation and Grid Search. You can consider speed as step length between two consecutive values C and gamma.

CLASSIFICATION

Parameters

values

Definition

Tips

SVM_Type

0 - C-SVC

C-SVC, NuSVC and One-className SVM performes binary and multi-className classification on a dataset. C-SVM and NuSVM are similar methods, but accept slightly different sets of parameters and have different mathematical formulations.

One-className SVM algorithms, learns a decision function for novelty detection: classifying new data as similar or different to the training set.

1 - Nu-SVC

2 - ONE-className SVM

Kernel Type

0 - Linear

linear: u'*v

Radial Basis Function is a general purpose kernel, used when there is no prior knowledge about the data because

1. The linear kernel is a special case of RBF since the linear kernel with a penalty parameter C has the same performance as the RBF kernel with some parameters (C, gamma)

2. The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than

the RBF kernel.

There are some situations where the RBF kernel is not suitable. In particular, when the number of features is very large, one may just use the linear kernel.

1 - Polynomial

polynomial: (gamma*u'*v + coef0)^degree

2 - RBF

radial basis function: exp(-gamma*|u-v|^2)

This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear.

3 - Sigmoid

sigmoid: tanh(gamma*u'*v + coef0)

Gamma

[0.000122,8]

gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

Degree

Degree of the polynomial kernel function. Ignored by all other kernels.

Coef0

Independent term in kernel function. It is only significant in ‘polynomial' and ‘sigmoid'.

Cost (C)

[0.031250,8192]

The parameter C, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. As C increases, tendency to misclassification decreases on train data( may lead to overfitting).

C is 1 by default and it's a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization.

NU

(0,1]

It's a hyperparameter for nu-SVC, one-className SVM and nu-SVR. It is similar to C. nu is upper bound on the fraction of errors and lower bound on the fraction of number of support vectors( number of support vectors determine the run time).

Example: if we want error to be less than 1% then nu is 0.01 and the number of supported vectors will be more than 1% of the total records.

Nu approximates value = the fraction of training errors and support vectors.

Cachesize

For C-SVC, SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems.

If you have enough RAM available, it is recommended to set cache size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB).

Termination Criterion

Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.

Shrinking

The shrinking are there to save the training time.They sometimes help, and sometimes they do not. It's a matter of runtime, rather than convergence. If the number of iterations is large, then shrinking can shorten the training time.

We found that if the number of iterations is large, then shrinking can shorten the training time.

Probability_Estimates

Whether to enable probability estimates.

nr_weight

nr_weight is the number of elements in the array weight_label and weight. Each weight[i] corresponds to weight_label[i], meaning that the penalty of className weight_label[i] is scaled by a factor of weight[i].

REGRESSION

Parameters

values

Definition

Tips

SVM_Type

3 - Epsilon-SVR

The Nu parameter in Nu-SVM can be used to control the amount of support vectors in the resulting model. However, in ϵ-SVR you have no control on how many data vectors from the dataset become support vectors, it could be a few, it could be many. Nonetheless, you will have total control of how much error you will allow your model to have, and anything beyond the specified ϵ will be penalized in proportion to C, which is the regularization parameter.

4 - Nu-SVR

Kernel Type

0 - Linear

linear: u'*v

Radial Basis Function is a general purpose kernel, used when there is no prior knowledge about the data because

1. The linear kernel is a special case of RBF since the linear kernel with a penalty parameter C has the same performance as the RBF kernel with some parameters (C, gamma)

2. The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than

the RBF kernel.

There are some situations where the RBF kernel is not suitable. In particular, when the number of features is very large, one may just use the linear kernel.

1 - Polynomial

polynomial: (gamma*u'*v + coef0)^degree

2 - RBF

radial basis function: exp(-gamma*|u-v|^2)

This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear.

3 - Sigmoid

sigmoid: tanh(gamma*u'*v + coef0)

Gamma

[0.000122,8]

gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

Degree

Degree of the polynomial kernel function. Ignored by all other kernels.

Coef0

Independent term in kernel function. It is only significant in ‘polynomial' and ‘sigmoid'.

Cost (C)

[0.031250,8192]

Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.

C is 1 by default and it's a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization.

NU

(0,1]

It's a hyperparameter for nu-SVC, one-className SVM and nu-SVR. It is similar to C. nu is upper bound on the fraction of errors and lower bound on the fraction of number of support vectors( number of support vectors determine the run time).

Example: if we want error to be less than 1% then nu is 0.01 and the number of supported vectors will be more than 1% of the total records.

Nu approximates value = the fraction of training errors and support vectors.

Epsilon_SVR (P)

Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.

Cachesize

For C-SVC, Epsilon-SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems.

If you have enough RAM available, it is recommended to set cache size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB).

Termination Criterion

Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.

Shrinking

The shrinking are there to save the training time.They sometimes help, and sometimes they do not. It's a matter of runtime, rather than convergence. If the number of iterations is large, then shrinking can shorten the training time.

We found that if the number of iterations is large, then shrinking can shorten the training time.

Probability_Estimates

Whether to enable probability estimates.

LINEAR REGRESSION

Parameters

values

Definition

Solver

11 - L2-regularized L2-loss SVR primal

We have 3 linear Regression solvers, by combining several types of loss functions and regularization schemes. The regularization

can be L1 or L2, and the losses can be the regular L2-loss for SVM (hinge loss), or L1-loss for SVM. The default value for type is 11

12 - L2-regularized L2-loss SVR dual

13 - L2-regularized L1-loss SVR dual

Cost (C)

The parameter C, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. As C increases, tendency to misclassification decreases on train data( may lead to overfitting).

Epsilon_SVR (P)

Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.

Termination Criterion

Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.

Folds

V-fold for Cross Validation. In v-fold cross-validation, we first divide the training set into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v − 1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified

LINEAR CLASSIFICATION

Parameters

values

Definition

Solver

0 - L2-regularized logistic regression primal,

1 - L2-regularized L2-loss SVC dual,

2 - L2-regularized L2-loss SVC primal ,

3 - L2-regularized L1-loss SVC dual,

We have 8 linear Classification solvers, by combining several types of loss functions and regularization schemes. The regularization

can be L1 or L2, and the losses can be the regular L2-loss for SVM (hinge loss), L1-loss for SVM, or the logistic loss for logistic regression. The default value for type is 0

4 - Support Vector Classification by Crammer and Singer

5 - L1-regularized L2-loss SVC

6 - L1-regularized Logistic regression ,

7 - L2-regularized Logistic regression dual

Cost (C)

Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.

Bias

Conside : w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + … + w_bias * x_bias = 0, Here x are the feature values and w are the trained “weights”. The additional feature x_bias is a constant, whose value is equal to the bias.

Termination Criterion

Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.

Folds

V-fold for Cross Validation. In v-fold cross-validation, we first divide the training set into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v − 1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified.

nr_weight

nr_weight is the number of elements in the array weight_label and weight. Each weight[i] corresponds to weight_label[i], meaning that the penalty of className weight_label[i] is scaled by a factor of weight[i].

Weight (wi)

set the parameter C of className i to weight*C, for C-SVC.

Weight_Label

These weights are used to change the penalty for specific labels (classes). If the weight for a label is not changed, it is set to 1.0.

K-MEANS

Parameters

values

Definition

Kernel Type

0. LINEAR

linear: u'*v

1. POLYNOMIAL

polynomial: (gamma*u'*v + coef0)^degree

2. RBF

radial basis function: exp(-gamma*|u-v|^2)

This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear.

3. SIGMOID

sigmoid: tanh(gamma*u'*v + coef0)

Gamma

gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

Coef0

Degree of the polynomial kernel function. Ignored by all other kernels.

Degree

Independent term in kernel function. It is only significant in 'polynomial' and 'sigmoid'.

Dimension (Number of Attributes)

Number of input attributes / columns in the training data set

Number of Centers

Number of clusters

Stopping Criteria

Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.

Number of Rows

Total number of records / rows in the training data