Algorithms and their HyperParameters

ALGORITHMS and PARAMETERS
Scaling	Scaling before applying ML algorithms is very important. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges and to avoid numerical difficulties during the calculation. we perform linear scaling that is scale range [-1,1].
Tune Param	Proper choice of C and gamma is critical to the SVM's performance. The user is advised to select Tune Param . On selecting Tune Param - you are performing cross-validation and Grid Search. You can consider speed as step length between two consecutive values C and gamma.
CLASSIFICATION
Parameters	values	Definition	Tips
SVM_Type	0 - C-SVC	C-SVC, NuSVC and One-className SVM performes binary and multi-className classification on a dataset. C-SVM and NuSVM are similar methods, but accept slightly different sets of parameters and have different mathematical formulations. One-className SVM algorithms, learns a decision function for novelty detection: classifying new data as similar or different to the training set.
	1 - Nu-SVC
	2 - ONE-className SVM
Kernel Type	0 - Linear	linear: u'*v	Radial Basis Function is a general purpose kernel, used when there is no prior knowledge about the data because 1. The linear kernel is a special case of RBF since the linear kernel with a penalty parameter C has the same performance as the RBF kernel with some parameters (C, gamma) 2. The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than the RBF kernel. There are some situations where the RBF kernel is not suitable. In particular, when the number of features is very large, one may just use the linear kernel.
	1 - Polynomial	polynomial: (gammau'v + coef0)^degree
	2 - RBF	radial basis function: exp(-gamma*\|u-v\|^2) This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear.
	3 - Sigmoid	sigmoid: tanh(gammau'v + coef0)
Gamma	[0.000122,8]	gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.
Degree		Degree of the polynomial kernel function. Ignored by all other kernels.
Coef0		Independent term in kernel function. It is only significant in ‘polynomial' and ‘sigmoid'.
Cost (C)	[0.031250,8192]	The parameter C, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. As C increases, tendency to misclassification decreases on train data( may lead to overfitting).	C is 1 by default and it's a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization.
NU	(0,1]	It's a hyperparameter for nu-SVC, one-className SVM and nu-SVR. It is similar to C. nu is upper bound on the fraction of errors and lower bound on the fraction of number of support vectors( number of support vectors determine the run time). Example: if we want error to be less than 1% then nu is 0.01 and the number of supported vectors will be more than 1% of the total records.	Nu approximates value = the fraction of training errors and support vectors.
Cachesize		For C-SVC, SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems.	If you have enough RAM available, it is recommended to set cache size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB).
Termination Criterion		Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.
Shrinking		The shrinking are there to save the training time.They sometimes help, and sometimes they do not. It's a matter of runtime, rather than convergence. If the number of iterations is large, then shrinking can shorten the training time.	We found that if the number of iterations is large, then shrinking can shorten the training time.
Probability_Estimates		Whether to enable probability estimates.
nr_weight		nr_weight is the number of elements in the array weight_label and weight. Each weight[i] corresponds to weight_label[i], meaning that the penalty of className weight_label[i] is scaled by a factor of weight[i].
REGRESSION
Parameters	values	Definition	Tips
SVM_Type	3 - Epsilon-SVR	The Nu parameter in Nu-SVM can be used to control the amount of support vectors in the resulting model. However, in ϵ-SVR you have no control on how many data vectors from the dataset become support vectors, it could be a few, it could be many. Nonetheless, you will have total control of how much error you will allow your model to have, and anything beyond the specified ϵ will be penalized in proportion to C, which is the regularization parameter.
SVM_Type	4 - Nu-SVR
Kernel Type	0 - Linear	linear: u'*v	Radial Basis Function is a general purpose kernel, used when there is no prior knowledge about the data because 1. The linear kernel is a special case of RBF since the linear kernel with a penalty parameter C has the same performance as the RBF kernel with some parameters (C, gamma) 2. The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than the RBF kernel. There are some situations where the RBF kernel is not suitable. In particular, when the number of features is very large, one may just use the linear kernel.
	1 - Polynomial	polynomial: (gammau'v + coef0)^degree
	2 - RBF	radial basis function: exp(-gamma*\|u-v\|^2) This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear.
	3 - Sigmoid	sigmoid: tanh(gammau'v + coef0)
Gamma	[0.000122,8]	gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.
Degree		Degree of the polynomial kernel function. Ignored by all other kernels.
Coef0		Independent term in kernel function. It is only significant in ‘polynomial' and ‘sigmoid'.
Cost (C)	[0.031250,8192]	Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.	C is 1 by default and it's a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization.
NU	(0,1]	It's a hyperparameter for nu-SVC, one-className SVM and nu-SVR. It is similar to C. nu is upper bound on the fraction of errors and lower bound on the fraction of number of support vectors( number of support vectors determine the run time). Example: if we want error to be less than 1% then nu is 0.01 and the number of supported vectors will be more than 1% of the total records.	Nu approximates value = the fraction of training errors and support vectors.
Epsilon_SVR (P)		Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.
Cachesize		For C-SVC, Epsilon-SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems.	If you have enough RAM available, it is recommended to set cache size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB).
Termination Criterion		Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.
Shrinking		The shrinking are there to save the training time.They sometimes help, and sometimes they do not. It's a matter of runtime, rather than convergence. If the number of iterations is large, then shrinking can shorten the training time.	We found that if the number of iterations is large, then shrinking can shorten the training time.
Probability_Estimates		Whether to enable probability estimates.
LINEAR REGRESSION
Parameters	values	Definition
Solver	11 - L2-regularized L2-loss SVR primal	We have 3 linear Regression solvers, by combining several types of loss functions and regularization schemes. The regularization can be L1 or L2, and the losses can be the regular L2-loss for SVM (hinge loss), or L1-loss for SVM. The default value for type is 11
	12 - L2-regularized L2-loss SVR dual
	13 - L2-regularized L1-loss SVR dual
Cost (C)		The parameter C, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. As C increases, tendency to misclassification decreases on train data( may lead to overfitting).
Epsilon_SVR (P)		Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.
Termination Criterion		Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.
Folds		V-fold for Cross Validation. In v-fold cross-validation, we first divide the training set into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v − 1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified
LINEAR CLASSIFICATION
Parameters	values	Definition
Solver	0 - L2-regularized logistic regression primal, 1 - L2-regularized L2-loss SVC dual, 2 - L2-regularized L2-loss SVC primal , 3 - L2-regularized L1-loss SVC dual,	We have 8 linear Classification solvers, by combining several types of loss functions and regularization schemes. The regularization can be L1 or L2, and the losses can be the regular L2-loss for SVM (hinge loss), L1-loss for SVM, or the logistic loss for logistic regression. The default value for type is 0
	4 - Support Vector Classification by Crammer and Singer
	5 - L1-regularized L2-loss SVC 6 - L1-regularized Logistic regression , 7 - L2-regularized Logistic regression dual
Cost (C)		Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
Bias		Conside : w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + … + w_bias * x_bias = 0, Here x are the feature values and w are the trained “weights”. The additional feature x_bias is a constant, whose value is equal to the bias.
Termination Criterion		Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.
Folds		V-fold for Cross Validation. In v-fold cross-validation, we first divide the training set into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v − 1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified.
nr_weight		nr_weight is the number of elements in the array weight_label and weight. Each weight[i] corresponds to weight_label[i], meaning that the penalty of className weight_label[i] is scaled by a factor of weight[i].
Weight (wi)		set the parameter C of className i to weight*C, for C-SVC.
Weight_Label		These weights are used to change the penalty for specific labels (classes). If the weight for a label is not changed, it is set to 1.0.
K-MEANS
Parameters	values	Definition
Kernel Type	0. LINEAR	linear: u'*v
	1. POLYNOMIAL	polynomial: (gammau'v + coef0)^degree
	2. RBF	radial basis function: exp(-gamma*\|u-v\|^2) This kernel nonlinearly maps samples into a higher dimensional space so it, unlike the linear kernel, can handle the case when the relation between className labels and attributes is nonlinear.
	3. SIGMOID	sigmoid: tanh(gammau'v + coef0)
Gamma		gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.
Coef0		Degree of the polynomial kernel function. Ignored by all other kernels.
Degree		Independent term in kernel function. It is only significant in 'polynomial' and 'sigmoid'.
Dimension (Number of Attributes)		Number of input attributes / columns in the training data set
Number of Centers		Number of clusters
Stopping Criteria		Tolerance for stopping criterion. The stopping tolerance affects the number of iterations used when optimizing the model.
Number of Rows		Total number of records / rows in the training data

Algorithms and their HyperParameters

ALGORITHMS and PARAMETERS

CLASSIFICATION

REGRESSION

LINEAR REGRESSION

LINEAR CLASSIFICATION

K-MEANS