Some notes on para/hyper-para of a few ML models

Linear Regression

L1/L2 penalty
Fit intercept
Solver

Logistic Regression

L1/L2 penalty
Class Weight
Solver

Naive Bayes

Alpha
Binarize
Fit Prior

Decision Tree

Criterion
Min Sample Split
Max Depth

Random Forest

n_estimators: num of trees (default: 100).
Below 3 are about setting the minimum impurity decrease in order to split a node.
- max_depth: max depth of each tree (to prevent overfitting).
- min_samples_split: minimum samples required to split a node.
- min_samples_leaf: minimum samples required to be a leaf node.
max_features: number of features to consider for best split (sqrt for classification, log2 is common).
- sqrt is just rule of thumb and should be ok. But we should check up to 30-40% of total features.
- Higher -> can overfit.
bootstrap: whether to bootstrap sampling (default: True).
oob_score: out-of-bag scoring for validation (True/False).
criterion: splitting criterion (“gini” or “entropy” for classification, “mse” or “mae” for regression).
- These are different ways to assess how similar/different two samples are.
seeding: RF also have this to resuse previous training, like GB.
Side note:
- Bagging is a general concept where the individual models can be different than trees.
- Also for bagging, all features are used when splitting a node. For RF, only a fraction (usually improve).
- RF, GB, SVM are non-parametric models (i.e., complexity grows as the number of training samples increases).
  - RF handles multi-class classification better than SVM, since RF does it out-of-the-box.
- In theory, RF should work with missing and categorical data. However, the sklearn implementation doesn’t handle this. To prepare data for RF, you need to make sure that:
  - there are no missing values in your data
  - convert categorical data into numerical.
- Usually, XGB is better than RF, maybe giving 2% improvement. But it comes as the cost of MUCH longer (maybe 50 times) training time because it can by nature NOT run in parallel like RF.
- Note that one drawback for GBT is: In Spark 2.3.0, GBT cannot support multi-class classification yet. But RF can.
Gradient Boosting (GB) (GBM, GBT)
- n_estimators: number of boosting stage (default: 100).
  - Usually choose a high value. But too high can overfit.
- learning_rate: shinks contribution of each tree (lower values need more trees).
  - But lower value prevent overfit.
- Same as RF: max_depth, min_samples_split, min_samples_leaf.
- subsample: fraction of samples used per boosting iteration (default: 1.0 i.e. used all).
  - Values slightly <1 such as 0.8 make model robust by reducing variance.
- loss: loss func (“deviance” for classification, “ls” for regression).
- alpha: used in quantile loss for quantile regression.
- warm_start: whether to reuse previous training results (True/False).
XGB (aka, GBM killer, or regularized GB)
- Key: XGB allows for tree pruning. You can tune the amount of pruning with a parameter eita.
- 3 key parameters to tune XGB: the learning rate (alpha), the regularization parameter (gamma), and the amount of pruning (eita).
- XGBoost is currently not available on pyspark 2.4.4.
- Both xgboost and gbm follows the principle of gradient boosting. There are however, the difference in modeling details. Specifically, xgboost used a more regularized model formalization to control over-fitting, which gives it better performance.

Principal Component

N component
Iterated Power
SVD Solver

K-Nearest neighbor

N Neighbors
Algorithm (kd-tree, brute)
Weights

K-means

N Clusters
Max iter
Init

Deep NNs

Hidden Layer Sizes
Activation
Dropout
Solver
Alpha
Learning Rate

Tho Le

Some notes on para/hyper-para of a few ML models

Linear Regression

Logistic Regression

Naive Bayes

Decision Tree

Random Forest

Gradient Boosting (GB) (GBM, GBT)

XGB (aka, GBM killer, or regularized GB)

Principal Component

K-Nearest neighbor

K-means

Deep NNs

Related Posts