Tho Le

A Data Scientist. Looking for knowledge!

Some notes on para/hyper-para of a few ML models

13 Mar 2025 » ml, parameters

Linear Regression

  • L1/L2 penalty
  • Fit intercept
  • Solver

Logistic Regression

  • L1/L2 penalty
  • Class Weight
  • Solver

Naive Bayes

  • Alpha
  • Binarize
  • Fit Prior

Decision Tree

  • Criterion
  • Min Sample Split
  • Max Depth

Random Forest

  • n_estimators: num of trees (default: 100).
  • Below 3 are about setting the minimum impurity decrease in order to split a node.
    • max_depth: max depth of each tree (to prevent overfitting).
    • min_samples_split: minimum samples required to split a node.
    • min_samples_leaf: minimum samples required to be a leaf node.
  • max_features: number of features to consider for best split (sqrt for classification, log2 is common).
    • sqrt is just rule of thumb and should be ok. But we should check up to 30-40% of total features.
    • Higher -> can overfit.
  • bootstrap: whether to bootstrap sampling (default: True).
  • oob_score: out-of-bag scoring for validation (True/False).
  • criterion: splitting criterion (“gini” or “entropy” for classification, “mse” or “mae” for regression).
    • These are different ways to assess how similar/different two samples are.
  • seeding: RF also have this to resuse previous training, like GB.
  • Side note:
    • Bagging is a general concept where the individual models can be different than trees.
    • Also for bagging, all features are used when splitting a node. For RF, only a fraction (usually improve).
    • RF, GB, SVM are non-parametric models (i.e., complexity grows as the number of training samples increases).
      • RF handles multi-class classification better than SVM, since RF does it out-of-the-box.
    • In theory, RF should work with missing and categorical data. However, the sklearn implementation doesn’t handle this. To prepare data for RF, you need to make sure that:
      • there are no missing values in your data
      • convert categorical data into numerical.
    • Usually, XGB is better than RF, maybe giving 2% improvement. But it comes as the cost of MUCH longer (maybe 50 times) training time because it can by nature NOT run in parallel like RF.
    • Note that one drawback for GBT is: In Spark 2.3.0, GBT cannot support multi-class classification yet. But RF can.

    Gradient Boosting (GB) (GBM, GBT)

    • n_estimators: number of boosting stage (default: 100).
      • Usually choose a high value. But too high can overfit.
    • learning_rate: shinks contribution of each tree (lower values need more trees).
      • But lower value prevent overfit.
    • Same as RF: max_depth, min_samples_split, min_samples_leaf.
    • subsample: fraction of samples used per boosting iteration (default: 1.0 i.e. used all).
      • Values slightly <1 such as 0.8 make model robust by reducing variance.
    • loss: loss func (“deviance” for classification, “ls” for regression).
    • alpha: used in quantile loss for quantile regression.
    • warm_start: whether to reuse previous training results (True/False).

    XGB (aka, GBM killer, or regularized GB)

    • Key: XGB allows for tree pruning. You can tune the amount of pruning with a parameter eita.
    • 3 key parameters to tune XGB: the learning rate (alpha), the regularization parameter (gamma), and the amount of pruning (eita).
    • XGBoost is currently not available on pyspark 2.4.4.
    • Both xgboost and gbm follows the principle of gradient boosting. There are however, the difference in modeling details. Specifically, xgboost used a more regularized model formalization to control over-fitting, which gives it better performance.

Principal Component

  • N component
  • Iterated Power
  • SVD Solver

K-Nearest neighbor

  • N Neighbors
  • Algorithm (kd-tree, brute)
  • Weights

K-means

  • N Clusters
  • Max iter
  • Init

Deep NNs

  • Hidden Layer Sizes
  • Activation
  • Dropout
  • Solver
  • Alpha
  • Learning Rate

Related Posts