Source code reading of LightGBM
source link: http://www.donghao.org/2021/03/31/source-code-reading-of-lightgbm/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Source code reading of LightGBM
Finally I get a few hours to look into the code of LightGBM.
I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that is still better than no answer at all
Q: Will LightGBM contruct a couple of trees as one model?
A: No. It will only contruct one tree as a model for a dataset
Q: How would LightGBM choose the feature that has the highest gain in entropy?
A: It will simply iterate all features (with a loop, in code) and try to find the best split for all of them. After that, it will pick the feature and the split with the highest gain
"src/treelearner/serial_tree_learner.cpp" 158 Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) { ... 185 int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth); 186 187 for (int split = init_splits; split < config_->num_leaves - 1; ++split) { 188 // some initial works before finding best split 189 if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) { 190 // find best threshold for every feature 191 FindBestSplits(tree_ptr); 192 } ...
xxxxxxxxxx
"src/treelearner/serial_tree_learner.cpp"
158 Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) {
...
185 int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth);
186
187 for (int split = init_splits; split < config_->num_leaves - 1; ++split) {
188 // some initial works before finding best split
189 if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) {
190 // find best threshold for every feature
191 FindBestSplits(tree_ptr);
192 }
...
"src/treelearner/serial_tree_learner.cpp" 322 void SerialTreeLearner::FindBestSplits(const Tree* tree) { 323 std::vector<int8_t> is_feature_used(num_features_, 0); 324 #pragma omp parallel for schedule(static, 256) if (num_features_ >= 512) 325 for (int feature_index = 0; feature_index < num_features_; ++feature_index) { 326 if (!col_sampler_.is_feature_used_bytree()[feature_index]) continue; 327 if (parent_leaf_histogram_array_ != nullptr 328 && !parent_leaf_histogram_array_[feature_index].is_splittable()) { 329 smaller_leaf_histogram_array_[feature_index].set_is_splittable(false); 330 continue; 331 } 332 is_feature_used[feature_index] = 1; 333 } ...
xxxxxxxxxx
"src/treelearner/serial_tree_learner.cpp"
322 void SerialTreeLearner::FindBestSplits(const Tree* tree) {
323 std::vector<int8_t> is_feature_used(num_features_, 0);
324 #pragma omp parallel for schedule(static, 256) if (num_features_ >= 512)
325 for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
326 if (!col_sampler_.is_feature_used_bytree()[feature_index]) continue;
327 if (parent_leaf_histogram_array_ != nullptr
328 && !parent_leaf_histogram_array_[feature_index].is_splittable()) {
329 smaller_leaf_histogram_array_[feature_index].set_is_splittable(false);
330 continue;
331 }
332 is_feature_used[feature_index] = 1;
333 }
...
Q: In the model file of LightGBM, it shows “num_leaves=63” in every iteration. Shouldn’t it change the depth and leaves of model for every iteration?
A: Can’t answer yet. Still need to look into the code to see why…
Like this:
Related
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK