3

Source code reading of LightGBM

 3 years ago
source link: http://www.donghao.org/2021/03/31/source-code-reading-of-lightgbm/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Source code reading of LightGBM

Finally I get a few hours to look into the code of LightGBM.

I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that is still better than no answer at all 🙂

Q: Will LightGBM contruct a couple of trees as one model?

A: No. It will only contruct one tree as a model for a dataset

Q: How would LightGBM choose the feature that has the highest gain in entropy?

A: It will simply iterate all features (with a loop, in code) and try to find the best split for all of them. After that, it will pick the feature and the split with the highest gain

"src/treelearner/serial_tree_learner.cpp"
158 Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) {
...
185   int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth);            
186   
187   for (int split = init_splits; split < config_->num_leaves - 1; ++split) {
188     // some initial works before finding best split
189     if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) {     
190       // find best threshold for every feature                                                      
191       FindBestSplits(tree_ptr);                                                                    
192     }
...
Python
xxxxxxxxxx
"src/treelearner/serial_tree_learner.cpp"
158 Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) {
...
185   int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth);            
186   
187   for (int split = init_splits; split < config_->num_leaves - 1; ++split) {
188     // some initial works before finding best split
189     if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) {     
190       // find best threshold for every feature                                                      
191       FindBestSplits(tree_ptr);                                                                    
192     }
...
"src/treelearner/serial_tree_learner.cpp"

322 void SerialTreeLearner::FindBestSplits(const Tree* tree) {
323   std::vector<int8_t> is_feature_used(num_features_, 0);
324   #pragma omp parallel for schedule(static, 256) if (num_features_ >= 512)
325   for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
326     if (!col_sampler_.is_feature_used_bytree()[feature_index]) continue;
327     if (parent_leaf_histogram_array_ != nullptr
328         && !parent_leaf_histogram_array_[feature_index].is_splittable()) {
329       smaller_leaf_histogram_array_[feature_index].set_is_splittable(false);
330       continue;
331     }
332     is_feature_used[feature_index] = 1;
333   }
...
Python
xxxxxxxxxx
"src/treelearner/serial_tree_learner.cpp"
322 void SerialTreeLearner::FindBestSplits(const Tree* tree) {
323   std::vector<int8_t> is_feature_used(num_features_, 0);
324   #pragma omp parallel for schedule(static, 256) if (num_features_ >= 512)
325   for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
326     if (!col_sampler_.is_feature_used_bytree()[feature_index]) continue;
327     if (parent_leaf_histogram_array_ != nullptr
328         && !parent_leaf_histogram_array_[feature_index].is_splittable()) {
329       smaller_leaf_histogram_array_[feature_index].set_is_splittable(false);
330       continue;
331     }
332     is_feature_used[feature_index] = 1;
333   }
...

Q: In the model file of LightGBM, it shows “num_leaves=63” in every iteration. Shouldn’t it change the depth and leaves of model for every iteration?

A: Can’t answer yet. Still need to look into the code to see why…

Like this:

Loading...

Related


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK