Picking the Proper MLA for Linear Regression

Decision Tree Succeeds where Ordinary Least Squared Failed

Sep 22 ·4min read

miiei2m.png!web

On my last post , I talked about selecting specific features and using them on an ordinary least squared formula to help predict the scalar coupling constant for molecular interaction properties. I left it off with a relatively poor predicting algorithm and decided that that was due to unimportant features as determined by all of my EDA calculations. I will post my previous picture as well as the output visual with only the important features.

YvArEjQ.png!web

ZZbeUnZ.png!web

Can you tell the difference? They are using two different features. First one contains all features with a weight of 150 or above while the second one contains features with weights above 1500 or above. That’s when I realized that this was just a bad model to work with.

Given that my earlier feature importance was incorrect, I decided to work on a model that takes into account multiple features and only keeps important ones through Primary Component Analysis. That led me to the decision tree regression algorithm. Now when most people talk about decision tree, they mean classifications but since I am attempting at finding a continuous variable, it has to be a regression.

First, I loaded up the subsample csv I created from the massive database using pandas. Then I used the following code to check for all categories and drop them. Remember that these categories have little to no importance and carry too many sparse count for it to be effective.

df= pd.read_csv('molecule_subsample.csv')for col in df.select_dtypes(include=[object]):
    print(df[col].value_counts(dropna=False), "\n\n")df= df.drop(columns=['molecule_name', 'atom_atom1_structure', 'type', 'type_scc', 'atom'])

Now I have a proper dataframe with the features I want. I also decided to save this as a csv file so I can import it quicker in the future:

df.to_csv(‘subsample_nocat.csv’, index=False)

Then I create the feature and target variables and then the train test split. I chose a small training size because even this subsample contains almost 50,000 dataset.

feature= df.drop(columns=['scalar_coupling_constant'])
target= df[['scalar_coupling_constant']]feature_train, feature_test, target_train, target_test= train_test_split(feature, target, test_size=0.1)total feature training features:  419233
total feature testing features:  46582
total target training features:  419233
total target testing features:  46582

Afterward, it's relatively simple. I loaded up my decision tree regressor and filled it with the criteria I wanted. Knowing I have a relatively large dataset, I went for a rather large max depth

DTR= tree.DecisionTreeRegressor(max_depth=75, min_samples_split=3, min_samples_leaf=5, random_state=1)DR= DTR.fit(feature_train, target_train)DR:
DecisionTreeRegressor(criterion='mse', max_depth=75, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=5,
                      min_samples_split=3, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')

I tried to visualize it but my computer kept crashing so I decided to go into an alternate method and use cross validation for my decision tree. I essentially redo the codes above but with cross validation in mind:

drcv= DR.fit(feature_train, target_train)drcv_scores =cv(drcv, feature_train, target_train, cv = 10)drcv_scores:
{'fit_time': array([20.96234488, 20.751436  , 20.6993022 , 20.62980795, 20.80624795,
        20.72991371, 20.73874903, 20.65243793, 20.55556297, 20.36065102]),
 'score_time': array([0.03302193, 0.0274229 , 0.02751803, 0.02114892, 0.02561307,
        0.02700615, 0.02410102, 0.02259707, 0.02510405, 0.02420998]),
 'test_score': array([0.99999431, 0.99998765, 0.99999402, 0.99999096, 0.99999444,
        0.99999466, 0.99998819, 0.99999362, 0.99999481, 0.99998841])}print("regression score: {}".format(drcv.score(feature_train, target_train)))regression score: 0.9999964614281138

Look at that result. It came with 99% of prediction correct. This can only mean one thing: the data is hugely overfit. Either way, I was willing to see through this. I know I can adjust for this overfit by lowering my max_depth and change my tree min_node and leaves at a later point as well as implement random forest ensemble method.

I ran the above code again but with a slight change:

cv= cross_validate(DR, feature_train, target_train, n_jobs=-1, return_train_score=True)

I wanted to see the status of my training scores as well:

{'fit_time': array([19.88309979, 19.68618298, 19.56496   ]),
 'score_time': array([0.06965423, 0.08991718, 0.07562518]),
 'test_score': array([0.99999126, 0.99998605, 0.99999297]),
 'train_score': array([0.99999497, 0.99999486, 0.99999842])}

That shows barely any difference between my training and testing. That should not happen even in the case of overfitting. I think this could be due to some leaky code that occurred or I made some form of mistakes when coding variables. So I decided to check out a few more results:

DTR.score(feature_test, target_test)
0.9999870784300612DTR.score(feature_train,target_train)
0.9999964614281138

That shows two different numbers although very close. So while I did not make any mistake in my names it seems that because of how I set up my tree, everything is being overfit. Then I decided to check two last metrics. Now since they are in two different data format, I had to change one of them to numpy array first and then plot:

predict=DTR.predict(feature_test)
type(predict):
numpy.ndarraytt_np= target_test.to_numpy()
type(tt_np):
numpy.ndarrayplt.rcParams["figure.figsize"] = (8, 8)
fig, ax = plt.subplots()
ax.scatter(predict, tt_np)
ax.set(title="Predict vs Actual")
ax.set(xlabel="Actual", ylabel="Predict");

3AnU7ju.png!web

Look at how beautiful it is!

Wow, I did not expect this. While my prediction missed a few, it looked like it got almost everything right. This led me to the conclusion that this machine learning model is also incorrect but I will reach a better insight after I try random forest.

Decision Tree Succeeds where Ordinary Least Squared Failed

Recommend

「译」提升 Web 开发效率的 VS Code 扩展

使用PyTorch进行情侣幸福度测试指南

OpenAI’s Hide-and-Seek Findings, the Systems Perspective

机器学习与视觉SLAM，哪个发展前景更好？知乎高赞回答

【JVM学习】2.Java虚拟机运行时数据区

命令行忘性大？这个开源备忘工具一次解决你的所有烦恼

无声的战争：网络安全中的罪与罚

电商创业成功背后的金三角法则，你了解吗？

上汽大众变相削减员工数量，有员工被调往集团旗下出行公司

原创Google 抢到量子计算的“霸权”？

About Joyk