34

Picking the Proper MLA for Linear Regression

 4 years ago
source link: https://www.tuicool.com/articles/QvIRjmi
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Decision Tree Succeeds where Ordinary Least Squared Failed

miiei2m.png!web

Decision Tree Regression

On my last post , I talked about selecting specific features and using them on an ordinary least squared formula to help predict the scalar coupling constant for molecular interaction properties. I left it off with a relatively poor predicting algorithm and decided that that was due to unimportant features as determined by all of my EDA calculations. I will post my previous picture as well as the output visual with only the important features.

YvArEjQ.png!web

ZZbeUnZ.png!web

Can you tell the difference? They are using two different features. First one contains all features with a weight of 150 or above while the second one contains features with weights above 1500 or above. That’s when I realized that this was just a bad model to work with.

Given that my earlier feature importance was incorrect, I decided to work on a model that takes into account multiple features and only keeps important ones through Primary Component Analysis. That led me to the decision tree regression algorithm. Now when most people talk about decision tree, they mean classifications but since I am attempting at finding a continuous variable, it has to be a regression.

First, I loaded up the subsample csv I created from the massive database using pandas. Then I used the following code to check for all categories and drop them. Remember that these categories have little to no importance and carry too many sparse count for it to be effective.

df= pd.read_csv('molecule_subsample.csv')for col in df.select_dtypes(include=[object]):
    print(df[col].value_counts(dropna=False), "\n\n")df= df.drop(columns=['molecule_name', 'atom_atom1_structure', 'type', 'type_scc', 'atom'])

Now I have a proper dataframe with the features I want. I also decided to save this as a csv file so I can import it quicker in the future:

df.to_csv(‘subsample_nocat.csv’, index=False)

Then I create the feature and target variables and then the train test split. I chose a small training size because even this subsample contains almost 50,000 dataset.

feature= df.drop(columns=['scalar_coupling_constant'])
target= df[['scalar_coupling_constant']]feature_train, feature_test, target_train, target_test= train_test_split(feature, target, test_size=0.1)total feature training features:  419233
total feature testing features:  46582
total target training features:  419233
total target testing features:  46582

Afterward, it's relatively simple. I loaded up my decision tree regressor and filled it with the criteria I wanted. Knowing I have a relatively large dataset, I went for a rather large max depth

DTR= tree.DecisionTreeRegressor(max_depth=75, min_samples_split=3, min_samples_leaf=5, random_state=1)DR= DTR.fit(feature_train, target_train)DR:
DecisionTreeRegressor(criterion='mse', max_depth=75, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=5,
                      min_samples_split=3, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')

I tried to visualize it but my computer kept crashing so I decided to go into an alternate method and use cross validation for my decision tree. I essentially redo the codes above but with cross validation in mind:

drcv= DR.fit(feature_train, target_train)drcv_scores =cv(drcv, feature_train, target_train, cv = 10)drcv_scores:
{'fit_time': array([20.96234488, 20.751436  , 20.6993022 , 20.62980795, 20.80624795,
        20.72991371, 20.73874903, 20.65243793, 20.55556297, 20.36065102]),
 'score_time': array([0.03302193, 0.0274229 , 0.02751803, 0.02114892, 0.02561307,
        0.02700615, 0.02410102, 0.02259707, 0.02510405, 0.02420998]),
 'test_score': array([0.99999431, 0.99998765, 0.99999402, 0.99999096, 0.99999444,
        0.99999466, 0.99998819, 0.99999362, 0.99999481, 0.99998841])}print("regression score: {}".format(drcv.score(feature_train, target_train)))regression score: 0.9999964614281138

Look at that result. It came with 99% of prediction correct. This can only mean one thing: the data is hugely overfit. Either way, I was willing to see through this. I know I can adjust for this overfit by lowering my max_depth and change my tree min_node and leaves at a later point as well as implement random forest ensemble method.

I ran the above code again but with a slight change:

cv= cross_validate(DR, feature_train, target_train, n_jobs=-1, return_train_score=True)

I wanted to see the status of my training scores as well:

{'fit_time': array([19.88309979, 19.68618298, 19.56496   ]),
 'score_time': array([0.06965423, 0.08991718, 0.07562518]),
 'test_score': array([0.99999126, 0.99998605, 0.99999297]),
 'train_score': array([0.99999497, 0.99999486, 0.99999842])}

That shows barely any difference between my training and testing. That should not happen even in the case of overfitting. I think this could be due to some leaky code that occurred or I made some form of mistakes when coding variables. So I decided to check out a few more results:

DTR.score(feature_test, target_test)
0.9999870784300612DTR.score(feature_train,target_train)
0.9999964614281138

That shows two different numbers although very close. So while I did not make any mistake in my names it seems that because of how I set up my tree, everything is being overfit. Then I decided to check two last metrics. Now since they are in two different data format, I had to change one of them to numpy array first and then plot:

predict=DTR.predict(feature_test)
type(predict):
numpy.ndarraytt_np= target_test.to_numpy()
type(tt_np):
numpy.ndarrayplt.rcParams["figure.figsize"] = (8, 8)
fig, ax = plt.subplots()
ax.scatter(predict, tt_np)
ax.set(title="Predict vs Actual")
ax.set(xlabel="Actual", ylabel="Predict");

3AnU7ju.png!web

Look at how beautiful it is!

Wow, I did not expect this. While my prediction missed a few, it looked like it got almost everything right. This led me to the conclusion that this machine learning model is also incorrect but I will reach a better insight after I try random forest.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK