# Choose one of the cleaned datasets at

Choose one of the cleaned datasets at https://www.kaggle.com/annavictoria/ml-friendly-public-datasets (Links to an external site.) (Links to an external site.). Split it into training and test data. Write from scratch and apply any ML algorithm that you learned in the class to this dataset. You can use R or Python to implement it.

For the implementation, you may use any classes, modules, and functions in R / Python libraries such as NumPy to do various math / linear algebra operations, but not use the ML classes or functions directly.

Apply another algorithm that you learned in the class to the same dataset. For this one, you are free to implement it from scratch or use the ML class and functions directly from the ML packages.

Which one of the algorithm fares better? Use as many evaluation metrics as possible to discuss the performance of the algorithms.

1.[b] [5 Points]

Derive an equation for accuracy in terms of Specificity and Sensitivity. The equation can include metrics such as number of True Positives or number of False Positives, etc in addition to accuracy, Specificity and Sensitivity. Give an intuitive interpretation of the equation.

2.a. [15 points]

Refer to online tutorials on regularization such as

https://medium.com/coinmonks/regularization-of-linear-models-with-sklearn-f88633a93a2 (?? ???? ?????.)

and

https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b (?? ???? ?????.)

Apply the techniques from the above tutorial to the student dataset at https://archive.ics.uci.edu/ml/datasets/student+performance (?? ???? ?????.). Does regularization help improve the accuracy of predicting the final Math grade of the students?

2.b. [5 points]

Consider a toy dataset with just two features x1 and x2. The data is linearly separable, so we use linear classification, where w1 is the weight or coefficient corresponding to x1 and w2 is for x2. The bias term is b. Suppose we do not include w1 or w2 in the regularization term and choose the hyperparameters in such a way to cause b to be 0. What is one unique characteristic that you can tell with certainty about the classifier? How about when we include only w1 in the regularization term and then only w2? (Hint: Use pen and paper if necessary)

Speaking of hyperparameters in such cases, should ? be larger or smaller to regularize the weights / bias to 0? Explain. (Hint: Think about the hyperparameter C in SVM)

2.c. [5 points]

For regularization, we added the regularizer to the loss function. Does it make sense to multiply or subtract the term, instead? Explain.

3. a.[15 points]

Manually generate the decision tree for the following toy “hiking” dataset using the algorithm discussed in the class. Show the information gain computation at each stage.

3.b. [10 points] Then generate the decision tree programmatically using R or Python using Gini index first followed by Entropy and information gain next. Submit the code and the decision trees so generated. Are the three decision trees generated for this question identical? Explain why or why not.

Dataset for “Go Hiking” Decision Tree that you will generate

Windy?

Air Quality Good?

Hot?

Go Hiking?

No

No

No

No

Yes

No

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

Yes

No