# Tree Approximation of scalar function

Machine Learning: Homework Assignment 8
E4525 Spring 2020,
IEOR, Columbia University
Due: May 4th, 2020
1. Tree Approximation of scalar function
We would like to approximate the function
f(x) = x (1)
by a regression tree. We assume x is distributed uniformly p(x) = 1 for
x 2 [0; 1].
(a) If we wish to approximate f(x) by a tree T1(x) with a maximum
depth of one. What is the optimal tree T1(x)? how many parameters
do you need to de ne T1(x) and what is the squared error of this tree
approximation
E = E
h
((f(x) 􀀀 T1(x))2
i
(2)
(b) If now we wish to approximate f(x) by a tree T2(x) with a maximum
depth of two by recursively spiting the leaves of T1(x). What is the
optimal tree T2(x)? how many parameters do you need to de ne
T2(x) and what is the squared error of this tree approximation?
(c) If now we wish to approximate f(x) by a tree Td(x) with a maximum
depth of d by recursively spiting as we did before. How many param-
eters do you need in total to de ne Td(x) and what is the squared
error of this tree approximation? You do not need to write explicitly
the tree.
(d) Consider now that we train a regression tree using 1,024 data sam-
ples fx; yg where y = f(x) = x and x is sampled from a uniform
distribution in [0; 1]. What is the maximum tree depth at which you
can expect to see improved results? And what is the minimum mean
square error you should expect? We should not expect any improve-
ments once the number of parameters on the tree is larger than the
number of data samples P = N.
The largest depth at which we see improvememts, thus, will satisfy
equation
N = 3(2dmax 􀀀 1): (3)
1
solving for dmax we nd
dmax = log2

N
3
􀀀 1

 8:4 (4)
So we have more parameters than data points for this rst time for
d = 9 and that is the last depth at which we can expect any improve-
ments.
To compute the expected mean square error we substitute dmax into
the formula for the error Ed
Edmax =
1
3
1
22(d+1
 7:110􀀀7 (5)
If we train a `sklearn` on N points generated according to our as-
sumptions that is indeed what we nd.
Figure 1: Regression trees trained of f(x)=x for di erent tree depths.
2. Tree Approximation of a binary classi cation Boundary
In Figure 2 the shaded gray area corresponds to the positive class y = 1
of a binary classi er, while the unshaded area corresponds to the negative
2
class y = 0. Write a decision tree with minimum depth that generates this
classi cation boundary.
0 1 2 3 4
0
1
2
3
4
5
x1
x2
Figure 2: Decision boundary for Problem ??. Shaded area corresponds to true
class y = 1.
3. Boosted Tree Approximation
We have the D input features (x 2 RD), we want to use a boosted re-
gression tree to approximate a function that we know can be written as
f(x) = (
XD
d;d0=1
XD
d=1
g(xd) (6)
where A is a DD unknown matrix and g is an unknown convex function
of a single argument.
What would be the minimum depth of the trees you should use to train
on this data set?
4. Backpropagation in Dense Neural Networks
We have a dense neural network with two hidden layers de ned by the
following graph:
3
x1
x2
a11
a12
a13
a21
a22

Figure 3: Network Architecture for Problem (4)
where
 The two hidden layer have ReLU activation.
 the connection between input an the rst hidden layer is given by
W0 =
0
@
1 0
􀀀1 0
1 1
1
A (7)
and
b0 =
0
@
0
􀀀2
1
1
A (8)
 The connection between the rst and second hidden layers is given
by
W1 =

1 1 2
0 2 2

(9)
and
b1 =

2
􀀀2

(10)
 the connection between the second hidden layer and the output unit
 is given by
W2 =
􀀀
􀀀1 2

(11)
and
b2 =
􀀀
6

(12)
 the last layer has linear activation so that ^y() = .
 The network performance is assessed using the squared loss
L(y; ^y(x)) = (^y(x) 􀀀 y)2 (13)
Given one input sample (x1; x2) = (1;􀀀1) and an output label y = 0
compute:
(a) The rst hidden layer activations (a11
; a12
).
4
(b) The second hidden layer activations (a21
; a22
).
(c) The output ^y(x)
(d) Back propagate the errors  through the network layers