# Tree Approximation of scalar function

Machine Learning: Homework Assignment 8

E4525 Spring 2020,

IEOR, Columbia University

Due: May 4th, 2020

1. Tree Approximation of scalar function

We would like to approximate the function

f(x) = x (1)

by a regression tree. We assume x is distributed uniformly p(x) = 1 for

x 2 [0; 1].

(a) If we wish to approximate f(x) by a tree T1(x) with a maximum

depth of one. What is the optimal tree T1(x)? how many parameters

do you need to dene T1(x) and what is the squared error of this tree

approximation

E = E

h

((f(x) T1(x))2

i

(2)

(b) If now we wish to approximate f(x) by a tree T2(x) with a maximum

depth of two by recursively spiting the leaves of T1(x). What is the

optimal tree T2(x)? how many parameters do you need to dene

T2(x) and what is the squared error of this tree approximation?

(c) If now we wish to approximate f(x) by a tree Td(x) with a maximum

depth of d by recursively spiting as we did before. How many param-

eters do you need in total to dene Td(x) and what is the squared

error of this tree approximation? You do not need to write explicitly

the tree.

(d) Consider now that we train a regression tree using 1,024 data sam-

ples fx; yg where y = f(x) = x and x is sampled from a uniform

distribution in [0; 1]. What is the maximum tree depth at which you

can expect to see improved results? And what is the minimum mean

square error you should expect? We should not expect any improve-

ments once the number of parameters on the tree is larger than the

number of data samples P = N.

The largest depth at which we see improvememts, thus, will satisfy

equation

N = 3(2dmax 1): (3)

1

solving for dmax we nd

dmax = log2

N

3

1

8:4 (4)

So we have more parameters than data points for this rst time for

d = 9 and that is the last depth at which we can expect any improve-

ments.

To compute the expected mean square error we substitute dmax into

the formula for the error Ed

Edmax =

1

3

1

22(d+1

7:1107 (5)

If we train a `sklearn` on N points generated according to our as-

sumptions that is indeed what we nd.

Figure 1: Regression trees trained of f(x)=x for dierent tree depths.

2. Tree Approximation of a binary classication Boundary

In Figure 2 the shaded gray area corresponds to the positive class y = 1

of a binary classier, while the unshaded area corresponds to the negative

2

class y = 0. Write a decision tree with minimum depth that generates this

classication boundary.

0 1 2 3 4

0

1

2

3

4

5

x1

x2

Figure 2: Decision boundary for Problem ??. Shaded area corresponds to true

class y = 1.

3. Boosted Tree Approximation

We have the D input features (x 2 RD), we want to use a boosted re-

gression tree to approximate a function that we know can be written as

f(x) = (

XD

d;d0=1

xdAd;d0xd0 )2 +

XD

d=1

g(xd) (6)

where A is a DD unknown matrix and g is an unknown convex function

of a single argument.

What would be the minimum depth of the trees you should use to train

on this data set?

4. Backpropagation in Dense Neural Networks

We have a dense neural network with two hidden layers dened by the

following graph:

3

x1

x2

a11

a12

a13

a21

a22

Figure 3: Network Architecture for Problem (4)

where

The two hidden layer have ReLU activation.

the connection between input an the rst hidden layer is given by

W0 =

0

@

1 0

1 0

1 1

1

A (7)

and

b0 =

0

@

0

2

1

1

A (8)

The connection between the rst and second hidden layers is given

by

W1 =

1 1 2

0 2 2

(9)

and

b1 =

2

2

(10)

the connection between the second hidden layer and the output unit

is given by

W2 =

1 2

(11)

and

b2 =

6

(12)

the last layer has linear activation so that ^y() = .

The network performance is assessed using the squared loss

L(y; ^y(x)) = (^y(x) y)2 (13)

Given one input sample (x1; x2) = (1;1) and an output label y = 0

compute:

(a) The rst hidden layer activations (a11

; a12

).

4

(b) The second hidden layer activations (a21

; a22

).

(c) The output ^y(x)

(d) Back propagate the errors through the network layers

(e) Compute the gradients @

@W2L and @

@b2L for the learnable parameters

W2, b2 of the output layer.

(f) Compute the gradient to the learnable parameters W1, b1 of the

second hidden layer

(g) Compute the gradient to the learnable parameters W0, b0 of the rst

hidden layer

5