Question 1

When a target variable is categorical, the CART algorithm produces a __________ tree to predict the class memberships of new cases.&#10;A) classification&#10;B) regression&#10;C) minimum&#10;D) pruned

Accepted Answer

The CART (Classification and Regression Trees) algorithm produces a classification tree when the target variable is categorical, aiming to predict the class memberships of new cases.

Question 2

Which tree is the least complex and contains the smallest validation error?&#10;A) best-pruned tree&#10;B) full-grown tree&#10;C) minimum error tree&#10;D) categorical tree

Accepted Answer

The minimum error tree is the least complex and contains the smallest validation error. This is because the minimum error tree is the tree that has the lowest error rate on the validation set. The validation set is a set of data that is used to evaluate the performance of a tree. The error rate on the validation set is the percentage of data points that the tree misclassifies. The minimum error tree is the tree that has the lowest error rate on the validation set, which means that it is the tree that is most accurate at classifying data.

Question 3

Based on the following sorted 20 values for age, what are the possible split points?
{20, 22, 24, 26, 28, 31, 32, 34, 35, 40, 42, 43, 45, 47, 49, 51, 52, 53, 55, 57}

A) {20, 21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56}

B) {21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56, 57}

C) {0, 21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56}

D) {21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56}

{21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56}

Accepted Answer

Split points are calculated as the average of each pair of adjacent values in the sorted list, which eliminates the need for the original values and any value not between the smallest and largest values in the list.

Question 4

Based on the following sorted 20 values for age, what are the possible split points?
{20, 22, 24, 26, 28, 31, 32, 34, 35, 40, 42, 43, 45, 47, 49, 51, 52, 53, 55, 57}

A) {20, 21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56}

B) {21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56, 57}

C) {0, 21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56}

D) {21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56}

{21, 23, 25, 27, 29.5, 31.5, 33, 34.5, 37.5, 41, 42.5, 44, 46, 48, 50, 51.5, 52.5, 54, 56}

Accepted Answer

Split points are calculated as the average of each pair of adjacent values in the sorted list. None of the options perfectly match this definition, but option D comes closest by excluding the first value (20) and the last value (57) from the list of ages and correctly averaging adjacent values for the most part, despite a minor discrepancy with the value 49.5 which should be 49.5 (correct) instead of 49 (incorrect as seen in option B).

Question 5

Based on the following values for income, what are the possible split points?
{12665, 15432, 28763, 34876, 41967, 52997}

A) {14048.50, 22097.50, 31819.50, 38421.50, 47482, 52997}

B) {12665, 14048.50, 22097.50, 31819.50, 38421.50, 47482}

C) {14048.50, 22097.50, 31819.50, 38421.50, 47482}

D) {14048, 22097, 31819, 38421, 47482}

Accepted Answer

Split points are calculated as the average between each pair of adjacent values in the sorted list. The given list is already sorted, so the split points are the averages of (12665+15432)/2 = 14048.5, (15432+28763)/2 = 22097.5, (28763+34876)/2 = 31819.5, (34876+43987)/2 = 39431.5, and (43987+53677)/2 = 48832.

Question 6

Based on the following values for income, what are the possible split points?
{12665, 15432, 28763, 34876, 41967, 52997}

A) {14048.50, 22097.50, 31819.50, 38421.50, 47482, 52997}

B) {12665, 14048.50, 22097.50, 31819.50, 38421.50, 47482}

C) {14048.50, 22097.50, 31819.50, 38421.50, 47482}

D) {14048, 22097, 31819, 38421, 47482}

Accepted Answer

Split points are calculated by taking the average of each pair of adjacent values in the sorted list. The correct split points are {14048.50, 22097.50, 31819.50, 38421.50, 47482}.

Question 7

If 73% of the cases belong to Class 0 and 27% belong to Class 1, what is the Gini index?&#10;A) 0.39&#10;B) 0&#10;C) 0.54&#10;D) 0.15

Accepted Answer

The Gini index is calculated as 1 - (p1^2 + p2^2), where p1 and p2 are the probabilities of the classes. Here, it is 1 - (0.73^2 + 0.27^2) = 0.39.

Question 8

If 80% of the cases belong to Class 0 and 20% belong to Class 1, what is the Gini index?&#10;A) 0.32&#10;B) 0&#10;C) 0.40&#10;D) 0.16

Accepted Answer

The Gini index is calculated as 1 - sum(p_i^2) for each class i, where p_i is the proportion of items labeled with class i. For two classes with proportions 0.8 and 0.2, the Gini index is 1 - (0.8^2 + 0.2^2) = 1 - (0.64 + 0.04) = 1 - 0.68 = 0.32.

Question 9

In reviewing the split of data, Maggie notes among the 13 cases, 2 belong to Class 1 and the remaining to Class 0. What is the Gini index for the cases and is it pure or impure?&#10;A) 0.00 default because it is under 0.5 and pure.&#10;B) 0.26 is closer to 0 implying relative purity.&#10;C) 0.24 is at the halfway point is not considered pure.&#10;D) 0.74 is over the 0.5 level and is impure.

Accepted Answer

The Gini index is calculated as 1 - sum(p_i^2) where p_i is the proportion of class i. For Class 1, p_1 = 2/13, and for Class 0, p_0 = 11/13. Thus, Gini index = 1 - ((2/13)^2 + (11/13)^2) = 0.26. This indicates relative purity but not absolute, as it is greater than 0 but less than 0.5.

Question 10

In reviewing the split of data, Maggie notes among the 15 cases, 2 belong to Class 1 and the remaining to Class 0. What is the Gini index for the cases and is it pure or impure?&#10;A) 0.00 default because it is under 0.5 and pure.&#10;B) 0.23 is closer to 0 implying relative purity.&#10;C) 0.27 is at the halfway point is not considered pure.&#10;D) 0.77 is over the 0.5 level and is impure.

Accepted Answer

The Gini index is calculated as 1 - sum(p_i^2) where p_i is the proportion of class i. For 2 classes with 2 in Class 1 and 13 in Class 0, the Gini index is 1 - ((2/15)^2 + (13/15)^2) = 0.23. This value is closer to 0, indicating relative purity, but not absolute purity.

Question 11

Viewing the results in the following scatterplot, for the 11 cases to the left subset (Age < 40), two belong to Class 1 and nine belong to Class 0. In the right subset (Age ? 40) three belong to Class 1 and one belong to Class 0. What is the Index score for the two subsets?   &#10;A) (Age < 40) = 0.3636; (Age $\ge$ 40) = 0.50&#10;B) (Age < 40) = 0.20; (Age $\ge$ 40) = 0.25&#10;C) (Age < 40) = 0.298; (Age $\ge$ 40) = 0.375&#10;D) (Age < 40) = 0.375; (Age $\ge$ 40) = 0.298

Accepted Answer

The answer of Viewing the results in the following scatterplot,...

Question 12

Robin wanted to know if the age partition chosen for her data was the best fit for her 30 case, 90% Class 1, 10% Class 0 partition. She completed the Gini impurity index with the results of (Age < 32) = 0.2034 and (Age $\ge$ 32) = 0.2786. What is the weighted combination and what did partition at Age 32 produce?

A) Robin was able to reduce the Gini index from 0.2786 to 0.2507, confirming the best split for age.
B) Robin was able to reduce the Gini index from 0.2786 to 0.20, confirming the best split for age.
C) Robin was able to reduce the Gini index from 0.2786 to 0.2109, confirming the best split for age.
D) Robin realized with the 0.2507 weighted average, the age split was not the best split for the age range.

Accepted Answer

The answer of Robin wanted to know if the age...

Question 13

Robin wanted to know if the age partition chosen for her data was the best fit for her 30 case, 90% Class 1, 10% Class 0 partition. She completed the Gini impurity index with the results of (Age < 32) = 0.2034 and (Age $\ge$ 32) = 0.2786. What is the weighted combination and what did partition at Age 32 produce?

A) Robin was able to reduce the Gini index from 0.2786 to 0.2507, confirming the best split for age.
B) Robin was able to reduce the Gini index from 0.2786 to 0.20, confirming the best split for age.
C) Robin was able to reduce the Gini index from 0.2786 to 0.2109, confirming the best split for age.
D) Robin realized with the 0.2507 weighted average, the age split was not the best split for the age range.

Accepted Answer

The answer of Robin wanted to know if the age...

Question 14

A split at the $32,000 Income point creates a top and bottom partition. Compute the overall (weighted) Gini index given an Income Split of $32,000. A) MSE_split ₍_{Income=$36,000}₎ = 0.2667 B) MSE_split ₍_{Income=$36,000}₎ = 0.0000 C) MSE_split ₍_{Income=$36,000}₎ = 0.4959 D) MSE_split ₍_{Income=$36,000}₎ = 0.3637

Accepted Answer

The answer of A split at the $32,000 Income point...

Question 15

Which description best fits the following tree structure for loan debt balance with a single age predictor?  &#10;A) The split points presented represent the MSE calculated points for Age = 35.&#10;B) The MAD of the single age predictor is $42,964 and $32,980 respectfully.&#10;C) The MSE split for Age = 35 is between the two partitions of $42,964 and $32,980, respectfully.&#10;D) The average loan debt balance of the two partitions are $42,964 and $32,980, respectfully, when Age = 35.

Accepted Answer

The answer of Which description best fits the following tree...

Question 16

In R, to determine the number of splits in the default classification tree, the rpart function uses what to determine when to stop growing the tree?&#10;A) nsplit&#10;B) complexity parameter&#10;C) prune&#10;D) predict

Accepted Answer

The answer of In R, to determine the number of...

Question 17

In a R complexity parameter table, the xerror column represents:&#10;A) the cross-validation errors associated with each candidate tree.&#10;B) the recommended measure for the full tree.&#10;C) the maximum error point for the first node split.&#10;D) the root node type argument point.

Accepted Answer

The answer of In a R complexity parameter table, the...

Question 18

Using the following pruning table, what does the Rel Error represent?  &#10;A) Rel error is the calculated difference after the standard deviation is removed.&#10;B) Rel error is the cross-validation error associated with each candidate tree.&#10;C) Rel error is the error for predictions of the data that were used to estimate the model.&#10;D) Rel error is the parameter associated with the candidate tree and complexity level.

Accepted Answer

The answer of Using the following pruning table, what does...

Question 19

Using the following pruning table, which tree is the minimum error tree?  &#10;A) Level 3&#10;B) Level 2&#10;C) Level 1&#10;D) Additional Levels needed to identify minimum tree among candidate trees.

Accepted Answer

The answer of Using the following pruning table, which tree...

Question 20

Using the following chart for age and income, determine the split points for income.  &#10;A) {33000, 36000, 41000, 47000, 51500, 57500, 63000}&#10;B) {33000, 41000, 47000, 51500, 63000}&#10;C) {36000, 47000, 57500, 63000}&#10;D) {36000, 41000, 47000, 51500, 57500}

Accepted Answer

The answer of Using the following chart for age and...

Question 21

Using the following chart for age and income, determine the split points for income.  &#10;A) {33000, 36000, 41000, 47000, 52500, 58500, 63,000}&#10;B) {33000, 41000, 47000, 52500, 63000}&#10;C) {36000, 47000, 58500, 63000}&#10;D) {36000, 41000, 47000, 52500, 58500}

Accepted Answer

The answer of Using the following chart for age and...

Question 22

Which is not a purpose of running classification and regression trees (CART)?&#10;A) To remove nodes that do not produce additional information&#10;B) To simplify and reduce complexity&#10;C) To identify the most diverse case set for the target variable&#10;D) To reduce the chances of overfitting

Accepted Answer

The answer of Which is not a purpose of running...

Question 23

If the RMSE for the validation set is 56.91 and the RMSE for the test set is 55.39, then what range will the new data RMSE lie in?&#10;A) 55.39-56.91 range&#10;B) 55-56 range&#10;C) 55.39-57 range&#10;D) 55-57 range

Accepted Answer

The answer of If the RMSE for the validation set...

Question 24

If the RMSE for the validation set is 58.78 and the RMSE for the test set is 57.12, then what range will the new data RMSE lie in?&#10;A) 57.12-58.78 range&#10;B) 57-58 range&#10;C) 57.12-59 range&#10;D) 57-59 range

Accepted Answer

The answer of If the RMSE for the validation set...

Question 25

A regression tree was developed to predict customer spending for a hotel during football season. One of the leaf nodes consists of six cases in the training set with the following values: 312.00, 350.00, 285.00, 295.00, 423.00, 249.00. What is the predicted spending amount on a hotel for the night for a customer that falls into this leaf node?

A) 319.00
B) 320.40
C) 322.50
D) 318.80

Accepted Answer

The answer of A regression tree was developed to predict...

Question 26

A regression tree was developed to predict customer spending for a hotel during football season. One of the leaf nodes consists of six cases in the training set with the following values: 312.00, 350.00, 285.00, 295.00, 423.00, 249.00. What is the predicted spending amount on a hotel for the night for a customer that falls into this leaf node?

A) 319.00
B) 320.40
C) 322.50
D) 318.80

Accepted Answer

The answer of A regression tree was developed to predict...

Question 27

When using the CART algorithm, the Gini index is used in the classification tree, however in a regression tree, _____ is used to measure impurity.&#10;A) mean percentage error&#10;B) mean squared error&#10;C) mean absolute deviation&#10;D) mean absolute percentage error

Accepted Answer

The answer of When using the CART algorithm, the Gini...

Question 28

Using the following sample of a regression prune log, the minimum error tree is decision node # 19 with a standard error of 4.689492 (not shown). Using the information provided, which decision node number represents the best-pruned tree?

A) decision node #21
B) decision node #5
C) decision node #4
D) decision node #17

Accepted Answer

The answer of Using the following sample of a regression...

Question 29

The following table reflects a partial Analytic Solver's Performance measure for a hotel cost during an NFL game night. What is the MAD implying?  &#10;A) The predicted mean absolute deviation is 0.53 of the mean absolute percentage error.&#10;B) The predicted cost is relatively low, providing the need for full tree.&#10;C) The predicted average cost is lesser than the standard error, thus impure.&#10;D) The predicted cost on average differs from the actual cost by $52.56.

Accepted Answer

The answer of The following table reflects a partial Analytic...

Question 30

The following table reflects a partial Analytic Solver's Performance measure for a hotel cost during an NFL game night. What is the MAD implying?  &#10;A) The predicted mean absolute deviation is 0.51 of the mean absolute percentage error.&#10;B) The predicted cost is relatively low, providing the need for full tree.&#10;C) The predicted average cost is lesser than the standard error, thus impure.&#10;D) The predicted cost on average differs from the actual cost by $50.56.

Accepted Answer

The answer of The following table reflects a partial Analytic...

Question 31

When generating a single regression tree visually, the prp function is used. Based on the following example code, what does setting type = 1 mean?
>prp(default_tree, type = 1, extra = 1, under = TRUE)

A) type = 1 argument is the number of observations that fall into each node displayed.
B) type = 1 argument places the number of cases under each decision node in the diagram.
C) type = 1 argument allows for all nodes, except leaf nodes, to be labeled in the diagram.
D) type = 1 argument allows for the predicting variable to be displayed in root node.

Accepted Answer

The answer of When generating a single regression tree visually,...

Question 32

Using the following pruning table, which tree is the best-pruned tree?  &#10;A) Level 3&#10;B) Level 2&#10;C) Level 1&#10;D) Additional Levels needed to identify best-pruned tree.

Accepted Answer

The answer of Using the following pruning table, which tree...

Question 33

Which option is not one of the three common strategies used in creating ensemble models?&#10;A) bagging&#10;B) boosting&#10;C) bootstrapping&#10;D) random Forest

Accepted Answer

The answer of Which option is not one of the...

Question 34

If the performance measures are based on a cutoff value of 0.5, then if we lower the cutoff value, more cases will be in the target class, resulting in different performance measurement values. What chart can be used to review the data that are independent of the cutoff value?

A) cumulative lift chart
B) decile-wise lift chart
C) ROC curve
D) All options are independent of the cutoff value.

Accepted Answer

The answer of If the performance measures are based on...

Question 35

If predictor variables are highly correlated, then repeated sampling of the training data and a random selection of features are used to construct trees. This is an example of which strategy?&#10;A) random Forest&#10;B) bagging&#10;C) boosting&#10;D) banking

Accepted Answer

The answer of If predictor variables are highly correlated, then...

Question 36

In a random forest model, as a guideline the user needs to select a number of the random features for each tree. If there are 196 predictor variables in the data, each tree will randomly select how many features to be included in the tree?

A) 5
B) 18
C) 196
D) 14

Accepted Answer

The answer of In a random forest model, as a...

Question 37

In a random forest model, as a guideline the user needs to select a number of the random features for each tree. If there are 196 predictor variables in the data, each tree will randomly select how many features to be included in the tree?

A) 5
B) 18
C) 196
D) 14

Accepted Answer

The answer of In a random forest model, as a...

Question 38

When constructing the argument for a bagging tree strategy, the varImpPlot function displays feature importance graphically. For this we set the type argument to either equal 1 or 2. If type = 2, then what does this command?

A) to show the average decrease in the predictive variable mean in a percentage form
B) that R will use the average decrease in the Gini impurity index to compare the feature importance
C) to show the feature importance as the average decrease in overall accuracy
D) that R will use the average increase in the Gini impurity index to compare future importance

Accepted Answer

The answer of When constructing the argument for a bagging...

Question 39

Ensemble tree models combine multiple single-tree models to reduce the variation in prediction error. Of the strategies, which may lead to overfitting?&#10;A) boosting&#10;B) random Forest&#10;C) bagging&#10;D) banking

Accepted Answer

The answer of Ensemble tree models combine multiple single-tree models...

Question 40

In the following tree, how many leaf nodes are there?  &#10;A) six&#10;B) seven&#10;C) two&#10;D) four

Accepted Answer

The answer of In the following tree, how many leaf...

Question 41

A pure subset contains leaf nodes where cases have contradicting values to the target variable, to enhance the variable case outcomes and allow for further splits.

Accepted Answer

The answer of A pure subset contains leaf nodes where...

Question 42

Decision trees produced by the CART algorithm are binary, meaning that there are two branches for each decision node.

Accepted Answer

The answer of Decision trees produced by the CART algorithm...

Question 43

The best-pruned tree is the smallest set, least complex tree, with the smallest validation error.

Accepted Answer

The answer of The best-pruned tree is the smallest set,...

Question 44

Small changes in the training set, while using the CART algorithm, will result in drastically different trees.

Accepted Answer

The answer of Small changes in the training set, while...

Question 45

A subset with the highest degree of impurity is when a 50% and 50% split occur between classes.

Accepted Answer

The answer of A subset with the highest degree of...

Question 46

Based on the Gini index, 0.10 implies a higher degree of purity because it is closer to 0 than 0.5.

Accepted Answer

The answer of Based on the Gini index, 0.10 implies...

Question 47

In a decision tree, the recursive process of partitions continues and only terminates when the Gini index reaches 0.5.

Accepted Answer

The answer of In a decision tree, the recursive process...

Question 48

To measure impurity in a regression tree, mean square error (MSE) is used.

Accepted Answer

The answer of To measure impurity in a regression tree,...

Question 49

The overall MSE split for Age = 24 is $21,987,111.29 and for Age = 23 is $20,983,723.40. Of the two presented, Age = 24 is slightly higher and has a lower level of impurity for constructing a regression tree.

Accepted Answer

The answer of The overall MSE split for Age =...

Question 50

The overall MSE split for Age = 25 is $22,987,111.29 and for Age = 23 is $21,983,723.40. Of the two presented, Age = 25 is slightly higher and has a lower level of impurity for constructing a regression tree.

Accepted Answer

The answer of The overall MSE split for Age =...

Question 51

Before constructing a decision tree, one of the first steps is identifying possible splits of the predictor variable.

Accepted Answer

The answer of Before constructing a decision tree, one of...

Deck 10: Supervised Data Mining: Decision Trees