Deck 5: Regression Analysis
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Question
Unlock Deck
Sign up to unlock the cards in this deck!
Unlock Deck
Unlock Deck
1/36
Play
Full screen (f)
Deck 5: Regression Analysis
1
Describe how the Ordinary least squares (OLS) method minimizes the sum of squared errors.
No Answer
2
Which of the following metrics returns the percentage absolute difference in error prediction, on average, from the actual target?
A) Mean Absolute Percentage Error
B) Mean Absolute Error
C) precipProbability
D) Root Mean Squared Error
A) Mean Absolute Percentage Error
B) Mean Absolute Error
C) precipProbability
D) Root Mean Squared Error
Mean Absolute Percentage Error
3
Which of the following is true of the hold-out method of model validation?
A) This procedure cannot be used in advanced analytics techniques due to its complexity.
B) Two-thirds of the data is randomly selected and removed to build the regression model.
C) This method uses the training dataset and is validated using the single selected validation set.
D) This method requires no training time and a minimum of computer processing power.
A) This procedure cannot be used in advanced analytics techniques due to its complexity.
B) Two-thirds of the data is randomly selected and removed to build the regression model.
C) This method uses the training dataset and is validated using the single selected validation set.
D) This method requires no training time and a minimum of computer processing power.
Two-thirds of the data is randomly selected and removed to build the regression model.
4
In the ridesharing case study, the variable distance refers to ________.
A) the number of miles a ride covered
B) the duration of the ride
C) the hour of day extracted from the datetime
D) how good the condition is overall
A) the number of miles a ride covered
B) the duration of the ride
C) the hour of day extracted from the datetime
D) how good the condition is overall
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
5
In the context of modeling categorical values, dummy coding is ________.
A) measuring the absolute difference between the predicted and actual values in a predictive model
B) representing the difference between the observed and predicted values of a dependent variable
C) typically dividing data into ten subsets called folds
D) creating a dichotomous value to represent a variable
A) measuring the absolute difference between the predicted and actual values in a predictive model
B) representing the difference between the observed and predicted values of a dependent variable
C) typically dividing data into ten subsets called folds
D) creating a dichotomous value to represent a variable
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
6
Overfitting happens when sample characteristics are included in the regression model that can be generalized to new data.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
7
Identify a difference between the model evaluation methods of hold-out variation and N-fold cross validation.
A) Unlike the N-fold cross validation method, the hold-out variation method requires high amounts of training time and computer processing power.
B) Unlike in the hold-out variation method, the N-fold cross validation method randomly selects one set of two-thirds of data to build the regression model.
C) Unlike the N-fold cross validation method, the hold-out variation method divides data into many smaller subsets.
D) Unlike the hold-out sample method, the N-fold cross validation method is less sensitive to variation in training and validation datasets selection.
A) Unlike the N-fold cross validation method, the hold-out variation method requires high amounts of training time and computer processing power.
B) Unlike in the hold-out variation method, the N-fold cross validation method randomly selects one set of two-thirds of data to build the regression model.
C) Unlike the N-fold cross validation method, the hold-out variation method divides data into many smaller subsets.
D) Unlike the hold-out sample method, the N-fold cross validation method is less sensitive to variation in training and validation datasets selection.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
8
An essential practice before starting with any modeling process is to first ________.
A) review and clean the dataset
B) determine the accuracy of the dataset
C) determine the target variables
D) plot the data into graphs
A) review and clean the dataset
B) determine the accuracy of the dataset
C) determine the target variables
D) plot the data into graphs
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
9
Descriptive/explanatory modeling is used
A) when the focus is limited to a single, numeric dependent variable and a single independent variable.
B) to represent explanation and association between independent and dependent variables.
C) to determine whether two or more independent variables are good predictors of the single dependent variable.
D) to predict a new observation.
A) when the focus is limited to a single, numeric dependent variable and a single independent variable.
B) to represent explanation and association between independent and dependent variables.
C) to determine whether two or more independent variables are good predictors of the single dependent variable.
D) to predict a new observation.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
10
Mean Absolute Error
A) represents the difference between the observed and predicted value of the dependent variable.
B) is the percentage absolute difference the prediction is, on average, from the actual target.
C) measures the total difference between the predicted and actual values of the model.
D) indicates how different the residuals are from zero.
A) represents the difference between the observed and predicted value of the dependent variable.
B) is the percentage absolute difference the prediction is, on average, from the actual target.
C) measures the total difference between the predicted and actual values of the model.
D) indicates how different the residuals are from zero.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
11
In the ridesharing case study data dictionary, the rideshare variable refers to the name of ride sharing service (e.g., Lyft or Uber).
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
12
In feature selection, ________ starts with a regression model that includes all predictors under consideration.
A) backward elimination
B) forward selection
C) overfitting
D) stepwise selection
A) backward elimination
B) forward selection
C) overfitting
D) stepwise selection
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
13
In feature selection, ________ follows forward selection by adding a variable at each stage, but also includes removing variables that no longer meet the threshold.
A) hold-out variation
B) stepwise selection
C) dummy coding
D) forward selection
A) hold-out variation
B) stepwise selection
C) dummy coding
D) forward selection
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
14
Employment status (unemployed, employed, student, retired) is an example of a dichotomous value.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
15
In the ridesharing case study, the windspeed variable refers to the
A) the likelihood of rain for a specific forecast period and location.
B) wind speed at the time and location of the ride.
C) wind gust measuring the increase in wind speed at the time and location of the ride.
D) air quality at the time and location of the ride.
A) the likelihood of rain for a specific forecast period and location.
B) wind speed at the time and location of the ride.
C) wind gust measuring the increase in wind speed at the time and location of the ride.
D) air quality at the time and location of the ride.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
16
A descriptive/explanatory model uses validation dataset metrics such as Mean Absolute Percentage Error and Root Mean Squared Error.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
17
Identify a true statement about the N-fold cross evaluation method of model validation.
A) This procedure is highly sensitive to variation in the datasets.
B) This method requires minimal computer processing power and no training time.
C) This method uses randomly selected data.
D) This procedure typically uses ten data subsets.
A) This procedure is highly sensitive to variation in the datasets.
B) This method requires minimal computer processing power and no training time.
C) This method uses randomly selected data.
D) This procedure typically uses ten data subsets.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
18
In the ridesharing case study, the variable source refers to the ________.
A) type of rideshare service
B) date and time of the ride
C) location of the ride pickup
D) ride unique id per observation
A) type of rideshare service
B) date and time of the ride
C) location of the ride pickup
D) ride unique id per observation
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
19
In feature selection, ________ begins by creating a separate regression model for each predictor.
A) stepwise selection
B) forward selection
C) hold-out variation
D) dummy coding
A) stepwise selection
B) forward selection
C) hold-out variation
D) dummy coding
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
20
Which of the following is true of validation data?
A) It is used as a last check that the regression model is complete.
B) It is a portion of the data that is used to build a regression model.
C) It is the portion of the data used to assess the regression model developed from the training data.
D) It provides a final estimate of the regression model's performance after it has been trained and validated.
A) It is used as a last check that the regression model is complete.
B) It is a portion of the data that is used to build a regression model.
C) It is the portion of the data used to assess the regression model developed from the training data.
D) It provides a final estimate of the regression model's performance after it has been trained and validated.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
21
Describe the backward elimination, forward selection, and stepwise selection regression models of feature selection.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
22
When high levels of accuracy in a training dataset do not apply to predicting models using new data, the phenomenon is termed ________.
A) adjacency
B) dummy coding
C) overfitting
D) multicollinearity
A) adjacency
B) dummy coding
C) overfitting
D) multicollinearity
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
23
A predictive model only uses coefficient sizes, goodness of fit, and overall model fit.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
24
In the N-fold cross validation model evaluation method, it is typical to use 45 folds (data subsets).
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
25
Explain how the quality of a predictive model is determined using metrics.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
26
Feature selection is a qualitative method used to reduce the impact of dummy coding.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
27
In regression analysis, the variable being predicted is referred to as the ________.
A) independent variable
B) target variable
C) predictor
D) feature
A) independent variable
B) target variable
C) predictor
D) feature
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
28
________ is used to determine whether two or more independent variables are good predictors of the single target variable.
A) Explanatory modeling
B) Multiple regression
C) Predictive modeling
D) The Ordinary least squares method
A) Explanatory modeling
B) Multiple regression
C) Predictive modeling
D) The Ordinary least squares method
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
29
Which of the following is an example of a categorical independent variable?
A) price
B) product amount
C) marital status
D) date of birth
A) price
B) product amount
C) marital status
D) date of birth
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
30
The Akaike information criterion (AIC) and Bayesian information criterion (BIC) are measures to determine the ________.
A) quality of the statistical model
B) significance
C) overall model fit
D) coefficients
A) quality of the statistical model
B) significance
C) overall model fit
D) coefficients
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
31
Which of the following datasets is used as an optional dataset dedicated for model validation?
A) training data
B) validation data
C) baseline data
D) test data
A) training data
B) validation data
C) baseline data
D) test data
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
32
In the ridesharing case study, the variable representing the likelihood of rain for a specific forecast period and location is ________.
A) percent_Rain
B) precipProbability
C) percent_humidity
D) rainPossibility
A) percent_Rain
B) precipProbability
C) percent_humidity
D) rainPossibility
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
33
Compare and contrast simple bivariate linear regression with multiple linear regression.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
34
With the accuracy measures of Mean Absolute Error, Mean Absolute Percentage Error, and Root Mean Squared Error, lower values indicate a better fit.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
35
Which of the following is true of a predictive model?
A) The interpretability of the x and y association is critical for this model to work.
B) An entire dataset is used to build a predictive model.
C) It is prospective-its main focus is on forecasting new data records.
D) It is retrospective-its main focus is on interpreting coefficients.
A) The interpretability of the x and y association is critical for this model to work.
B) An entire dataset is used to build a predictive model.
C) It is prospective-its main focus is on forecasting new data records.
D) It is retrospective-its main focus is on interpreting coefficients.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
36
KDNuggets identified ________ as one of the software tools most often used for data analysis.
A) RapidMiner
B) PowerBI
C) Crystal Report
D) Tableau
A) RapidMiner
B) PowerBI
C) Crystal Report
D) Tableau
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck