Deck 15: Machine Learning: Classification, Regression and Clustering  

Full screen (f)
exit full mode
Question
Which of the following statements a), b) or c) is false?

A) Supervised machine learning falls into two categories-xe "classification (machine learning)"classification and xe "regression"regression.
B) You train machine-learning models on datasets that consist of rows and columns. Each row represents a data feature. Each column represents a sample of that feature.
C) In supervised machine learning, each sample has an associated label called a target (like "spam" or "not spam" for classifying e-mails). This is the value you're trying to predict for new data that you present to your models.
D) All of the above statements are true.
Use Space or
up arrow
down arrow
to flip the card.
Question
Which of the following statements a), b) or c) is false?

A) The amount of data that's available today is already enormous and continues to grow exponentially-the data produced in the world in the last few years alone equals the amount produced up to that point since the dawn of civilization.
B) People used to say "I'm drowning in data and I don't know what to do with it. With machine learning, we now say, "Flood me with big data so I can use machine-learning technology to extract insights and make predictions from it."
C) The big data phenomenon is occurring at a time when computing power is exploding and computer memory and secondary storage are exploding in capacity while costs dramatically decline. This enables us to think differently about solution approaches.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) "Toy" datasets, generally have a small number of samples with a limited number of features. In the world of big data, datasets commonly have millions and billions of samples, or even more.
B) There's an enormous number of free and open datasets available for data science studies. Libraries like scikit-learn bundle popular datasets for you to experiment with and provide mechanisms for loading datasets from various repositories (such as openml.org).
C) Governments, businesses and other organizations worldwide offer datasets on a vast range of subjects.
D) All of the above statements are true.
Question
With regard to our code that displays 24 digit images, which of the following statements a), b) or c) is false?

A) The following call to function subplots creates a 6-by-4 inch Figure (specified by the figsize=(6, 4) keyword argument) containing 24 subplots arranged in 6 rows and 4 columns: import matplotlib.pyplot as plt
Figure, axes = plt.subplots(nrows=4, ncols=6, figsize=(6, 4))
B) Each subplot has its own Axes object.
C) Function subplots returns the Axes objects in a two-dimensional NumPy array.
D) All of the above are true.
Question
Which of the following statements is false?

A) K-means clustering works through the data attempting to divide it into that many clusters.
Question
Which of the following statements about scikit-learn and the machine-learning models you'll build with it is false?

A) It's difficult to know in advance which model(s) will perform best on your data, so you typically try many models and pick the one that performs best-scikit-learn makes this convenient for you.
B) You'll rarely get to know the details of the complex mathematical algorithms in the scikit-learn estimators, but with experience, you'll be able to intuit the best model for each new dataset.
C) It generally takes at most a few lines of code for you to create and use each scikit-learn model.
D) The models report their performance so you can compare the results and pick the model(s) with the best performance.
Question
Which of the following statements is false?

A) Classification in xe "supervised machine learning"supervised machine learning attempts to predict the distinct class to which a sample belongs.
B) If you have images of dogs and images of cats, you can classify each image as a "dog" or a "cat." This is a binary classification problem.
C) When classifying digit images from the Digits dataset bundled with xe "machine learning:scikit-learn"xe "scikit-learn (sklearn) machine-learning library"scikit-learn, our goal is to predict which digit an image represents. Since there are 10 possible digits (the classes), this is a multi-classification problem.
D) You train a classification model using unlabeled data.
Question
Which of the following statements is false?

A) Regression models predict a continuous output, such as the predicted temperature output in a weather time-series analysis.
B) The LinearRegression estimator can perform simple linear regression.
C) The LinearRegression estimator also can perform multiple linear regression.
D) The LinearRegression estimator, by default, uses all the nonnumerical features in a dataset to make more sophisticated predictions than you can with a single-feature simple linear regression.
Question
Which of the following statements is false?

A) Scikit-learn's machine-learning algorithms require samples to be stored in a one-dimensional array of floating-point values (or one-dimensional array-like collection, such as a list).
B) To represent every sample as one row, multi-dimensional data must be flattened into a one-dimensional array.
C) If you work with a dataset containing categorical features (typically represented as strings, such as 'spam' or 'not-spam'), you have to preprocess those features into numerical valuesxe "one-hot encoding[one hot encoding]".
D) Scikit-learn's sklearn.preprocessing module provides capabilities for converting categorical data to numeric data.
Question
Which of the following statements a), b) or c) is false?

A) The simplest supervised machine-learning algorithm we use is k-means clustering.
B) In k-means clustering, each cluster's centroid is the cluster's center point.
C) You'll often run multiple clustering estimators to compare their ability to divide a dataset's samples effectively into clusters.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) Scikit-learn conveniently packages the most effective machine-learning algorithms as evaluators.
B) Each scikit-learn algorithm is encapsulated, so you don't see its intricate details, including any heavy mathematics.
C) With scikit-learn and a small amount of Python code, you can create powerful models quickly for analyzing data, extracting insights from the data and making predictions.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) We can make machines learn.
B) The "secret sauce" of machine learning is data-and lots of it.
C) With machine learning, rather than programming expertise into our applications, we program them to learn from data.
D) All of the above statements are true.
Question
Which of the following statements is false?

A) Scikit-learn supports many classification algorithms, including the simplest-k-nearest neighbors (k-NN).
B) The k-nearest neighbors algorithm attempts to predict a test sample's class by looking at the k training samples that are nearest (in distance) to the test sample.
C) Always pick an even value of k for the k-nearest neighbors algorithm.
D) In the k-nearest neighbors algorithm, the class with the most "votes" wins.
Question
Unsupervised machine learning uses ________ algorithms.

A) classification
B) clustering
C) regression
D) None of the above
Question
Which of the following statements is false?

A) With scikit-learn, you train each model on a subset of your data, then test each model on the rest to see how well your model works.
B) Once your models are trained, you put them to work making predictions based on data they have not seen.
C) With machine learning, your computer will take on characteristics of intelligence.
D) Although you can specify parameters to customize scikit-learn models and possibly improve their performance, if you use the models' default parameters for simplicity, you'll generally obtain mediocre results.
Question
Which of the following statements is false?

A) The two main types of machine learning are xe "supervised machine learning"supervised machine learning, which works with unxe "labeled data"labeled data, and xe "unsupervised machine learning"unsupervised machine learning, which works with xe "unlabeled data"labeled data.
B) If you're developing a computer vision application to recognize dogs and cats, you'll train your model on lots of dog photos labeled "dog" and cat photos labeled "cat." If your model is effective, when you put it to work processing unlabeled photos it will recognize dogs and cats it has never seen before. The more photos you train with, the greater the chance that your model will accurately predict which new photos are dogs and which are cats.
D) In this era of big data and massive, economical computer power, you should be able to build some pretty accurate machine learning models.
Question
Which of the following are not steps in a typical machine-learning case study?

A) loading the dataset and exploring the data with pandas and visualizations
B) transforming your data (converting non-numeric data to numeric data because scikit-learn requires numeric data) and splitting the data for training and testing
C) creating, training and testing the model; tuning the model, evaluating its accuracy and making predictions on live data that the model hasn't seen before.
D) All of the above are steps in a typical machine-learning case study.
Question
Which of the following statements is false?

A) Even though k-nearest neighbors is one of the most complex xe "classification (machine learning)"classification algorithms, because of its superior prediction accuracy we use it to analyze the Digits dataset bundled with scikit-learn.
B) Classification algorithms predict the discrete classes (categories) to which samples belong.
C) Binary classification uses two classes, such as "spam" or "not spam" in an e-mail classification application. Multi-classification uses more than two classes, such as the 10 classes, 0 through 9, in the Digits dataset.
D) A classification scheme looking at movie descriptions might try to classify them as "action," "adventure," "fantasy," "romance," "history" and the like.
Question
Which of the following statements is false?

A) In machine learning, a model implements a machine-learning algorithm. In xe "machine learning:scikit-learn"xe "scikit-learn (sklearn) machine-learning library"scikit-learn, models are called estimators.
B) There are two parameter types in machine learning-those the estimator calculates as it learns from the data you provide and those you specify in advance when you create the scikit-learn estimator object that represents the model.
C) The machine-learning parameters the estimator calculates as it learns from the data are called hyperparameters-in the k-nearest neighbors algorithm, k is a hyperparameter.
D) For simplicity, we use scikit-learn's default hyperparameter values. In real-world machine-learning studies, you'll want to experiment with different values of k to produce the best possible models for your studies-this process is called hyperparameter tuning.
Question
Which of the following are related to compressing a dataset's large number of features down to two for visualization purposes.

A) dimensionality reduction
B) TSNE estimator
C) PCA estimator
D) All of the above.
Question
Which of the following statements a), b) or c) is false?

A) The following code uses function xe "sklearn.model_selection module:cross_val_score function"xe "cross_val_score function sklearn.model_selection"cross_val_score to train and test a model: from sklearn.model_selection import cross_val_score
Scores = cross_val_score(estimator=knn, X=digits.data,
Y=digits.target, cv=kfold)
B) The keyword arguments in Part (a) are: \bullet estimator=knn, which specifies the estimator you'd like to validate.
\bullet X=digits.data, which specifies the samples to use for training and testing.
\bullet y=digits.target, which specifies the targets for the samples.
\bullet cv=kfold, which specifies the cross-validation generator that defines how to split the samples and targets for training and testing.
C) Function cross_val_score returns a single overall accuracy score for the model.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) The KNeighborsClassifier estimator (module sklearn.neighbors) implements the k-nearest neighbors algorithm.
B) The following code creates a KNeighborsClassifier estimator object: from sklearn.neighbors import KNeighborsClassifier
Knn = KNeighborsClassifier()
C) The internal details of how a KNeighborsClassifier object implements the k-nearest neighbors algorithm are hidden in the object. You simply call its methods.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) You typically train a machine-learning model with a subset of a dataset.
B) Generally, you should train your model with the smallest amount of data that makes the model perform well.
C) It's important to set aside a portion of your data for testing, so you can evaluate a model's performance using data that the model has not yet seen. Once you're confident that the model is performing well, you can use it to make predictions using new data.
D) All of the above statements are true.
Question
Which of the following statements is false?

A) By default, train_test_split reserves 75% of the data for training and 25% for testing.
B) To specify different splits, you can set the sizes of the testing and training sets with the train_test_split function's keyword arguments test_size and train_size. Use floating-point values from 0.0 through 100.0 to specify the percentages of the data to use for each.
C) You can use integer values to set the precise numbers of samples.
D) If you specify one of the keyword arguments test_size and train_size, the other is inferred-for example, the statement X_train, X_test, y_train, y_test = train_test_split(
Digits.data, digits.target, random_state=11, test_size=0.20)
Specifies that 20% of the data is for testing, so train_size is inferred to be 0.80.
Question
Which of the following statements a), b) or c) is false?

A) The LinearRegression estimator is in the sklearn.linear_model module.
B) By default, LinearRegression uses all the numeric features in a dataset, performing a multiple linear regression.
C) Simple linear regression uses one feature as the xe "independent variable"independent variable.
D) All of the above statements are true.
Question
Scikit-learn estimators require their training and testing data to be two-dimensional arrays (or two-dimensional array-like data, such as lists of lists or pandas DataFrames). Which of the following statements is false?
A) To transform a one-dimensional array into two dimensions, we call an array's ________ method.

A) transform
B) switch
C) convert
D) reshape
Question
Consider the confusion matrix for the Digits dataset's predictions: array([[45, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 45, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 54, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 42, 0, 1, 0, 1, 0, 0],
[ 0, 0, 0, 0, 49, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 0, 38, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 42, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 45, 0, 0],
[ 0, 1, 1, 2, 0, 0, 0, 0, 39, 1],
[ 0, 0, 0, 0, 1, 0, 0, 0, 1, 41]])
Which of the following statements is false?

A) The correct predictions are shown on the diagonal from top-left to bottom-right-this is called the principal diagonal.
B) The nonzero values that are not on the principal diagonal indicate incorrect predictions (that is, misses).
C) Each row represents one distinct class-that is, one of the digits 0-9.
D) The columns within a row specify how many of the test samples were classified incorrectly into each distinct class 0-9.
Question
Which of the following statements is false?

A) The k in the k-nearest neighbors algorithm is a xe "machine learning:hyperparameter"hyperparameter of the algorithm.
B) Hyperparameters are set after using the algorithm to train your model.
C) In real-world machine learning studies, you'll want to use xe "hyperparameter:tuning"hyperparameter tuning to choose hyperparameter values that produce the best possible predictions.
D) To determine the best value for k in the kNN algorithm, try different odd values of k then compare the estimator's performance with each.
Question
Which of the following statements a), b) or c) is false?

A) In real-world machine-learning applications, it can often take minutes, hours, days or even months to train your models-special-purpose, high-performance hardware called GPUs and TPUs can significantly reduce model training time.
B) The fit method returns the estimator object.
C) For simplicity, we generally use the default estimator settings-by default, a KNeighborsClassifier looks at the three nearest neighbors to make its predictions.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) It's difficult to know in advance which machine learning model(s) will perform best for a given dataset, especially when they hide the details of how they operate from their users.
B) Even though the KNeighborsClassifier predicts digit images with a high degree of accuracy, it's possible that other scikit-learn estimators are even more accurate.
C) Scikit-learn provides many models with which you can quickly train and test your data. This encourages you to run multiple models to determine which is the best for a particular machine learning study.
D) All of the above statements are true.
Question
Which of the following statements is false?

A) Another way to check a classification estimator's accuracy is via a confusion matrix, which shows only the incorrect predicted values (also known as the misses) for a given class.
B) To create a confusion matrix imply call the function confusion_matrix from the sklearn.metrics module, passing the expected classes and the predicted classes as arguments, as in: from sklearn.metrics import confusion_matrix
Confusion = confusion_matrix(y_true=expected, y_pred=predicted)
C) The y_true keyword argument in Part (b) specifies the test samples' actual classes.
D) The y_pred keyword argument in Part (b) specifies the predicted classes for the test samples.
Question
Which of the following statements a), b) or c) is false?

A) You should first break your data into a training set and a testing set to prepare to train and test a model.
B) The function train_test_split from the sklearn.model_selection module simply splits in order the dataset's samples and target values into training and testing sets. This helps ensure that the training and testing sets have similar characteristics.
C) Function train_test_split provides the keyword argument random_state for xe "reproducibility"reproducibility. When you run the code in the future with the same seed value, train_test_split will select the same data for the training set and the same data for the testing set. In machine-learning studies, this helps others confirm your results by working with the same randomly selected data.
D) All of the above statements are true.
Question
Which of the following statements is false?

A) With K-fold cross-validation, you use all of your data at once for training your model.
B) K-fold cross-validation splits the dataset into k equal-size folds.
C) You then repeatedly train your model with k - 1 folds and test the model with the remaining fold.
D) Consider using k = 10 with folds numbered 1 through 10. With 10 folds, we'd do 10 successive training and testing cycles: \bullet First, we'd train with folds 1-9, then test with fold 10.
\bullet Next, we'd train with folds 1-8 and 10, then test with fold 9.
\bullet Next, we'd train with folds 1-7 and 9-10, then test with fold 8.
This training and testing cycle continues until each fold has been used to test the model.
Question
Which of the following statements a), b) or c) is false?

A) Scikit-learn has separate classes for simple linear regression and multiple linear regression.
B) To find the best fitting regression line for the data in a simple linear regression, a LinearRegression estimator iteratively adjusts the slope and intercept values to minimize the sum of the squares of the data points' distances from the line.
C) Once LinearRegression is finished performing a simple linear regression, you can use the slope and intercept in the y = mx + b calculation to make predictions. The slope is stored in the estimator's coeff_ attribute (m in the equation) and the intercept is stored in the estimator's intercept_ attribute (b in the equation).
D) All of the above are true.
Question
Which of the following statements a), b) or c) is false?

A) Once we've loaded our data into s KNeighborsClassifier, we can use it with the test samples to make predictions. Calling the estimator's predict method with the test samples (X_test) as an argument returns an array containing the predicted class of each sample: predicted = knn.predict(X=X_test)
B) If predicted and expected are arrays containing the predictions and expected target values, respectively, evaluating the following code snippets in IPython interactive mode displays the predicted and expected target values for the first 20 test samples: predicted[:20]
Expected[:20]
C) If predicted and expected are arrays containing the predictions and expected target values, respectively, the following list comprehension locates all the incorrect predictions for the entire test set-that is, the cases in which the predicted and expected values do not match: wrong = [(p, e) for (p, e) in zip(predicted, expected) if p != e]
D) All of the above statements are true.
Question
Consider the following code and output: In [57]: for k in range(1, 20, 2):
)..: kfold = KFold(n_splits=10, random_state=11, shuffle=True)
)..: knn = KNeighborsClassifier(n_neighbors=k)
)..: scores = cross_val_score(estimator=knn,
)..: X=digits.data, y=digits.target, cv=kfold)
)..: print(f'k={k:<2}; mean accuracy={scores.mean():.2%}; ' +
)..: f'standard deviation={scores.std():.2%}')
)..:
K=1 ; mean accuracy=98.83%; standard deviation=0.58%
K=3 ; mean accuracy=98.78%; standard deviation=0.78%
K=5 ; mean accuracy=98.72%; standard deviation=0.75%
K=7 ; mean accuracy=98.44%; standard deviation=0.96%
K=9 ; mean accuracy=98.39%; standard deviation=0.80%
K=11; mean accuracy=98.39%; standard deviation=0.80%
K=13; mean accuracy=97.89%; standard deviation=0.89%
K=15; mean accuracy=97.89%; standard deviation=1.02%
K=17; mean accuracy=97.50%; standard deviation=1.00%
K=19; mean accuracy=97.66%; standard deviation=0.96%
Which of the following statements is false?

A) The loop creates KNeighborsClassifiers with odd k values from 1 through 19 and performs k-fold cross-validation on each.
B) The k value 7 in kNN produces the most accurate predictions for the Digits dataset.
C) The accuracy tends to decrease for higher k values.
D) Compute time grows with k, because k-NN needs to perform many more calculations to find the nearest neighbors.
Question
Which of the following statements a), b) or c) is false?

A) Each estimator has a score method that returns an indication of how well the estimator performs for the test data you pass as arguments.
B) For classification estimators, the score method returns the xe "prediction:accuracy"prediction accuracy for the test data.
C) You can perform hyperparameter tuning to try to determine the optimal value for k.
D) All of the above statements are true.
Question
Which of the following statements is false?

A) Scikit-learn provides the KFold class and the cross_val_score function (both in the module sklearn.model_selection) to help you perform the training and testing cycles.
B) The following code creates a KFold object: from sklearn.model_selection import KFold
Kfold = KFold(n_folds=10, random_state=11, shuffle=True)
C) The keyword argument random_state=11 seeds the random number generator for xe "reproducibility"reproducibility.
D) The keyword argument shuffle=True causes the KFold object to randomize the data by shuffling it before splitting it into folds. This is particularly important if the samples might be ordered or grouped.
Question
Which of the following statements a), b) or c) is false?

A) The following call to the KNeighborsClassifier object's fit method loads the training set's samples (X_train) and targets (y_train) into the estimator: knn.fit(X=X_train, y=y_train)
B) After the KNeighborsClassifier's fit method loads the data into the estimator, it uses that data to perform complex calculations behind the scenes that learn from the data and train the model.
C) The KNeighborsClassifier estimator is said to be xe "lazy estimator (scikit-learn)"lazy because its work is performed only when you use it to make predictions.
D) All of the above statements are true.
Question
The sklearn.metrics module's xe "sklearn.metrics module:classification_report function"xe "classification_report function from the sklearn.metrics module"classification_report function produces a table of classification metrics based on the expected and predicted values, as in: from sklearn.metrics import classification_report
Names = [str(digit) for digit in digits.target_names]
Print(classification_report(expected, predicted,
<strong>The sklearn.metrics module's xe sklearn.metrics module:classification_report functionxe classification_report function from the sklearn.metrics moduleclassification_report function produces a table of classification metrics based on the expected and predicted values, as in: from sklearn.metrics import classification_report Names = [str(digit) for digit in digits.target_names] Print(classification_report(expected, predicted,  </strong> A) The precision column shows the total number of correct predictions for a given digit divided by the total number of predictions for that digit. You can confirm the precision by looking at each column in the confusion matrix. B) The recall column is the total number of correct predictions for a given digit divided by the total number of samples that should have been predicted as that digit. You can confirm the recall by looking at each row in the confusion matrix. C) The f1-score column is the average of the precision. The recall and the support column is the number of samples with a given expected value-for example, 50 samples were labeled as 4s, and 38 samples were labeled as 5s. D) All of the above are true. <div style=padding-top: 35px>

A) The precision column shows the total number of correct predictions for a given digit divided by the total number of predictions for that digit. You can confirm the precision by looking at each column in the confusion matrix.
B) The recall column is the total number of correct predictions for a given digit divided by the total number of samples that should have been predicted as that digit. You can confirm the recall by looking at each row in the confusion matrix.
C) The f1-score column is the average of the precision. The recall and the support column is the number of samples with a given expected value-for example, 50 samples were labeled as 4s, and 38 samples were labeled as 5s.
D) All of the above are true.
Question
Which of the following statements is false?

A) You load the California Housing dataset using the the xe "modules:sklearn.datasets"xe "sklearn.datasets module"sklearn.datasets module's fetch_california_housing function, which returns a Bunch object.
B) The Bunch object's xe "Bunch class from sklearn.utils:data attribute"xe "data:attribute of a Bunch"data and xe "Bunch class from sklearn.utils:target attribute"xe "target attribute of a Bunch"target attributes are NumPy arrays containing the 20,640 xe "machine learning:samples"xe "samples (in machine learning)"samples and their xe "machine learning:target values"xe "target values (in machine learning)"target values respectively.
C) To confirm the number of samples (rows) and features (columns), look at the data array's shape attribute, which shows that there are 20,640 rows and 8 columns, as in: In [4]: california.data.shape
Out[4]: (20640, 8)
Similarly, you can see that the number of target values-the median house values-matches the number of samples by looking at the target array's shape, as in:
In [5]: california.target.shape
Out[5]: (20640,)
D) The Bunch's features attribute contains the names that correspond to each column in the data array.
Question
Which of the following statements is false?
A) The following code tests a linear regression model using the data in X_test and checks some of the predictions throughout the dataset by displaying the predicted and expected values for every ________ element: predicted = linear_regression.predict(X_test)
Expected = y_test
For p, e in zip(predicted[::5], expected[::5]):
Print(f'predicted: {p:.2f}, expected: {e:.2f}')

A) second
B) fifth
C) pth
D) eth
Question
Which of the following statements about the k-means clustering algorithm is false?

A) Each cluster of samples is grouped around a centroid-the cluster's center point.
B) Initially, the algorithm chooses k centroids at random from the dataset's samples. Then the remaining samples are placed in the cluster whose centroid is the closest.
C) The centroids are iteratively recalculated and the samples re-assigned to clusters until, for all clusters, the distances from a given centroid to the samples in its cluster are maximized.
D) The algorithm's results are a one-dimensional array of labels indicating the cluster to which each sample belongs, and a two-dimensional array of centroids representing the center of each cluster.
Question
Which of the following statements a), b) or c) is false?

A) Scikit-learn provides many metrics functions for evaluating how well estimators predict results and for comparing estimators to choose the best one(s) for your particular study.
B) Scikit-learn's metrics vary by estimator type.
C) Functions confusion_matrix and classification_report (from the module sklearn.metrics) are two of many metrics functions specifically for evaluating regression estimators.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) Unsupervised machine learning and visualization can help you get to know your data by finding patterns and relationships among unlabeled samples.
B) Using Matplotlib, Seaborn and other visualization libraries, you can plot datasets with two or three variables using 2D and 3D visualizations, respectively.
C) In the Digits dataset, every sample has 64 features (and a target value), so there is no way to visualize the dataset.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) It's difficult for humans to think about data with large numbers of dimensions. This is called the curse of dimensionality.
B) If data has closely correlated features, some could be eliminated via dimensionality reduction to improve the training performance.
C) Eliminating features with dimensionality reduction, improves the accuracy of the model.
D) All of the above statements are true.
Question
Consider the following code that imports pandas and sets some options: import pandas as pd
Pd)set_option('precision', 4)
Pd)set_option('max_columns', 9)
Pd)set_option('display.width', None)
Which of the following statements a), b) or c)about the set_option calls is false?

A) 'precision' is the maximum number of digits to display to the right of each decimal point.
B) 'max_columns' is the maximum number of columns to display when you output the DataFrame's string representation. In IPython interactive mode, by default, pandas displays all of the columns left-to-right. The 'max_columns' setting enables pandas to show all the columns using multiple rows of output.
C) 'display.width' specifies the width in characters of your Command Prompt (Windows), Terminal (macOS/Linux) or shell (Linux). The value None tells pandas to auto-detect the display width when formatting string representations of Series and DataFrames.
D) All of the above statements are true.
Question
Which of the following statements is false?

A) When creating a model, a key goal is to ensure that it is capable of making accurate predictions for data it has not yet seen. Two common problems that prevent accurate predictions are overfitting and underfitting.
B) Underfitting occurs when a model is too simple to make accurate predictions, based on its training data. An example of underfitting is using a linear model, such as simple linear regression, when in fact, the problem really requires a more sophisticated non-linear model.
C) Overfitting occurs when your model is too complex. In the most extreme case of overfitting, a model memorizes its training data.
D) When you make predictions with an overfit model, the model won't know what to do with new data that matches the training data, but the model will make excellent predictions with data it has never seen.
Question
Which of the following statements a), b) or c) is false?

A) In big data, samples can have hundreds, thousands or even millions of features.
B) To visualize a dataset with many features (that is, many dimensions), you must first reduce the data to two or three dimensions. This requires a supervised machine learning technique called dimensionality reduction.
C) When you graph the resulting data after dimensionality reduction, you might see patterns in the data that will help you choose the most appropriate machine learning algorithms to use. For example, if the visualization contains clusters of points, it might indicate that there are distinct classes of information within the dataset.
D) All of the above statements are true.
Question
In the context of the California Housing dataset, which of the following statements is false?

A) The following code creates a LinearRegression estimator and invokes its xe "scikit-learn (sklearn) machine-learning library:fit method of an estimator"xe "fit method:of a scikit-learn estimator"fit method to train the estimator using X_train (the samples) and y_train (the targets): from sklearn.linear_model import LinearRegression
Linear_regression = LinearRegression()
Linear_regression.fit(X=X_train, y=y_train)
B) Multiple linear regression produces separate coefficients for each feature (stored in coeff_) in the dataset and one intercept (stored in intercept_).
C) For positive coefficients, the median house value increases as the feature value increases. For negative coefficients, the median house value decreases as the feature value decreases.
D) You can use the coefficient and intercept values with the following equation to make predictions: y = m1x1 + m2x2 + … mnxn + b
Where
\bullet m1, m2, …, mn are the feature coefficients
\bullet b is the intercept
\bullet x1, x2, …, xn are the feature values (that is, the values of the independent variables)
\bullet y is the predicted value (that is, the xe "dependent variable"dependent variable)
Question
Which of the following statements is false?

A) k-means clustering is perhaps the simplest unsupervised machine learning algorithm.
B) The k-means clustering algorithm analyzes unlabeled samples and attempts to place them in clusters that appear to be related.
C) The k in "k-means" represents the number of clusters to impose on the data.
D) The k-means clustering algorithm organizes samples into the number of clusters you specify in advance, using distance calculations similar to the k-nearest neighbors clustering algorithm.
Question
Which of the following statements a), b) or c) is false?

A) The following code tests a model by calling the estimator's xe "scikit-learn (sklearn) machine-learning library:predict method of an estimator"xe "predict method of a scikit-learn estimator"predict method with the test samples as an argument: predicted = linear_regression.predict(X_test)
B) Assuming the array expected contains the expected values for the samples used to make predictions in Part (a)'s snippet, evaluating the following snippets displays the first five predictions and their corresponding expected values: In [32]: predicted[:5]
Out[32]: array([1.25396876, 2.34693107, 2.03794745, 1.8701254 , 2.53608339])
In [33]: expected[:5]
Out[33]: array([0.762, 1.732, 1.125, 1.37 , 1.856])
C) With classification, we saw that the predictions were distinct classes that matched existing classes in the dataset. With regression, it's tough to get exact predictions, because you have continuous outputs. Every possible value of x1, x2 … xn in the calculation y = m1x1 + m2x2 + … mnxn + b
Predicts a different value.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) By default, a xe "sklearn.linear_model module:LinearRegression estimator"xe "LinearRegression estimator from sklearn.linear_model"LinearRegression estimator uses all the features in the dataset's data array to perform a xe "linear regression:multiple"multiple linear regression.
B) An error occurs if any of the features passed to a xe "sklearn.linear_model module:LinearRegression estimator"xe "LinearRegression estimator from sklearn.linear_model"LinearRegression estimator for training are categorical rather than numeric. If a dataset contains categorical data, you must exclude the categorical features from the training process.
C) A benefit of working with scikit-learn's bundled datasets is that they're already in the correct format for machine learning using scikit-learn's models.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) Another common metric for regression models is the mean squared error, which \bullet calculates the difference between each expected and predicted value-this is called the error,
\bullet squares each difference and
\bullet calculates the average of the squared values.
B) To calculate a regression estimator's mean squared error, call function mean_squared_error (from module sklearn.metrics) with the arrays representing the expected and predicted results, as in: In [46]: metrics.mean_squared_error(expected, predicted)
Out[46]: 0.5350149774449119
C) When comparing estimators with the mean squared error metric, the one with the value closest to 1 best fits your data.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) Among the many metrics for regression estimators is the model's coefficient of determination, which is also called the R2 score.
B) To calculate an estimator's R2 score, use the sklearn.metrics module's r2_score function with the arrays representing the expected and predicted results, as in: In [44]: from sklearn import metrics
In [45]: metrics.r2_score(expected, predicted)
Out[45]: 0.6008983115964333
C) R2 scores range from 0.0 to 1.0 with 1.0 being the best. An R2 score of 1.0 indicates that the estimator perfectly predicts the independent variable's value, given the dependent variable(s) value(s). An R2 score of 0.0 indicates the model cannot make predictions with any accuracy, based on the independent variables' values.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) Dimensionality reduction in scikit-learn typically involves two steps-training the estimator with the dataset, then using the estimator to transform the data into the specified number of dimensions.
B) The steps mentioned in Part (a) can be performed separately with the TSNE methods fit and transform, or they can be performed in one statement using the fit_transform method, as in: In [5]: reduced_data = tsne.fit_transform(digits.data)
C) TSNE's fit_transform method takes some time to train the estimator then perform the reduction. When the method completes its task, it returns an array with the same number of rows as digits.data, but only the number of columns specified by the n_components argument when you created the estimator object. You can confirm this by checking reduced_data's shape.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) The California Housing dataset (bundled with xe "sklearn (scikit-learn)"xe "machine learning:scikit-learn"xe "scikit-learn (sklearn) machine-learning library"scikit-learn) has 20,640 samples, each with eight numerical features.
B) The LinearRegression estimator performs multiple linear regression by default using all of a dataset's numeric features.
C) You should expect more meaningful results from simple linear regression than from multiple linear regression on the dataset.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) It's helpful to xe "visualize the data"visualize your data by plotting the target value against each feature-in the case of the California Housing Prices dataset, to see how the median home value relates to each feature.
B) DataFrame method sample can randomly select a percentage of a DataFrame's data (specified keyword argument frac), as in: sample_df = california_df.sample(frac=0.1, random_state=17)
C) The keyword argument random_state in Part (b)'s snippet enables you to seed the random number generator. Each time you use the same seed value, method sample selects a similar random subset of the DataFrame's rows.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) We can use a TSNE estimator (from the sklearn.manifold module) to perform dimensionality reduction. This estimator analyzes a dataset's features and reduces them to the specified number of dimensions.
B) The following code creates a TSNE object for reducing a dataset's features to two dimensions, as specified by the keyword argument n_components: In [3]: from sklearn.manifold import TSNE
In [4]: tsne = TSNE(n_components=2, random_state=11)
C) When using TSNE on the Digits dataset bundled with scikit-learn, the TSNE estimator's random_state keyword argument in Part (b) ensures the reproducibility of the "render sequence" when we display the digit clusters, for example.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) The Iris dataset bundled with scikit-learn is commonly analyzed with both classification and clustering.
B) Although the Iris dataset is labeled, we can ignore those labels to demonstrate clustering. Then, we can use the labels to determine how well the k-means algorithm clusters the samples.
C) The Iris dataset is referred to as a "toy dataset" because it has only 150 samples and four features. The dataset describes 50 samples for each of three Iris flower species-xe "Iris setosa"Iris setosa, xe "Iris versicolor"Iris versicolor and xe "Iris virginica"Iris virginica.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) Because the Iris dataset is labeled, we can look at its target array values to get a sense of how well the k-means algorithm clustered the samples for the three Iris species.
B) In the Iris dataset, the first 50 samples are Iris setosa, the next 50 are Iris versicolor, and the last 50 are Iris virginica.
C) If the KMeans estimator chose the Iris dataset clusters perfectly, then each group of 50 elements in the estimator's labels_ array should have mostly the same label.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) We train the KMeans estimator by calling the object's fit method-this performs the k-means algorithm.
B) As with the other estimators, the fit method returns the estimator object.
C) When the training completes, the KMeans object contains a labels_ array with values from 0 to n_clusters - 1 (in the Iris dataset example, 0-2), indicating the clusters to which the samples belong, and a cluster_centers_ array in which each row represents a cluster.
D) All of the above statements are true.
Question
Which of the following statements is false?

A) Each centroid in the KMeans object's cluster_centers_ array has the same number of features as the original dataset (four in the case of the Iris dataset). b To plot the centroids in two-dimensions, you must reduce their dimensions.
C) You can think of a centroid as the "median" sample in its cluster.
D) Each centroid should be transformed using the same PCA estimator used to reduce the other samples in that cluster
Question
Which of the following statements a), b) or c) is false?

A) One way to learn more about your data is to see how the features relate to one another.
B) The samples in the Iris dataset each have four features.
C) We cannot graph one feature against the other three in a single graph. But we can plot pairs of features against one another in a pairplot.
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) The PCA estimator (from the sklearn.decomposition module), like TSNE, performs dimensionality reduction. The PCA estimator uses an algorithm called xe "principal components analysis (PCA)"principal component analysis to analyze a dataset's features and reduce them to the specified number of dimensions.
B) Like TSNE, a PCA estimator uses the keyword argument n_components to specify the number of dimensions, as in: from sklearn.decomposition import PCA
Pca = PCA(n_components=2, random_state=11)
C) The following snippet trains the PCA estimator and produces the reduced data by calling the PCA estimator's fit and transform methods: pca.fit(iris.data)
Iris_pca = pca.transform(iris.data)
D) All of the above statements are true.
Question
Which of the following statements a), b) or c) is false?

A) We can use xe "k-means clustering algorithm"k-means clustering via scikit-learn's KMeans estimator (from the sklearn.cluster module) to place each sample in a dataset into a cluster. The KMeans estimator hides from you the algorithm's complex mathematical details, making it straightforward to use.
B) The following code creates a KMeans object: from sklearn.cluster import KMeans
Kmeans = KMeans(n_clusters=3, random_state=11)
C) The keyword argument n_clusters specifies the k-means clustering algorithm's hyperparameter k (in this case, 3), which KMeans requires to calculate the clusters and label each sample. The default value for n_clusters is 8.
D) All of the above statements are true.
Unlock Deck
Sign up to unlock the cards in this deck!
Unlock Deck
Unlock Deck
1/66
auto play flashcards
Play
simple tutorial
Full screen (f)
exit full mode
Deck 15: Machine Learning: Classification, Regression and Clustering  
1
Which of the following statements a), b) or c) is false?

A) Supervised machine learning falls into two categories-xe "classification (machine learning)"classification and xe "regression"regression.
B) You train machine-learning models on datasets that consist of rows and columns. Each row represents a data feature. Each column represents a sample of that feature.
C) In supervised machine learning, each sample has an associated label called a target (like "spam" or "not spam" for classifying e-mails). This is the value you're trying to predict for new data that you present to your models.
D) All of the above statements are true.
B
2
Which of the following statements a), b) or c) is false?

A) The amount of data that's available today is already enormous and continues to grow exponentially-the data produced in the world in the last few years alone equals the amount produced up to that point since the dawn of civilization.
B) People used to say "I'm drowning in data and I don't know what to do with it. With machine learning, we now say, "Flood me with big data so I can use machine-learning technology to extract insights and make predictions from it."
C) The big data phenomenon is occurring at a time when computing power is exploding and computer memory and secondary storage are exploding in capacity while costs dramatically decline. This enables us to think differently about solution approaches.
D) All of the above statements are true.
D
3
Which of the following statements a), b) or c) is false?

A) "Toy" datasets, generally have a small number of samples with a limited number of features. In the world of big data, datasets commonly have millions and billions of samples, or even more.
B) There's an enormous number of free and open datasets available for data science studies. Libraries like scikit-learn bundle popular datasets for you to experiment with and provide mechanisms for loading datasets from various repositories (such as openml.org).
C) Governments, businesses and other organizations worldwide offer datasets on a vast range of subjects.
D) All of the above statements are true.
D
4
With regard to our code that displays 24 digit images, which of the following statements a), b) or c) is false?

A) The following call to function subplots creates a 6-by-4 inch Figure (specified by the figsize=(6, 4) keyword argument) containing 24 subplots arranged in 6 rows and 4 columns: import matplotlib.pyplot as plt
Figure, axes = plt.subplots(nrows=4, ncols=6, figsize=(6, 4))
B) Each subplot has its own Axes object.
C) Function subplots returns the Axes objects in a two-dimensional NumPy array.
D) All of the above are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
5
Which of the following statements is false?

A) K-means clustering works through the data attempting to divide it into that many clusters.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
6
Which of the following statements about scikit-learn and the machine-learning models you'll build with it is false?

A) It's difficult to know in advance which model(s) will perform best on your data, so you typically try many models and pick the one that performs best-scikit-learn makes this convenient for you.
B) You'll rarely get to know the details of the complex mathematical algorithms in the scikit-learn estimators, but with experience, you'll be able to intuit the best model for each new dataset.
C) It generally takes at most a few lines of code for you to create and use each scikit-learn model.
D) The models report their performance so you can compare the results and pick the model(s) with the best performance.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
7
Which of the following statements is false?

A) Classification in xe "supervised machine learning"supervised machine learning attempts to predict the distinct class to which a sample belongs.
B) If you have images of dogs and images of cats, you can classify each image as a "dog" or a "cat." This is a binary classification problem.
C) When classifying digit images from the Digits dataset bundled with xe "machine learning:scikit-learn"xe "scikit-learn (sklearn) machine-learning library"scikit-learn, our goal is to predict which digit an image represents. Since there are 10 possible digits (the classes), this is a multi-classification problem.
D) You train a classification model using unlabeled data.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
8
Which of the following statements is false?

A) Regression models predict a continuous output, such as the predicted temperature output in a weather time-series analysis.
B) The LinearRegression estimator can perform simple linear regression.
C) The LinearRegression estimator also can perform multiple linear regression.
D) The LinearRegression estimator, by default, uses all the nonnumerical features in a dataset to make more sophisticated predictions than you can with a single-feature simple linear regression.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
9
Which of the following statements is false?

A) Scikit-learn's machine-learning algorithms require samples to be stored in a one-dimensional array of floating-point values (or one-dimensional array-like collection, such as a list).
B) To represent every sample as one row, multi-dimensional data must be flattened into a one-dimensional array.
C) If you work with a dataset containing categorical features (typically represented as strings, such as 'spam' or 'not-spam'), you have to preprocess those features into numerical valuesxe "one-hot encoding[one hot encoding]".
D) Scikit-learn's sklearn.preprocessing module provides capabilities for converting categorical data to numeric data.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
10
Which of the following statements a), b) or c) is false?

A) The simplest supervised machine-learning algorithm we use is k-means clustering.
B) In k-means clustering, each cluster's centroid is the cluster's center point.
C) You'll often run multiple clustering estimators to compare their ability to divide a dataset's samples effectively into clusters.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
11
Which of the following statements a), b) or c) is false?

A) Scikit-learn conveniently packages the most effective machine-learning algorithms as evaluators.
B) Each scikit-learn algorithm is encapsulated, so you don't see its intricate details, including any heavy mathematics.
C) With scikit-learn and a small amount of Python code, you can create powerful models quickly for analyzing data, extracting insights from the data and making predictions.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
12
Which of the following statements a), b) or c) is false?

A) We can make machines learn.
B) The "secret sauce" of machine learning is data-and lots of it.
C) With machine learning, rather than programming expertise into our applications, we program them to learn from data.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
13
Which of the following statements is false?

A) Scikit-learn supports many classification algorithms, including the simplest-k-nearest neighbors (k-NN).
B) The k-nearest neighbors algorithm attempts to predict a test sample's class by looking at the k training samples that are nearest (in distance) to the test sample.
C) Always pick an even value of k for the k-nearest neighbors algorithm.
D) In the k-nearest neighbors algorithm, the class with the most "votes" wins.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
14
Unsupervised machine learning uses ________ algorithms.

A) classification
B) clustering
C) regression
D) None of the above
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
15
Which of the following statements is false?

A) With scikit-learn, you train each model on a subset of your data, then test each model on the rest to see how well your model works.
B) Once your models are trained, you put them to work making predictions based on data they have not seen.
C) With machine learning, your computer will take on characteristics of intelligence.
D) Although you can specify parameters to customize scikit-learn models and possibly improve their performance, if you use the models' default parameters for simplicity, you'll generally obtain mediocre results.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
16
Which of the following statements is false?

A) The two main types of machine learning are xe "supervised machine learning"supervised machine learning, which works with unxe "labeled data"labeled data, and xe "unsupervised machine learning"unsupervised machine learning, which works with xe "unlabeled data"labeled data.
B) If you're developing a computer vision application to recognize dogs and cats, you'll train your model on lots of dog photos labeled "dog" and cat photos labeled "cat." If your model is effective, when you put it to work processing unlabeled photos it will recognize dogs and cats it has never seen before. The more photos you train with, the greater the chance that your model will accurately predict which new photos are dogs and which are cats.
D) In this era of big data and massive, economical computer power, you should be able to build some pretty accurate machine learning models.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
17
Which of the following are not steps in a typical machine-learning case study?

A) loading the dataset and exploring the data with pandas and visualizations
B) transforming your data (converting non-numeric data to numeric data because scikit-learn requires numeric data) and splitting the data for training and testing
C) creating, training and testing the model; tuning the model, evaluating its accuracy and making predictions on live data that the model hasn't seen before.
D) All of the above are steps in a typical machine-learning case study.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
18
Which of the following statements is false?

A) Even though k-nearest neighbors is one of the most complex xe "classification (machine learning)"classification algorithms, because of its superior prediction accuracy we use it to analyze the Digits dataset bundled with scikit-learn.
B) Classification algorithms predict the discrete classes (categories) to which samples belong.
C) Binary classification uses two classes, such as "spam" or "not spam" in an e-mail classification application. Multi-classification uses more than two classes, such as the 10 classes, 0 through 9, in the Digits dataset.
D) A classification scheme looking at movie descriptions might try to classify them as "action," "adventure," "fantasy," "romance," "history" and the like.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
19
Which of the following statements is false?

A) In machine learning, a model implements a machine-learning algorithm. In xe "machine learning:scikit-learn"xe "scikit-learn (sklearn) machine-learning library"scikit-learn, models are called estimators.
B) There are two parameter types in machine learning-those the estimator calculates as it learns from the data you provide and those you specify in advance when you create the scikit-learn estimator object that represents the model.
C) The machine-learning parameters the estimator calculates as it learns from the data are called hyperparameters-in the k-nearest neighbors algorithm, k is a hyperparameter.
D) For simplicity, we use scikit-learn's default hyperparameter values. In real-world machine-learning studies, you'll want to experiment with different values of k to produce the best possible models for your studies-this process is called hyperparameter tuning.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
20
Which of the following are related to compressing a dataset's large number of features down to two for visualization purposes.

A) dimensionality reduction
B) TSNE estimator
C) PCA estimator
D) All of the above.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
21
Which of the following statements a), b) or c) is false?

A) The following code uses function xe "sklearn.model_selection module:cross_val_score function"xe "cross_val_score function sklearn.model_selection"cross_val_score to train and test a model: from sklearn.model_selection import cross_val_score
Scores = cross_val_score(estimator=knn, X=digits.data,
Y=digits.target, cv=kfold)
B) The keyword arguments in Part (a) are: \bullet estimator=knn, which specifies the estimator you'd like to validate.
\bullet X=digits.data, which specifies the samples to use for training and testing.
\bullet y=digits.target, which specifies the targets for the samples.
\bullet cv=kfold, which specifies the cross-validation generator that defines how to split the samples and targets for training and testing.
C) Function cross_val_score returns a single overall accuracy score for the model.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
22
Which of the following statements a), b) or c) is false?

A) The KNeighborsClassifier estimator (module sklearn.neighbors) implements the k-nearest neighbors algorithm.
B) The following code creates a KNeighborsClassifier estimator object: from sklearn.neighbors import KNeighborsClassifier
Knn = KNeighborsClassifier()
C) The internal details of how a KNeighborsClassifier object implements the k-nearest neighbors algorithm are hidden in the object. You simply call its methods.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
23
Which of the following statements a), b) or c) is false?

A) You typically train a machine-learning model with a subset of a dataset.
B) Generally, you should train your model with the smallest amount of data that makes the model perform well.
C) It's important to set aside a portion of your data for testing, so you can evaluate a model's performance using data that the model has not yet seen. Once you're confident that the model is performing well, you can use it to make predictions using new data.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
24
Which of the following statements is false?

A) By default, train_test_split reserves 75% of the data for training and 25% for testing.
B) To specify different splits, you can set the sizes of the testing and training sets with the train_test_split function's keyword arguments test_size and train_size. Use floating-point values from 0.0 through 100.0 to specify the percentages of the data to use for each.
C) You can use integer values to set the precise numbers of samples.
D) If you specify one of the keyword arguments test_size and train_size, the other is inferred-for example, the statement X_train, X_test, y_train, y_test = train_test_split(
Digits.data, digits.target, random_state=11, test_size=0.20)
Specifies that 20% of the data is for testing, so train_size is inferred to be 0.80.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
25
Which of the following statements a), b) or c) is false?

A) The LinearRegression estimator is in the sklearn.linear_model module.
B) By default, LinearRegression uses all the numeric features in a dataset, performing a multiple linear regression.
C) Simple linear regression uses one feature as the xe "independent variable"independent variable.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
26
Scikit-learn estimators require their training and testing data to be two-dimensional arrays (or two-dimensional array-like data, such as lists of lists or pandas DataFrames). Which of the following statements is false?
A) To transform a one-dimensional array into two dimensions, we call an array's ________ method.

A) transform
B) switch
C) convert
D) reshape
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
27
Consider the confusion matrix for the Digits dataset's predictions: array([[45, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 45, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 54, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 42, 0, 1, 0, 1, 0, 0],
[ 0, 0, 0, 0, 49, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 0, 38, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 42, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 45, 0, 0],
[ 0, 1, 1, 2, 0, 0, 0, 0, 39, 1],
[ 0, 0, 0, 0, 1, 0, 0, 0, 1, 41]])
Which of the following statements is false?

A) The correct predictions are shown on the diagonal from top-left to bottom-right-this is called the principal diagonal.
B) The nonzero values that are not on the principal diagonal indicate incorrect predictions (that is, misses).
C) Each row represents one distinct class-that is, one of the digits 0-9.
D) The columns within a row specify how many of the test samples were classified incorrectly into each distinct class 0-9.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
28
Which of the following statements is false?

A) The k in the k-nearest neighbors algorithm is a xe "machine learning:hyperparameter"hyperparameter of the algorithm.
B) Hyperparameters are set after using the algorithm to train your model.
C) In real-world machine learning studies, you'll want to use xe "hyperparameter:tuning"hyperparameter tuning to choose hyperparameter values that produce the best possible predictions.
D) To determine the best value for k in the kNN algorithm, try different odd values of k then compare the estimator's performance with each.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
29
Which of the following statements a), b) or c) is false?

A) In real-world machine-learning applications, it can often take minutes, hours, days or even months to train your models-special-purpose, high-performance hardware called GPUs and TPUs can significantly reduce model training time.
B) The fit method returns the estimator object.
C) For simplicity, we generally use the default estimator settings-by default, a KNeighborsClassifier looks at the three nearest neighbors to make its predictions.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
30
Which of the following statements a), b) or c) is false?

A) It's difficult to know in advance which machine learning model(s) will perform best for a given dataset, especially when they hide the details of how they operate from their users.
B) Even though the KNeighborsClassifier predicts digit images with a high degree of accuracy, it's possible that other scikit-learn estimators are even more accurate.
C) Scikit-learn provides many models with which you can quickly train and test your data. This encourages you to run multiple models to determine which is the best for a particular machine learning study.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
31
Which of the following statements is false?

A) Another way to check a classification estimator's accuracy is via a confusion matrix, which shows only the incorrect predicted values (also known as the misses) for a given class.
B) To create a confusion matrix imply call the function confusion_matrix from the sklearn.metrics module, passing the expected classes and the predicted classes as arguments, as in: from sklearn.metrics import confusion_matrix
Confusion = confusion_matrix(y_true=expected, y_pred=predicted)
C) The y_true keyword argument in Part (b) specifies the test samples' actual classes.
D) The y_pred keyword argument in Part (b) specifies the predicted classes for the test samples.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
32
Which of the following statements a), b) or c) is false?

A) You should first break your data into a training set and a testing set to prepare to train and test a model.
B) The function train_test_split from the sklearn.model_selection module simply splits in order the dataset's samples and target values into training and testing sets. This helps ensure that the training and testing sets have similar characteristics.
C) Function train_test_split provides the keyword argument random_state for xe "reproducibility"reproducibility. When you run the code in the future with the same seed value, train_test_split will select the same data for the training set and the same data for the testing set. In machine-learning studies, this helps others confirm your results by working with the same randomly selected data.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
33
Which of the following statements is false?

A) With K-fold cross-validation, you use all of your data at once for training your model.
B) K-fold cross-validation splits the dataset into k equal-size folds.
C) You then repeatedly train your model with k - 1 folds and test the model with the remaining fold.
D) Consider using k = 10 with folds numbered 1 through 10. With 10 folds, we'd do 10 successive training and testing cycles: \bullet First, we'd train with folds 1-9, then test with fold 10.
\bullet Next, we'd train with folds 1-8 and 10, then test with fold 9.
\bullet Next, we'd train with folds 1-7 and 9-10, then test with fold 8.
This training and testing cycle continues until each fold has been used to test the model.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
34
Which of the following statements a), b) or c) is false?

A) Scikit-learn has separate classes for simple linear regression and multiple linear regression.
B) To find the best fitting regression line for the data in a simple linear regression, a LinearRegression estimator iteratively adjusts the slope and intercept values to minimize the sum of the squares of the data points' distances from the line.
C) Once LinearRegression is finished performing a simple linear regression, you can use the slope and intercept in the y = mx + b calculation to make predictions. The slope is stored in the estimator's coeff_ attribute (m in the equation) and the intercept is stored in the estimator's intercept_ attribute (b in the equation).
D) All of the above are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
35
Which of the following statements a), b) or c) is false?

A) Once we've loaded our data into s KNeighborsClassifier, we can use it with the test samples to make predictions. Calling the estimator's predict method with the test samples (X_test) as an argument returns an array containing the predicted class of each sample: predicted = knn.predict(X=X_test)
B) If predicted and expected are arrays containing the predictions and expected target values, respectively, evaluating the following code snippets in IPython interactive mode displays the predicted and expected target values for the first 20 test samples: predicted[:20]
Expected[:20]
C) If predicted and expected are arrays containing the predictions and expected target values, respectively, the following list comprehension locates all the incorrect predictions for the entire test set-that is, the cases in which the predicted and expected values do not match: wrong = [(p, e) for (p, e) in zip(predicted, expected) if p != e]
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
36
Consider the following code and output: In [57]: for k in range(1, 20, 2):
)..: kfold = KFold(n_splits=10, random_state=11, shuffle=True)
)..: knn = KNeighborsClassifier(n_neighbors=k)
)..: scores = cross_val_score(estimator=knn,
)..: X=digits.data, y=digits.target, cv=kfold)
)..: print(f'k={k:<2}; mean accuracy={scores.mean():.2%}; ' +
)..: f'standard deviation={scores.std():.2%}')
)..:
K=1 ; mean accuracy=98.83%; standard deviation=0.58%
K=3 ; mean accuracy=98.78%; standard deviation=0.78%
K=5 ; mean accuracy=98.72%; standard deviation=0.75%
K=7 ; mean accuracy=98.44%; standard deviation=0.96%
K=9 ; mean accuracy=98.39%; standard deviation=0.80%
K=11; mean accuracy=98.39%; standard deviation=0.80%
K=13; mean accuracy=97.89%; standard deviation=0.89%
K=15; mean accuracy=97.89%; standard deviation=1.02%
K=17; mean accuracy=97.50%; standard deviation=1.00%
K=19; mean accuracy=97.66%; standard deviation=0.96%
Which of the following statements is false?

A) The loop creates KNeighborsClassifiers with odd k values from 1 through 19 and performs k-fold cross-validation on each.
B) The k value 7 in kNN produces the most accurate predictions for the Digits dataset.
C) The accuracy tends to decrease for higher k values.
D) Compute time grows with k, because k-NN needs to perform many more calculations to find the nearest neighbors.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
37
Which of the following statements a), b) or c) is false?

A) Each estimator has a score method that returns an indication of how well the estimator performs for the test data you pass as arguments.
B) For classification estimators, the score method returns the xe "prediction:accuracy"prediction accuracy for the test data.
C) You can perform hyperparameter tuning to try to determine the optimal value for k.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
38
Which of the following statements is false?

A) Scikit-learn provides the KFold class and the cross_val_score function (both in the module sklearn.model_selection) to help you perform the training and testing cycles.
B) The following code creates a KFold object: from sklearn.model_selection import KFold
Kfold = KFold(n_folds=10, random_state=11, shuffle=True)
C) The keyword argument random_state=11 seeds the random number generator for xe "reproducibility"reproducibility.
D) The keyword argument shuffle=True causes the KFold object to randomize the data by shuffling it before splitting it into folds. This is particularly important if the samples might be ordered or grouped.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
39
Which of the following statements a), b) or c) is false?

A) The following call to the KNeighborsClassifier object's fit method loads the training set's samples (X_train) and targets (y_train) into the estimator: knn.fit(X=X_train, y=y_train)
B) After the KNeighborsClassifier's fit method loads the data into the estimator, it uses that data to perform complex calculations behind the scenes that learn from the data and train the model.
C) The KNeighborsClassifier estimator is said to be xe "lazy estimator (scikit-learn)"lazy because its work is performed only when you use it to make predictions.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
40
The sklearn.metrics module's xe "sklearn.metrics module:classification_report function"xe "classification_report function from the sklearn.metrics module"classification_report function produces a table of classification metrics based on the expected and predicted values, as in: from sklearn.metrics import classification_report
Names = [str(digit) for digit in digits.target_names]
Print(classification_report(expected, predicted,
<strong>The sklearn.metrics module's xe sklearn.metrics module:classification_report functionxe classification_report function from the sklearn.metrics moduleclassification_report function produces a table of classification metrics based on the expected and predicted values, as in: from sklearn.metrics import classification_report Names = [str(digit) for digit in digits.target_names] Print(classification_report(expected, predicted,  </strong> A) The precision column shows the total number of correct predictions for a given digit divided by the total number of predictions for that digit. You can confirm the precision by looking at each column in the confusion matrix. B) The recall column is the total number of correct predictions for a given digit divided by the total number of samples that should have been predicted as that digit. You can confirm the recall by looking at each row in the confusion matrix. C) The f1-score column is the average of the precision. The recall and the support column is the number of samples with a given expected value-for example, 50 samples were labeled as 4s, and 38 samples were labeled as 5s. D) All of the above are true.

A) The precision column shows the total number of correct predictions for a given digit divided by the total number of predictions for that digit. You can confirm the precision by looking at each column in the confusion matrix.
B) The recall column is the total number of correct predictions for a given digit divided by the total number of samples that should have been predicted as that digit. You can confirm the recall by looking at each row in the confusion matrix.
C) The f1-score column is the average of the precision. The recall and the support column is the number of samples with a given expected value-for example, 50 samples were labeled as 4s, and 38 samples were labeled as 5s.
D) All of the above are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
41
Which of the following statements is false?

A) You load the California Housing dataset using the the xe "modules:sklearn.datasets"xe "sklearn.datasets module"sklearn.datasets module's fetch_california_housing function, which returns a Bunch object.
B) The Bunch object's xe "Bunch class from sklearn.utils:data attribute"xe "data:attribute of a Bunch"data and xe "Bunch class from sklearn.utils:target attribute"xe "target attribute of a Bunch"target attributes are NumPy arrays containing the 20,640 xe "machine learning:samples"xe "samples (in machine learning)"samples and their xe "machine learning:target values"xe "target values (in machine learning)"target values respectively.
C) To confirm the number of samples (rows) and features (columns), look at the data array's shape attribute, which shows that there are 20,640 rows and 8 columns, as in: In [4]: california.data.shape
Out[4]: (20640, 8)
Similarly, you can see that the number of target values-the median house values-matches the number of samples by looking at the target array's shape, as in:
In [5]: california.target.shape
Out[5]: (20640,)
D) The Bunch's features attribute contains the names that correspond to each column in the data array.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
42
Which of the following statements is false?
A) The following code tests a linear regression model using the data in X_test and checks some of the predictions throughout the dataset by displaying the predicted and expected values for every ________ element: predicted = linear_regression.predict(X_test)
Expected = y_test
For p, e in zip(predicted[::5], expected[::5]):
Print(f'predicted: {p:.2f}, expected: {e:.2f}')

A) second
B) fifth
C) pth
D) eth
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
43
Which of the following statements about the k-means clustering algorithm is false?

A) Each cluster of samples is grouped around a centroid-the cluster's center point.
B) Initially, the algorithm chooses k centroids at random from the dataset's samples. Then the remaining samples are placed in the cluster whose centroid is the closest.
C) The centroids are iteratively recalculated and the samples re-assigned to clusters until, for all clusters, the distances from a given centroid to the samples in its cluster are maximized.
D) The algorithm's results are a one-dimensional array of labels indicating the cluster to which each sample belongs, and a two-dimensional array of centroids representing the center of each cluster.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
44
Which of the following statements a), b) or c) is false?

A) Scikit-learn provides many metrics functions for evaluating how well estimators predict results and for comparing estimators to choose the best one(s) for your particular study.
B) Scikit-learn's metrics vary by estimator type.
C) Functions confusion_matrix and classification_report (from the module sklearn.metrics) are two of many metrics functions specifically for evaluating regression estimators.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
45
Which of the following statements a), b) or c) is false?

A) Unsupervised machine learning and visualization can help you get to know your data by finding patterns and relationships among unlabeled samples.
B) Using Matplotlib, Seaborn and other visualization libraries, you can plot datasets with two or three variables using 2D and 3D visualizations, respectively.
C) In the Digits dataset, every sample has 64 features (and a target value), so there is no way to visualize the dataset.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
46
Which of the following statements a), b) or c) is false?

A) It's difficult for humans to think about data with large numbers of dimensions. This is called the curse of dimensionality.
B) If data has closely correlated features, some could be eliminated via dimensionality reduction to improve the training performance.
C) Eliminating features with dimensionality reduction, improves the accuracy of the model.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
47
Consider the following code that imports pandas and sets some options: import pandas as pd
Pd)set_option('precision', 4)
Pd)set_option('max_columns', 9)
Pd)set_option('display.width', None)
Which of the following statements a), b) or c)about the set_option calls is false?

A) 'precision' is the maximum number of digits to display to the right of each decimal point.
B) 'max_columns' is the maximum number of columns to display when you output the DataFrame's string representation. In IPython interactive mode, by default, pandas displays all of the columns left-to-right. The 'max_columns' setting enables pandas to show all the columns using multiple rows of output.
C) 'display.width' specifies the width in characters of your Command Prompt (Windows), Terminal (macOS/Linux) or shell (Linux). The value None tells pandas to auto-detect the display width when formatting string representations of Series and DataFrames.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
48
Which of the following statements is false?

A) When creating a model, a key goal is to ensure that it is capable of making accurate predictions for data it has not yet seen. Two common problems that prevent accurate predictions are overfitting and underfitting.
B) Underfitting occurs when a model is too simple to make accurate predictions, based on its training data. An example of underfitting is using a linear model, such as simple linear regression, when in fact, the problem really requires a more sophisticated non-linear model.
C) Overfitting occurs when your model is too complex. In the most extreme case of overfitting, a model memorizes its training data.
D) When you make predictions with an overfit model, the model won't know what to do with new data that matches the training data, but the model will make excellent predictions with data it has never seen.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
49
Which of the following statements a), b) or c) is false?

A) In big data, samples can have hundreds, thousands or even millions of features.
B) To visualize a dataset with many features (that is, many dimensions), you must first reduce the data to two or three dimensions. This requires a supervised machine learning technique called dimensionality reduction.
C) When you graph the resulting data after dimensionality reduction, you might see patterns in the data that will help you choose the most appropriate machine learning algorithms to use. For example, if the visualization contains clusters of points, it might indicate that there are distinct classes of information within the dataset.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
50
In the context of the California Housing dataset, which of the following statements is false?

A) The following code creates a LinearRegression estimator and invokes its xe "scikit-learn (sklearn) machine-learning library:fit method of an estimator"xe "fit method:of a scikit-learn estimator"fit method to train the estimator using X_train (the samples) and y_train (the targets): from sklearn.linear_model import LinearRegression
Linear_regression = LinearRegression()
Linear_regression.fit(X=X_train, y=y_train)
B) Multiple linear regression produces separate coefficients for each feature (stored in coeff_) in the dataset and one intercept (stored in intercept_).
C) For positive coefficients, the median house value increases as the feature value increases. For negative coefficients, the median house value decreases as the feature value decreases.
D) You can use the coefficient and intercept values with the following equation to make predictions: y = m1x1 + m2x2 + … mnxn + b
Where
\bullet m1, m2, …, mn are the feature coefficients
\bullet b is the intercept
\bullet x1, x2, …, xn are the feature values (that is, the values of the independent variables)
\bullet y is the predicted value (that is, the xe "dependent variable"dependent variable)
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
51
Which of the following statements is false?

A) k-means clustering is perhaps the simplest unsupervised machine learning algorithm.
B) The k-means clustering algorithm analyzes unlabeled samples and attempts to place them in clusters that appear to be related.
C) The k in "k-means" represents the number of clusters to impose on the data.
D) The k-means clustering algorithm organizes samples into the number of clusters you specify in advance, using distance calculations similar to the k-nearest neighbors clustering algorithm.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
52
Which of the following statements a), b) or c) is false?

A) The following code tests a model by calling the estimator's xe "scikit-learn (sklearn) machine-learning library:predict method of an estimator"xe "predict method of a scikit-learn estimator"predict method with the test samples as an argument: predicted = linear_regression.predict(X_test)
B) Assuming the array expected contains the expected values for the samples used to make predictions in Part (a)'s snippet, evaluating the following snippets displays the first five predictions and their corresponding expected values: In [32]: predicted[:5]
Out[32]: array([1.25396876, 2.34693107, 2.03794745, 1.8701254 , 2.53608339])
In [33]: expected[:5]
Out[33]: array([0.762, 1.732, 1.125, 1.37 , 1.856])
C) With classification, we saw that the predictions were distinct classes that matched existing classes in the dataset. With regression, it's tough to get exact predictions, because you have continuous outputs. Every possible value of x1, x2 … xn in the calculation y = m1x1 + m2x2 + … mnxn + b
Predicts a different value.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
53
Which of the following statements a), b) or c) is false?

A) By default, a xe "sklearn.linear_model module:LinearRegression estimator"xe "LinearRegression estimator from sklearn.linear_model"LinearRegression estimator uses all the features in the dataset's data array to perform a xe "linear regression:multiple"multiple linear regression.
B) An error occurs if any of the features passed to a xe "sklearn.linear_model module:LinearRegression estimator"xe "LinearRegression estimator from sklearn.linear_model"LinearRegression estimator for training are categorical rather than numeric. If a dataset contains categorical data, you must exclude the categorical features from the training process.
C) A benefit of working with scikit-learn's bundled datasets is that they're already in the correct format for machine learning using scikit-learn's models.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
54
Which of the following statements a), b) or c) is false?

A) Another common metric for regression models is the mean squared error, which \bullet calculates the difference between each expected and predicted value-this is called the error,
\bullet squares each difference and
\bullet calculates the average of the squared values.
B) To calculate a regression estimator's mean squared error, call function mean_squared_error (from module sklearn.metrics) with the arrays representing the expected and predicted results, as in: In [46]: metrics.mean_squared_error(expected, predicted)
Out[46]: 0.5350149774449119
C) When comparing estimators with the mean squared error metric, the one with the value closest to 1 best fits your data.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
55
Which of the following statements a), b) or c) is false?

A) Among the many metrics for regression estimators is the model's coefficient of determination, which is also called the R2 score.
B) To calculate an estimator's R2 score, use the sklearn.metrics module's r2_score function with the arrays representing the expected and predicted results, as in: In [44]: from sklearn import metrics
In [45]: metrics.r2_score(expected, predicted)
Out[45]: 0.6008983115964333
C) R2 scores range from 0.0 to 1.0 with 1.0 being the best. An R2 score of 1.0 indicates that the estimator perfectly predicts the independent variable's value, given the dependent variable(s) value(s). An R2 score of 0.0 indicates the model cannot make predictions with any accuracy, based on the independent variables' values.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
56
Which of the following statements a), b) or c) is false?

A) Dimensionality reduction in scikit-learn typically involves two steps-training the estimator with the dataset, then using the estimator to transform the data into the specified number of dimensions.
B) The steps mentioned in Part (a) can be performed separately with the TSNE methods fit and transform, or they can be performed in one statement using the fit_transform method, as in: In [5]: reduced_data = tsne.fit_transform(digits.data)
C) TSNE's fit_transform method takes some time to train the estimator then perform the reduction. When the method completes its task, it returns an array with the same number of rows as digits.data, but only the number of columns specified by the n_components argument when you created the estimator object. You can confirm this by checking reduced_data's shape.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
57
Which of the following statements a), b) or c) is false?

A) The California Housing dataset (bundled with xe "sklearn (scikit-learn)"xe "machine learning:scikit-learn"xe "scikit-learn (sklearn) machine-learning library"scikit-learn) has 20,640 samples, each with eight numerical features.
B) The LinearRegression estimator performs multiple linear regression by default using all of a dataset's numeric features.
C) You should expect more meaningful results from simple linear regression than from multiple linear regression on the dataset.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
58
Which of the following statements a), b) or c) is false?

A) It's helpful to xe "visualize the data"visualize your data by plotting the target value against each feature-in the case of the California Housing Prices dataset, to see how the median home value relates to each feature.
B) DataFrame method sample can randomly select a percentage of a DataFrame's data (specified keyword argument frac), as in: sample_df = california_df.sample(frac=0.1, random_state=17)
C) The keyword argument random_state in Part (b)'s snippet enables you to seed the random number generator. Each time you use the same seed value, method sample selects a similar random subset of the DataFrame's rows.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
59
Which of the following statements a), b) or c) is false?

A) We can use a TSNE estimator (from the sklearn.manifold module) to perform dimensionality reduction. This estimator analyzes a dataset's features and reduces them to the specified number of dimensions.
B) The following code creates a TSNE object for reducing a dataset's features to two dimensions, as specified by the keyword argument n_components: In [3]: from sklearn.manifold import TSNE
In [4]: tsne = TSNE(n_components=2, random_state=11)
C) When using TSNE on the Digits dataset bundled with scikit-learn, the TSNE estimator's random_state keyword argument in Part (b) ensures the reproducibility of the "render sequence" when we display the digit clusters, for example.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
60
Which of the following statements a), b) or c) is false?

A) The Iris dataset bundled with scikit-learn is commonly analyzed with both classification and clustering.
B) Although the Iris dataset is labeled, we can ignore those labels to demonstrate clustering. Then, we can use the labels to determine how well the k-means algorithm clusters the samples.
C) The Iris dataset is referred to as a "toy dataset" because it has only 150 samples and four features. The dataset describes 50 samples for each of three Iris flower species-xe "Iris setosa"Iris setosa, xe "Iris versicolor"Iris versicolor and xe "Iris virginica"Iris virginica.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
61
Which of the following statements a), b) or c) is false?

A) Because the Iris dataset is labeled, we can look at its target array values to get a sense of how well the k-means algorithm clustered the samples for the three Iris species.
B) In the Iris dataset, the first 50 samples are Iris setosa, the next 50 are Iris versicolor, and the last 50 are Iris virginica.
C) If the KMeans estimator chose the Iris dataset clusters perfectly, then each group of 50 elements in the estimator's labels_ array should have mostly the same label.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
62
Which of the following statements a), b) or c) is false?

A) We train the KMeans estimator by calling the object's fit method-this performs the k-means algorithm.
B) As with the other estimators, the fit method returns the estimator object.
C) When the training completes, the KMeans object contains a labels_ array with values from 0 to n_clusters - 1 (in the Iris dataset example, 0-2), indicating the clusters to which the samples belong, and a cluster_centers_ array in which each row represents a cluster.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
63
Which of the following statements is false?

A) Each centroid in the KMeans object's cluster_centers_ array has the same number of features as the original dataset (four in the case of the Iris dataset). b To plot the centroids in two-dimensions, you must reduce their dimensions.
C) You can think of a centroid as the "median" sample in its cluster.
D) Each centroid should be transformed using the same PCA estimator used to reduce the other samples in that cluster
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
64
Which of the following statements a), b) or c) is false?

A) One way to learn more about your data is to see how the features relate to one another.
B) The samples in the Iris dataset each have four features.
C) We cannot graph one feature against the other three in a single graph. But we can plot pairs of features against one another in a pairplot.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
65
Which of the following statements a), b) or c) is false?

A) The PCA estimator (from the sklearn.decomposition module), like TSNE, performs dimensionality reduction. The PCA estimator uses an algorithm called xe "principal components analysis (PCA)"principal component analysis to analyze a dataset's features and reduce them to the specified number of dimensions.
B) Like TSNE, a PCA estimator uses the keyword argument n_components to specify the number of dimensions, as in: from sklearn.decomposition import PCA
Pca = PCA(n_components=2, random_state=11)
C) The following snippet trains the PCA estimator and produces the reduced data by calling the PCA estimator's fit and transform methods: pca.fit(iris.data)
Iris_pca = pca.transform(iris.data)
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
66
Which of the following statements a), b) or c) is false?

A) We can use xe "k-means clustering algorithm"k-means clustering via scikit-learn's KMeans estimator (from the sklearn.cluster module) to place each sample in a dataset into a cluster. The KMeans estimator hides from you the algorithm's complex mathematical details, making it straightforward to use.
B) The following code creates a KMeans object: from sklearn.cluster import KMeans
Kmeans = KMeans(n_clusters=3, random_state=11)
C) The keyword argument n_clusters specifies the k-means clustering algorithm's hyperparameter k (in this case, 3), which KMeans requires to calculate the clusters and label each sample. The default value for n_clusters is 8.
D) All of the above statements are true.
Unlock Deck
Unlock for access to all 66 flashcards in this deck.
Unlock Deck
k this deck
locked card icon
Unlock Deck
Unlock for access to all 66 flashcards in this deck.