Deck 2: Cloudera Certified Developer for Apache Hadoop (CCDH)

Full screen (f)
exit full mode
Question
For each intermediate key, each reducer task can emit:

A) As many final key-value pairs as desired. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
B) As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
C) As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
D) One final key-value pair per value associated with the key; no restrictions on the type.
E) One final key-value pair per key; no restrictions on the type.
Use Space or
up arrow
down arrow
to flip the card.
Question
For each input key-value pair, mappers can emit:

A) As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
B) As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
C) One intermediate key-value pair, of a different type.
D) One intermediate key-value pair, but of the same type.
E) As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.
Question
Indentify which best defines a SequenceFile?

A) A SequenceFile contains a binary encoding of an arbitrary number of homogeneous Writable objects
B) A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous Writable objects
C) A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D) A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be the same type.
Question
Determine which best describes when the reduce method is first called in a MapReduce job?

A) Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.
B) Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
C) Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only or reduce-only jobs.
D) Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.
Question
You have user profile records in your OLPT database, that you want to join with web logs you have already ingested into the Hadoop file system. How will you obtain these user records?

A) HDFS command
B) Pig LOAD command
C) Sqoop import
D) Hive LOAD DATA command
E) Ingest with Flume agents
F) Ingest with Hadoop Streaming
Question
In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

A) Increase the parameter that controls minimum split size in the job configuration.
B) Write a custom MapRunner that iterates over all key-value pairs in the entire file.
C) Set the number of mappers equal to the number of input files you want to process.
D) Write a custom FileInputFormat and override the method isSplitable to always return false.
Question
Which best describes how TextInputFormat processes input files and line breaks?

A) Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
B) Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.
C) The input file is split exactly at the line breaks, so each RecordReader will read a series of complete lines.
D) Input file splits may cross line breaks. A line that crosses file splits is ignored.
E) Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.
Question
On a cluster running MapReduce v1 (MRv1), a TaskTracker heartbeats into the JobTracker on your cluster, and alerts the JobTracker it has an open map task slot. What determines how the JobTracker assigns each map task to a TaskTracker?

A) The amount of RAM installed on the TaskTracker node.
B) The amount of free disk space on the TaskTracker node.
C) The number and speed of CPU cores on the TaskTracker node.
D) The average system load on the TaskTracker node over the past fifteen (15) minutes.
E) The location of the InsputSplit to be processed in relation to the location of the node.
Question
Which process describes the lifecycle of a Mapper?

A) The JobTracker calls the TaskTracker's configure () method, then its map () method and finally its close () method.
B) The TaskTracker spawns a new Mapper to process all records in a single input split.
C) The TaskTracker spawns a new Mapper to process each key-value pair.
D) The JobTracker spawns a new Mapper to process all records in a single file.
Question
The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:

A) Combine
B) IdentityMapper
C) IdentityReducer
D) Default Partitioner
E) Speculative Execution
Question
What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?

A) You will not be able to compress the intermediate data.
B) You will longer be able to take advantage of a Combiner.
C) By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
D) There are no concerns with this approach. It is always advisable to use multiple reduces.
Question
To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What is the best way to accomplish this?

A) Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the mapper.
B) Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
C) Place the data file in the DataCache and read the data into memory in the configure method of the mapper.
D) Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper.
Question
You have the following key-value pairs as output from your Map task: (the, 1) (fox, 1) (faster, 1) (than, 1) (dog, 1) How many keys will be passed to the Reducer's reduce method?

A) Six
B) Five
C) Four
D) Two
E) One
F) Three
Question
You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?

A) Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
B) Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
C) When submitting the job on the command line, specify the -libjars option followed by the JAR file path.
D) Package your code and the Apache Commands Math library into a zip file named JobJar.zip
Question
Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage?

A) ResourceManager
B) NodeManager
C) ApplicationMaster
D) ApplicationMasterService
E) TaskTracker
F) JobTracker
Question
You have written a Mapper which invokes the following five calls to the OutputColletor.collect method: output.collect (new Text ("Apple"), new Text ("Red") ) ; output.collect (new Text ("Banana"), new Text ("Yellow") ) ; output.collect (new Text ("Apple"), new Text ("Yellow") ) ; output.collect (new Text ("Cherry"), new Text ("Red") ) ; output.collect (new Text ("Apple"), new Text ("Green") ) ; How many times will the Reducer's reduce method be invoked?

A) 6
B) 3
C) 1
D) 0
E) 5
Question
All keys used for intermediate output from mappers must:

A) Implement a splittable compression algorithm.
B) Be a subclass of FileInputFormat.
C) Implement WritableComparable.
D) Override isSplitable.
E) Implement a comparator for speedy sorting.
Question
A client application creates an HDFS file named foo.txt with a replication factor of 3. Identify which best describes the file access rules in HDFS if the file has a single block that is stored on data nodes A, B and C?

A) The file will be marked as corrupted if data node B fails during the creation of the file.
B) Each data node locks the local file to prohibit concurrent readers and writers of the file.
C) Each data node stores a copy of the file in the local file system with the same name as the HDFS file.
D) The file can be accessed if at least one of the data nodes storing the file is available.
Question
What data does a Reducer reduce method process?

A) All the data in a single input file.
B) All data produced by a single mapper.
C) All data for a given key, regardless of which mapper(s) produced it.
D) All data for a given value, regardless of which mapper(s) produced it.
Question
Given a directory of files with the following structure: line number, tab character, string: Example: 1    abialkjfjkaoasdfjksdlkjhqweroij 2    kadfjhuwqounahagtnbvaswslmnbfgy 3    kjfteiomndscxeqalkzhtopedkfsikj You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?

A) SequenceFileAsTextInputFormat
B) SequenceFileInputFormat
C) KeyValueFileInputFormat
D) BDBInputFormat
Question
In a MapReduce job with 500 map tasks, how many map task attempts will there be?

A) It depends on the number of reduces in the job.
B) Between 500 and 1000.
C) At most 500.
D) At least 500.
E) Exactly 500.
Question
You want to perform analysis on a large collection of images. You want to store this data in HDFS and process it with MapReduce but you also want to give your data analysts and data scientists the ability to process the data directly from HDFS with an interpreted high-level programming language like Python. Which format should you use to store this data in HDFS?

A) SequenceFiles
B) Avro
C) JSON
D) HTML
E) XML
F) CSV
Question
Table metadata in Hive is:

A) Stored as metadata on the NameNode.
B) Stored along with the data in HDFS.
C) Stored in the Metastore.
D) Stored in ZooKeeper.
Question
A combiner reduces:

A) The number of values across different keys in the iterator supplied to a single reduce method call.
B) The amount of intermediate data that must be transferred between the mapper and reducer.
C) The number of input files a mapper must process.
D) The number of output files a reducer must produce.
Question
What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

A) Algorithms that require applying the same mathematical function to large numbers of individual binary records.
B) Relational operations on large amounts of structured and semi-structured data.
C) Algorithms that require global, sharing states.
D) Large-scale graph algorithms that require one-step link traversal.
E) Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).
Question
In the reducer, the MapReduce API provides you with an iterator over Writable values. What does calling the next () method return?

A) It returns a reference to a different Writable object time.
B) It returns a reference to a Writable object from an object pool.
C) It returns a reference to the same Writable object each time, but populated with different data.
D) It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a new object.
E) It returns a reference to the same Writable object if the next value is the same as the previous value, or a new Writable object otherwise.
Question
When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?

A) When the types of the reduce operation's input key and input value match the types of the reducer's output key and output value and when the reduce operation is both communicative and associative.
B) When the signature of the reduce method matches the signature of the combine method.
C) Always. Code can be reused in Java since it is a polymorphic object-oriented programming language.
D) Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase performance.
E) Never. Combiners and reducers must be implemented separately because they serve different purposes.
Question
You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.

A) There is no difference in output between the two settings.
B) With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS.
C) With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.
D) With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.
Question
MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate daemons? Select two.

A) Heath states checks (heartbeats)
B) Resource management
C) Job scheduling/monitoring
D) Job coordination between the ResourceManager and NodeManager
E) Launching tasks
F) Managing file system metadata
G) MapReduce metric reporting
H) Managing tasks
Question
Analyze each scenario below and indentify which best describes the behavior of the default partitioner?

A) The default partitioner assigns key-values pairs to reduces based on an internal random number generator.
B) The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each reducer in turn. This ensures an event partition of the key space.
C) The default partitioner computers the hash of the key. Hash values between specific ranges are associated with different buckets, and each bucket is assigned to a specific reducer.
D) The default partitioner computers the hash of the key and divides that valule modulo the number of reducers. The result determines the reducer assigned to process the key-value pair.
E) The default partitioner computers the hash of the value and takes the mod of that value with the number of reducers. The result determines the reducer assigned to process the key-value pair.
Question
You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?

A) Processor and network I/O
B) Disk I/O and network I/O
C) Processor and RAM
D) Processor and disk I/O
Question
You need to move a file titled "weblogs" into HDFS. When you try to copy the file, you can't. You know you have ample space on your DataNodes. Which action should you take to relieve this situation and store more files in HDFS?

A) Increase the block size on all current files in HDFS.
B) Increase the block size on your remaining files.
C) Decrease the block size on your remaining files.
D) Increase the amount of memory for the NameNode.
E) Increase the number of disks (or size) for the NameNode.
F) Decrease the block size on all current files in HDFS.
Question
In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase?

A) mXn (i.e., m multiplied by n)
B) n
C) m
D) m+n (i.e., m plus n)
E) m n (i.e., m to the power of n) (i.e., m to the power of n)
Question
In a MapReduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values?

A) The values are in sorted order.
B) The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job.
C) The values are arbitrary ordered, but multiple runs of the same MapReduce job will always have the same ordering.
D) Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.
Question
Workflows expressed in Oozie can contain:

A) Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.
B) Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.
C) Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks.
D) Iterntive repetition of MapReduce jobs until a desired answer or state is reached.
Question
Which best describes what the map method accepts and emits?

A) It accepts a single key-value pair as input and emits a single key and list of corresponding values as output.
B) It accepts a single key-value pairs as input and can emit only one key-value pair as output.
C) It accepts a list key-value pairs as input and can emit only one key-value pair as output.
D) It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.
Unlock Deck
Sign up to unlock the cards in this deck!
Unlock Deck
Unlock Deck
1/36
auto play flashcards
Play
simple tutorial
Full screen (f)
exit full mode
Deck 2: Cloudera Certified Developer for Apache Hadoop (CCDH)
1
For each intermediate key, each reducer task can emit:

A) As many final key-value pairs as desired. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
B) As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
C) As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
D) One final key-value pair per value associated with the key; no restrictions on the type.
E) One final key-value pair per key; no restrictions on the type.
E
2
For each input key-value pair, mappers can emit:

A) As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
B) As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
C) One intermediate key-value pair, of a different type.
D) One intermediate key-value pair, but of the same type.
E) As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.
E
3
Indentify which best defines a SequenceFile?

A) A SequenceFile contains a binary encoding of an arbitrary number of homogeneous Writable objects
B) A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous Writable objects
C) A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D) A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be the same type.
D
4
Determine which best describes when the reduce method is first called in a MapReduce job?

A) Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.
B) Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
C) Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only or reduce-only jobs.
D) Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
5
You have user profile records in your OLPT database, that you want to join with web logs you have already ingested into the Hadoop file system. How will you obtain these user records?

A) HDFS command
B) Pig LOAD command
C) Sqoop import
D) Hive LOAD DATA command
E) Ingest with Flume agents
F) Ingest with Hadoop Streaming
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
6
In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

A) Increase the parameter that controls minimum split size in the job configuration.
B) Write a custom MapRunner that iterates over all key-value pairs in the entire file.
C) Set the number of mappers equal to the number of input files you want to process.
D) Write a custom FileInputFormat and override the method isSplitable to always return false.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
7
Which best describes how TextInputFormat processes input files and line breaks?

A) Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
B) Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.
C) The input file is split exactly at the line breaks, so each RecordReader will read a series of complete lines.
D) Input file splits may cross line breaks. A line that crosses file splits is ignored.
E) Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
8
On a cluster running MapReduce v1 (MRv1), a TaskTracker heartbeats into the JobTracker on your cluster, and alerts the JobTracker it has an open map task slot. What determines how the JobTracker assigns each map task to a TaskTracker?

A) The amount of RAM installed on the TaskTracker node.
B) The amount of free disk space on the TaskTracker node.
C) The number and speed of CPU cores on the TaskTracker node.
D) The average system load on the TaskTracker node over the past fifteen (15) minutes.
E) The location of the InsputSplit to be processed in relation to the location of the node.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
9
Which process describes the lifecycle of a Mapper?

A) The JobTracker calls the TaskTracker's configure () method, then its map () method and finally its close () method.
B) The TaskTracker spawns a new Mapper to process all records in a single input split.
C) The TaskTracker spawns a new Mapper to process each key-value pair.
D) The JobTracker spawns a new Mapper to process all records in a single file.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
10
The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:

A) Combine
B) IdentityMapper
C) IdentityReducer
D) Default Partitioner
E) Speculative Execution
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
11
What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?

A) You will not be able to compress the intermediate data.
B) You will longer be able to take advantage of a Combiner.
C) By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
D) There are no concerns with this approach. It is always advisable to use multiple reduces.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
12
To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What is the best way to accomplish this?

A) Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the mapper.
B) Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
C) Place the data file in the DataCache and read the data into memory in the configure method of the mapper.
D) Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
13
You have the following key-value pairs as output from your Map task: (the, 1) (fox, 1) (faster, 1) (than, 1) (dog, 1) How many keys will be passed to the Reducer's reduce method?

A) Six
B) Five
C) Four
D) Two
E) One
F) Three
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
14
You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?

A) Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
B) Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
C) When submitting the job on the command line, specify the -libjars option followed by the JAR file path.
D) Package your code and the Apache Commands Math library into a zip file named JobJar.zip
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
15
Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage?

A) ResourceManager
B) NodeManager
C) ApplicationMaster
D) ApplicationMasterService
E) TaskTracker
F) JobTracker
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
16
You have written a Mapper which invokes the following five calls to the OutputColletor.collect method: output.collect (new Text ("Apple"), new Text ("Red") ) ; output.collect (new Text ("Banana"), new Text ("Yellow") ) ; output.collect (new Text ("Apple"), new Text ("Yellow") ) ; output.collect (new Text ("Cherry"), new Text ("Red") ) ; output.collect (new Text ("Apple"), new Text ("Green") ) ; How many times will the Reducer's reduce method be invoked?

A) 6
B) 3
C) 1
D) 0
E) 5
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
17
All keys used for intermediate output from mappers must:

A) Implement a splittable compression algorithm.
B) Be a subclass of FileInputFormat.
C) Implement WritableComparable.
D) Override isSplitable.
E) Implement a comparator for speedy sorting.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
18
A client application creates an HDFS file named foo.txt with a replication factor of 3. Identify which best describes the file access rules in HDFS if the file has a single block that is stored on data nodes A, B and C?

A) The file will be marked as corrupted if data node B fails during the creation of the file.
B) Each data node locks the local file to prohibit concurrent readers and writers of the file.
C) Each data node stores a copy of the file in the local file system with the same name as the HDFS file.
D) The file can be accessed if at least one of the data nodes storing the file is available.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
19
What data does a Reducer reduce method process?

A) All the data in a single input file.
B) All data produced by a single mapper.
C) All data for a given key, regardless of which mapper(s) produced it.
D) All data for a given value, regardless of which mapper(s) produced it.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
20
Given a directory of files with the following structure: line number, tab character, string: Example: 1    abialkjfjkaoasdfjksdlkjhqweroij 2    kadfjhuwqounahagtnbvaswslmnbfgy 3    kjfteiomndscxeqalkzhtopedkfsikj You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?

A) SequenceFileAsTextInputFormat
B) SequenceFileInputFormat
C) KeyValueFileInputFormat
D) BDBInputFormat
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
21
In a MapReduce job with 500 map tasks, how many map task attempts will there be?

A) It depends on the number of reduces in the job.
B) Between 500 and 1000.
C) At most 500.
D) At least 500.
E) Exactly 500.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
22
You want to perform analysis on a large collection of images. You want to store this data in HDFS and process it with MapReduce but you also want to give your data analysts and data scientists the ability to process the data directly from HDFS with an interpreted high-level programming language like Python. Which format should you use to store this data in HDFS?

A) SequenceFiles
B) Avro
C) JSON
D) HTML
E) XML
F) CSV
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
23
Table metadata in Hive is:

A) Stored as metadata on the NameNode.
B) Stored along with the data in HDFS.
C) Stored in the Metastore.
D) Stored in ZooKeeper.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
24
A combiner reduces:

A) The number of values across different keys in the iterator supplied to a single reduce method call.
B) The amount of intermediate data that must be transferred between the mapper and reducer.
C) The number of input files a mapper must process.
D) The number of output files a reducer must produce.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
25
What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

A) Algorithms that require applying the same mathematical function to large numbers of individual binary records.
B) Relational operations on large amounts of structured and semi-structured data.
C) Algorithms that require global, sharing states.
D) Large-scale graph algorithms that require one-step link traversal.
E) Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
26
In the reducer, the MapReduce API provides you with an iterator over Writable values. What does calling the next () method return?

A) It returns a reference to a different Writable object time.
B) It returns a reference to a Writable object from an object pool.
C) It returns a reference to the same Writable object each time, but populated with different data.
D) It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a new object.
E) It returns a reference to the same Writable object if the next value is the same as the previous value, or a new Writable object otherwise.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
27
When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?

A) When the types of the reduce operation's input key and input value match the types of the reducer's output key and output value and when the reduce operation is both communicative and associative.
B) When the signature of the reduce method matches the signature of the combine method.
C) Always. Code can be reused in Java since it is a polymorphic object-oriented programming language.
D) Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase performance.
E) Never. Combiners and reducers must be implemented separately because they serve different purposes.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
28
You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.

A) There is no difference in output between the two settings.
B) With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS.
C) With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.
D) With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
29
MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate daemons? Select two.

A) Heath states checks (heartbeats)
B) Resource management
C) Job scheduling/monitoring
D) Job coordination between the ResourceManager and NodeManager
E) Launching tasks
F) Managing file system metadata
G) MapReduce metric reporting
H) Managing tasks
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
30
Analyze each scenario below and indentify which best describes the behavior of the default partitioner?

A) The default partitioner assigns key-values pairs to reduces based on an internal random number generator.
B) The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each reducer in turn. This ensures an event partition of the key space.
C) The default partitioner computers the hash of the key. Hash values between specific ranges are associated with different buckets, and each bucket is assigned to a specific reducer.
D) The default partitioner computers the hash of the key and divides that valule modulo the number of reducers. The result determines the reducer assigned to process the key-value pair.
E) The default partitioner computers the hash of the value and takes the mod of that value with the number of reducers. The result determines the reducer assigned to process the key-value pair.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
31
You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?

A) Processor and network I/O
B) Disk I/O and network I/O
C) Processor and RAM
D) Processor and disk I/O
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
32
You need to move a file titled "weblogs" into HDFS. When you try to copy the file, you can't. You know you have ample space on your DataNodes. Which action should you take to relieve this situation and store more files in HDFS?

A) Increase the block size on all current files in HDFS.
B) Increase the block size on your remaining files.
C) Decrease the block size on your remaining files.
D) Increase the amount of memory for the NameNode.
E) Increase the number of disks (or size) for the NameNode.
F) Decrease the block size on all current files in HDFS.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
33
In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase?

A) mXn (i.e., m multiplied by n)
B) n
C) m
D) m+n (i.e., m plus n)
E) m n (i.e., m to the power of n) (i.e., m to the power of n)
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
34
In a MapReduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values?

A) The values are in sorted order.
B) The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job.
C) The values are arbitrary ordered, but multiple runs of the same MapReduce job will always have the same ordering.
D) Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
35
Workflows expressed in Oozie can contain:

A) Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.
B) Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.
C) Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks.
D) Iterntive repetition of MapReduce jobs until a desired answer or state is reached.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
36
Which best describes what the map method accepts and emits?

A) It accepts a single key-value pair as input and emits a single key and list of corresponding values as output.
B) It accepts a single key-value pairs as input and can emit only one key-value pair as output.
C) It accepts a list key-value pairs as input and can emit only one key-value pair as output.
D) It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.
Unlock Deck
Unlock for access to all 36 flashcards in this deck.
Unlock Deck
k this deck
locked card icon
Unlock Deck
Unlock for access to all 36 flashcards in this deck.