You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each) . You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload. What should you do?
A) Increase the size of your parquet files to ensure them to be 1 GB minimum.
B) Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
C) Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
D) Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
Correct Answer:
Verified
Q61: You need to create a near real-time
Q62: You set up a streaming data insert
Q63: You are a head of BI at
Q64: You are designing an Apache Beam pipeline
Q65: MJTelco Case Study Company Overview MJTelco is
Q67: You used Cloud Dataprep to create a
Q68: You launched a new gaming app almost
Q69: You have a requirement to insert minute-resolution
Q70: You store historic data in Cloud Storage.
Q71: MJTelco Case Study Company Overview MJTelco is
Unlock this Answer For Free Now!
View this answer and more for free by performing one of the following actions
Scan the QR code to install the App and get 2 free unlocks
Unlock quizzes for free by uploading documents