Consider a multi-core processor with 64 cores where first shared cache is at level L3 (L1 and L2 are private to each core). Suppose an application needs to compute the sum of all the nodes in a perfectly balanced binary tree T (A tree where every node has either two or zero children and all the leaves are at the same depth/level). Assuming that an add operation takes 10 units of time, the total sum can be computed sequentially, in time 10*n units, ignoring the time to load and traverse links in the tree (assume they are factored in the add). Here n is the number of nodes in T. One way to compute the sum in a parallel manner is to have two arrays of length 64: (a) an input array having the roots of 64 leaf subtrees (b) an output array that holds the partial sums computed for the 64 leaf subtrees for each core. Both the arrays are indexed by the id (0 to 63) of the core that is responsible for that entry. Furthermore, assume the following:
The input arrays are already filled in with the roots of the leaf subtrees and are in the L1 caches of each core.
Core 0 is responsible for handling the internal nodes that are not a part of any of the 64 subtrees. It is also responsible for computing the final sum once the partial sums are filled in.
Every time a partial sum is filled in the output array by a core, another core can fill in its partial sum only after a minimum delay of 50 units (due to cache invalidations).
In order to achieve a speedup of greater than 2 (over sequential code), what is the minimum number of nodes that should be present in the tree?
Correct Answer:
Verified
View Answer
Unlock this answer now
Get Access to more Verified Answers free of charge
Q1: Applying the send/receive programming model as outlined
Q2: Consider the following code that adds two
Q3: Why should there be stride-access for vector
Q4: Consider a system with two multiprocessors with
Q5: Consider a multi-core processor with heterogeneous cores:
Q6: Suppose we have a dual core chip
Q7: Vector architecture exploits the data-level parallelism to
Q9: Consider the following GPU that consists of
Q10: How would you rewrite the following sequential
Q11: Besides network bandwidth and bisection bandwidth, two
Unlock this Answer For Free Now!
View this answer and more for free by performing one of the following actions
Scan the QR code to install the App and get 2 free unlocks
Unlock quizzes for free by uploading documents