To get the free app, enter mobile phone number. See all free Kindle reading apps. Don't have a Kindle?

Product details Paperback: pages Publisher: Apress; 2nd ed. No customer reviews. Share your thoughts with other customers. Write a product review. Most helpful customer reviews on Amazon. Verified Purchase. Go to Amazon. Now that we have monoids for describing a generic method for combining two items, we can consider a generic method for combining many items in parallel.

Once we have this ability, we will see that we can solve the remaining problems from last homework by simply plugging the appropriate monoids into our generic operator, reduce. The interface of this operator in our framework is specified below. We can solve our second problem in a similar fashion. Note that in this case, since we know that the input sequence is nonempty, we can pass the first item of the sequence as the identity element.

What could we do if we instead wanted a solution that can deal with zero-length sequences? What identity element might make sense in that case? Observe that in order to seed the reduction we selected the provisional maximum value to be the item at the first position of the input sequence. Now let us handle the general case by seeding with the smallest possible value of type long. Like the tabulate function, reduce is a higher-order function.

Just like any other higher-order function, the work and span costs have to account for the cost of the client-supplied function, which is in this case, the associative combining operator. A scan is an iterated reduction that is typically expressed in one of two forms: inclusive and exclusive. The example below represents the logical behavior of scan , but actually says nothing about the way scan is implemented.

Scan has applications in many parallel algorithms. To name just a few, scan has been used to implement radix sort, search for regular expressions, dynamically allocate processors, evaluate polynomials, etc. Suffice to say, scan is important and worth knowing about because scan is a key component of so many efficient parallel algorithms. In this course, we are going to study a few more applications not in this list. If we just blindly follow the specification above, we might be tempted to try the solution below. Consider that our sequential algorithm takes linear time in the size of the input array.

As such, finding a work-efficient parallel solution means finding a solution that also takes linear work in the size of the input array. The problem is that our parallel algorithm takes quadratic work: it is not even asymptotically work efficient! Even worse, the algorithm performs a lot of redundant work. Can we do better?

Yes, in fact, there exist solutions that take, in the size of the input, both linear time and logarithmic span, assuming that the given associative operator takes constant time. It might be worth pausing for a moment to consider this fact, because the specification of scan may at first look like it would resist a solution that is both highly parallel and work efficient. The remaining operations that we are going to consider are useful for writing more succinct code and for expressing special cases where certain optimizations are possible.

All of the the operations that are presented in this section are derived forms of tabulate, reduce, and scan. The map f, xs operation applies f to each item in xs returning the array of results. It is straightforward to implement as a kind of tabulation, as we have at our disposal efficient indexing.

The array-increment operation that we defined on the first day of lecture is simple to express via map. The work and span costs of map are similar to those of tabulate. Granularity control is handled similarly as well. However, that the granularity controller object corresponding to map is instantiated properly is not obvious. It turns out that, for no extra effort, the behavior that we want is indeed preserved: each distinct function that is passed to map is assigned a distinct granularity controller.

The call fill v, n creates an array that is initialized with a specified number of items of the same value. Although just another special case for tabulation, this function is worth having around because internally the fill operation can take advantage of special hardware optimizations, such as SIMD instructions, that increase parallelism. Just like fill , the copy operation can take advantage of special hardware optimizations that accellerate memory traffic. For the same reason, the copy operation is a good choice when a full copy is needed. We now consider a slight generalization on the copy operator: with the slice operation we can copy out a range of positions from a given array rather than the entire array.

The slice operation takes a source array and a range to copy out and returns a fresh array that contains copies of the items in the given range. In contrast to slice , the concat operation lets us "copy in" to a fresh array. The prefix sums problem is a special case of the scan problem. We have defined two solutions for two variants of the problem: one for the exclusive prefix sums and one for the inclusive case. The last data-parallel operation that we are going to consider is the operation that copies out items from a given array based on a given predicate function.

For our purposes, a predicate function is any function that takes a value of type long i. The particular instance of the filter problem that we are considering is a little tricky because we are working with fixed-size arrays. In particular, what requires care is the method that we use to copy the selected items out of the input array to the output array.

We need to first run a pass over the input array, applying the predicate function to the items, to determine which items are to be written to the output array. Furthermore, we need to track how many items are to be written so that we know how much space to allocate for the output array.

In the sequential solution above, it appears that there are two particular obstacles to parallelization. What are they? Under one particular assumption regarding the predicate, this sequential solution takes linear time in the size of the input, using two passes. What is the assumption? The challenge of this exercise is to solve the following problem: given two arrays of the same size, the first consisting of boolean valued fields and the second containing the values, return the array that contains in the same relative order as the items from the input the values selected by the flags.

Your solution should take linear work and logarithmic span in the size of the input. The call tabulate g, n returns the length- n array where the i th element is given by g i. The call reduce b, id, xs is logically equal to id if xs. This cost assumes that the work and span of b are constant.

Then the map takes work. The call filter p, xs returns the subsequence of xs which contains each xs[i] for which p xs[i] returns true. The quicksort algorithm for sorting an array sequence of elements is known to be a very efficient sequential sorting algorithm. A natural question thus is whether quicksort is similarly effective as a parallel algorithm? Let us first convince ourselves, at least informally, that quicksort is actually a good parallel algorithm. But first, what do we mean by "parallel quicksort.

While this implementation of the quicksort algorithm is not immediately parallel, it can be parallelized. Note that the recursive calls are naturally independent. So we really ought to focus on the partitioning algorithm. There is a rather simple way to do such a partition in parallel by performing three filter calls on the input array, one for picking the elements less that the pivot, one for picking the elements equal to the pivot, and another one for picking the elements greater than the pivot.

This algorithm can be described as follows. Now that we have a parallel algorithm, we can check whether it is a good algorithm or not. Recall that a good parallel algorithm is one that has the following three characteristics. Let us now turn our attention to asymptotic and observed work efficiency. Recall first that quicksort can exhibit a quadratic-work worst-case behavior on certain inputs if we select the pivot deterministically.

To avoid this, we can pick a random element as a pivot by using a random-number generator, but then we need a parallel random number generator. Here, we are going to side-step this issue by assuming that the input is randomly permuted in advance. Under this assumption, we can simply pick the pivot to be the first item of the sequence.

The figure below illustrates the structure of an execution of quicksort by using a tree. Each node corresponds to a call to the quicksort function and is labeled with the key at that call. Note that the tree is a binary search tree. In quicksort, a comparison always involves a pivot and another key. Since, the pivot is never sent to a recursive call, a key is selected as a pivot exactly once, and is not involved in further comparisons after is becomes a pivot.

Before a key is selected as a pivot, it may be compared to other pivots, once per pivot, and thus two keys are never compared more than once. We can sum up the two observations: a key is compared with all its ancestors in the call tree and all its descendants in the call tree, and with no other keys. Since a pair of key are never compared more than once, the total number of comparisons performed by quicksort can be expressed as the sum over all pairs of keys.

We define following random variable:. For any one call to quicksort there are three possibilities as illustrated in the figure above. This reasoning applies at any level of the quicksort call tree. Since pivot choice at each call is independent of the other calls. By basic arithmetic, we obtain. From this bound, we can calculute a bound on the depth of the whole tree. By using Total Expectation Theorem or the Law of total expectation, we can now calculate expected span by dividing the sample space into mutually exclusive and exhaustive space as follows.

The span of quicksort is determined by the sizes of these larger subsequences. As the partition step uses filter we have the following recurrence for span:. The figure below illustrates this. We now make this intuition more precise. For the analysis, we use the conditioning technique for computing expectations as suggested by the total expectation theorem. The Total Expectation Theorem or the Law of total expectation states that. For an implementation to be observably work efficient, we know that we must control granularity by switching to a fast sequential sorting algorithm when the input is small.

Of course, we have to assess observable work efficiency experimentally after specifying the implementation. The code for quicksort is shown below. Note that we use our array class sparray to store the input and output. To partition the input, we use our parallel filter function from the previous lecture to parallelize the partitioning phase.

Similarly, we use our parallel concatenation function to constructed the sorted output. By using randomized-analysis techniques, it is possible to analyze the work and span of this algorithm. The techniques needed to do so are beyond the scope of this book.

## C / C++ Language

The interested reader can find more details in another book. When the input is large, there should be ample parallelism to keep many processors well fed with work. Since 3. Unfortunately, the code that we wrote leaves much to be desired in terms of observable work efficiency.

Consider the following benchmarking runs that we performed on our processor machine.

The first two runs show that, on a single processor, our parallel algorithm is roughly 6x slower than the sequential algorithm that we are using as baseline! In other words, our quicksort appears to have "6-observed work efficiency". That means we need at least six processors working on the problem to see even a small improvement compared to a good sequential baseline. The rest of the results confirm that it takes about ten processors to see a little improvement and forty processors to see approximately a 2. This plot shows the speedup plot for this program. Clearly, it does not look good.

Our analysis suggests that we have a good parallel algorithm for quicksort, yet our observations suggest that, at least on our test machine, our implementation is rather slow relative to our baseline program. In particular, we noticed that our parallel quicksort started out being 6x slower than the baseline algorithm. What could be to blame? In fact, there are many implementation details that could be to blame. The problem we face is that identifying those causes experimentally could take a lot of time and effort. Fortunately, our quicksort code contains a few clues that will guide us in a good direction.

It should be clear that our quicksort is copying a lot of data and, moreover, that much of the copying could be avoided. The copying operations that could be avoided, in particular, are the array copies that are performed by each of the the three calls to filter and the one call to concat.

## 100 most important c++ programs

Each of these operations has to touch each item in the input array. Let us now consider a mostly in-place version of quicksort. This code is mostly in place because the algorithm copies out the input array in the beginning, but otherwise sorts in place on the result array. The code for this algorithm appears just below. We have good reason to believe that this code is, at least, going to be more work efficient than our original solution.

First, it avoids the allocation and copying of intermediate arrays. And, second, it performs the partitioning phase in a single pass. There is a catch, however: in order to work mostly in place, our second quicksort code sacrified on parallelism. In specific, observe that the partitioning phase is now sequential. The span of this second quicksort is therefore linear in the size of the input and its average parallelism is therefore logarithmic in the size of the input.

Verify that the span of our second quicksort has linear span and that the average parallelism is logarithmic. So, we expect that the second quicksort is more work efficient but should scale poorly. To test the first hypothesis, let us run the second quicksort on a single processor. Indeed, the running time of this code is essentially same as what we observed for our baseline program. The plot below shows one speedup curve for each of our two quicksort implementations. The in-place quicksort is always faster. However, the in-place quicksort starts slowing down a lot at 20 cores and stops after 30 cores.

So, we have one solution that is observably not work efficient and one that is, and another that is the opposite. The question now is whether we can find a happy middle ground. We encourage students to look for improvements to quicksort independently. For now, we are going to consider parallel mergesort. This time, we are going to focus more on achieving better speedups. As a divide-and-conquer algorithm, the mergesort algorithm, is a good candidate for parallelization, because the two recursive calls for sorting the two halves of the input can be independent.

## Ask HN: Best way to learn modern C++? | Hacker News

The final merge operation, however, is typically performed sequentially. It turns out to be not too difficult to parallelize the merge operation to obtain good work and span bounds for parallel mergesort. The resulting algorithm turns out to be a good parallel algorithm, delivering asymptotic, and observably work efficiency, as well as low span. This process requires a "merge" routine which merges the contents of two specified subranges of a given array.

The merge routine assumes that the two given subarrays are in ascending order. The result is the combined contents of the items of the subranges, in ascending order. The precise signature of the merge routine appears below and its description follows. In mergesort, every pair of ranges that are merged are adjacent in memory. This observation enables us to write the following function. The function merges two ranges of source array xs : [lo, mid and [mid, hi. A temporary array tmp is used as scratch space by the merge operation. The function writes the result from the temporary array back into the original range of the source array: [lo, hi.

To see why sequential merging does not work, let us implement the merge function by using one provided by STL: std::merge. This merge implementation performs linear work and span in the number of items being merged i. In our code, we use this STL implementation underneath the merge interface that we described just above. Now, we can assess our parallel mergesort with a sequential merge, as implemented by the code below.

The code uses the traditional divide-and-conquer approach that we have seen several times already.

### Product description

The code is asymptotically work efficient, because nothing significant has changed between this parallel code and the serial code: just erase the parallel annotations and we have a textbook sequential mergesort! How well does our "parallel" mergesort scale to multiple processors, i. Unfortunately, this implementation has a large span: it is linear, owing to the sequential merge operations after each pair of parallel calls. More precisely, we can write the work and span of this implementation as follows:.

That is terrible, because it means that the greatest speedup we can ever hope to achieve is 15x! The analysis above suggests that, with sequential merging, our parallel mergesort does not expose ample parallelism. Let us put that prediction to the test. The following experiment considers this algorithm on our processor test machine.

We are going to sort a random sequence of million items. The baseline sorting algorithm is the same sequential sorting algorithm that we used for our quicksort experiments: std::sort. The first two runs suggest that our mergesort has better observable work efficiency than our quicksort. Compare that to the 6x-slower running time for single-processor parallel quicksort!

We have a good start. But we can do better by using a parallel merge instead of a sequential one: the speedup plot in [mergesort-speedups] shows three speedup curves, one for each of three mergesort algorithms. The mergesort algorithm is the same mergesort routine that we have seen here, except that we have replaced the sequential merge step by our own parallel merge algorithm. The cilksort algorithm is the carefully optimized algorithm taken from the Cilk benchmark suite. What this plot shows is, first, that the parallel merge significantly improves performance, by at least a factor of two.

The second thing we can see is that the optimized Cilk algorithm is just a little faster than the one we presented here. It turns out that we can do better by simply changing some of the variables in our experiment. The plot shown in [better-speedups] shows the speedup plot that we get when we change two variables: the input size and the sizes of the items. In particular, we are selecting a larger number of items, namely million instead of million, in order to increase the amount of parallelism. And, we are selecting a smaller type for the items, namely 32 bits instead of 64 bits per item. The speedups in this new plot get closer to linear, topping out at approximately 20x.

Practically speaking, the mergesort algorithm is memory bound because the amount of memory used by mergesort and the amount of work performed by mergesort are both approximately roughly linear. It is an unfortunate reality of current multicore machines that the main limiting factor for memory-bound algorithms is amount of parallelism that can be achieved by the memory bus. The memory bus in our test machine simply lacks the parallelism needed to match the parallelism of the cores.

The effect is clear after just a little experimentation with mergesort. An important property of the sequential merge-sort algorithm is that it is stable: it can be written in such a way that it preserves the relative order of equal elements in the input. Is the parallel merge-sort algorithm that you designed stable?

If not, then can you find a way to make it stable? In just the past few years, a great deal of interest has grown for frameworks that can process very large graphs. Interest comes from a diverse collection of fields. To name a few: physicists use graph frameworks to simulate emergent properties from large networks of particles; companies such as Google mine the web for the purpose of web search; social scientists test theories regarding the origins of social trends. In response, many graph-processing frameworks have been implemented both in academia and in the industry.

Such frameworks offer to client programs a particular application programming interface. The purpose of the interface is to give the client programmer a high-level view of the basic operations of graph processing. Internally, at a lower level of abstraction, the framework provides key algorithms to perform basic functions, such as one or more functions that "drive" the traversal of a given graph.

The exact interface and the underlying algorithms vary from one graph-processing framework to another. One commonality among the frameworks is that it is crucial to harness parallelism, because interesting graphs are often huge, making it practically infeasible to perform sequentially interesting computations. We will use an adjacency lists representation based on compressed arrays to represent directed graphs. In this representation, a graph is stored as a compact array containing the neighbors of each vertex.

The representation then consists of two array. The edge array contains the adjacency lists of all vertices ordered by the vertex ids. The vertex array stores an index for each vertex that indicates the starting position of the adjacency list for that vertex in the edge array.

This array implements the. The sentinel value "-1" is used to indicate a non-vertex id. Acar and Arthur Chargueraud and Mike Rainey v1. Preface The goal of these notes is to introduce the reader to the following. Chapter: Fork-join parallelism Fork-join parallelism, a fundamental model in parallel computing, dates back to and has since been widely used in parallel computing. Parallel Fibonacci Now, we have all the tools we need to describe our first parallel code: the recursive Fibonacci function.

Incrementing an array, in parallel Suppose that we wish to map an array to another by incrementing each element by one. The sequential elision In the Fibonacci example, we started with a sequential algorithm and derived a parallel algorithm by annotating independent functions. Race Conditions A race condition is any behavior in a program that is determined by some feature of the system that cannot be controlled by the program, such as timing of the execution of instructions.

Synchronization Hardware Since mutual exclusion is a common problem in computer science, many hardware systems provide specific synchronization operations that can help solve instances of the problem. Example 7. Accessing the contents of atomic memory cells. The key operation that help with race conditions is the compare-and-exchange operation.

Read the contents of target. Otherwise, returns false. Software Setup You can skip this section if you are using a computer already setup by us or you have installed an image file containing our software. Check for software dependencies Currently, the software associated with this course supports Linux only. Use a custom parallel heap allocator At the time of writing this document, the system-default implementations of malloc and free that are provided by Linux distributions do not scale well with even moderately large amounts of concurrent allocations.

Also, the environment linkder needs to be instructed where to find tcmalloc. Use hwloc If your system has a non-uniform memory architecture i. Starting with installed binaries At this point, you have either installed all the necessary software to work with PASL or these are installed for you. Specific set up for the andrew. First set up your PATH variable to refer to the right directories.

Using cshell. Fetch the benchmarking tools pbench We are going to use two command-line tools to help us to run experiments and to analyze the data. Build the tools The following command builds the tools, namely prun and pplot. Create aliases We recommend creating the following aliases. Visualizer Tool When we are tuning our parallel algorithms, it can be helpful to visualize their processor utilization over time, just in case there are patterns that help to assign blame to certain regions of code. Task 1: Run the baseline Fibonacci We are going to start our experimentation with three different instances of the same program, namely bench.

Task 2: Run the sequential elision of Fibonacci The. Task 3: Run parallel Fibonacci The. The output of this program is similar to the output of the previous two programs. Measuring performance with "speedup" We may ask at this point: What is the improvement that we just observed from the parallel run of our program? Example Speedup for our run of Fibonacci on 40 processors. Generate a speedup plot Let us see what a speedup curve can tell us about our parallel Fibonacci program.

Starting to generate 1 charts. Produced file plots. Superlinear speedup Suppose that, on our processor machine, the speedup that we observe is larger than 40x. Strong versus weak scaling We are pretty sure that or Fibonacci program is not scaling as well is it could. Figure 4. How processor utilization of Fibonacci computation varies with input size. Chapter Summary We have seen in this lab how to build, run, and evaluate our parallel programs.

Chapter: Work efficiency In many cases, a parallel algorithm which solves a given problem performs more work than the fastest sequential algorithm that solves the same problem. Definition: asymptotic work efficiency. Definition: observed work efficiency. Observed work efficiency of parallel increment. To obtain this measure, we first run the baseline version of our parallel-increment algorithm.

Definition: good parallel algorithm. Tuning the parallel array-increment function. Observed work efficiency of tuned array increment. Determining the threshold The basic idea behind coarsening or granularity control is to revert to a fast serial algorithm when the input size falls below a certain threshold.

Chapter: Automatic granularity control There has been significant research into determining the right threshold for a particular algorithm. Controlled statements In PASL, a controlled statement , or cstmt , is an annotation in the program text that activates automatic granularity control for a specified region of code. Array-increment function with automatic granularity control. Granularity control with alternative sequential bodies It is not unusual for a divide-and-conquer algorithm to switch to a different algorithm at the leaves of its recursion tree. Array-increment function with automatic granularity control and sequential body.

Controlled parallel-for loops Let us add one more component to our granularity-control toolkit: the parallel-for from. Figure 6. Simple Parallel Arrays Arrays are a fundamental data structure in sequential and parallel computing. Interface and cost model The key components of our array data structure, sparray , are shown by the code snippet below.

The cost model guaranteed by our implementation of parallel array is as follows:. Allocation and deallocation Arrays can be allocated by specifying the size of the array. Automatic deallocation of arrays upon return. Create and initialize an array sequentially. What is the work and span complexity of your solution? Does your solution expose ample parallelism? How much, precisely? What is the speedup do you observe in practice on various input sizes?

Tabulation A tabulation is a parallel operation which creates a new array of a given size and initializes the contents according to a given "generator function". Javascript Essentials by Jason J. Manger PDF Kindle. Karl Moore's Visual Basic. Net Developers Series by Joseph E. Learning Programming Using Visual Basic.

Burrows PDF Online. Net Herausgeber 1. Net Framework Web Microsoft Outlook Bible by Peter G. Aitken PDF Download. Scott, Zimmerman, Beverly B. Berryhill] [Oct] ePub. Daniel Liang] published on July, Download. PDF Adobe Photoshop 5. PDF Adobe Photoshop 6. PDF B. PDF C 3. Lippman Page search results for this author Stanley B. Lippman Editor Dec Paperback ePub.

Tenenbaum ePub. Penfold Download.

NET 3. Rich Mar Paperback Download. NET: Developing Web Beskeen Download. Mai Gebundene Ausgabe Download.

Coggeshall ePub. PDF Professional C 5. PDF Programando C 3. PDF Real-World. Toliver ePub. Taylor ePub. Boddenberg 5. David Paperback Download. Kraynak May Paperback ePub. Net, Inc. Chaudhri Download. Net 1. Pro ASP. NET by Stephen R. Fraser PDF Online.