Automatic and transparent optimizations of an application’s MPI communication

Please download to get full document.

View again

of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
HPC users frequently develop and run their MPI programs without optimizing communication, leading to poor performance on clusters connected with regular Gigabit Ethernet. Unfortunately, optimizing communication patterns will often decrease the
  Automatic and Transparent Optimizations of anApplication’s MPI Communication Thorvald Natvig and Anne C. Elster Norwegian University of Science and Technology(NTNU),Dept. of Computer and Information Science, Sem Sælands vei 7-9,NO-7491 Trondheim, Norway { thorvan, elster } @idi.ntnu.no Abstract. HPC users frequently develop and run their MPI programswithout optimizing communication, leading to poor performance on clus-ters connected with regular Gigabit Ethernet. Unfortunately, optimizingcommunication patterns will often decrease the clarity and ease of modi-fication of the code, and users desire to focus on the application problemand not the tool used to solve it.In this paper, our new method for automatically optimizing any ap-plication’s communication is presented. All MPI calls are intercepted bya library we inject into the application. Memory associated with MPIrequests is protected using hardware supported memory protection. Therequest can then continue in the background as an asynchronous oper-ation while the application is allowed to continue as if the request isfinished. Once the data is accessed by the application, a page fault willoccur, and our injected library will wait for the background transfer tofinish before allowing the application to continue. Performance close tothat of manual optimization are observed on our test-cases when run onGigabit Ethernet clusters. Keywords: MPI, Automatic Optimization, Memory Protection. 1 Introduction Optimizing for parallel machines requires extensive knowledge of both generaloptimization techniques and specific hardware and software details of the ma-chine to use. Any non-trivial parallelization will add some overhead for commu-nication. For example, when parallelizing Game of Life, an iterative 2D partialdifferential equation solver (PDE) or any other kind of grid-based domain, eachiteration requires the border data of its subdomain to be exchanged with the”neighbors”.The goal of our new method presented here was to achieve a solution thatautomatically optimizes the communication of MPI[1,2]programs. The solutionshould not require any user intervention at all to be enabled, and should ideallybe fully usable on all MPI based parallel architectures. B. K˚agstr¨om et al. (Eds.): PARA 2006, LNCS 4699, pp. 208–217,2007. c  Springer-Verlag Berlin Heidelberg 2007  Automatic and Transparent Optimizations 209 It is important that the solution does not in any way alter the result of theMPI program, so we have to make sure any optimizations done do not alter thedata flow of the program, only the communication. The only difference the usershould notice, should be an improved wallclock running time.A more detailed description of the implementation and more extensive bench-marking may be found in[3]. 1.1 Previous Work Much work has been done on optimizing the individual MPI functions. Some of these are inspired by the ATLAS (Automatically Tuned Linear Algebra Software[4]) idea of taking a basic routine and trying thousands of small variations untilyou find the specific set of parameters that is perfect for the architecture you arecompiling on. Faraj and Yuan [5]have presented such a method for automaticallyoptimizing the MPI Collective subroutines, and Østvold has presented numerousways of timing collective communication[6].Ogawa and Matsuoka[7] used compiler modifications to optimize MPI. Thecompiler would recognize the MPI calls in a program, do a static analysis tofind out what arguments are static and then create specialized MPI functionsfor that program. However, they did not turn synchronous communication intoasynchronous. With the introduction of interprocedural optimizations such asthose available in the Intel C++ Compiler [8], such optimizations can be ex-tended to all function calls and not just MPI Calls. Unfortunately, this type of optimization requires recompiling the program and for optimal results requireseveral passes of profile-guided-compilation, which increases the amount of worka user has to do before running his program.The inspiration for using the page protection mechanism and runtime anal-ysis of program data comes in part from the work done by the DUMA[9]andValgrind[10]projects.A more user-involved approach to optimization has been presented by Jost[11], which build an expert system based on profiling data to aid the user intheir optimizations. Their paper is based around OpenMP, but we believe theinstrumentation provided by methods presented here could be used to extendsuch approaches to work with MPI.Karwande have presented a method for compiled communication (CC-MPI), which applies more aggressive optimizations to communications whoseinformation are known at compile time[12]. 1.2 Motivation To see how much the potential gain for this idea was, we implemented a Red-Black 2D SOR Solver[13]using three different methods of exchanging borders.The first method uses MPI Sendrecv 4 times. It first sends ’top’ and receives’bottom’, then the same for the remaining 3 borders. This is the easiest to code,and allows easy and intuitive understanding of what happens.  210 T. Natvig and A.C. Elster The second method uses MPI Isend and MPI Irecv (asynchronous opera-tions), and then waits for all 8 requests to finish. This is just a little bit morecode, but requires the user to think carefully about what operations can be doneconcurrently with each other, and which operations can be done concurrentlywith computation. It is also less intuitive to read that the first method.The third method computes the new value for the cells along the borders first,and then exchanges these with MPI Isend and MPI Irecv in the backgroundwhile the interior cells are updated. Coding this way caused the communicationcode to become 5 times the number of codelines used in the first method, and thecode was no longer intuitive and easy to understand. Furthermore, at least withMPICH 1.2.7, independent progress (fully asynchonous background transfer)only happens if the amount of data to send is small enough to fit in the OSnetwork buffer (typically 64kb).Benchmarking these methods on a Pentium 4 cluster using MPICH 1.2.7showed that just using MPI Isend gave a significant speedup for small problemsizes, and the full overlapping provided just marginal speedup on top of that(see Table1in the Results section for details).Usingblockingfunctions,theMPI Sendrecvcallsareexecutedinsequence,andthismakestheapplicationvulnerabletolatencyissues.Thedatasizeineachtrans-ferissosmallthatmostofthetransfertimeisnotspentphysicallytransferingdataon the wire, but in OS overhead. Additionally, the receiving machine will receivean interrupt which must be handled and the data must be copied to userspacebefore the transfer is done. The physcial network is idle during most of this time.By using MPI Isend and MPI Irecv we simply make sure we only have this verylong wait once for each group of transfers instead of once per transfer. 2 Protected Asynchronous Transfers 2.1 Overview Our goal is to show that, by using certain safeguards, it is safe to turn anysynchronous request into an asynchronous one.Our first method (paging) uses the hardware page protection mechanism toprotect the memory associated with each request. While the transfer occurs inthe background, the application is allowed to continue. If the protected memoryis accessed, the application is paused until the background transfer is complete.This way, no change of logic in the application is necessary, we just postponewaiting for data until it’s actually needed.The second method (chaining) improves upon paging by recognizing chains of requests without any computation in-between them. The chains are recognizedat run-time by our library, and when the start of such a chain is encountered, weallow the requests to proceed in the background. Once the last request is issued,we wait for all of them to complete. Hence, we use the paging to verify that therequests in a chain are isolated from other requests and computation, and afterthat we gain the benefits our paging method provides without the overhead of updating hardware paging tables.  Automatic and Transparent Optimizations 211 2.2 Process Injection Since it was a design goal that the users should not have to change their code atall when using our new method, the first challenge was how to place ourselvesbetween the application and MPI. Our implementation therefore allows for thechoice between run-time and static injection.Run-time injection has the advantage that it can be used for precompiledprograms, and may also be used for programs that call MPI through anotherlibrary or is even written in another language (Python, Java etc). At the moment,run-time injection only works if MPI itself is a dynamic library. Our initial workrecognized the specific parts of an executable that contain the MPI library.However, such recognition is impossible if the library is compiled with inter-procedural optimizations, as that may make the library functions inlined in theuser’s code.Static injection (compile-time) worksby overriding  mpi.h with our own headerwhich wraps the calls we want to intercept. Therefore, it is a much cleanersolution which also works with highly optimizing compilers, but it is then nolonger possible to update our injected library without recompiling. 2.3 Marking Memory When an overridden synchronous MPI function is called and turned into anasynchronous one, we need to track the memory area in use.For a send request, we perform a heuristic on the datasize. If it is small(contained within one memory page), we simply copy the data to a new bufferand start the send in the background. The application is now free to modify thesrcinal buffer without affecting the background transfer. If the request is large,we protect the memory pages the buffer occupies so that they are read-only.For a receive request, we always protect the application memory and start thetransfer in the background to a separate buffer. Once complete, we copy fromthis buffer to the application memory. If we had write-only memory protection,we could deny the application read access while still allowing MPI to writethe incoming data to memory, but unfortunately no current memory protectionhardware implements write-only protection[14].The result of this is that the application will not actually stop until it touchesan address that is still being transferred. At that point, a page fault will occurwhich calls our page fault handler, which simply waits for the request to finishbefore unprotecting the pages and allowing the application to continue. As such,the computation flow of the application is unchanged, and we avoid any deadlockissues. If the application calls MPI Finalize before touching the data, we also waitfor the communication to finish and restore the state of the page table beforeallowing the application to continue. 2.4 Using Memory Paging to Control Application Access The foundation for allowing ”safe” background transfer is denying the srcinalapplication access to its own memory. We do this by using the standard page  212 T. Natvig and A.C. Elster protection feature found on all modern CPUs. This is normally used to imple-ment swapping and virtual memory. When the OS runs low on memory, it willmark a few pages in virtual address space as inaccessible, and copy the contentsto disk, thereby freeing up physical memory. When the application tries to accessthe memory, a page fault occurs, and the OS copies the data back and marksthe page accessible again.We are only interested in the marking of memory (using standard UNIX  mprotect() or Win32 VirtualProtect() ), so when we are talking about pagefaults here, it is just a page fault in virtual memory. No disk I/O is performed.As mentioned, most memory protection implementations today suffer fromwrite-implies-read and read-implies-execute. Hence, it is impossible to get awrite-only page. If you ask for write-only, what you get is a page with all per-missions.Furthermore, memory protection operates on a whole memory page (usually4096 or 16384 bytes). You cannot protect just part of a page, so if you protect just a single byte, any other variable occupying the same memory page will beprotected as well. 2.5 Overlapping Requests When tracking memory, we need to be careful about overlapping elements andoverlapping pages. Overlapping elements happens when two requests need toaccess the exact same memory location, while overlapping pages happens whentwo requests needing access to the same page.For overlapping elements, the order of send operations is irrelevant, and anynumber of them can be done simultaneously. However, if a receive operation oc-curs, any other operation already started must complete, and no other operationcan start until the receive operation has completed.When two requests share the same page, we need to pay attention to whatmode was used to start the previous operation on the same page. If we wereto start a send operation, and a receive operation is in progress on the samepage, it means the page is already protected. If we have determined that theactual elements don’t overlap, we have to unprotect the page, copy out the datato send, and then reprotect the page. Similarly, if send operations are alreadyenroute and the page is read-only, we can’t change the protection of the pageto no access as that will just make the MPI library cause a page fault, so wehave to wait until they are done before we change protections. Shared pagesalso makes the pagefault handler quite complicated, since it has to wait for allrequests that might share a page to complete before unprotecting it and allowingthe application access.It should be noted that since our method works on the process level, themethod works equally well for single-threaded and multi-threaded applicationsas well as MPI implementations.Figure1shows a theoretical example of all the requests for a classic JacobiPDE solver with an 8x8 local grid, 1 layer of shadow cells, and a theoretical pagesize of 128 bytes. Each row represents one page in memory, and each color is a
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!