| Title | Scalable formal dynamic verification of MPI programs through distributed causality tracking |
| Publication Type | dissertation |
| School or College | College of Engineering |
| Department | Computing |
| Author | Vo, Anh |
| Date | 2011-08 |
| Description | Almost all high performance computing applications are written in MPI, which will continue to be the case for at least the next several years. Given the huge and growing importance of MPI, and the size and sophistication of MPI codes, scalable and incisive MPI debugging tools are essential. Existing MPI debugging tools have, despite their strengths, many glaring de ficiencies, especially when it comes to debugging under the presence of nondeterminism related bugs, which are bugs that do not always show up during testing. These bugs usually become manifest when the systems are ported to di fferent platforms for production runs. This dissertation focuses on the problem of developing scalable dynamic verifi cation tools for MPI programs that can provide a coverage guarantee over the space of MPI nondeterminism. That is, the tools should be able to detect diff erent outcomes of nondeterministic events in an MPI program and enforce all those di fferent outcomes through repeated executions of the program with the same test harness. We propose to achieve the coverage guarantee by introducing efficient distributed causality tracking protocols that are based on the matches-before order. The matches-before order is introduced to address the shortcomings of the Lamport happens-before order [40], which is not sufficient to capture causality for MPI program executions due to the complexity of the MPI semantics. The two protocols we propose are the Lazy Lamport Clocks Protocol (LLCP) and the Lazy Vector Clocks Protocol (LVCP). LLCP provides good scalability with a small possibility of missing potential outcomes of nondeterministic events while LVCP provides full coverage guarantee with a scalability tradeoff . In practice, we show through our experiments that LLCP provides the same coverage as LVCP. This thesis makes the following contributions: •The MPI matches-before order that captures the causality between MPI events in an MPI execution. • Two distributed causality tracking protocols for MPI programs that rely on the matches-before order. • A Distributed Analyzer for MPI programs (DAMPI), which implements the two aforementioned protocols to provide scalable and modular dynamic verifi cation for MPI programs. • Scalability enhancement through algorithmic improvements for ISP, a dynamic verifi er for MPI programs. |
| Type | Text |
| Publisher | University of Utah |
| Subject | Causality tracking; Correctness checking; MPI; Verification |
| Dissertation Institution | University of Utah |
| Dissertation Name | Doctor of Philosophy |
| Language | eng |
| Rights Management | Copyright © Anh Vo 2011 |
| Format | application/pdf |
| Format Medium | application/pdf |
| Format Extent | 2,278,392 bytes |
| Identifier | us-etd3,36193 |
| Source | Original housed in Marriott Library Special Collections, QA3.5 2011 .V59 |
| ARK | ark:/87278/s6m04m59 |
| DOI | https://doi.org/doi:10.26053/0H-VCG0-K5G0 |
| Setname | ir_etd |
| ID | 194398 |
| OCR Text | Show SCALABLE FORMAL DYNAMIC VERIFICATION OF MPI PROGRAMS THROUGH DISTRIBUTED CAUSALITY TRACKING by Anh Vo A dissertation submitted to the faculty of The University of Utah in partial ful llment of the requirements for the degree of Doctor of Philosophy in Computer Science School of Computing The University of Utah August 2011 Copyright c Anh Vo 2011 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of has been approved by the following supervisory committee members: , Chair Date Approved , Member Date Approved , Member Date Approved , Member Date Approved , Member Date Approved and by , Chair of the Department of and by Charles A. Wight, Dean of The Graduate School. Anh Vo Ganesh Gopalakrishnan 04/26/2010 Robert M. Kirby 04/18/2010 Bronis R. de Supinski 04/18/2010 Mary Hall 04/18/2010 Matthew Might 04/18/2010 Al Davis School of Computing ABSTRACT Almost all high performance computing applications are written in MPI, which will continue to be the case for at least the next several years. Given the huge and growing importance of MPI, and the size and sophistication of MPI codes, scalable and incisive MPI debugging tools are essential. Existing MPI debugging tools have, despite their strengths, many glaring de ciencies, especially when it comes to debugging under the presence of nondeterminism related bugs, which are bugs that do not always show up during testing. These bugs usually become manifest when the systems are ported to di erent platforms for production runs. This dissertation focuses on the problem of developing scalable dynamic veri cation tools for MPI programs that can provide a coverage guarantee over the space of MPI nondeterminism. That is, the tools should be able to detect di erent outcomes of nondeterministic events in an MPI program and enforce all those di erent outcomes through repeated executions of the program with the same test harness. We propose to achieve the coverage guarantee by introducing e cient distributed causality tracking protocols that are based on the matches-before order. The matches- before order is introduced to address the shortcomings of the Lamport happens-before order [40], which is not su cient to capture causality for MPI program executions due to the complexity of the MPI semantics. The two protocols we propose are the Lazy Lamport Clocks Protocol (LLCP) and the Lazy Vector Clocks Protocol (LVCP). LLCP provides good scalability with a small possibility of missing potential outcomes of nondeterministic events while LVCP provides full coverage guarantee with a scalability tradeo . In practice, we show through our experiments that LLCP provides the same coverage as LVCP. This thesis makes the following contributions: The MPI matches-before order that captures the causality between MPI events in an MPI execution. Two distributed causality tracking protocols for MPI programs that rely on the matches-before order. A Distributed Analyzer for MPI programs (DAMPI), which implements the two aforementioned protocols to provide scalable and modular dynamic veri cation for MPI programs. Scalability enhancement through algorithmic improvements for ISP, a dynamic veri er for MPI programs. iv For my parents \The major di erence between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair" { Douglas Adams, Mostly Harmless CONTENTS ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x LIST OF TABLES: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xii ACKNOWLEDGMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii CHAPTERS 1. INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Dynamic Veri cation for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 MPI Matches-Before . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Lazy Update Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.3 Contributions to ISP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.4 DAMPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. BACKGROUND : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2.1 Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Distributed Causality Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Lamport Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Vector Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 The Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Synchronous Point-to-Point Communication . . . . . . . . . . . . . . . . . . . 11 2.3.2 Asynchronous Point-to-Point Communication . . . . . . . . . . . . . . . . . . 14 2.3.3 Collective Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.4 Nondeterminism in MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.5 Nonovertaking in MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.6 Common MPI Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.6.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.6.2 Resource Leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.6.3 Erroneous Bu er Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.6.4 Type Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.7 The MPI Pro ling Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Piggybacking in MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Bu er Attachment Piggybacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Separate Message Piggybacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Datatype Piggyback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3. MPI MATCHES-BEFORE : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28 3.1 Our Computational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Issues Applying Happens-Before to MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Matches-Before . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4. CENTRALIZED DYNAMIC VERIFICATION FOR MPI : : : : : : : : : 35 4.1 ISP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.1 The ISP Pro ler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.2 The ISP Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.3 The POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.4 ISP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 ISP Scalability Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.1 The Scalability Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.2 Memory Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.3 Improvements to POE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.3.1 Transitivity of Matches-Before . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.3.2 Parallel-ISP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5. DISTRIBUTED DYNAMIC VERIFICATION FOR MPI : : : : : : : : : : 46 5.1 Lazy Lamport Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.2 Clock Update Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.3 Match Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Lazy Vector Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 Handling Synchronous Sends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.1.1 Piggyback Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.1.2 Algorithm Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 DAMPI: Distributed Analyzer for MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.1 DAMPI Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.1.1 The DAMPI Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.1.2 The Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.2 Implementation Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.2.1 Piggyback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.2.2 DAMPI Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.2.3 Error Checking Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2.4 The DAMPI Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.3 Evaluation of LLCP and LVCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.4 DAMPI Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.4.1 Full Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.5 Search Bounding Heuristics Evaluation . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.5.1 Loop Iteration Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.5.2 Bounded Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 viii 6. RELATED WORK : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87 6.1 Debugging and Correctness Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.1.1 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.1.2 Correctness Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2 Veri cation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3 Deterministic Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7. CONCLUSIONS AND FUTURE WORK : : : : : : : : : : : : : : : : : : : : : : : : 93 7.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.1.1 Static Analysis Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.1.2 Hybrid Programming Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95 ix LIST OF FIGURES 1.1 MPI example to illustrate POE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 A distributed system using Lamport clocks . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 A distributed system using Lamport clocks . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 An MPI program calculating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Deadlock due to send receive mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Head-to-head deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Deadlock due to collective posting order . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 Deadlock due to nondeterministic receive . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 Resource leak due to unfreed request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.9 Erroneous bu er reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.10 Type mismatch between sending and receiving . . . . . . . . . . . . . . . . . . . . . . . 22 2.11 A simple PMPI wrapper counting MPI Send . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.12 Bu er attachment piggyback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.13 Separate message piggyback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.14 Separate message piggyback issue on the same communicator . . . . . . . . . . . 25 2.15 Separate message piggyback issue on di erent communicators . . . . . . . . . . . 26 2.16 Datatype piggyback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Wildcard receive with two matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Counterintuitive matching of nonblocking receive . . . . . . . . . . . . . . . . . . . . . 30 3.3 Nonovertaking matching of nonblocking calls . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Transitivity of matches-before . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 ISP framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 MPI example to illustrate POE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Improvements based on transitive reduction . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Improvements made by data structures changes . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Improvements made by parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 Late messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 An example illustrating LLCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3 Omission scenario with LLCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 Handling synchronous sends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.5 DAMPI framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.6 DAMPI library overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.7 Packing and unpacking piggyback data - collective . . . . . . . . . . . . . . . . . . . . 57 5.8 Packing and unpacking piggyback data - point-to-point . . . . . . . . . . . . . . . . 58 5.9 Pseudocode for piggybacking in MPI Send . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.10 Pseudocode for piggybacking in MPI Isend . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.11 Pseudocode for piggybacking in MPI Recv . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.12 Pseudocode for piggybacking in MPI Irecv . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.13 Pseudocode for piggybacking in MPI Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.14 Wildcard receives with associated clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.15 Pseudocode for MPI Irecv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.16 Pseudocode for MPI Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.17 Pseudocode for CompleteNow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.18 Pseudocode for MPI Recv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.19 Pseudocode for ProcessIncomingMessage . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.20 Pseudocode for FindPotentialMatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.21 Pseudocode for CompleteNow with probe support . . . . . . . . . . . . . . . . . . . . 67 5.22 MPI example with MPI ANY TAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.23 Pseudocode for the DAMPI scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.24 Bandwidth impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.25 Latency impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.26 Overhead on SMG2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.27 Overhead on AMG2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.28 Overhead on ParMETIS-3.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.29 ParMETIS-3.1: DAMPI vs. ISP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.30 Matrix multiplication: DAMPI vs. ISP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.31 A simple program ow to demonstrate bounded mixing . . . . . . . . . . . . . . . . 84 5.32 Matrix multiplication with bounded mixing applied . . . . . . . . . . . . . . . . . . . 85 5.33 ADLB with bounded mixing applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 xi LIST OF TABLES 4.1 Comparison of POE with Marmot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Number of MPI calls in ParMETIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1 Statistics of MPI operations in ParMETIS-3.1 . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 DAMPI overhead: Large benchmarks at 1K processes . . . . . . . . . . . . . . . . . 80 ACKNOWLEDGMENTS My research that led to this dissertation and this dissertation itself would not have been possible without the advice and support from my friends, my family, the faculty of the School of Computing, and our research collaborators. First and foremost, I would like to thank my advisor, Professor Ganesh Gopalakrishnan, who I consider to be the best advisor one could ever hope to have. His level of enthusiasm has inspired many students, myself included, to be self-motivated and work hard to achieve the goals we have set for ourselves. Thank you, Ganesh; I could not have done it without your support. The dissertation would not have reached its current state without the suggestions of Professor Robert M. Kirby, who is my coadvisor, Dr. Bronis de Supinski, Professor Mary Hall, and Professor Matthew Might, who are my committee members. Thank you all of you for your valuable input on my research and my dissertation. During the course of my research on DAMPI, I probably maintained an average ratio of about ve bad ideas to one decent idea, and about ve decent ideas to one good (i.e., publishable) idea. Without the countless brainstorming sessions with our research collaborators, Bronis de Supinski, Martin Schulz, and Greg Bronevetsky, whether at the whiteboard or through emails, the bad ones would have been considered good, and vice versa. Working with you has been an eye opening experience. I am especially grateful to Bronis for his editing help with the papers, the proposal, and this dissertation itself. My research began as a project to improve ISP, a dynamic code level model checker for MPI. ISP is the work of Sarvani Vakkalanka, whom I had the pleasure of working with for two years. She was one of the smartest and hardest working colleagues that I have had. Her work on ISP inspired my research and I am thankful for that. I would also like to express my gratitude to many of my colleagues, especially Michael Delisi, Alan Humphrey, Subodh Sharma, and Sriram Aananthakrishnan, for their input and contributions. With a few exceptions, going to graduate school is usually demanding and stressful. I would not have made it without the support of my family and my friends. My parents, my sister, and my cousin and her family have always been my biggest supporters. I appreciate that they always try to tactfully ask me how many years I have been in school instead of asking me how many years I have left. I am especially indebted to my ance, Phuong Pham, for providing me the motivation and the encouragement I needed to nish the dissertation work. I am also thankful to many good friends who have made my graduate school experience memorable. My-phuong Pham, Linh Ha, Hoa Nguyen, Huong Nguyen, Khiem Nguyen, Thao Nguyen, Huy Vo, and Trang Pham, thank you all. Five years ago as I was pondering the decision on whether I should continue my studying, it was my friend Khoi Nguyen who encouraged me to go for graduate school. And until today I still remember the countless nights we stayed up together, each working on our own projects and paper, even though we are several time zones apart. Khoi will always have my gratitude. xiv CHAPTER 1 INTRODUCTION It is undeniable that the era of parallel computing has dawned on us, regardless of whether we are ready or not. In a recent report entitled The Future of Computing Performance: Game Over or Next Level, the National Research Council states that \the rapid advances in information technology that drive many sectors of the U.S. economy could stall unless the nation aggressively pursues fundamental research and development of parallel computing" [25]. Today supercomputers are becoming faster, cheaper, and more popular. The last release of the Top500 list in November 2010 witnessed ve new supercomputers, which had not previously been on the list before, making it to the top ten [16]. Future growth in computing power will have to come from parallelism, from both the hardware side and the software side. Programmers who are used to thinking and coding sequential software now have to turn to parallel software to achieve the desired performance. Unfortunately, the transition from sequential programming to parallel programming remains a challenge due to the complexity of parallel programming. There are many di erent forms of parallelism in software, from multithreading to message passing. This dissertation speci cally focuses on message passing software, especially those written in MPI (Message Passing Interface [28]). Today MPI is the most widely used programming API for writing parallel programs that run on large clusters. The ubiquity of MPI can be attributed to its design goals, which are exibility, performance, and portability. MPI accomplishes these goals by providing a very rich semantics that incorporates the features of both asynchronous and synchronous systems. Synchronous behavior is easy to use and understand, which allows developers to achieve higher productivity, while asynchronous behavior allows for the highest performance. Both of these properties are necessary for a ubiquitous standard. Unfortunately, the performance and exibility of MPI come with several debugging challenges. MPI programs, especially under the presence of nondeterminism, are notori- ously hard to debug. Nondeterminism bugs are di cult to catch because repeated unit 2 testing, which is the most commonly used method to test concurrent code, usually covers only a small number of possible executions [73]. When the code enters production and is deployed in di erent environments, the untested (buggy) path becomes manifest and might cause the software to crash. To highlight how di cult debugging can get with MPI, we consider a simple MPI program shown in Figure 1.1, which contains a very subtle bug. In this program, the asynchronous nondeterministic receive posted in process P1 can potentially match with message sent by either P0 or P2. Under traditional testing, one may never successfully be able to force P2's message (which triggers ERROR) to match. While this option appears impossible due to its issuance after an MPI barrier, it is indeed a possible match because the MPI semantics allows a nonblocking call to pend until its corresponding wait is posted. This example illustrates the need for more powerful veri cation techniques than ordinary random testing on a cluster where, due to the absolute delays, P2's match may never happen (and yet, it may show up when the code is ported to a di erent machine). Even though there are many techniques and tools that help developers discover MPI nondeterminism errors, they basically fall into one of these three categories: static methods, dynamic methods, and model checking. Static methods have the advantages of being input-independent since they verify the program at the source code level. However, they tend to provide too many false alarms, especially for a large code base, due to the lack of runtime knowledge. Model checking methods are very powerful for small programs in terms of veri cation coverage but they quickly become impractical for large software due to the infeasibility of building models for such software. Dynamic methods such as testing or dynamic veri cation are the most applicable methods for large MPI programs since they produce no false alarms and also require little work from the tool users. This dissertation focuses on applying formal techniques to create e cient and scalable dynamic veri cation tools for MPI programs. P0 P1 P2 Isend(to P1,22) Irecv(from:*,x) Barrier Barrier Barrier Isend(to P1,33) Wait() Recv(from:*,y) Wait() if(x==33) ERROR Wait() Figure 1.1: MPI example to illustrate POE 3 1.1 Dynamic Veri cation for MPI Most realistic MPI programs are written in Fortran/C/C++ and run on clusters with thousands to hundreds of thousands of cores. These programs can have not only the common C/C++/Fortran bugs such as memory leaks or bu er over ow, but also bugs speci c to MPI such as deadlocks or illegal bu er reuse. Earlier we presented a buggy example involving a nondeterministic receive, which is troublesome for developers to debug because the bugs appear intermittently and do not show up in all traces. Testing tools for MPI such as Marmot [35] and Umpire [68] are unreliable for such bugs because they only catch the bugs that appear in the testing run. In other words, they do not provide any coverage guarantee over the space of nondeterminism. The model checker MPI-SPIN [58] can provide a coverage guarantee for MPI ver- i cation. However, MPI-SPIN requires the users to build models manually using the SPIN programming language for MPI programs and run the model checker to verify the models. For realistic MPI programs containing hundreds of thousands of lines of code, this requirement is unrealistic and renders this approach impractical. While there exist dynamic veri cation tools for other types of parallel software such as CHESS [46] or Verisoft [26], similar tools for MPI are still nonexistent. 1.2 Thesis Statement Scalable, modular and usable dynamic veri cation of realistic MPI programs is novel and feasible. 1.3 Contributions 1.3.1 MPI Matches-Before We investigate the Lamport happens-before [40] order between events in a distributed system and show that it is insu cient for capturing the full semantics of MPI executions. More speci cally, the reason is that the happens-before order relies on knowing when an event nishes all its execution e ects. However, obtaining such information for MPI events is a challenging task since an MPI event can exist in many di erent states from the point in time when the process invokes the MPI call to the point where the call no longer has any e ect on the local state. We show that either the point of issuing or the point of completion is insu cient to order events in an MPI execution correctly, which is counterintuitive to what most tool developers tend to think. To overcome these limitations, we contribute the notion of matches-before which focuses on the matching 4 point of MPI events (intuitively, the matching point is the point when an operation commits that it will nish according to the commitment). 1.3.2 Lazy Update Protocols We introduce two fully distributed protocols, namely the Lazy Lamport Clocks Pro- tocol (LLCP) and the Lazy Vector Clocks Protocol (LVCP). Both of these protocols rely on the matches-before order to track causality between nondeterministic events in MPI executions. While the vector clock-based protocol provides a complete coverage guarantee, it does not scale as well as the Lamport clock-based protocol. We show through our experiments that in practice, the Lamport clock protocol provides the same coverage guarantee without sacri cing scalability. 1.3.3 Contributions to ISP ISP is a formal dynamic veri er for MPI programs developed originally by Sarvani Vakkalanka [64{66, 71]. ISP uses a centralized version of matches-before to achieve veri cation coverage over the space of nondeterminism. My speci c contributions to ISP are as follows: Studying the scalability of ISP and making ISP scale to handle realistic MPI ap- plications through various algorithmic improvements such as reducing ISP memory footprint through data structure improvements, increasing speed up through the use of better communication mechanisms, and parallelization of the ISP scheduler with OpenMP [66, 67, 71, 72]. Interfacing with GEM [34] developers to make ISP a practical tool. 1.3.4 DAMPI The lazy update algorithms provide the basis for developing scalable and portable correctness checking tools for MPI programs. We demonstrate this by providing the implementation for these algorithms through a new tool called DAMPI [69, 70], which is a Distributed Analyzer for MPI programs. Similarly to ISP, DAMPI's goals are to verify MPI programs for common errors such as deadlocks, resource leaks over the space of nondeterminism. In contrast with ISP, DAMPI is fully distributed and targets large scale MPI programs that run on large clusters. The lazy update algorithms allow DAMPI to provide coverage over the space of nondeterminism without sacri cing scalability. Further, we implement several user con gurable search bounding heuristics in DAMPI such as loop 5 iteration abstraction, which allows the user to specify the regions for which DAMPI should bypass during the veri cation, and bounded mixing, which is a mechanism that allows the user to limit the impact a nondeterministic choice has on subsequent choices. Both of these heuristics aim to reduce the search space and provide the user with con gurable coverage. CHAPTER 2 BACKGROUND This chapter gives the background knowledge about causality tracking in distributed systems in general, and MPI in particular. 2.1 Distributed Systems While there are several possible ways to de ne what distributed systems are, we adapt the de nition from Coulouris [22], which de nes a distributed system as a collection of networked computers that communicate with each other through message passing only. Since we mostly restrict our study to the software level, we nd the concept of dis- tributed programs more useful and applicable. A distributed program P is a collection of processes P0; : : : ; Pn communicating through message passing, running a common program to achieve a common goal. It is important to note that this de nition allows a distributed program to run even on a single computer where each process Pi runs within its own virtual address space provided by the host operating system. In the rest of the paper, we shall use the term distributed system in place of distributed program. 2.2 Distributed Causality Tracking The ordering of events is an essential part of our daily life. Consider the following hypothetical example: Bob receives two undated letters from his father; one of which says \Mom is sick" and the other says \Mom is well." Since the letters are undated, Bob has no way to reason about the current well-being of his mother. One apparent solution is for Bob to pick up the phone and call his father to inquire about his mother's status. However, let us assume that in this hypothetical time and space, telephone communication does not exist, which would also explain why Bob's father sent him letters instead of calling. With this constraint, one possible solution is for Bob's father to write down the time from his wristwatch to the end of each letter. In other words, he is attaching the physical clock to each message that he sends to Bob. 7 This solution works ne if Bob's father is the only one communicating with Bob. It is not hard to imagine why this scheme would fail if another person, e.g., Bob's sister, also communicates with Bob. Assuming that instead of receiving two letters from his father, Bob receives one from his father that says \Mom is sick" and one from his sister that says \Mom is well." If Bob's sister uses her own wristwatch to timestamp her message to Bob and her watch is not in synchronization with his father's watch, the scheme still does not allow Bob to order the events properly based on the received messages. He would not be able to gure out whether his sister received a message from his father updating the status of the mother (and told her to send a message to Bob) after his father had sent him the message, or she simply visited the mother before she became ill. In other words, the scheme does not fully capture the causal relation of the two messages. In distributed systems, causality tracking is a major part of many problems, ranging from the simplest problems of resource allocation and reservation to more complicated problems such as checkpointing or deterministic replay. Many of these algorithms are used in safety critical systems and faulty knowledge would have catastrophic consequences. We now look at several ways that one can track causality in distributed systems and how they can help Bob solve the problem of guring out his Mom's current health status. 2.2.1 Lamport Clocks In 1978, Lamport invented a very simple yet e ective mechanism to capture the total order of events in distributed systems [40]. Instead of using physical clocks, process Pi now has a function Ci(a) that returns a number C(a) for event a in Pi, and we shall call this number a's timestamp (or a's clock). In other words, instead of associating physical times to Pi's events, the algorithm now associates logical times to them. Assuming that sending and receiving messages are observable events in the system and that local events follow program order, we describe the Lamport clocks algorithm through a set of clock maintenance rules as follows: R1. Each process Pi maintains a counter Ci initialized to 0. R2. Pi increments Ci when event e occurs and associates e with the new clock. Let Ci(e) denote this value. R3. Pi attaches (piggybacks) Ci whenever it sends a message m to Pj R4. When Pi receives m, it sets Ci greater than or equal to its present value and greater than the clock it receives. 8 Figure 2.1 shows a message passing program with three processes implementing the above Lamport Clock algorithm. Each event has an associated clock value and the direction of the arrow indicates the direction of the message (i.e., a; c; e are the sending events and b; d; f are the corresponding receiving events, respectively). The above algorithm has two important properties: P1. If event a occurs before event b in Pi, then Ci(a) < Ci(b). This follows from rule R2 above. P2. If a is the sending event of message m and b is the corresponding receiving event of m, then Ci(a) < Ci(b). This follows from rule R3 and R4 above. We are now ready to de ne the Lamport happens-before (!) relation for the set of all events in a distributed system. Let eai be the ath event that occurs in process Pi, send(Pi;m) be the event corresponding to sending message m to Pi, and recv(Pi) be the event corresponding to the reception of m from Pi, ! is de ned as follows: eai ! ebj , 8>< >: (i = j ^ a + 1 = b) _ (i 6= j ^ (eai = send(Pj ;m) ^ ebj = recv(Pi;m))) _ (9ec k : eai ! ec k ^ ec k ! ebj ) Using this de nition of ! and applying the two properties P1 and P2, we can see that any distributed system implementing the Lamport clocks algorithm satis es the Clock Condition, which states: for any two events a and b, if a ! b then C(a) < C(b). It is important to note that the converse of the clock condition is not always true. Consider events e and c in Figure 2.1, while C(e) < C(c), we cannot conclude that e ! c. However, we can infer that c could not have happened before e. While this inference is enough for a 1 b 2 c 3 d 4 e 2 f 5 Figure 2.1: A distributed system using Lamport clocks 9 several applications, some of them do require a more meaningful answer (i.e., whether e happens-before c, or e and c are simply concurrent events). We will now look at vector clocks, a more powerful scheme of logical clocks that can address the aforementioned de ciency of Lamport clocks. 2.2.2 Vector Clocks Vector clocks have been used a long time before they are formally de ned simultane- ously and independently by Fidge [24] and Mattern [43]. For example, version vectors, which are essentially vector clocks, were used to detect mutual inconsistency in distributed systems [49]. Vector clocks address the limitation of Lamport clocks by maintaining a vector of timestamps per process. That is, process Pi maintains a vector V Ci[0::n] where V Ci[j] represents Pi's knowledge about the current timestamp of Pj . We now describe the vector clocks algorithms: R1. Each process Pi has a vector V Ci initialized to 0 (8k 2 f0::ng : V C[k] = 0). R2. Pi increments V Ci[i] when event e occurs and assigns e the new clock. Let e:V C denote this value. R3. Pi attaches (piggybacks) Vi whenever it sends a message m to Pj . Let m:V C denote this value. R4. When Pi receives m, it updates its vector clock as follows: 8k 2 f0::ngV Ci[k] = max(V Ci[k]; m:V C[k]). We also need to de ne a way to compare vector clocks (which is not necessary for Lamport clocks since we are only dealing with a single integer). Two vector clocks V Ci and V Cj are compared as follows (we only show the case for <, the = case is trivial and thus omitted, the > case is similar to that of <): V Ci < V Cj , 8k 2 f0::ng : V Ci[k] V Cj [k] ^ 9l 2 f0::ng : V Ci[l] < V Cj [l] Earlier we mentioned the fact that the Lamport clocks algorithm cannot guarantee the converse of the Clock Condition. The vector clocks algorithm e ectively addresses that de ciency, which means it satis es the Strong Clock Condition: for any events a and b: a ! b iff a:V C < b:V C (in contrast with Lamport clocks, which only guarantee that if a ! b then a:LC < b:LC) Figure 2.2 shows the same parallel program as Figure 2.1 using vector clocks instead of Lamport clocks. Now consider events e and c, which have vector timestamps of [2; 0; 0] 10 and [1; 2; 0], respectively. Apparently, neither e ! c nor c ! e holds. In this case, e and c are concurrent events. De nition 2.1 Two events a and b are concurrent if a 9 b ^ b 9 a While vector clocks are useful in applications that require the knowledge of the events' partial order, they have one major drawback: each message has to carry a vector of n integers. As systems scale beyond thousands of processes, the impact on bandwidth becomes signi cant. Unfortunately, under the worst case scenarios, the size limitation of vector clocks is a necessary requirement [20]. Nonetheless, in systems where bandwidth is a large concern, one can apply several compression schemes to reduce the size of the vector clocks that are transmitted [31, 44, 60, 62]. The e ectiveness of these schemes are highly dependent on the communications pattern and also on the properties of the communicating channels. 2.3 The Message Passing Interface The Message Passing Interface (MPI) is a message-passing library interface speci ca- tion [28], designed to help programmers write high performance, scalable, and portable parallel message passing programs. Today it is the de-facto API for writing programs running on large clusters. A description of an MPI program can be found in the MPI standard [28], which states: An MPI program consists of autonomous processes, executing their own code, in a MIMD style. The codes executed by each process need not be identical. The processes communicate via calls to MPI communication primitives. Typically, each process executes in its own address space, although shared-memory implementations a 1,0,0 b 1,1,0 c 1,2,0 d 1,2,1 e 2,0,0 f 2,2,2 Figure 2.2: A distributed system using Lamport clocks 11 of MPI are possible. This document speci es the behavior of a parallel program as- suming that only MPI calls are used. The interaction of an MPI program with other possible means of communication, I/O, and process management is not speci ed. A typical MPI program is written in C/C++/Fortran, compiled, and linked with an MPI implementation. There exist many di erent MPI implementations [9, 11, 29], all of which follow the speci cations given in the standard. An example of a typical MPI program is given in Figure 2.3. The program tries to compute as follows: each process tries to compute its own chunk based on the numerical integration method, using the number of intervals it receives from the master through MPI Bcast, which is a broadcasting call. The master would then collect the chunks to calculate the nal results of through MPI Reduce, which is a reduction call. To provide the maximum performance and portability, MPI supports a wide range of communication modes including nonblocking communication, nondeterministic receives, and a large number of collective calls. We will divide these communication calls into three groups, namely asynchronous communication, synchronous communication, and collective communication and describe them in Sections 2.3.1, 2.3.2, and 2.3.3. 2.3.1 Synchronous Point-to-Point Communication The most basic form of MPI point-to-point communication is through the use of synchronous communication. These calls usually implement some rendezvous protocol where the receiver blocks until it starts to receive data from a matching sender. In the case of a synchronous send, the sender blocks until it receives an acknowledgement from the receiver that is has started the receiving process. Synchronous communication o ers several advantages. First, it is easier to use and understand compared to asynchronous communication, which allows for higher produc- tivity. Second, it can help prevent memory exhaustion by not requiring the MPI runtime to provide message bu er. However, synchronous communication usually comes with performance penalty due to the cost of synchronization, especially for applications that communicate large messages infrequently. In order to address this problem, MPI o ers two alternatives: bu ered communication and asynchronous communication. Bu ered communication allows the process to issue a sending request and continue processing without waiting for the acknowledgement from the receiver. MPI programmers can take advantage of bu ered communication through one of these two methods: 12 #include "mpi.h" #include <stdio.h> #include <math.h> int main( int argc, char *argv[] ) { int n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (1) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break; else { h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); } } MPI_Finalize(); return 0; } Figure 2.3: An MPI program calculating 13 Allocate an explicit bu er and provide it to the MPI Bsend call through the use of MPI Buffer attach. Note that the user can only attach one bu er per process and the bu er can be used for more than one message. MPI Buffer detach can be called later to force the delivery of all messages currently in the bu er. Take advantage of the MPI runtime's bu er through the use of MPI Send. The user bu er is available immediately after the call since MPI has copied the data into its own bu er. However, it is generally unsafe to rely on the runtime to provide such a bu er. In fact, the MPI standard does not mandate that the implementation should provide any bu er (although most of them do in practice to improve performance). If the runtime runs out of bu er space due to excessive pending communication, MPI Send will block until more bu er space is available, or until the data has been transmitted to the receiver side's bu er (normally it only blocks until the user bu er has been completely copied to the runtime's bu er). We now describe the syntax of MPI Ssend and MPI Recv, which are the two main synchronous point-to-point operations. MPI Ssend(buffer,count,dtype,dest,tag,comm) where buffer is a pointer to the data to be sent, dtype is the abstract type of the data, count is the number of elements of type dtype in buffer, dest is the destination process for this message, tag is an integer tag associated with this message, and comm is the MPI communicator in which this event will take place (a communicator is basically a group of processes created by the program; the special MPI COMM WORLD is the default communicator for all processes). Note that while MPI Send has the same blocking behavior as MPI Ssend, according to the MPI standard, its behavior is asynchronous. That is, the call can return before a matching receive is posted. MPI Recv(buffer,count,dtype,source,tag,comm,status) where buffer is a pointer to the receiving bu er, dtype is the abstract type of the data, count is the number of elements of type dtype expected to receive, source is the process that is expected to deliver the message, tag is the integer tag associated with the expected message, comm is the MPI communicator in which this event will take place, and status is a data structure that can be used to get more information about the received message. The receive is not required to ll the bu er (i.e., partial receives are allowed), in which case the user can nd out exactly how many elements 14 were received by calling MPI Get count or MPI Get elements. Note that it is an MPI error to post a receive that does not have enough bu er space to receive an incoming message. 2.3.2 Asynchronous Point-to-Point Communication As mentioned earlier, synchronous communication o ers robustness and predictability of message delivery at the cost of program exibility and performance. Many applications exhibit a large degree of communication-computation overlap and thus would bene t from having the ability to issue some communication requests, continue with local processing, and process the results of those requests when the computation phase is over, with the hope that the MPI runtime has sent/delivered the message during the computation phase. MPI o ers asynchronous point-to-point communication through the use of calls such as MPI Isend and MPI Irecv. The process would provide a bu er, issue the call, ob- tain a request handle from the runtime, and wait for the communication request to nish later using either MPI Wait or MPI Test (or their variants such as MPI Waitall or MPI Waitany). In the MPI 1.1 standard, the process cannot access the bu er while the requests are still pending. This was later changed to read-only for pending sending requests and no-access for pending receiving requests for MPI 2.2 and higher. Similarly to MPI Send, the MPI runtime can (and often will) bu er the messages sent by MPI Isend as long as the runtime's bu er has enough space. In other words, the corresponding call to MPI Wait simply indicates that the user data have been copied to the runtime's bu er and the process can now reuse the bu er associated with the MPI Isend. Those applications that require a rendezvous semantics for such situations will have to use MPI Issend where the corresponding MPI Wait will block until the receiver has started to receive the data. We now describe the syntax of MPI Isend, MPI Irecv and MPI Wait. MPI Isend(buffer,count,dtype,dest,tag,comm,req handle) where req handle represents the communication handle returned by the MPI run- time. All other arguments are similar to those of MPI Send. MPI Recv(buffer,count,dtype,source,tag,comm,req handle) where req handle represents the communication handle returned by the MPI run- time. All other arguments are similar to those of MPI Recv. MPI Wait(req handle,status) where req handle is the communication request to be nished and status is where 15 the user can obtain more information about the communication request after it nishes. Note that req handle is set to MPI REQUEST NULL once the communication associated with this request completes. Invoking MPI Wait on MPI REQUEST NULL causes no e ect. 2.3.3 Collective Communication As the name suggests, collective communication refers to MPI functions that require the participation of all processes within a de ned communicator. It is easy to think of collective communications as a set of point-to-point operations; for example, the MPI Bcast call can be decomposed into multiple MPI Send calls from the root to all other processes in the communicator and multiple MPI Recv calls from the other processes to receive the data from the root. In practice, however, collective operations are heavily optimized by most implementations depending on the size of the messages and the network structure. For example, the MPI Bcast call can use a tree-based algorithm to broadcast the message e ciently [61]. While it is intuitive for developers to consider collective operations as having syn- chronizing behavior, the implementation is often not required to provide such seman- tics. There are only a few collective calls that have synchronization semantics such as MPI Barrier while the rest are only required to block until they have ful lled their roles in the collective operation. For example, in an MPI Reduce call, after a process has sent out its data to the reducing root, it can proceed locally without having to wait for the root to receive all messages from other processes. However, the MPI standard does require that all processes in the communicator execute the collective. Collective operations also have additional requirements such as the sending/receiving bu ers have to be precisely speci ed (i.e., no partial receives allowed). We now describe the syntax of the MPI Barrier call. MPI Barrier(comm) where comm is the MPI communicator on which this process wants to invoke the barrier call. The MPI standard requires that all processes in the communicator participate in the barrier and that they all block until all processes have reached this routine. 16 2.3.4 Nondeterminism in MPI The MPI standard also allows some MPI calls to have nondeterministic behavior to provide programmers with more exibility and reduce coding complexity. There are several nondeterministic constructs in MPI: Nondeterministic receives using MPI ANY SOURCE as their argument for the source eld. As the name suggests, these receives can accept any incoming messages carrying a compatible tag and coming from senders within the same communicator. We sometimes refer to nondeterministic receives as wildcard receives. Nondeterministic receives using MPI ANY TAG. In addition to MPI ANY SOURCE, a receive call in MPI can also choose to accept messages carrying any tag (within the same communicator and coming from a matching sender). A nondeterministic receive can use both MPI ANY SOURCE and MPI ANY TAG, in which case it can accept any incoming message from senders belonging to the same communicator. It is also important to note that the communicator cannot be nondeterministic. The MPI Waitany call can complete any one of the request handles passed in as its argument (the choice of this request can be arbitrary). Similarly, the MPI Waitsome can complete any number of requests out of all request handles passed in as its arguments (i.e., if there are n request handles to complete, there are 2n1 possible ways for MPI Waitsome to nish). Note that due to their highly nondeterministic behavior, MPI Waitany is only occasionally used and MPI Waitsome is almost never used. The MPI Startall call starts all persistent requests, which are communication handles that can be reused over and over again until they are explicitly deallocated, in any arbitrary order and di erent ordering might lead to di erent execution paths. However, in practice, most MPI implementations start them in the order given by the array of request handles. MPI Test and its variants MPI Testany, MPI Testall, MPI Testsome return whether some pending communication requests have nished or not. If the pending requests have nished, the MPI runtime deallocates the requests and set the ags. Since communication completion depends not only on the order that the requests are issued, but also on network routing, timing, and numerous system factors, the ag returned by MPI Test is not guaranteed to be set at the same time between multiple program executions with the same test harness. For example, during the 17 rst run, the developer might observe that MPI Test sets the ag to true after ve invocations; yet during the next run for the same test harness, it sets the ag to true after the seventh invocation. The only thing the MPI standard guarantees is that if the process repeatedly invokes MPI Test in a busy-wait loop, the ag eventually will be set, if both the receiver and the sender have already started the receiving/sending calls (this is called the MPI progress-guarantee). Many large programs use MPI Test in place of MPI Wait due to its nonblocking characteristic. The program can periodically check whether some pending communication requests have nished without having to block. Nondeterministic probes (MPI Probe or MPI Iprobe) using either MPI ANY SOURCE, MPI ANY TAG, or both. Probes allow the process to check whether there are any messages to receive without actually receiving the messages. In applications where the receivers do not always know how large the incoming messages are, probes are extremely useful. If there are ready-to-receive messages, the status eld returned by the probe allows the process to determine the exact size of the incoming messages, and thus the process can now allocate just enough bu er to receive them. MPI Probe behaves similarly to MPI Recv in the sense that it blocks until there are messages to receive. In contrast, MPI Iprobe behaves similarly to MPI Test, which returns immediately and sets the ag to true if there are messages to receive. As in the case with MPI Test, MPI Iprobe also has progress-guarantee semantics. It is important to note that if a program invokes a probe call with MPI ANY SOURCE and later issues a receive with MPI ANY SOURCE, there is no guarantee that the receive would receive the message probed earlier (unless there is only one possible message to receive). 2.3.5 Nonovertaking in MPI The rich features and the enormous exibility of MPI come with the cost of increased complexity. In a program with asynchronous sends/receives interacting with synchronous calls with some or all of them being nondeterministic, trying to determine which sending event should match with each receiving event can be a challenging task. To facilitate the matching of sends and receives, the MPI standard enforces the nonovertaking rule, which states: Messages are nonovertaking: If a sender sends two messages in succession to the same destination, and both match the same receive, then this operation cannot receive the second message if the rst one is still pending. If a receiver posts two 18 receives in succession, and both match the same message, then the second receive operation cannot be satis ed by this message, if the rst one is still pending. Intuitively, one can imagine the communication universe in MPI being split into multiple FIFO channels. Two processes exchanging messages using the same tag within the same communicator e ectively are utilizing one of these FIFO channels. However, the relative order of two messages from two di erent channels can be arbitrary. We will provide a formal notion for this rule in Chapter 3. 2.3.6 Common MPI Errors We provide several examples that illustrate the most common errors found in MPI programs. They can be classi ed in these categories: deadlocks, resource leaks, erroneous bu er reuse, and type mismatches. Some bugs can be caused by the use of MPI nonde- terministic constructs as explained earlier in Section 2.3.4. That is, when a bug is caused by nondeterminism, there are MPI program schedules that may not be executed under conventional testing. We try to present a mixture of both nondeterministic bugs and deterministic bugs through several examples. 2.3.6.1 Deadlock Deadlock typically happens when there is a send and receive mismatch. That is, one process tries to receive a message from a process that either has no intention to or is not able to send the expected message. Figure 2.4 presents a simple program where each process sends a message to P0, and P0 tries to receive from all other processes. However, due to a programming bug, P0's rst receive call is expecting a message from P0 (itself), which does not post any send to match that receive. Therefore, the execution deadlocks. Figure 2.5 presents an unsafe program involving two processes sending messages to each other (a head-to-head deadlock). The deadlock occurs when the size of the bu er exceeds the amount of bu ering the MPI runtime provides. The MPI standard recommends against relying on the runtime bu er to achieve the program's objective if (rank != 0) MPI_Send(sendbuf, count, MPI_INT, 0, 0, MPI_COMM_WORLD); else for (i = 0; i < proc_count; i++) MPI_Recv(recvbuf+i, count, MPI_INT, i, 0, MPI_COMM_WORLD, status+i); Figure 2.4: Deadlock due to send receive mismatch 19 if (rank == 0) { MPI_Isend(buf, count, MPI_INT, 1, 0, MPI_COMM_WORLD, &h); MPI_Wait(&h, &status); MPI_Irecv(buf, count, MPI_INT, 1, 0, MPI_COMM_WORLD, &h); MPI_Wait(&h, &status); else if (rank == 1) { MPI_Isend(buf, count, MPI_INT, 0, 0, MPI_COMM_WORLD, &h); MPI_Wait(&h, &status); MPI_Irecv(buf, count, MPI_INT, 0, 0, MPI_COMM_WORLD, &h); MPI_Wait(&h, &status); } Figure 2.5: Head-to-head deadlock since it limits program portability. It is unsafe because the MPI standard guarantees that it will either deadlock or execute correctly. This communication pattern exists in the Memory Aware Data Redistribution Engine (MADRE) [59], which is a collection of memory aware parallel redistribution algorithms addressing the problem of e ciently moving data blocks across nodes, and many others. Figure 2.6 presents an example of an unsafe program in which two MPI processes post MPI Barrier and MPI Bcast calls in a way that could potentially cause a deadlock. Since the MPI standard does not require the implementations to provide synchronizing semantics for MPI Bcast, it is possible (and likely in practice) that P0 does not have to wait for P1 to post the corresponding MPI Bcast call before P0 can nish its MPI Bcast call, which means that the execution does not deadlock. However, if an implementation assumes synchronizing behavior for MPI Bcast, the execution deadlocks. This example again shows that semantic deadlock need not imply observed deadlock. Figure 2.7 presents a program that contains a nondeterminism deadlock. In this example, P1 posts two receives, one of which is a wildcard while the other one is speci cally if (rank == 0) MPI_Bcast(buffer, count, MPI_INT, 0, MPI_COMM_WORLD); MPI_Barrier(MPI_COMM_WORLD); else if (rank == 1) MPI_Barrier(MPI_COMM_WORLD); MPI_Bcast(buffer, count, MPI_INT, 0, MPI_COMM_WORLD); Figure 2.6: Deadlock due to collective posting order 20 if (rank == 0) { MPI_Send(buf, count, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (rank == 1) { MPI_Recv(buf, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status); MPI_Recv(buf, count, MPI_INT, 2, 0, MPI_COMM_WORLD); } else if (rank == 2) { MPI_Send(buf, count, MPI_INT, 1, 0, MPI_COMM_WORLD); } Figure 2.7: Deadlock due to nondeterministic receive matches a message from P2. However, since either send from P0 or P2 is eligible to match the wildcard receive, the second receive from P1 will not have a matching send if the wildcard receive matches the send from P2. During testing, a developer might observe that the program runs ne during some executions and deadlocks during others. 2.3.6.2 Resource Leaks There are many di erent types of resource leaks such as unfreed communicators, un- freed types, and unfreed requests. We provide an example of a request leak in Figure 2.8. In this program, the nonblocking request from P0 remains in the system since the program never deletes it through a call to MPI Cancel, nor does it wait or test for the request's completion through a call to MPI Wait or MPI Test. Request leaks are a serious issue for MPI programs as an excessive number of pending requests drastically degrades the performance of the program or may crash a long running application. In addition to having a request leak, this example also contains a di erent kind of resource leaks: type leak. Both P0 and P1 fail to free newtype through a call to MPI Type free. Imagine a program where this pattern is enclosed in a loop that creates many di erent new MPI datatypes without freeing them; the resources associated with the types never get freed and returned to the system, which in the long run might a ect the program's performance or behavior (due to out of memory errors). 2.3.6.3 Erroneous Bu er Reuse The MPI standard requires that the bu er associated with a nonblocking request not be accessed by the process until the request has been waited or tested for completion. 21 if (rank == 0) { MPI_Datatype new_type; MPI_Type_contiguous(count, MPI_INT, &newtype); MPI_Type_commit(&newtype); MPI_Isend(buf, 1, newtype, 1, 0, MPI_COMM_WORLD, &h); ... MPI_Finalize(); } else if (rank == 1) { MPI_Datatype new_type; MPI_Type_contiguous(count, MPI_INT, &newtype); MPI_Type_commit(&newtype); MPI_Recv(buf, 1, newtype, 0, 0, MPI_COMM_WORLD, &status); MPI_Finalize(); } Figure 2.8: Resource leak due to unfreed request Since MPI 2.2, this requirement is relaxed for nonblocking send operations with respect to read access. That is, the process can read a bu er of a nonblocking send request before the request completes (writing to the bu er is still prohibited). Violating this requirement leads to unde ned behavior. The program shown in Figure 2.9 presents a situation of illegal bu er reuse in both the sender side and receiver side. 2.3.6.4 Type Mismatches MPI's requirements for type matching between sending and receiving are very complex because the standard supports many di erent methods to create new datatype. The if (rank == 0) { MPI_Isend(buf, 1, newtype, 1, 0, MPI_COMM_WORLD, &h); buf = 1; /* illegal write to buffer before send request completes */ MPI_Wait(&h, &status); } else if (rank == 1) { MPI_Irecv(buf, 1, newtype, 0, 0, MPI_COMM_WORLD, &h); a = buf; /* illegal read from buffer before read request completes */ MPI_Wait(&h, &status); } Figure 2.9: Erroneous bu er reuse 22 exibility limits the ability of the MPI runtime to perform strict type checking and thus many erroneous type mismatches go uncaught during testing yet surface during production runs. Figure 2.10 shows a program that should run correctly in most cases but will produce an erroneous result when running in an environment where the two nodes have di erent endianness. 2.3.7 The MPI Pro ling Interface Being an API used heavily in high performance computing where the users tend to have strong interest in performing various analyses such as performance measurement and data tracing, the MPI standard de nes a pro ling interface to facilitate such tasks. The user can take advantage of the pro ling interface by providing wrappers for those MPI calls that they are interested in pro ling (e.g., MPI Send). The wrapper would then invoke the MPI calls from the runtime by issuing the corresponding PMPI calls (e.g., PMPI Send). Figure 2.11 shows a simple user wrapper that counts the number of times MPI Send is invoked. The major drawback of the pro ling interface provided by the standard is that there can be at most one active wrapper linked with the program. The PNMPI framework [53] allows multiple MPI wrappers to be stacked on top of MPI programs. 2.4 Piggybacking in MPI Piggybacking is the act of sending additional data (piggyback data) along with mes- sages originated from the main application. Many distributed protocols and algorithms rely on piggybacking support. For example, tracing libraries [39, 55], critical path anal- if (rank == 0) { int data = 5; /* sending 4 bytes */ MPI_Send(&data, 4, MPI_BYTE, 1, 0, MPI_COMM_WORLD); } else if (rank == 1) { int data; /* receive one int */ MPI_Recv(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); } Figure 2.10: Type mismatch between sending and receiving 23 … if (rank == 0) { MPI_Send(…); }… int MPI_Send(…){ send_counter++; return PMPI_Send(…); } User code Wrapper Executable Figure 2.11: A simple PMPI wrapper counting MPI Send ysis [18], and application-level checkpointing protocol [52] all require piggyback data to function correctly. In addition, causality tracking protocols such as Lamport clocks and vector clocks that we mentioned earlier also require piggyback. Unfortunately, the MPI standard, as of version 2.2, does not have any built-in pig- gyback mechanism. Most tools have relied on ad hoc implementations to support piggy- backing. We describe here several popular mechanisms of sending piggyback data, each with its own advantages and disadvantages. 2.4.1 Bu er Attachment Piggybacking Bu er attachment piggybacking, also called explicit packing piggybacking [51], is one of the simplest approaches of piggybacking where the tool layer attaches the piggyback data directly to the message bu er. This scheme involves using MPI Pack at the sender side to pack the piggyback data together with the message data and using MPI Unpack at the receiver side to separate the piggyback data from the main message. The piggyback bu er can be attached at the beginning or at the end of the bu er. Figure 2.12 illustrates the concept of bu er attachment piggybacking. While this method is simple, it incurs very high overhead, especially in communication- intensive programs, due to the excessive calls to MPI Pack and MPI Unpack. It is also 24 Piggyback Message Data pb_buf stores piggyback int MPI_Send(buf,…){ PACK pb_buf and buf into newbuf return PMPI_Send(newbuf,…); } int MPI_Recv(buf,…) { err = PMPI_Recv(newbuf,…); UNPACK newbuf into pb_buf and buf return err; } Wrapper - Piggyback Layer Piggyback Message Data Figure 2.12: Bu er attachment piggyback not entirely clear how one would attach piggyback data to collective operations such as MPI Reduce. Studies have also shown that this method of piggybacking has the highest overhead in terms of bandwidth and latency [51]. There are currently several MPI tools that use bu er attachment piggybacking [48]. 2.4.2 Separate Message Piggybacking As the name implies, this piggyback scheme involves sending the piggyback data as a separate message, either right before or right after the message originated by the main application. Figure 2.13 illustrates the concept. The piggyback layer must pay special attention to nondeterministic asynchronous Piggyback Message Data pb_buf stores piggyback int MPI_Send(buf,…){ PMPI_Send(pb_buf,…); return PMPI_Send(buf,…); } int MPI_Recv(buf,…) { PMPI_Recv(pb_buf,…); return PMPI_Recv(buf,…); } Wrapper - Piggyback Layer Piggyback Message Data Figure 2.13: Separate message piggyback 25 wildcard receives since the sender of the message is not known at the time of issuing the receives. In such cases, the request handle for the piggyback is usually posted at the time of nishing the communication (e.g., MPI Wait) [51]. However, we can show that the separate message piggybacking scheme as described does not correctly handle piggyback in the presence of wildcard receives. Consider the example shown in Figure 2.14 where the executed code shows all MPI calls being executed when the user code is linked together with a piggyback layer implementing the two message protocol. We rst consider the case where the piggyback messages are transmitted over the same communicator with the original messages. The starred and italicized text indicates the extra messages that the piggyback layer inserts. Since they are from the same communicator, the piggyback message of the rst MPI Isend() ends up being received by the second MPI Irecv of process P0, which is erroneous. We now consider the case in which the piggyback layer transmits the piggyback messages in a di erent communicator. This means for each communicator in the program, it would need to create a corresponding shadow communicator to send piggyback data. Consider the example shown in Figure 2.15, which is slightly di erent from the earlier example (the MPI Wait calls are in di erent order with respect to the nonblocking sends). In this example, pbcomm is the piggyback communicator corresponding to comm. It is Process 0 Irecv (buf_1,from *,h1,comm); Irecv (buf_2,from *,h1,comm); Wait (h1,s1); * Irecv (pb_buf1,s1.SOURCE,h1',comm); * Wait (h1',s1'); Wait (h2,s2); Process 1 Isend (buf_1,to 0,h1,comm); * Isend (pbbuf_1,to 0,h1',comm); Isend (buf_2,to 0,h2,comm); * Isend (pbbuf_2,to 0,h2',comm); Wait (h1,s1); Wait (h1',s1'); Wait (h2,s2); Wait (h2',s2'); User code Executed code Process 0 Irecv (buf_1,from *,h1,comm); Irecv (buf_2,from *,h1,comm); Wait (h1,s1); Wait (h2,s2); Process 1 Isend (buf_1,to 0,h1,comm); Isend (buf_2,to 0,h2,comm); Wait (h1,s1); Wait (h2,s2); Figure 2.14: Separate message piggyback issue on the same communicator 26 Process 0 Irecv (buf_1,from *,h1,comm); Irecv (buf_2,from *,h2,comm); Wait (h2,s2); Wait for h2 before h1 * Irecv (pb_buf2,s2.SOURCE,h2',pbcomm); * Wait (h2',s2'); Wait (h1,s1); * Irecv (pb_buf1,s1.SOURCE,h1',pbcomm); * Wait (h1',s1'); Process 1 Isend (buf_1,to 0,h1,comm); * Isend (pb_buf1,to 0,h1',pbcomm); Isend (buf_2,to 0,h2,comm); * Isend (pb_buf2,to 0,h2',pbcomm); Wait (h1,s1); Wait (h1',s1'); Wait (h2,s2); Wait (h2',s2'); User code Executed code Process 0 Irecv (buf_1,from *,h1,comm); Irecv (buf_2,from *,h1,comm); Wait (h1,s1); Wait (h2,s2); Process 1 Isend (buf_1,to 0,h1,comm); Isend (buf_2,to 0,h2,comm); Wait (h1,s1); Wait (h2,s2); Figure 2.15: Separate message piggyback issue on di erent communicators clear from the gure that the piggyback layer will end up associating the piggyback of the second message to the rst message, which is also erroneous. This is due to the fact that one cannot post the piggyback receive requests immediately after the application receive requests because the sender of the message received by a nonblocking wildcard receive is not known until after the corresponding wait has completed. Even with these shortcomings, separate message piggybacking remains a useful mech- anism to attach and to receive piggyback data with collective operations. In fact, it is currently the only known method of transmitting piggyback data with collective opera- tions (short of modifying the MPI library or breaking up the collective operations into point-to-point operations). 2.4.3 Datatype Piggyback Another type of piggyback mechanism favored by many tools is datatype piggybacking [50, 54]. In this scheme, a new datatype is created by MPI Type struct for every send and receive operation. The new datatype combines a pointer to the main message bu er and a pointer to the current piggyback bu er. Figure 2.16 illustrates the mechanism of datatype piggybacking. In order to handle partial receives correctly, the piggyback data should be placed before the message data. 27 Piggyback Message Data pb_buf stores piggyback int MPI_Send(buf,count,user_type,…){ Create datatype D from pb_buf and buf return PMPI_Send(MPI_BOTTOM,1,D,…); } int MPI_Recv(buf,count,user_type,…) { Create datatype D from pb_buf and buf return PMPI_Recv(MPI_BOTTOM,1,D,…); } Wrapper - Piggyback Layer Datatype D Piggyback Message Data Sending/Receiving (MPI_BOTTOM,1,D) instead of (buffer,count,user_type) Datatype D Figure 2.16: Datatype piggyback Datatype piggybacking o ers a compromise between bu er attachment piggybacking and separate message piggybacking. It does not su er from the high bandwidth overhead and it correctly addresses piggybacking for nondeterministic receives. However, it does have several drawbacks: (i) moderate overhead due to excessive datatype creation (a new datatype has to be created for every send and receive operation), (ii) it is very di cult to implement piggyback for collective operations. CHAPTER 3 MPI MATCHES-BEFORE Using happens-before to track causality is an essential part of dynamic veri cation for parallel programs in general and MPI programs in particular. Unfortunately, the complex semantics of MPI allows many di erent types of interactions between events, many of which cannot be captured su ciently by the traditional Lamport happens-before order that we discussed earlier. In this chapter we will discuss the issues of applying the happens-before order to MPI programs and introduce the MPI matches-before order which addresses the shortcomings. 3.1 Our Computational Model A message passing program consists of sequential processes P0; P1; : : : ; Pn communi- cating by exchanging messages through some communication channels. The channels are assumed to be reliable and to support the following operations: send(dest,T) - send a message with tag T to process dest. This operation has similar semantics to the MPI Send, which means it has asynchronous behavior. That is, the call can complete before a matching receive has been posted. ssend(dest,T) - the synchronous version of send. This call only returns when the receiver has started to receive the message. In most MPI implementations, the receiver sends an ack to the sender to indicate that it has begun the receiving process. recv(src,T) - receive a message with tag T from process src. When src is MPI ANY SOURCE (denoted as ), any incoming message sent with tag T can be received (a wildcard receive). isend(dest,T,h) - the nonblocking version of send. The request handle h allows the call to be awaited for completion later. Similar to send, this call has an asynchronous behavior. The completion of the call (by a wait) only indicates that the bu er can be safely reused. 29 issend(dest,T,h) - the synchronous version of isend. The completion of this call indicates that the receiver has started to receive the message. irecv(src,T,h) - the nonblocking version of recv. The request handle h allows the call to be awaited for completion later. wait(h) - wait for a nonblocking communication request until it completes. Ac- cording to MPI semantics, relevant piggyback information for a nonblocking receive cannot be accessed until the wait call. Similarly, for a nondeterministic nonblocking receive, the source eld (identity of the sending process) can only be retrieved at the wait. barrier - all processes have to invoke their barrier calls before any one process can proceed beyond the barrier. For illustrative purposes, we abstract away the bu er associated with all send and receive events since it does not a ect our algorithm. Further, we assume that all these events happen in the same communicator and that MPI ANY TAG is not used. We also do not consider collective operations other than MPI Barrier. Our implementation, however, does take into account all these possibilities. 3.2 Issues Applying Happens-Before to MPI We brie y go over how one might apply the traditional vector clocks algorithm to the example in Figure 3.1 to conclude that the rst wildcard receive in P0 can match either send from P1 or P2 and also why the Lamport clocks algorithm fails to do so. Assuming the rst receive from P0 matches with the send from P1 and the second receive from P0 matches with the send from P2, we want to know if the vector clocks algorithm can determine whether the rst receive from P0 could have received the message sent from P2. Using the clock updating rules from the vector clocks algorithm described earlier, P0's rst receive's vector timestamp would be [1; 1; 0] while the send from P2 would have [0; 2; 2]. Clearly, the send and the receive are concurrent and thus, the send is a potential match to the receive. In contrast, if we apply the Lamport clocks algorithm to this example, P0's rst receive event would have a clock of 1 while the send from P2 would have a clock of 3. The algorithm could not determine whether the two events have any causal relationship. Hence, it cannot safely ag the send from P2 as a potential match to the rst receive from P0. One can observe that the communication between P1 and P2 in this example has no 30 recv(*) [1,1,0] send(0) [0,1,0] send(2) [0,2,0] recv(1) [0,2,1] send(0) [0,2,2] recv(*) P0 [2,2,2] P1 P2 Figure 3.1: Wildcard receive with two matches e ect on P0, yet the matching causes a clock increase which prevents the determination at P0 of the causality between the rst wildcard receive and P2's send. Now consider the example shown in Figure 3.2. Assuming the irecv by P1, denoted as r, matches to the isend from P0, we will apply the vector clocks algorithm to gure out whether the isend from P2 , denoted as s, can be safely agged as a potential match to r. By using the vector clock updating rules and considering barrier as a synchronization event where all processes synchronize their clocks to the global maximum, the clocks for r and s would be [1; 0; 0], and [1; 0; 1], respectively. This means r hbVC ! s and thus the algorithm fails to recognize s as a potential match to r. Clearly, the notion of happening and the corresponding happens-before order are insu cient for capturing all behaviors of MPI programs. We need a new model that allows us to completely capture the ordering of all events within an MPI program execution. 3.3 Matches-Before We rst consider the di erent possible states of an MPI operation op after a process invokes op: P0 P1 P2 Isend(to P1,22) Irecv(from:*,x) Barrier Barrier Barrier Isend(to P1,33) Wait() Recv(from:*,y) Wait() if(x==33) ERROR Wait() Figure 3.2: Counterintuitive matching of nonblocking receive 31 issued - op attains this state immediately after the process invokes it. All MPI calls are issued in program order. matched - We de ne this state in De nition 3.1. returned - op reaches this state when the process nishes executing the code of op. completed - op reaches this state when op no longer has any visible e ects on the local program state. All blocking calls reach this state immediately after they return while nonblocking calls reach this state after their corresponding waits return. Of these, only the issued and matched states have signi cant roles in our algorithms; nonetheless, we included all possible states for completeness. The matched state is central to our protocols and is described in further details below. De nition 3.1 An event e in an MPI execution attains the matched state if it satis es one of these conditions: e is an issued sending event of message m and the destination process has started to receive m through some event e0 . e is said to have matched with e0 . The receiving process is considered to have started the receive process when we can (semantically) determine from which of the send events it will receive the data. The timing of the completion of the receiving process is up to the MPI runtime and is not relevant to this discussion. e and e0 in this case are considered to be in a send-receive match-set. e is a receive event that marks the start of reception. If e is a wildcard receive, we denote e:src as the process with which e matched. e is a wait(h) call whose pending receive request associated with h has been matched. For an isend, the wait can attain the matched state upon completion while the isend still has not matched (i.e., it is bu ered by the MPI runtime). A matched wait is the only element in its match-set (a wait match-set). e is a barrier and all processes have reached their associated barrier. e is said to have matched with e0 if they are in the same set of barriers. All participating barriers are in the same match-set (a barrier match-set). While it is straightforward to determine the matching point of synchronous calls recv, and barrier, the situation is more complex when it comes to nonblocking calls. The assumption that all nonblocking calls would attain the matched state exactly at their corresponding wait calls is incorrect. We explained the situation with the isend call earlier. Figure 3.3 shows another situation in which the rst irecv call from process P2 can attain the matched state anywhere from its issuance to right before the recv call 32 returns (the arrow in the gure shows the interval during which the call can attain the matched state), which could be much earlier than the completion of its corresponding wait. This is due to the nonovertaking rule of the MPI standard. Let E be the set of events produced in an execution, where each e 2 E is a match event as per De nition 3.1. We represent this execution as P = hE; mb !i, where mb ! is the matches-before relation over E de ned as follows. Consider two distinct events e1 and e2 in E; e1 mb ! e2 if and only if one of the following conditions holds: C1. e1 and e2 are two events from the same process where e1 is either a ssend, recv, wait, or barrier, and e2 is issued after e1. C2. e1 is a nonblocking receive and e2 is the corresponding wait. C3. e1 and e2 are send events from the same process i with the same tag, targeting the same process j and e1 is issued before e2. This is the nonovertaking rule of MPI for sends. The sends can be either blocking/nonblocking. C4. e1 and e2 are receive events from the same process i with the same tag, either e1 is a wildcard receive or both are receiving from the same process j, and e1 is issued before e2. This is the nonovertaking rule of MPI for receives. The receives can be blocking or nonblocking. C40. e1 and e2 are receive events from the same process i in which e2 is a wildcard receive; e2 is issued after e1, e2:tag = e1:tag, and e2:src = e1:src. C40 is a special case of C4 in which the mb ! relationship between e1 and e2 can only be determined after e2 attains the matched state. C5. e1 and e2 are from two di erent processes and there are events e3 and e4 such that e1 mb ! e3, e4 mb ! e2, and furthermore e3 and e4 are in the same match-set, and e3 is not a receive event (i.e., e3 is either a send, isend, or barrier). Figure 3.4 illustrates this transitivity condition. The two shaded areas in the gure show a send-receive match-set and a barrier match-set while the dashed arrows show the matches-before relationship between events in the same processes. Condition C5 P1 isend(0,h1) barrier send(0) wait(h1) P2 irecv(0,h2) barrier recv(0) wait(h2) Figure 3.3: Nonovertaking matching of nonblocking calls 33 send recv e2 e1 barrier barrier barrier e3 Figure 3.4: Transitivity of matches-before allows us to infer that e1 mb ! e2 and e2 mb ! e3. C6. There exists an event e3 such that e1 mb ! e3 and e3 mb ! e2 (transitive order). In Figure 3.4, condition C5 and C6 allow us to infer that e1 mb ! e3. Corollary 3.1 If e1 and e2 are two events in the same match-set, neither e1 mb ! e2 nor e2 mb ! e1 holds. Corollary 3.2 If e1 and e2 are two events in the same process and e1 is issued before e2, then e2 mb ! e1 is false. In addition to the mb ! relationship for two events, we also de ne the mb ! relationship between X and Y where either X, Y , or both are match-sets. In which case, X mb ! Y if and only if one of the following conditions holds: C7: X is an event e1, Y is a send-receive match-set, e2 is the send event in Y , and e1 mb ! e2. C8: X is an event e1, Y is either a barrier match-set or a wait match-set, for all events e2 in Y : e1 mb ! e2. C9: X is a send-receive match-set, e1 is the receive event in X, Y is an event e2, and e1 mb ! e2. C10. X is a send-receive match-set in which the send e1 is a synchronous send, e2 is the corresponding receive in the same match-set, Y is an event e3, and e1 mb ! e3 ^ e2 mb ! e3. C11: X is a barrier match-set or a wait match-set and Y is an event e2, and there exists some event e1 in X: e1 mb ! e2. C12: X and Y are both match-sets, there is some event e1 in X such that e1 mb ! Y . 34 De nition 3.2 Two events e1 and e2 are considered concurrent if they are not ordered by mb !. Let e1 mb ! e2 denote the fact that e1 is not required to match before e2; then e1 and e2 are concurrent if and only if e1 mb ! e2 ^ e2 mb ! e1. 3.4 Discussion We have provided the notion of matches-before, which allows us to correctly capture the causality between events in an MPI execution. We have also de ned the concept of match-set, which treats the matching action between a send and a receive as a single event itself. This is in contrast with most protocols based on the traditional Lamport clocks and vector clocks, which consider the sending event to happens-before the receive event. In the next chapter, we will introduce the Partial Order Avoid Elusive Interleavings (POE) algorithm, which uses a centralized version of the matches-before order. This centralized version of the matches-before order allows for easy implementation but it does not scale well. We later introduce the lazy update algorithms that uses the matches-before order introduced in this chapter as the basis to provide scalable causality tracking for MPI. CHAPTER 4 CENTRALIZED DYNAMIC VERIFICATION FOR MPI Since we have adopted the matches-before relationship in place of the traditional Lamport happens-before, we also need new clock updating algorithms that correctly characterize the causality information between events in an MPI execution based on mb !. In this chapter we present the rst approach, which uses a centralized scheduler to maintain a global view of all interactions between all MPI calls. This global view enables the scheduler to maintain the matches-before relationship in order to determine whether a nondeterministic event can have multiple di erent outcomes, and enforce those outcomes through replay. This chapter only summarizes some of the key concepts of ISP. My work on ISP has mainly focused on improving ISP's scalability and usability, for which I provide the details in Section 4.2. 4.1 ISP ISP, which stands for In-Situ Partial order, is a dynamic veri er for MPI programs which is driven by the POE algorithm. ISP veri es MPI programs for deadlocks, resource leaks, type mismatches, and assertion violations. ISP works by intercepting the MPI calls made by the target program and making decisions on when to send these MPI calls to the MPI runtime. This is accomplished by the two main components of ISP: the ISP Pro ler and the ISP Scheduler. Figure 4.1 provides the overview of the ISP tool. 4.1.1 The ISP Pro ler The interception of MPI calls is accomplished by compiling the ISP Pro ler together with the target programs source code. The pro ler uses the MPI pro ling interface (PMPI). It provides its own version of MPI f for each corresponding MPI function f. Within each of these MPI f, the pro ler communicates with the scheduler using either 36 Source files ISP Profiler Executable MPI_f Scheduler signals MPI runtime PMPI_f (w/ goahead signal from the Scheduler) ISP Figure 4.1: ISP framework TCP sockets or Unix sockets to send information about the MPI call the process wants to make. The pro ler will then wait for the scheduler to make a decision on whether to send the MPI call to the MPI library or to postpone it until later. When the permission to re f is given from the scheduler, the corresponding PMPI f will be issued to the MPI runtime. Since all MPI libraries come with functions such as PMPI f for every MPI function f, this approach provides a portable and light-weight instrumentation mechanism for MPI programs being veri ed. 4.1.2 The ISP Scheduler The ISP scheduler carries out the veri cation algorithms. Since every process starts executing with an MPI Init, every process invokes the MPI Init provided by the pro ler. This initialization phase of the pro ler involves establishing a TCP connection with the scheduler and communicating its process rank to the scheduler. The TCP connection is used for all further communication between the process and the scheduler. The scheduler maintains a mapping between the process rank and its corresponding TCP connection. Once the connection with the scheduler is established, the processes execute a PMPI Init into the MPI library. The processes nally return from the MPI Init of the pro ler and continue executing the program. Whenever a process wishes to execute an MPI function, ,- - , 37 it invokes the MPI f of the pro ler, which communicates this information to the scheduler over the TCP connection. The pro ler does not always execute the PMPI f call into the MPI library when it calls the pro lers MPI f. For nonblocking calls like MPI Isend and MPI Irecv, the pro ler code sends the information to the scheduler and stores this information in a structure in the pro ler and returns. When the process executes a fence instruction like MPI Wait, the scheduler makes various matching decisions and sends a message to the process to execute the PMPI Isend (or other nonblocking functions) corresponding to theWait call. The MPI library is not aware of the existence of MPI Isend until this time. Eventually, the scheduler sends a message to the process to execute the PMPI Wait, at which time the process returns. It must be noted that the scheduler will allow a process to execute a fence MPI function only when the Wait can complete and hence return. Otherwise, the scheduler will detect a deadlock. 4.1.3 The POE Algorithm The ISP scheduler implements the POE (Partial Order avoiding Elusive interleavings) algorithm [65]. We rst provide the intuition for the POE algorithm by considering the example in Figure 4.2, which is the same crooked barrier example in Chapter 3. The scheduler allows us to have the absolute control of the MPI runtime and gives us the ability to only execute the MPI calls at our discretion as long as the MPI semantics is preserved. In that case, instead of executing the matching between the isend of P0 and the irecv of P1, we delay the irecv call and execute other MPI calls rst, until the process invoke some MPI calls f which requires that the irecv matches before it (e.g., the recv call in the example). Clearly, by delaying the irecv and executing the barrier rst, we can now see both of the isend's coming from P0 and P2 as possible matches for the irecv from P1. We now brie y describe the POE algorithm. The formal description and the proof of correctness are available in [63]. The POE algorithm works as follows. There are two classes of statements to be exe- cuted: (i) those statements of the embedding programming language (C/C++/Fortran) that do not invoke MPI commands, and (ii) the MPI function calls. The embedding statements in an MPI process are local in the sense that they have no interactions with those of another process. Hence, under POE, they are executed in program order. When an MPI call f is encountered, the scheduler records it in its state; however, it does not (necessarily) issue this call into the MPI runtime. (Note: When we say that the scheduler 38 P0 P1 P2 Isend(to P1,22) Irecv(from:*,x) Barrier Barrier Barrier Isend(to P1,33) Wait() Recv(from:*,y) Wait() if(x==33) ERROR Wait() Figure 4.2: MPI example to illustrate POE issues/executes MPI call f, we mean that the scheduler grants permission to the process to issue the corresponding PMPI f call to the MPI runtime.) This process continues until the scheduler arrives at a fence, where a fence is de ned as an MPI operation that cannot complete after any other MPI operation following it. The list of such fences includes all MPI blocking calls such as MPI_Wait, MPI_Barrier. When all processes reach their fences, the POE algorithm now forms match-sets as described earlier in Chapter 3. Each match-set is either a single big-step move (as in operational semantics) or a set of big-step moves. A set of big-step moves results from dynamically rewriting a wildcard receive. Each big-step move is a set of actions that are issued collectively into the MPI run-time by the POE-scheduler (we enclose them in hh..ii). In the example of Figure 4.2, these are all possible match-sets. Note that we rewrite the wildcard into each speci c process according to the matching send. The set of big-step moves f hh P0's isend(to P1), P1's irecv(from P0) ii, hh P2's isend(to P1), P1's irecv(from P2) ii, g The single big-step move hh Barrier,Barrier,Barrier ii The POE algorithm executes all big-step moves (match sets). The execution of a match-set consists of executing all of its constituent MPI operations ( ring the PMPI versions of these operations into the MPI runtime). The set of big-step moves (set of match sets) is executed only when no ordinary big-step moves are left. In our example, the big-step move of barriers is executed rst. This priority order guarantees that a representative sequence exists for each possible interleaving [65]. Once only a set of big-step moves are left, each member of this set (a big-step move) is red. The POE algorithm then resumes from the resulting state. 39 In our example, each big-step moves in the set f hh P0's isend(to P1), P1's irecv(from P0) ii, hh P2's isend(to P1), P1's irecv(from P2) ii, g is executed, and the POE algorithm is invoked after each such big-step move. Thus, one can notice that the POE scheduler never actually issues into the MPI run-time any wildcard receive operations it encounters. It always dynamically rewrites these operations into receives with speci c sources, and pursues each speci c receive paired with the corresponding matching send as a match-set in a depth- rst manner. 4.1.4 ISP Evaluation We present an evaluation of ISP with Marmot [35], a popular MPI correctness checking tool. Marmot detects deadlocks using a timeout mechanism. Marmot also uses the MPI Pro ling Interface to trap MPI calls. The timeout mechanism works by enforcing an interval that represents Marmot's estimate of the computation time between two successive MPI calls. When a process does not execute any MPI call after the timeout interval, Marmot signals a deadlock warning. For the experiment, we apply both ISP and Marmot on the Umpire test suite [68] and report the results on selected benchmarks in Table 4.1. The full set of experiments is also available [5]. Table 4.1 has three columns. The rst column provides the Umpire benchmark programs. The second column shows the result of running the Umpire benchmark on Table 4.1: Comparison of POE with Marmot Umpire Benchmark POE Marmot any src-can-deadlock7.c Deadlock Detected Deadlock Caught in 2 interleavings 5/10 runs any src-can-deadlock10.c Deadlock Detected Deadlock Caught in 1 interleaving 7/10 runs basic-deadlock10.c Deadlock Detected Deadlock Caught in 1 interleaving 10/10 runs basic-deadlock2.c No Deadlock Detected No Deadlock Caught 2 interleavings in 20 runs collective-misorder.c Deadlock Detected Deadlock Caught in 1 interleaving 10/10 runs 40 ISP executing the POE algorithm. We show the number of interleavings generated by POE. The last column shows the result of running the benchmark with Marmot. The benchmark is run multiple times on Marmot to evaluate the e ectiveness of Marmot's deadlock detection scheme. Since Marmot's deadlock detection scheme relies on the deadlock's occurrence during a particular run, it cannot guarantee the detection of possible deadlocks due to nondeterminism. The basic-deadlock2.c example presents a deadlock scenario in which the deadlock only happens if the veri cation restricts the MPI Send calls to having zero bu er space. Since POE is set to provide in nite bu ering in this experiment, we do not report the deadlock here. Upon setting the available bu er space for MPI Send to 0, the deadlock is caught. 4.2 ISP Scalability Issues While the centralized scheduler easily maintains a complete global picture that facil- itates the state space discovery process, it limits scalability. When the number of MPI calls becomes su ciently large, the synchronous communication between the scheduler and the MPI processes becomes an obvious bottleneck. This section details our e orts in improving ISP's scalability and the lessons learned throughout the process. 4.2.1 The Scalability Challenge We attempted to apply ISP on ParMETIS [12], which is a hypergraph partition library, to verify its routines for freedom of deadlocks as well as resource leaks. Verifying ParMETIS is a challenging task, not only because of its scale (AdaptiveRepart, one repartition routine provided by ParMETIS, has more than 12,000 lines of code between itself and its helper functions), but also because of the enormous number of MPI calls involved. In some of our tests, the number of MPI calls recorded by the ISP scheduler exceeds 1.3 million. This class of applications stresses both the memory usage overhead and the processing overhead of the scheduler. Our attempt to improve ISP while working on the large code base of ParMETIS introduced several challenges at a pragmatic level. Since we did not have another MPI program debugger { and especially one that understands the semantics of our ISP sched- uler that was itself being tweaked { we had to spend considerable e ort employing low level debugging methods based on printfs and similar methods. 41 4.2.2 Memory Consumption In order to replay the execution of the processes and correctly skip over all previous matching of sends/receives, ISP has to store all transitions (i.e., the MPI calls) for each process. This consumes a considerable amount of memory. The problem was not very apparent when we tested ISP with the Umpire test suite [68] and the Game of Life program [65], which made fewer than a hundred MPI calls in our testing. In our several rst runs, ParMETIS exceeded all available memory allocations. The problem was attributed to the storage taken by ISP's Node structure which maintains the list of transitions for each process. In addition, each transition maintained a list of ancestors which grew quadratically. We will describe our approach to handling this problem in Section 4.2.3.1. Forming match sets is a central aspect of POE. One source of processing overheads in match set formation was located to be the compiler's inability to inline the .end() call in loops such as this: for (iter = list.begin(); iter != list.end(); iter++) { ... do something ... } Improvements at this level had marginal e ects on ISP's performance. 4.2.3 Improvements to POE 4.2.3.1 Transitivity of Matches-Before It became obvious that searching through hundreds of thousands of matches-before edges was having a huge e ect on ISP's performance. We either needed to store less matches-before edges, or search through less matches-before edges. First, we exploit the fact that ancestor is a transitive binary relation, and store only the immediate ancestor relation. As the name suggests, immediate ancestor is the transitive reduction of the ancestor relation { i.e., the smallest binary relation whose transitive closure is ancestor. We then realized that the POE algorithm remained correct even if it employed immediate ancestors in match-set formation. The intuitive reason for this lies in the fact that whenever x is an ancestor of y and y is an ancestor of z, a match set involving y would be formed (and red) before one involving z is formed (and red). 42 The graph in Figure 4.3 shows the improvement of ISP in handling ParMETIS after switching over to the use of immediate ancestors. The testing setup we employed is similar to the ptest script provided in ParMETIS 3.1 distribution. To be more speci c, our tests involve running rotor.graph, a le that represents a graph with about 100,000 nodes and 600,000 edges, through V3 Re neKWay, followed by a partitioning routine called V3 PartKway, then the graph is repartitioned again with AdaptiveRepart. The test completes by running through the same routine again with a di erent set of options. All tests were carried out on a dual Opteron 2.8 GHz (each itself is a dual-core), with 4 GB of memory. We also show in Table 4.2 the number of MPI calls this test setup makes (collectively by all processes) as the number of processes increases. The comparison between the original ISP and the modi ed ISP (dubbed ISP v2 in this study) shows a huge improvement in ISP's processing time. In fact, without the use of immediate ancestors, ISP was not able to complete the test when running with eight processes. Even running one test for 4 processes already took well over a day! In contrast, ISP v2 nishes the test for 4 processes in 34 minutes. With the change over from ancestors to immediate ancestors, we also made additional data structure simpli cations, whose impact is summarized in the graph of Figure 4.4 (this version of ISP was termed ISP v3). Even with these improvements, ISP was still taking considerable time to complete 0 400 800 1200 1600 2000 0 2 4 6 8 Time (minutes) Number of processes ISP ISP v2 Figure 4.3: Improvements based on transitive reduction 43 Table 4.2: Number of MPI calls in ParMETIS num. of procs Total MPI calls 2 15,789 4 56,618 8 171,912 16 544,114 32 1,390,260 0 100 200 300 400 500 600 0 4 8 12 16 20 24 28 32 Time (minutes) Number of processes ISP v2 ISP v3 Figure 4.4: Improvements made by data structures changes model checking ParMETIS for 32 processes, which is almost ten hours. This led us to consider parallelizing the search for ancestors. 4.2.3.2 Parallel-ISP The discovery of where ISP spends most of its processing time leads us to the idea of parallelizing ISP's search for ancestors while building the match-sets. Recall that the MPI calls made by each process of the target program are represented by transition lists. The formation of match sets requires searching through all transition lists. Fortunately, these searches are independent of each other, and can be easily parallelized. There are several ways to parallelize this process: (i) make a distributed ISP where each ISP process performs the search for each transition list, or (ii) create a multithreaded-ISP where each thread performs the search, or (iii) use OpenMP to parallelize the search and let the 44 OpenMP run-time handle the thread creation. We opted for the OpenMP approach due to the fact that the POE scheduler is implemented with many for loops { a good candidate for parallelization using OpenMP. We present the performance results of Parallel ISP vs. ISP v3 in Figure 4.5. Par- allelization does not help ISP much when running with a small number of processes. However, when we verify up to 16 and 32 processes, the bene ts of parallelization becomes more obvious (On average, Parallel-ISP was about 3 times faster than the serial ISP). 4.2.4 Discussion Although ISP has been improved greatly to handle practical MPI programs. We still notice that as the number of processes increases, the performance of ISP degrades exponentially. We investigate the system load of ISP verifying ParMETIS with 32 processes and notice that the ISP Scheduler is taking almost all of the CPU time while the MPI processes are just waiting for the responses from the scheduler. This shows that ISP fails to exploit the distributed processing of all processes, which means it will become infeasible to verify large MPI applications beyond a few dozen processes. An early experimental version of ISP was developed in which MPI processes would be launched on di erent hosts and communicate with the scheduler through TCP sockets. The distributed launching mechanism e ectively removes the resource constraints faced by 0 100 200 300 400 500 600 0 4 8 12 16 20 24 28 32 Time (minutes) Number of processes ISP v3 Parallel ISP Figure 4.5: Improvements made by parallelization 45 launching all MPI processes within one single machine. However, we have run experiments that demonstrates that the main bottleneck lies in the scheduler and running the MPI processes in a distributed environment does little in speeding up the veri cation. In the next chapter we will present our distributed approach to address ISP's scalability issues. CHAPTER 5 DISTRIBUTED DYNAMIC VERIFICATION FOR MPI We have shown in the previous chapter that the centralized approach does not scale well beyond a few dozen processes. In order to maintain good scalability, the veri cation needs to exploit the processing power of all processes. In essence, a good algorithm has to run the veri cation in distributed fashion and cannot rely on a centralized scheduler. To this end, we propose two algorithms: the Lazy Lamport Clocks Protocol (LLCP) and the Lazy Vector Clocks Protocol (LVCP). These are the design goals of the protocols. Scalable - Many MPI applications today require at least some scale in order to run certain inputs due to memory size and other limits. Further, many bugs, including nondeterminism related bugs, are only manifest when a problem is run at large scale. Any protocol aiming at handling large scale MPI programs must be scalable as well. LLCP is very scalable compared to LVCP as demonstrated by our experimental results. Sound - We de ne a sound protocol to be one that does not force the match of events that cannot match. Clearly, this is a crucial goal; an unsound protocol can cause a deadlock in an otherwise deadlock-free MPI program! We argue that both LLCP and LVCP are sound. Complete - While it is challenging to design a causality tracking protocol that is both complete and scalable, we still want to have a protocol that is scalable and maintains completeness in the most common usages. In all our testing with real MPI programs, LLCP proved to be complete, that is we did not discover any extra matches when we ran the same program under LVCP on the same test harness. If completeness in all cases is a requirement, then LVCP should be used. 47 5.1 Lazy Lamport Clocks 5.1.1 Algorithm Overview LLCP maintains the matches-before relationship between events by maintaining a clock in each process and associates each event with a clock value in a way that can help us order these events according to when they attain the matched state. Since the matches-before relationship describes the ordering for events inside a process and across processes, the algorithm needs to be able to o er such coverage. More speci cally, given a wildcard receive r from process Pi and a send s targeting Pi which did not match with r, the protocol should allow us to gure out whether r and s have any matches-before relationship between them. If, for example, the successful completion of r triggers the issuance of s, then it is never the case that s could have matched with r. The intuitive way to do this is to have the protocol maintain the clock such that if r triggers the issuance of some event e, then it must be the case that the clock of r is smaller than the clock of e. Basically this means all outgoing messages after r from Pi need to carry some clock value (as piggyback data) higher than r. The challenge of the protocol lies in the handling of nonblocking wildcard receives. As explained earlier in the example in Figure 3.2, a nonblocking wildcard receive from a process Pi could potentially be pending (not yet reach the matched state) until its corresponding wait is posted. However, we have also shown in Figure 3.3 that such a receive could also attain the matched state due to the nonovertaking semantics (which could be earlier than the posting of the wait). The protocol needs to know precisely the status of the receive to avoid sending the wrong piggyback data, which could lead to incorrect matching decisions. 5.1.2 Clock Update Rules We now describe the protocol in detail through a set of clock updating rules. For simplicity, we assume the programs do not contain synchronous sends and discuss the handling of synchronous sends in Section 5.2.1. R1. Each process Pi keeps a clock LCi, initialized to 0. R2. When a nonblocking wildcard receive event e occurs, assign LCi to e:LC and add e to the set of pending receives: Pending Pending [ feg. R3. When Pi sends a message m to Pj , it attaches LCi (as piggyback data) to m (denoted m:LC). 48 R4. When Pi completes a receive event r (either forced by a blocking receive or at the wait of a nonblocking receive as in Figure 3.3), it rst constructs the ordered set CompleteNow as follows: CompleteNow = fe j e 2 Pending ^ e mb ! rg. The set CompleteNow is ordered by the event's clock, where CompleteNow[i] denotes the ith item of the set and CompleteNow denotes the total items in the set. Intuitively, this is the set of pending nonblocking receives that have matched before r due to the MPI nonovertaking rule. Since they have all reached the matched status, we need to update their clocks as well. Note that the ordering of the events in CompleteNow is very important since all receives in CompleteNow are also mb ! ordered by the nonovertaking semantics themselves. We can update the clocks using the following loop: for i = 1 TO CompleteNow do CompleteNow[i]:LC = LCi LCi LCi + 1 end for Pending Pending n CompleteNow After this, the process associates the current clock with r: r:LC LCi and ad- vances its clock to re ect the completion of a wildcard receive: LCi LCi+1. Note that the clock assignment and advancement do not happen to those nonblocking receives that have their clocks increased earlier due to the for loop above. We can check this by detecting whether the current nonblocking receive is still in the Pending set or not. Finally, the process compares its current clock with the piggybacked data from the received message and updates LCi to m:LC if the current clock is less than m:LC. R5. At barrier events, all clocks are synchronized to the global maximum of the individual clocks. 5.1.3 Match Detection Rules R2 and R4 form the lazy basis of the protocol in the sense that a nonblocking wildcard receive r gets a temporary clock when it initially occurs in the process and gets its nal clock when it nishes (either by its corresponding wait or by another receive r0 for which r mb ! r0). Lemma 5.1 If e1 mb ! e2 then e1:LC e2:LC 49 Proof. We rst consider the case when e1 and e2 are from the same process. Based on our de nition of matches-before, event e2 will always occur after event e1. Since our algorithm never decreases the clock, it follows that e1:LC e2:LC. Now assume e1 and e2 are events from two di erent processes. Based on the de nition of matches-before, there exist events e3 and e4 such that e1 mb ! e3, e4 mb ! e2, e3 and e4 are in a match-set, and e3 is either an isend, send, or barrier. We recursively apply this process to (e1; e3) and (e4; e2) and construct the set S = s1; s2; ::; sn in which s1 = e1, sn = e2, and other elements are either events or match-sets that satisfy si mb ! si+1. In addition, s has to satisfy the following rule: for any pair of adjacent elements (si; si+1), there does not exist any event e such that si mb ! e and e mb ! si+1. Note that the construction of S is possible based on our de nition of mb !. Intuitively, S represents all hops between e1 and e2 if one is to follow the mb ! chain event by event. Now consider any pair of events (si; si+1). They must be both events from the same process, in which case si:LC si+1:LC, or either one has to be a match-set, or both are match-sets, in which case our piggyback ensures that si:LC si+1:LC. Hence, the set S has the property that s1:LC s2:LC :: sn:LC, which means e1:LC e2:LC. Lemma 5.2 Assuming r is either a blocking receive or a nonblocking receive that is not pending, if r mb ! e then r:LC < e:LC. Proof. If e is an event in the same process with r then rule R2 and R4 ensure that r:LC < e:LC. If e is not in the same process with r then based on the de nition of mb !, there must be an event f from the same process with r such that r mb ! f ^ f mb ! e, which means r:LC < f:LC and by Lemma 5.1, f:LC e:LC. Thus, r:LC < e:LC. We now introduce the concept of late messages, which is essential for the protocol to determine if an incoming send can match an earlier wildcard receive. One can think of late messages as in- ight messages in the sense that these messages have already been issued at the point when a receive reaches the matched state. Consider the MPI program shown in Figure 5.1. The rst wildcard receive of P1 matched with the send from P0 while the second wildcard receive matches the send from P2. The clock value associated with each event according to our protocol is displayed in the square bracket. The shaded area represents the set of all events that are triggered by r (i.e., for all events e in that shaded area, r mb ! e). The message from P2 was not triggered by the matching of the rst wildcard receive in P1, despite being received within the shaded area. We call the message from P2 a late message. At the reception of late messages, the protocol checks 50 barrier [1] recv(*) [0] recv(*) [1] send(1) [0] send(1) [0] P0 P1 P2 barrier [1] barrier [1] Figure 5.1: Late messages whether they can be potential matches for receives that have matched earlier (in this gure, the late message from P2 is a potential match). De nition 5.1 A message m originating from process m:src with timestamp m:LC is considered late with respect to a wildcard receive event r (which earlier did not match with m:src) i m:LC r:LC. If message m is late with respect to r, it is denoted as late(m; r). We are now ready to devise the match detection rule: Theorem 5.3 An incoming send s carrying message m with tag that is received by event r0 in process Pi is a potential match to a wildcard receive event r with tag issued before r0 if (m:LC < r0:LC ^ late(m; r)) Proof. In order to prove that s is a potential match of r, we prove that s and r are concurrent, which means: r mb ! s ^ s mb ! r. First we notice that r mb ! r0, which means r cannot be a pending event (due to rule R4). In addition, we also have r:LC s:LC since s is a late message. Using the contrapositive of Lemma 5.2, we infer that r mb ! s. It is also the case that s mb ! r because if s mb ! r, it must be the case that s mb ! r0 due to the transitive order rule of matches-before. This violates Corollary 3.1 which says that two events in the same match-set are not ordered by mb !. Let us now revisit the crooked barrier example introduced earlier in Figure 3.2 and show how the protocol applies (Figure 5.2 shows the same example with the clock values at the end of the execution). Using the LLCP clock update rules, the clock for the irecv by P1 has a clock of 0 when it is issued and P1 adds this recv to Pending. The barrier calls synchronize all clocks to the global maximum, which is 0. At the recv call, P1 51 barrier [0] irecv(*) [0] recv(*) [1] isend(1) [0] isend(1) [0] P0 P1 P2 barrier [0] barrier [0] wait [2] wait [0] wait [0] Figure 5.2: An example illustrating LLCP applies rule R4 and constructs the CompleteNow set which consists of the irecv. Upon the completion of this step, the irecv has a clock of 0 and the recv has a clock of 1. Assuming that the isend from P0 matches with the irecv, the recv call will match with the isend from P2. The message from P2 carries piggyback data of 0 and is agged as a late message with respect to the irecv and is detected as a potential match (per Theorem 5.3). It is important to note that the theorem only applies one way. That is, there might be potential matches that the LLCP misses. Consider the example in Figure 5.3 where it is easy to see by manual inspection that the send from P0 is a potential match of the wildcard receive from P2 (assuming the P2's recv matches with P1's send). However, the LLCP would fail to detect such a match since at the time of receiving P0's send, the clock of P0's send is 1, which is the same as the clock of P2's recv(0), and it does not satisfy the condition of Theorem 5.3. This issue again re ects the disadvantage of Lamport clocks when there are multiple concurrent sources of nondeterminism. In general, omissions might happen when there are multiple sources of nondeterminism and processes for which the clocks are out of synchro- nization communicating with each other. Fortunately, this situation rarely happens in practice because most MPI programs have well-established communication patterns that do not have cross communications between groups of processes that generate relevant events before clock synchronization takes place. We will present our extension of LLCP to vector clocks that will address MPI programs with subtle communication patterns for which LLCP might not recognize all matches. 52 send(0) [0] recv(*) [0] send(2) [0] recv(*) P0 [0] P1 P2 recv(0) [1] send(2) [1] Figure 5.3: Omission scenario with LLCP 5.2 Lazy Vector Clocks LLCP can be extended to use vector clocks. In the case of vector clocks, the rules remain similar while taking into account the fact that we are working with a vector of clocks (e.g., instead of incrementing a single clock, Pi now increments V Ci[i]). We shall now prove the updated lemmas and theorems that are based on LVCP. Lemma 5.4 Assuming r is either a blocking receive or a nonblocking receive that is not pending in Pi: r mb ! e , r:V C[i] < e:V C[i]. Proof. The proof for r mb ! e ) r:V C[i] < e:V C[i] is similar to the LLCP case and is omitted. We will now prove the converse. Observe that in the LVCP, the only process that might increment V C[i] is Pi (this is the fundamental di erence between Lamport clock and vector clocks - which is also why the converse of this lemma does not hold for Lamport clocks). Thus, e is either an event that occurs in Pi after r completes (which is the point where V Ci[i] becomes greater than r:V C[i]) or an event in another process that receives the piggybacked V Ci[i] from Pi (either directly or indirectly via another process). If e is an event that occurs in Pi after r completes and r is a blocking receive, we clearly have r mb ! e due to the de nition of mb !. On the other hand, if r is a nonblocking receive, Pi will increase its clock (i.e., V C[i]) only in one of these scenarios: The corresponding wait for r is posted and r is still pending before the wait call. A blocking receive r0 which satis es r mb ! r0 is posted and r is still pending before r0. A wait for a nonblocking receive r0 which satis es r mb ! r0 is posted and r is still pending before r0. 53 Notice that in all of these scenarios, we need a blocking operation b such that r mb ! b to increase the clock of Pi. If e:V C[i] r:V C[i], it must be the case that e occurs after b. Hence, b mb ! e and by the transitive order rule, r mb ! e. Using the updated de nition of late message (De nition 5.1) where m:LC r:LC is replaced by m:V C[i] r:V C[i], we now prove the LVCP matching theorem, which is stated as follows: Theorem 5.5 An incoming send s carrying message m with tag being received by event r0 in process Pi is a potential match to a wildcard receive event r with tag issued before r0 if and only if (m:V C[i] < r0:V C[i] ^ late(m; r)). In other words, all potential matches are recognized and there are no omissions. Proof. The proof for the if part is similar to the LLCP proof and is omitted. We now prove the converse, which can be alternatively stated as: if r mb ! s ^ s mb ! r then m:V C[i] < r0V C[i] ^ late(m; r). First we notice that due to the nonovertaking rule, r mb ! r0, which gives us r:V C[i] < r0:V C[i] according to Lemma 5.4. Now applying Lemma 5.4 to r mb ! s, we obtain m:V C[i] r:V C[i], which means m:V C[i] < r0:V C[i] ^ late(m; r) (note that m:V C[i] is the same as s:V C[i] since m:V C[i] is the piggybacked value attached to the message). 5.2.1 Handling Synchronous Sends We brie y describe our approach to handle synchronous sends in MPI. Recall that a synchronous send s returns only when the corresponding receive call r has started to receive the message. Essentially, this means that if for any MPI events e, f such that s mb ! e and r mb ! f, then s mb ! f and r mb ! e. 5.2.1.1 Piggyback Requirements Sending and receiving piggyback data for synchronous call is challenging and de- pendent on the MPI implementation. We have so far experimented with MPICH2 in which the sender of the synchronous call sends a Request-To-Send (RTS) packet to the receiver and waits for a Clear-To-Send (CTS) packet from the receiver indicating that the receiver is ready for the receiving process. We add a special eld in the CTS packet to store the piggyback data and extract it on the sender side. Other MPI implementations use some similar rendezvous protocols and the same modi cations can be applied. While such modi cations can potentially limit portability, our experiments show that few MPI 54 applications use synchronous sends and for those that do, their communication patterns do not require this scheme of piggyback. 5.2.1.2 Algorithm Extension We discuss the algorithm extension for the case of LLCP and omit the case of LVCP due to similarity. Consider the case where a synchronous send s matches with a receive r; the following extensions are made to the clock updating rules described in Section 5.1.2: Before s returns, it extracts the piggybacked clock c coming from the receiver's side from the CTS packet. If the receive that matches with s is a wildcard receive, it increments c by 1. Finally, it updates its clock to c if c is greater than its current clock. Intuitively, this rule ensures that any event e such that s mb ! e will have a higher clock than r. When the receiver side starts receiving the message from a synchronous send (by sending a CTS packet), we take no action if the receive is blocking receive; otherwise, if it is a nonblocking wildcard receive we add it into the set MatchedW ithSsend. If a pending receive is in this set, any incoming message with a higher clock than the receive's clock at the time it is issued will not be counted as a potential match. This rule allows other eligible sends that are concurrent with the receive to be considered as potential matches. The example in Figure 5.4 illustrates the extensions to handle synchronous sends. The extensions would allow LLCP to identify the send from P0 correctly as a potential match and dismiss the send from P3 as a potential match. 5.3 DAMPI: Distributed Analyzer for MPI DAMPI (Distributed Analyzer for MPI) is the rst dynamic MPI veri er that o ers meaningful scalability: users can verify MPI codes within the parallel environment in which they develop and optimize them. In order to provide coverage over the nonde- terminism space, DAMPI implements both LLCP and LVCP as its core modules and allows the users to use either protocol, depending on their needs. Many other optional error checking modules such as deadlock detection or resource leak detection are also available. In addition, DAMPI o ers several search bounding heuristics th |
| Reference URL | https://collections.lib.utah.edu/ark:/87278/s6m04m59 |



