Testing and exposing weak graphics processing unit memory models

Testing and exposing weak graphics processing unit memory models

Title	Testing and exposing weak graphics processing unit memory models
Publication Type	thesis
School or College	College of Engineering
Department	Computing
Author	Sorensen, Tyler Rey
Date	2014-12
Description	Graphics Processing Units (GPUs) are highly parallel shared memory microprocessors, and as such, they are prone to the same concurrency considerations as their traditional multicore CPU counterparts. In this thesis, we consider shared memory consistency, i.e. what values can be read when issued concurrently with writes on current GPU hardware. While memory consistency has been relatively well studied for CPUs, GPUs present substantially different concurrency systems with an explicit thread and memory hierarchy. Because documentation on GPU memory models is limited, it remains unclear what behaviors are allowed by current GPU implementations. To this end, this work focuses on testing shared memory consistency behavior on NVIDIA GPUs. We present a format for describing GPU memory consistency tests (dubbed GPU litmus tests) which includes the placement of testing threads into the GPU thread hierarchy (e.g. cooperative thread arrays, warps) and memory locations into GPU memory regions (e.g. shared, global). We then present a framework for running GPU litmus tests under system stress designed to trigger weak memory model behaviors, that is, executions that do not correspond to an interleaving of the instructions of the concurrent program. We discuss GPU specific incantations (i.e. heuristics) which we found to be crucial for observing weak memory model executions; these include bank conflicts and custom GPU memory stressing functions. We then report the results of running GPU litmus tests in this framework and show that we observe a controversial relaxed coherence behavior on older NVIDIA chips. We present several examples of published GPU applications which may exhibit unintended behavior due to the lack of fence synchronization; one such example is a spin-lock published in the popular CUDA by Example book. We then test several families of tests and compare our results to a proposed operational GPU memory model and show that the model is unsound (i.e. disallows behaviors that we observe on hardware). Our techniques are implemented in a modified version of a memory model testing tool named litmus.
Type	Text
Subject	GPU; Litmus tests; Memory consistency
Dissertation Institution	University of Utah
Dissertation Name	Master of Science
Language	eng
Rights Management	Copyright © Tyler Rey Sorensen 2014
Format	application/pdf
Format Medium	application/pdf
Format Extent	745,857 bytes
Identifier	etd3/id/3258
ARK	ark:/87278/s6rz2mbp
DOI	https://doi.org/doi:10.26053/0H-WESH-SSG0
Setname	ir_etd
ID	196823
OCR Text	Show TESTING AND EXPOSING WEAK GRAPHICS PROCESSING UNIT MEMORY MODELS by Tyler Rey Sorensen A thesis submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Master of Science in Computer Science School of Computing The University of Utah December 2014 Copyright c Tyler Rey Sorensen 2014 All Rights Reserved Th e Uni v e r s i t y o f Ut a h Gr a dua t e S cho o l STATEMENT OF THESIS APPROVAL The thesis of Tyler Rey Sorensen has been approved by the following supervisory committee members: Ganesh Gopalakrishnan , Chair 5-30-2014 Date Approved Zvonimir Rakamaric , Member 5-30-2014 Date Approved Mary Hall , Member 5-30-2014 Date Approved and by Ross Whitaker , Chair/Dean of the Department/College/School of Computing and by David B. Kieda, Dean of The Graduate School. ABSTRACT Graphics Processing Units (GPUs) are highly parallel shared memory microprocessors, and as such, they are prone to the same concurrency considerations as their traditional multicore CPU counterparts. In this thesis, we consider shared memory consistency, i.e. what values can be read when issued concurrently with writes on current GPU hardware. While memory consistency has been relatively well studied for CPUs, GPUs present substan-tially different concurrency systems with an explicit thread and memory hierarchy. Because documentation on GPU memory models is limited, it remains unclear what behaviors are allowed by current GPU implementations. To this end, this work focuses on testing shared memory consistency behavior on NVIDIA GPUs. We present a format for describing GPU memory consistency tests (dubbed GPU litmus tests) which includes the placement of testing threads into the GPU thread hierarchy (e.g. cooperative thread arrays, warps) and memory locations into GPU memory regions (e.g. shared, global). We then present a framework for running GPU litmus tests under system stress designed to trigger weak memory model behaviors, that is, executions that do not correspond to an interleaving of the instructions of the concurrent program. We discuss GPU specific incantations (i.e. heuristics) which we found to be crucial for observing weak memory model executions; these include bank conflicts and custom GPU memory stressing functions. We then report the results of running GPU litmus tests in this framework and show that we observe a controversial relaxed coherence behavior on older NVIDIA chips. We present several examples of published GPU applications which may exhibit unintended behavior due to the lack of fence synchronization; one such example is a spin-lock published in the popular CUDA by Example book. We then test several families of tests and compare our results to a proposed operational GPU memory model and show that the model is unsound (i.e. disallows behaviors that we observe on hardware). Our techniques are implemented in a modified version of a memory model testing tool named litmus. CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTERS 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Thesis Statement and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 GPU Memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 GPU Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Hardware Memory Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 PTX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 CUDA to PTX Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Memory Models and Litmus Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 GPU Litmus Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 GPU Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3. GPU TESTING FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 PTX GPU .litmus Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 GPU Program Skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Critical Incantations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 General Bank Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 Memory Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Extra Incantations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 Random Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Effectiveness of Incantations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 Inter-CTA Incantations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Intra-CTA Incantations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4. NOTABLE RESULTS AND CASE STUDIES . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Notations and Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Coherence of Read-Read (CoRR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Fermi Memory Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Message Passing Through L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.2 Mixing Memory Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.3 CUDA Programming Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Volatile Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5 Spin-Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5.1 CUDA by Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5.2 Efficient Synchronization Primitives for GPUs . . . . . . . . . . . . . . . . . . . . 48 4.6 Dynamic Work Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6.1 CTA Level Work Stealing Deques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6.2 Synchronization Between Owner and Thief . . . . . . . . . . . . . . . . . . . . . . 51 4.6.3 Test Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.6.4 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5. BULK RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 Naming and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.1 Different Kinds of Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Test Specifications and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.1 Message Passing (MP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.2 Load Delaying (LD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.3 Store Buffering (SB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.4 IRIW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.5 Coherence of Independent Writes (2+2W) . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.6 Fences and Coherence Version 1 (R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.7 Fences and Coherence Version 2 (S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 High-Level Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Comparison to Operational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.1 Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6. CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 Additional GPU Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Herd Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.3 OpenCL Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 APPENDIX: PTX FROM DYNAMIC LOAD BALANCING . . . . . . . . . . . . 76 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 v LIST OF FIGURES 2.1 GPU thread and memory hierarchy of the GPU programming model . . . . . . . 12 2.2 Vector addition GPU kernel written in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 GPU hardware showing CUDA cores, SMs, and the memory hierarchy . . . . . . 14 2.4 Different types of concurrent memory accesses within a warp: a) parallel access where threads reads different banks, b) broadcast access where threads read from the same bank and same address, and c) bank conflict access where threads access the same bank but different addresses . . . . . . . . . . . . . . . . . . . . 15 2.5 Store buffering (SB) litmus test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 All interleaving of the store buffering (SB) litmus test . . . . . . . . . . . . . . . . . . . 20 2.7 Histogram of results from running the store buffering litmus test on an Intel i7 x86 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 Litmus test example written for GPUs in PTX syntax . . . . . . . . . . . . . . . . . . . 21 3.1 High-level flow of the GPU litmus tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Example of a GPU .litmus file which specifies the store buffering (SB) test . 25 3.3 Additional examples of scope tree declarations . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Testing loop of the CPU portion of the generated program . . . . . . . . . . . . . . . 28 3.5 The kernel code where GPU threads execute the tests specified in the GPU .litmus file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6 Code snippet of the general bank conflict incantation implementation . . . . . . . 31 3.7 High-level structure of the memory stress incantation implementation . . . . . . . 33 4.1 Test specification for CoRR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Test specification for MP-L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Test specification for CoRR-L2-L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Test specification for MP-volatile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 Implementation of lock and unlock given in CUDA by Example . . . . . . . . . . . . 46 4.6 Code snippet from the mutex example given in CUDA by Example . . . . . . . . . 47 4.7 Test specification for CAS-SL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.8 Test specification for EXCH-SL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.9 Example configuration of the concurrent deque . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.10 Implementation of push and steal for the concurrent deque . . . . . . . . . . . . . . . 52 4.11 Initial state of the concurrent deque . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.12 Concurrent deque after a single task has been pushed . . . . . . . . . . . . . . . . . . . 53 4.13 Test specification for DLB-MP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Test specification for MP+membar.cta+membar.gl . . . . . . . . . . . . . . . . . . . . . 57 5.2 Test specification for MP+membar.ctas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Test specification for LD+datas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.4 Test specification for MP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5 Test specification for LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.6 Test specification for SB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.7 Test specification for IRIW; memory annotations (.cg) and types (.s32) are omitted in this example for readability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.8 Test specification for 2+2W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.9 Test specification for R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.10 Test specification for S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.11 High-level view of the data structures and communication in the operational GPU weak memory model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 Simple scoped RMO Herd axiomatic memory model with a fence parameter-ized global happens-before and PTX fences . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A.1 Annotated PTX code for the steal and push methods produced from com-piling the dynamic load balancing CUDA code . . . . . . . . . . . . . . . . . . . . . . . . . 77 vii LIST OF TABLES 2.1 GPU terminology mappings between different vendors and frameworks . . . . . . 10 2.2 Relevant PTX data types, memory annotations, and instructions . . . . . . . . . . 17 2.3 CUDA compilation mappings to PTX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 Effectiveness of incantations for inter-CTA GPU configurations . . . . . . . . . . . . 36 3.2 Effectiveness of incantations for intra-CTA GPU configurations . . . . . . . . . . . . 37 4.1 Results for CoRR tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Results for MP-L1 tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Results for CoRR-L2-L1 tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Results for MP-volatile tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Results for CAS-SL tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Results for EXCH-SL tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Results for DLB-MP tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 Test attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Results for MP tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Results for LD tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 Results for SB tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 Results for IRIW tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.6 Results for 2+2W tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.7 Results for R tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.8 Results for S tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.9 Observed executions and allowed behaviors for operational model . . . . . . . . . . 69 6.1 Results for intra-CTA SB tests with different memory regions . . . . . . . . . . . . . 72 6.2 Observed executions and allowed behaviors for axiomatic model . . . . . . . . . . . 74 ACKNOWLEDGMENTS This work would not have been possible without the following people, to whom I extend my sincerest of gratitude. First and foremost, thanks to my professor, mentor, and role model Professor Ganesh Gopalakrishnan for giving me the amazing opportunity to get involved in research. The experiences I've had over the last few years working with Professor Gopalakrishnan have given me a love for learning I never thought I could have. His tireless devotion to his students will not be forgotten. Thanks to my committee members, Professor Zvonimir Rakamaric and Professor Mary Hall. The mentoring, support, and opportunities they both have provided me have been essential in shaping my current interests and future goals. Thanks to Dr. Jade Alglave at University College of London for supervising much of this work and her detailed feedback during the writing process. In addition to be being a knowledgeable and motivating mentor, she facilitated collaborations which gave this work breadth and momentum. Thanks to my UK GPU memory model collaborators, namely Daniel Poetzle (University of Oxford), Dr. Alastair Donaldson, Dr. John Wickerson (Imperial College London), Mark Batty (University of Cambridge), and Dr. Luc Maranget (Inria) for their insights and discussions that contributed to this work. Thanks to Vinod Grover at Nvidia; his feedback and encouragement from an industry perspective helped steer us in new and interesting directions. Thanks to Professor Suresh Venkatasubramanian, Professor Stephen Siegel, and Professor Matt Might for their encour-agement and invaluable contribution to my education over the last few years. Thanks to my fellow aspiring researchers: Kathryn Rodgers, Mohammed Al-Mahfoudh, Bruce Bolick, and Leif Andersen for sharing my struggles and all their help, both academ-ically and emotionally. Thanks to all the Gauss Group members for providing me with a stimulating environment and for forcing me to work harder than I ever have in my life trying to reach the precedence they have set. Thanks to my parents and other family members whose unwavering support and patience throughout my life has led me to where I am today. Lastly, and not exclusive from the aforementioned, thanks to my friends for their generous support throughout my education and consistent reminders that life is meant to be enjoyed. I am grateful for the funding for this work which was provided by the following NSF awards: CCF 1346756, ACI-1148127, CCF-1241849, CCF 1255776, and CCF 7298529. x CHAPTER 1 INTRODUCTION Much of the implementation work for this project was conducted during a three-month visit to University College London under the supervision of Dr. Jade Alglave. During that time, we met several other researchers interested in GPU memory models and began collaborating on a thorough study on the subject. This work presents one aspect of the larger study, namely running GPU litmus tests on hardware. However, this work was conducted in close collaboration with the larger project and draws heavy inspiration from discussions and work with the larger group, namely: Daniel Poetzl (University of Oxford), Dr. Alastair Donaldson, Dr. John Wickerson (Imperial College London), and Mark Batty (University of Cambridge). A Graphics Processing Unit (GPU) is an accelerated co-processor (a processor used to supplement the primary processor often for domain-specific tasks) designed with many cores and high data bandwidth [1, pp. 3-5]. These devices were originally developed for graphics acceleration, particularly in 3D games; however, the high arithmetic throughput and energy efficiency of these microprocessors had potential to be used in other applications. In late 2006, NVIDIA released the first GPU that supported the CUDA framework [2, p. 6]. CUDA allowed programmers to develop general purpose code to execute on a GPU. Since then, the use of GPUs has grown in many aspects of modern computing. For example, these devices have now been used in a wide range of applications, including medical imaging [3], radiation modeling [4], and molecular simulations [5]. Current research is developing innovative new GPU algorithms for efficiently solving fundamental problems in computer science, e.g. Merrill et al. [6] recently published an optimized graph traversal algorithm specifically to run on GPUs. The most recent results (November 2013) of the TOP500 project, which ranks and documents the current most powerful 500 computers1 in terms of performance, states that 1see http://www.top500.org 2 a total of 53 the computers on the list are using accelerators or co-processor technology, including the top two. A similar list known as the Green5002 ranks super computers in terms of energy efficiency; GPU accelerated systems dominate this list and occupy all top ten spots. Statistics from a popular online GPU research hub (www.hgpu.org) show how GPUs research has increased over the years. For example, less than 600 papers were published in 2009 describing applications developed for GPUs; in 2010 this rose to 1000 papers and years 2011 through 2013 each saw over 1200 papers. GPUs are also becoming common in the mobile market; popular tablets and smart phones, such as the iPad Air [7] and Samsung Galaxy S [8] series, now contain GPU accelerators. GPUs are concurrent shared memory devices and as such, they share many of the concurrency considerations as their traditional multicore CPU counterparts including no-torious concurrency bugs. One example of a concurrency bug is a data race in which shared memory is accessed concurrently without sufficient synchronization; data races cause undefined behavior in many instances (e.g. C++11 [9]). Another example of a concurrency bug is a deadlock, in which two processes are waiting on each other, causing the system to hang indefinitely. Concurrency bugs can be difficult to detect and reproduce due to the nondeterministic execution of threads. That is, a bug may appear in one run and not in another even with the exact same input data [10]. In some cases, concurrency bugs have gone completely undetected until deployment and have caused substantial damage. Notable examples include: • The Therac-25 radiation machine, in which a data race caused at least six patients to be given massive overdoses of radiation [11]. • The Northeastern blackout of 2003, which left an estimated ten million people power-less for up to two days, was primarily due to a race condition in the alarm system [12]. • The 1997 Mars Pathfinder, in which a deadlock caused a total system reset during the first few days of its landing on Mars. Luckily the spacecraft was able to be patched from earth once the problem was debugged [13]. A related source of nondeterminism which can cause subtle and unintended (i.e. buggy) behaviors in concurrent programs is the shared memory consistency model, which is what values can be read from shared memory when issued concurrently with other reads and 2see http://www.green500.org 3 writes [14, p. 1]. A developer may expect every concurrent execution to be equivalent to a sequential interleaving of the instructions, a property known as sequential consistency [15]. This however, is not always the case as many modern architectures (e.g. x86, PowerPC, and ARM [16]) weaken sequential consistency for substantial performance and efficiency gains [17]. These architectures are said to have weak memory models and the underlying architecture is allowed to execute certain memory instructions out of the order in which they are given in the syntax of the program. We refer to executions that do not correspond to an interleaving of the instructions as weak or relaxed behaviors. To enable developers to enforce orderings not provided by the architecture, special instructions known as memory fences can be used to guarantee certain orderings and properties. If a programmer is to avoid costly and elusive concurrency bugs, he or she must understand the architecture's shared memory consistency model and the guarantees (or lack thereof) provided. Shared memory consistency models for traditional CPUs have been relatively well stud-ied over the years [14, 16, 18] and continue to be a rich area of research. However, GPUs have a hierarchical concurrency model that is substantially different from that of a traditional CPU. GPU developers have explicit access to the location of threads in the GPU thread hierarchy and can design programs using this information; threads that share finer grained levels of the hierarchy enjoy accelerated interactions and additional functionality. For example, one level of the hierarchy is called a CTA (Cooperative Thread Array). A GPU program often has many CTAs, and threads residing in the same CTA have access to a fast region of memory called shared memory3. Threads in different CTAs cannot access the same shared memory region and must use the slower global memory region to communicate data. Additionally, there are built-in synchronization barrier primitives and a memory fence that only apply to threads residing in the same CTA [19, p. 95]. These features are a noticeable departure from traditional CPU models where generally only one memory space is considered and memory fences apply to all threads. Unfortunately, GPU vendor documentation on shared memory consistency remains limited and incomplete. The CUDA 6 manual provides only 3 pages of documentation on the subject, which largely covers memory model basics and shows one complicated example [19, pp. 92-95]. While NVIDIA does not release machine code documentation or tools, they provide a low-level intermediate language called PTX (Parallel Thread eXecution). The PTX 4.0 ISA gives only one page of shared memory consistency documentation with 3We use the term shared memory in this document to refer to the specialized GPU memory region as opposed to any region of memory that is accessible to multiple threads 4 no examples [20, p. 169]. Both CUDA and PTX documentation are written in prose and lack the mathematical rigor required to reason about complicated interactions. It remains unclear to us what behavior GPU developers can safely rely on when using current NVIDIA hardware. 1.1 Thesis Statement and Contributions Due to the lack of a rigorous specification for the weak memory behaviors allowed by GPUs, it remains unclear what memory relaxations current GPUs allow. This issue can be systematically approached by developing formally-based testing methods that explore the behaviors observable on GPUs. These testing methods are able to experimentally investigate corner cases left underspecified by the documentation as well as rigorously test classic memory consistency properties (e.g. coherence); additionally this approach promotes the development of abstract formal models of the architecture, thus helping designers and developers agree on what user programs may rely upon. Without this understanding between designers and developers, GPU applications may be prone to elusive bugs due to weak memory orderings. While these testing approaches have been employed successfully for CPU architectures, GPUs contain an explicit hierarchical concurrency model with subtle scoped properties unseen on CPU architectures; additionally, the throughput oriented hardware of GPUs require innovative new testing heuristics in order to effectively reveal weak behaviors. 1.1.1 Thesis Statement Systematic memory model explorations are greatly aided by developing formally-based testing methods that reveal experimentally the extent to which the memory orderings are relaxed. In addition to helping corroborate with intentionally designed relaxations, these approaches also help expose unintended weak behaviors (bugs), and also help set allowed weakenings for the architectural family. 1.1.2 Contributions To better understand and test GPU memory models, this work presents a GPU hardware memory model testing framework which runs simple concurrent tests (known as litmus tests) thousands of times under complex system stress designed to trigger weak memory model behavior. The results are recorded and checked for weak memory model behaviors and how often they occurred. We present a format for describing GPU litmus tests which account for the explicit placement of threads into the GPU thread hierarchy and memory locations 5 into GPU memory regions. The framework reads a GPU litmus test and creates executable CUDA or OpenCL code with inline PTX which will run the test and display the results. We develop GPU-specific heuristics without which we are unable to observe many weak model behaviors. These heuristics include purposely placing poor memory access patterns (known as bank conflicts) on certain memory accesses in the tests and randomly placing the testing threads throughout the GPU. For example, if the GPU litmus test specifies two testing threads are in different CTAs, the framework will then randomly assign a distinct CTA ID to the testing thread for each run of the test. Our testing framework also uses the nontesting threads on the GPU to create memory stress by constantly reading and writing to nontesting memory locations. These heuristics have a substantial impact on if, and how many times, weak behaviors are observed. We then report the results of running GPU litmus tests in this framework. We observe a controversial and unexpected relaxed coherence behavior, in which a read instruction is allowed to observe stale data w.r.t. an earlier read from the same address. We observe this behavior on older NVIDIA chips, but not the newest architecture (named Maxwell). We present several examples of published GPU applications which may exhibit unintended behavior due to the lack of fence synchronization. These examples include a spin-lock pub-lished in the popular CUDA by Example book and a dynamic GPU load balancing scheme published as a chapter in GPU Computing GEMs - Jade Edition. We test many classical CPU litmus tests under different GPU configurations and show that GPUs implement weak memory models with subtle scoped properties unseen in CPU models. Finally, we compare our testing results to a proposed operational GPU memory model and show that it is unsound, i.e. disallows behaviors that we observe on hardware. Our techniques are implemented in a modified version of the litmus tool of the DIY memory model testing tool suite (see http://diy.inria.fr/). 1.2 Prior Work The work presented in this thesis draws heavy inspiration from the original litmus tool [21] of the DIY memory model testing tool suite4 which runs litmus tests on several different CPU architectures, including x86, PowerPC, and ARM. It takes a litmus test written in pseudo assembly code as input and creates executable C code which will execute and record the outcomes of the input litmus test. The litmus tool uses heuristics to make weak behaviors show up more frequently which include affinity assignments and 4see http://diy.inria.fr/ 6 custom synchronization barriers. The work presented in this thesis modifies the litmus tool to take GPU litmus tests as input and creates executable CUDA or OpenCL code with GPU-specific heuristics. TSOTool [22] is another memory model testing tool which exclusively targets architectures which implement the Total Store Order (TSO) memory model. The ARCHTEST tool [23] is an earlier memory model testing tool which only tests for certain behaviors and cannot easily run new tests as the tests are hard coded in the tool. Using litmus tests are an intuitive way to understand memory consistency models and are used in official industry documentation [24]. Litmus tests have been studied formally and have been shown to describe important properties of memory systems such as model equivalence [25]. Alglave et al. have developed a method for generating litmus tests [26] based on cycles and present large families of litmus tests in [16]. This thesis expands the traditional CPU litmus test with additional GPU unique specifications (described in Section 3.1). 1.2.1 GPU Memory Models The past two years have seen a noticeable push in both academia and industry to understand and document GPU memory models. We consider this work part of that effort and hope to see same level of rigorous testing and modeling applied to GPU memory models as CPU memory models have enjoyed (for example, in [16, 18, 14]). We present here the history as we know it of GPU memory models in prior literature and how this work on testing contributes to them: • In June 2010, Feng and Xiao revisited their GPU device-wide synchronization method [27] to repair it with fences [28]. They report on the high overhead cost of GPU fences, which in some cases removes the performance gain of their original barrier. They ap-pear skeptical that GPUs exhibit weak memory behaviors, illustrated by the following quote [28, p. 1]: In practice, it is infinitesimally unlikely that this will ever happen given the amount of time that is spent spinning at the barrier, e.g., none of our thousands of experimental runs ever resulted in an incorrect answer. Furthermore, no existing literature has been able to show how to trigger this type of error. We consider our work to be a response to that quote in that we show heuristics which trigger weak memory effects on GPUs (see Chapter 3). • In June 2013, Hower et al. proposed a SC (Sequential Consistency) for race-free memory model for GPUs [29]. This model uses acquire/release [14, pp. 68-69] syn- 7 chronization; however, to allow efficient use of the explicit GPU thread hierarchy, the acquire and release atomic operations may be annotated with a scope (i.e. level) in the GPU hierarchy which restricts the ordering constraints to that scope. Using these atomics and program order, they construct a happens-before relation which they use to define a particular type of data race they dub a heterogeneous data race. They state that hardware satisfying this memory model must give sequentially consistent behavior for programs free of heterogeneous data races. While this model is intuitive, it is unclear if or how this is to be implemented on current hardware. • Also in June 2013, work by Hechtman and Sorin [30] showed that in a particular model of GPU and for programs run on GPUs, weak memory consistency has a negligible impact on performance and efficiency. Because of this, the authors suggest that sequential consistency is an attractive choice for GPUs. In our work, we show that regardless of the benefits (or lack thereof) of weak memory consistency on GPUs, current GPUs do in fact implement weak memory models. • Continuing in June 2013, Sorensen et al. [31] proposed an operational weak GPU memory model based on the limited available documentation and communication with industry representatives. This model was implemented in a model checker and gave semantics to simple scoped GPU fences over shared and global memory regions. More complex interactions were left unspecified. In our work (Section 5.4), we compare the behaviors allowed under this model against behaviors observed on hardware and show that this model is unsound (i.e. the model disallows behaviors that we observe on hardware). • In January 2014, Hower et al. [32] continued their work and present two SC for data race free GPU memory models using scoped acquire/release semantics again. The first model, dubbed HRF-direct, is suited for traditional GPU programs and current language standards model. The second model, dubbed HRF-indirect, is forward-looking to irregular GPU programs and new standards. Much like their previous work in [30], this work describes intuitive models, but it still remains unclear if or how this relates to memory models on current GPUs. At this point, we have only discussed NVIDIA specific industry documentation. How-ever, non-NVIDIA proprietary GPU languages and frameworks have begun to explore GPU memory models. The new OpenCL 2.0 [33] GPU programming language specification released in November of 2013 has adopted a memory model similar to C++11 [9]. However, 8 to enable developers to take advantage of the explicit GPU thread hierarchy, the OpenCL 2.0 specification has introduced new memory scope annotations to atomic operations which restricts ordering constraints to certain levels in the GPU thread hierarchy. Similarly, the HSA low-level intermediate language [34] provides scoped acquire/release memory opera-tions and fences similar to the previously mentioned work by Hower et al. [32]. Our work empirically investigates the current GPU hardware memory models, which must be well understood if these new specifications are to be efficiently implemented. 1.3 Roadmap Chapter 2 presents the required background for the proper understanding of the rest of this document. This includes a primer on GPU architectures and programming models including the relevant low-level PTX instructions. Furthermore, we discuss some prerequi-sites on shared memory consistency and litmus tests. We conclude this chapter by formally discussing our notation for GPU litmus tests. In Chapter 3, we discuss our testing framework, starting with the format of a GPU .litmus test for the PTX architecture. We then discuss critical incantations, without which we are unable to observe any weak memory model behaviors. We continue to present additional heuristics and report on their effectiveness. Chapter 4 presents several notable results that we have gathered from running tests with the framework. We show a controversial relaxed coherence behavior observable on older NVIDIA GPUs, but not on the most recent architecture. We discuss interesting behaviors with PTX memory instruction annotations, and show examples where we observed behaviors that we did not expect from reading the documentation. We then shift our focus to CUDA applications (two of them published in CUDA books) which contain interesting concurrent idioms, namely two mutex implementations and a concurrent data structure. We show that these implementations may allow unintended (i.e. buggy) executions but may be experimentally repaired with memory fences. In Chapter 5, we present the results of running families of different tests under several GPU configurations. We show that GPUs implement weak memory models with subtle scoped properties unseen in CPU models. These families of tests provide intuition about what types of re-orderings are allowed on GPUs and what memory fences will experimentally restore orderings. We compare our observations to an operational GPU model presented in [31] and show that the model is unsound (i.e. disallows behaviors that we observe on hardware). 9 We end with a conclusion in Chapter 6 which discusses ongoing work and future work. Specifically, we discuss different GPU configurations that we were unable to test in this document and interesting results they could yield. Additionally, we show new features being added to the Herd [16] axiomatic memory model framework to reason about GPU memory models. We finish with a summary of the document. CHAPTER 2 BACKGROUND In this chapter, we discuss the necessary background required for this work, including an overview of GPU programming and hardware models (in Section 2.1 and 2.2, respectively). Section 2.3 discusses the NVIDIA low-level intermediate PTX language and the instructions we consider in this document. We provide a table of CUDA to PTX compilation mappings in Section 2.3.1 which enables us to reason about CUDA code using PTX test cases. Section 2.4 then contains a primer on memory consistency models and litmus tests. We more formally define the litmus test format, naming conventions and GPU configurations we consider in Section 2.5. Different GPU frameworks and vendors use different terminology and often overload terms that have established meanings in traditional concurrent programming (e.g. shared memory). Because this work focuses largely with NVIDIA GPU hardware, we use similar terminology to that in the PTX ISA [20]. Table 2.1 shows a mapping from other GPU terminologies to the ones we use; recall HSA is a new standard for heterogeneous computing, including GPUs [34]. 2.1 GPU Programming Model Programs that execute on GPUs are called GPU kernels and consist of many threads which are partitioned in the GPU thread hierarchy. Threads that share finer grained levels of the hierarchy have additional functionality which developers can design their GPU kernels Table 2.1. GPU terminology mappings between different vendors and frameworks PTX CUDA OpenCL HSA thread thread work-item work-item warp warp subgroup wavefront CTA thread-block work-group work-group shared memory shared memory local memory group memory global memory global memory global memory global memory 11 to exploit. There are four levels of the GPU thread hierarchy that are considered in this work: • Thread: Much like a CPU thread, a GPU thread executes a sequence of instructions specified in the GPU kernel. • Warp: For all available NVIDIA architectures, a warp consists of 32 threads. Threads in the same warp are able to quickly perform reductions and share variables via the warp vote and warp shuffle CUDA function [19, pp. 114-118]. • CTA: A Cooperative Thread Array or CTA consists of a variable number of warps which can be programmed at run-time. Depending on the GPU generation, a CTA can contain up to 16 or 32 warps (512 or 1024 threads). Threads in the same CTA are able to efficiently synchronize via a built-in synchronization barrier called with the syncthreads command in CUDA [19, pp. 95-96]. • Kernel: A kernel (or GPU program) consists of a variable number of CTAs, which may be in the millions. Distinct CTAs share the slowest memory region (global memory) and have very limited support for interacting. There is no synchronization barrier for all CTAs; however, there is a memory fence [19, p. 93] and read-modify-write atomics [19, p. 111] which are supported to work across distinct CTAs. It should be noted that CTAs are not guaranteed to be scheduled concurrently and deadlocks may occur if a CTA is waiting for another CTA that is not scheduled [19, p. 12]. In addition to the functionality available at different levels of the GPU hierarchies, GPUs also provide different memory regions that are only shared between threads in common hierarchy levels. These memory regions are: • Global Memory: This region of memory is shared between all threads in the GPU kernel. • Shared Memory: This region of memory is shared only between threads in the same CTA; it is considerably faster and smaller than the global memory region. Many GPUs additionally provide read-only memory regions (e.g. known as constant and texture memory in CUDA). These memory regions are not considered in this work because they are uninteresting with respect to shared memory consistency, i.e. the set of values a read can return from read-only memory region is simply the memory value with which it was initialized. The GPU thread and memory hierarchy are shown in Figure 2.1. 12 Figure 2.1. GPU thread and memory hierarchy of the GPU programming model GPU kernels are written as a single function which all threads in the kernel execute. Threads are able to query special variables (or registers in PTX) to determine the ID of the CTA to which they belong, the size of their CTA, and their thread ID within the CTA. Using this information, threads are able to compute a unique global ID and can then access unique data to operate on. For example, a GPU kernel to add two vectors x and y and store the result in vector z written in CUDA is shown in Figure 2.2. This program assumes that the kernel has exactly as many threads as elements in the vector. A GPU kernel is called from a CPU function using triple chevron style syntax, where the two arguments inside the chevrons are the number of CTAs and threads per CTA. For example, to launch the GPU kernel shown in Figure 2.2 with c CTAs and t threads per 1 //__global__ specifies that this function starts a GPU kernel 2 __global__ void add_vectors(int x, int y, int z) { 3 4 int cta_id = blockIdx.x; //special variable for cta ID 5 int cta_size = blockDimx.x; //special variable for cta size 6 int thread_id = threadIdx.x; //special variable for thread ID 7 8 //A unique global ID can be computed from the above values as: 9 int global_id = (cta_id cta_size) + thread_id; 10 11 //Now each thread adds its own array index 12 z[global_id] = x[global_id] + y[global_id]; 13 } Figure 2.2. Vector addition GPU kernel written in CUDA 13 CTA would be written as: add vectors<<<c,t>>>(x,y,z);. Finally, the CPU may not directly access GPU memory; it must be explicitly copied to and from the GPU through a built-in CUDA function named cudaMemCpy. 2.2 GPU Architecture The GPU hardware architecture consists of physical processing units and a cache hier-archy onto which the programming model maps. The architecture white papers published by NVIDIA provide detailed information about the different features of the hardware. In this document, we focus on the Fermi, Kepler, and Maxwell (GTX 750 Ti) architectures, whose white papers are [35], [36], and [37], respectively. A GPU consists of several streaming multiprocessors (or SMs). Larger GPUs designed for HPC and heavy gaming may contain up to 15 SMs (e.g. GTX Titan) while smaller GPUs may have much fewer; for example, the GTX 540m GPU has only 3 SMs. Each SM contains a number of CUDA cores with pipelined arithmetic and logic units. The Fermi architecture contains 32 CUDA cores per SM while the Kepler architecture features 192 CUDA cores per SM. All threads in the same CTA are mapped to CUDA cores in the same SM and are executed in groups of 32 (i.e. a warp) in a model known as single instruction, multiple threads (or SIMT) [19, pp. 66-67]. In this model, all threads in the warp are given the same instruction to execute similar to the SIMD model in Flynn's taxonomy [38]. However, the SIMT model differs from the SIMD model in that all threads have unique registers and not all threads must execute the instruction (e.g. if a conditional only allows some threads of a warp into a program region, then the other threads in the warp simply do not execute until the conditional block of code ends). The Fermi architecture has a dual warp scheduler that may issue instructions from two independent warps concurrently while the Kepler architecture features a quad warp scheduler. The maximum number of threads that can be assigned to an SM at any given time is 1536 and 2048 for Fermi and Kepler, respectively. GPUs are attached to the main CPU through the PCI bus. GPUs contain a physical cache hierarchy for the memory regions of the programming model to map onto. A GPU contains a large region of DRAM to which all SMs have access; it houses global and constant memory. This memory is usually 1 to 6 GBs in size. The entire GPU then shares an L2 cache which is typically 1 to 2 MB in size and accelerates global and constant memory accesses. Each SM contains a region for shared memory and also a L1 cache for global and constant memory. In the Fermi and Kepler architectures, this region of memory is the same and developers are free to configure this region to have 14 more shared memory or more L1 cache. In the Maxwell architecture, the shared memory region and L1 cache are separate. This region of memory is typically 64 KB in size. It is documented that the L2 cache is coherent (see Section 4.2 for a discussion of coherence), but multiple interacting L1 caches are not coherent, e.g. two SMs accessing global memory via their respective L1 caches are not guaranteed to have coherent interactions. GPU memory instructions can be annotated to enforce which cache is targeted; these annotations are documented in Section 2.3. A figure of the GPU hardware model is shown in Figure 2.3. Notice the similarities to the programming model shown in Figure 2.1, i.e. threads map to CUDA cores, CTAs map to SMs, shared memory maps to the L1/shared memory cache, and global memory maps to the L2/DRAM memory. 2.2.1 Hardware Memory Banks One aspect of the GPU architecture that is used in this work is the different ways that the hardware handles concurrent memory accesses. The shared memory region on a GPU is organized in 32 4-byte banks on each SM [39, p. 118]. When threads in a warp issue a Figure 2.3. GPU hardware showing CUDA cores, SMs, and the memory hierarchy 15 memory access from shared memory, three things may happen which are shown in Figure 2.4 and described below: • Parallel Access: In a parallel access, each thread in the warp accesses a unique hardware bank and memory requests are able to be serviced in parallel. • Broadcast: In a broadcast access, only one memory load is issued and the result is broadcast to all threads. This access only applies to load operations and happens when threads load from the same address. • Bank Conflict: In a bank conflict access, the hardware serializes the accesses which causes a performance slowdown. This access is similar to a broadcast access except that threads access different addresses from the same bank. Additionally, GPUs are sensitive to the alignment of global memory accesses. Cache lines are 128 bytes, and warps that access across multiple cache lines result in unnecessary data movement (i.e. entire cache lines) which causes a loss of performance. Avoiding these types of poorly aligned accesses is known as memory coalescing [39, pp. 125-127]. 2.3 PTX We have previously mentioned that GPUs may be programmed using CUDA language; however, the goal of this work is to test GPU hardware, and as such it is convenient to program as close to the hardware as possible. The CUDA compilation process takes a file Figure 2.4. Different types of concurrent memory accesses within a warp: a) parallel access where threads reads different banks, b) broadcast access where threads read from the same bank and same address, and c) bank conflict access where threads access the same bank but different addresses 16 with a program written in the CUDA language as input and compiles it into a GPU binary file known as a cubin which contains GPU machine code. As part of this process, a low-level intermediate representation known as Parallel Thread eXecution (or PTX) is generated. NVIDIA provides very limited access to the machine code, which is very sparsely documented [40]. Additionally, there is no available method to write inline GPU machine code or even assemble machine code programs. The sole access to GPU machine code is through the application cuobjdump which provides the assembly code of a cubin file. To this end, our framework tests the hardware by using inline PTX in CUDA or OpenCL code which is supported [41]. PTX syntax requires each instruction to contain a type annotation specifying the data type the instruction is targeting. For example, an unsigned 32 bit type carries an annotation of .u32. Additionally, memory instructions may be annotated to specify different caching behaviors. For example, a load instruction (ld) may be annotated to read from the L2 cache with annotation .cg. As a complete example, to load an unsigned 32 bit value from the L2 cache, the following instruction would be used: ld.cg.u32. Table 2.2 shows the types, annotations, and instructions that this work targets with a brief description interpreted from the PTX ISA [20] to the best of our understanding. 2.3.1 CUDA to PTX Mappings In Chapter 4, we discuss several case studies where we evaluate published CUDA code in our testing framework. Because our framework evaluates PTX code, CUDA instructions must first be mapped to PTX instructions. Table 2.3 shows the relevant instruction mappings from CUDA to PTX for these case studies which we have discovered by examining CUDA code and generated PTX code1. We have taken precautions to ensure that loads and stores are compiled with the L2 memory annotation. This is done because Section 4.3 shows that is not possible to restore orderings to operations that target the L1 cache (the default for the CUDA compiler) on the Fermi architecture. We are interested in experimentally examining which fences are required to restore orderings to the examples, thus instructions to which we are unable to restore order are not interesting. The L2 annotation can be set to the default with the following compiler flags: -Xptxas -dlcm=cg -Xptxas -dscm=cg. The focus of this document is to show the behaviour of these examples at the hardware level; as such, we ignore the effects of potential compiler optimizations. For the CUDA case studies we examine, we have verified by manually inspecting the PTX output that the CUDA 1We used CUDA release 5.5 V5.5.0 17 Table 2.2. Relevant PTX data types, memory annotations, and instructions PTX Data Types .u32 unsigned 32 bit integer type .s32 signed 32 bit integer type .b32 generic 32 bit type .b64 generic 64 bit type .pred predicate (contains either true or false) PTX Memory Operation Annotation .ca annotates load instructions, loads values from the L1 cache .wb annotates store instructions, stores values to L2 cache, but future architectures may use it to store to L1 cache .cg annotates both load and store instructions, accesses will target L2 cache .volatile annotates both load and store instructions, inhibits optimizations and may be used to enforce sequential consistency PTX Instructions ld{.ann}{.type} r1, [r2] loads value at address in register r2 into register r1 of data type type with annotation ann st{.ann}{.type} [r1], r2 stores value in register r2 of data type type to the address in register r1 with annotation ann membar{.scope} memory fence for scope of .cta or .gl for inter-CTA and interdevice, respectively atom{.op}{.type} r1, [r2], r3 atomically perform operator op with memory at address r2 and value in register r3 and stores the previous memory value in register r1. op may be .add to atomically add or .exch to exchange etc. setp{.comp}{.type} p1, r1, r2 sets the value of the predicate register p1 to the value of comparing registers r1 and r2 with comp where comp might be .gt (greater than), .eq (equal to) etc. PTX Predicates @p1 {ins} execute instruction ins only if predicate register p1 is true 18 Table 2.3. CUDA compilation mappings to PTX CUDA Instruction PTX Instruction atomicCAS atom.cas.b32 atomicExch atom.exch.b32 threadfence membar.gl threadfence block membar.cta store global int st.cg.u32 load global int ld.cg.u32 store global volatile int st.volatile.u32 load global volatile int ld.volatile.u32 control flow (e.g. while, if) setp with predicate (e.g. @r1) compiler does not reorder or otherwise optimize the memory accesses (e.g. hold memory accesses in registers). For PTX tests, we have again manually inspected the assembly code, using cuobjdump, to ensure that the PTX compiler does not reorder or otherwise optimize the memory accesses; future work will attempt to automate this validation. Because of this manual work, we can ignore compiler optimizations for the examples we present and be sure that we are indeed testing the hardware behavior. 2.4 Memory Models and Litmus Tests For a given program and architecture, a memory model defines the set of values that the load instructions are allowed to return. That is, it specifies all possible behaviors of shared memory interactions. Memory models may be described in an operational style in which the system is described as an abstract machine. Given the current state of the system, the operational model will provide all possible transitions the system could take and how the system state is updated based on the transition; examples of operational models include [42, 18]. Memory models may alternatively be defined in an axiomatic style where constraints are described on sets and relations over memory actions; for examples of this type of model, see [43, 44, 16]. Our work does not propose any memory model; instead, we examine the observable effects of the memory model implemented on current GPUs. In Section 5.4, we compare our results to a proposed operational GPU memory model and show that the model is unsound (i.e. disallows behaviors that we observe on hardware). In Section 6.2, we briefly discuss future work to extend the herd axiomatic memory model tool [16] of the DIY tool suite for GPU memory models. An intuitive way to understand memory models is through litmus tests, i.e. short con-current programs with an assertion about the final states of registers and memory. Litmus 19 tests are evaluated under a memory model and can be allowed (the assertion sometimes passes) or disallowed (the assertion never passes). Figure 2.5 shows a litmus test known as store buffering (or abbreviated to SB) written in C-like syntax. In this test, x and y are memory locations initialized to 0. Thread 0 first stores the value 1 to location x then reads from location y and stores the result in local register r1. Thread 1 writes to location y and then reads from x and stores the result in local register r2. The assertion asks if r1 and r2 are allowed to both equal 0 after both threads finish executing. Many programmers are taught to reason about concurrent programs under the sequen-tially consistent memory model (or simply SC), first defined by Lamport in 1979 [15]. That is, a concurrent execution must correspond to some interleaving of the instructions. Figure 2.6 shows how one would reason about the SB litmus test (shown in Figure 2.5) under SC; that is, the interleavings are enumerated and executed as a sequential program. There are six possible interleavings and the assertion (r1 = 0 ∧ r2 = 0) is not satisfied for any of them, thus the SB litmus test is disallowed under the SC memory model. Modern multiprocessors (e.g. x86, ARM) implement weak memory models, where ex-ecutions may not correspond to an interleaving. Using the original litmus tool [21] to run the store buffering litmus test on an Intel i7 processor one million times yields the histogram of results shown in Figure 2.7 (the output has been slightly modified from the actual litmus output to correspond to the register syntax used throughout in this section). This shows that empirically we can observe that the Intel i7 processor allows weak behaviors (executions that do not correspond to an interleaving of the instructions) in 119 out of a million iterations. Weak architectures provide fence instructions to restore orderings. For example, consid-ering the store buffering litmus test shown in Figure 2.5, if we place the x86 fence instruction mfence between instructions a and b and instructions c and d and execute the test again, we do not observe any weak behaviors and the litmus test becomes disallowed. initial state: x = 0, y = 0 Thread 0 Thread 1 a: x ← 1; b: r1 ← y; c: y ← 1; d: r2 ← x; assert: r1 = 0 ∧ r2 = 0 Figure 2.5. Store buffering (SB) litmus test 20 Interleaving 1 Interleaving 2 Interleaving 3 a: x ← 1; b: r1 ← y; c: y ← 1; d: r2 ← x; a: x ← 1; c: y ← 1; b: r1 ← y; d: r2 ← x; a: x ← 1; c: y ← 1; d: r2 ← x; b: r1 ← y; Final: r1 = 0 ∧ r2 = 1 Final: r1 = 1 ∧ r2 = 1 Final: r1 = 1 ∧ r2 = 1 Interleaving 4 Interleaving 5 Interleaving 6 c: y ← 1; a: x ← 1; b: r1 ← y; d: r2 ← x; c: y ← 1; a: x ← 1; d: r2 ← x; b: r1 ← y; c: y ← 1; d: r2 ← x; a: x ← 1; b: r1 ← y; Final: r1 = 1 ∧ r2 = 1 Final: r1 = 1 ∧ r2 = 1 Final: r1 = 1 ∧ r2 = 0 Figure 2.6. All interleaving of the store buffering (SB) litmus test 2.5 GPU Litmus Tests Here we formally define our notation for the presentation of GPU litmus tests and show a concrete example of a PTX litmus test. Additionally, we present the three different GPU configurations on which we focus throughout this document. A test specification, such as the one shown in Figure 2.8, consists of several columns, each headed by a global thread ID. Each thread scheduled on the GPU has a unique global thread ID. In practice, a global thread ID can be computed using a combination of the built-in GPU values, i.e. thread ID, CTA ID, and CTA size. However, in our examples, we use symbolic global thread IDs, such as T0 and T1, for ease of presentation. A brief description of each major part of the test specification follows. Test SB Allowed Histogram (4 states) 119 > r1=0; r2=0; 499580 :> r1=1; r2=0; 500248 :> r1=0; r2=1; 53 :> r1=1; r2=1; Ok Witnesses Positive: 119, Negative: 999881 Condition exists (r1=0 /\ r2=0) is validated Figure 2.7. Histogram of results from running the store buffering litmus test on an Intel i7 x86 processor. 21 initial state: x = 0, y = 0 T0 T1 st.cg.s32 [x], 1 ; st.cg.s32 [y], 1 ; ld.cg.s32 r1 , [y] ; ld.cg.s32 r2 , [x] ; assert: 1:r1=1 ∧ 1:r2=0 Figure 2.8. Litmus test example written for GPUs in PTX syntax In each test, the initialization of memory will be explicitly provided at the top of the test. In the example shown in Figure 2.8, the memory locations are initialized to 0. Under each global thread ID is a program, i.e. the sequence of instructions executed by that thread. In GPU programming, every thread executes the same program; however, we can arrange for each thread to execute a different program by having threads branch to different parts of the program based on their global thread IDs (we discuss this more fully in Section 3.2). Consider the example in Figure 2.8, which implements a message passing idiom and is known as MP (this test is analyzed in Section 5.2.1; it is given here for explanatory reasons only). Here, the global thread IDs are T0 and T1. We assume that each kernel is launched with a sufficient number of CTAs and threads such that each program in the test will eventually be executed on the GPU. In the example, the two store instructions will be executed by thread T0, and the two load instructions will be executed by thread T1. We deviate from concrete PTX syntax in that we allow direct stores of immediate values to memory (e. g. st.cg.s32 [x],1 as seen in T0's program in Figure 2.8), instead of moving the value first to a register (via a mov instruction), and then storing the register contents to memory. Thread local registers are denoted by rn where n is a non-negative integer. Locations are given by single lower-case letters, e. g. x,y,z. In Figure 2.8, there are two memory locations x and y, and thread T1 loads memory values to registers r1 and r2. Questions about the executions of a test are given as an assertion on the final values of registers or memory locations. In Figure 2.8, the constraint is given as: Assert: 1:r1=1 ∧ 1:r2=0 to ask if it is possible to observe T1's private registers r1 to be 1 and r2 to be 0 in the final state of the GPU after having executed all testing threads. Here, registers in the final constraint are denoted n:reg, where reg is a register and n is the ID of the thread to which the register belongs. 22 2.5.1 GPU Configurations Regarding Figure 2.8, the test may yield different behaviors depending on whether T0 and T1 are in the same CTA or in different CTAs. Similarly with memory regions, the behaviors allowed may be different depending on which GPU memory region x and y are located in (shared or global). We refer to the placement of testing threads into the thread hierarchy and memory locations into memory regions as a GPU configuration. Although in Section 3.1 we show that our testing framework can execute most GPU configurations, in this document, we largely only consider three simple GPU configurations. We refer to them as D-warp:S-cta-Shared, D-warp:S-cta-Global, and D-cta:S-ker-Global; they are defined as follows: • D-warp:S-cta-Shared: In this GPU configuration, all programs in the test are mapped to threads in different warps (D-warp stands for different warps), but in the same CTA (S-cta stands for same CTA). Additionally, all testing memory locations are located in the shared memory region. • D-warp:S-cta-Global: Similar to D-warp:S-cta-Shared, in this GPU configuration, all programs in the test are mapped to threads in different warps (D-warp), but in the same CTA (S-cta). However, all testing memory locations are located in the global memory region. • D-cta:S-ker-Global: In this GPU configuration, all programs in the test are mapped to different CTAs (D-cta) but the same kernel (S-ker). There is no shared memory region variant of this GPU configuration because threads in different CTAs have disjoint shared memory regions. 2.5.1.1 Limitations We note that these are not a complete set of GPU configurations. For example, in 3+ threaded tests, some threads may be in the same CTA and others may be in different CTAs. Similarly the same test may contain both shared and global memory locations. However, because the three configurations we examine are explicitly discussed in the PTX ISA [20, p. 165], we believe these configurations serve as a good basis for our exploration. In ongoing work discussed in Section 6.1 we consider more complicated GPU configurations, which show interesting initial results. In this document, we do not test intrawarp interactions. This is largely because of the essential role that warps have in our testing frameworks incantations, supporting intrawarp 23 testing becomes very difficult to develop and maintain. For example, in the synchronization incantation described in Section 3.4.2, we note that only one thread per-warp can execute the synchronization barrier; if multiple threads within a warp are executing tests with the synchronization incantation, then the high-level kernel design (shown in Section 3.2) will deadlock, and a much more complicated high-level design will be needed. We instead chose to spend our energy developing other features, such as read modify write atomics and conditionals, which produced interesting results as seen in Chapter 4. Some applications contain multiple kernels which run on multiple GPUs concurrently. In this document, we do not consider multi-GPU interactions because different GPU chips may implement different memory models, e.g. in Chapter 5, it can be seen that different GPU chips experimentally allow different behaviors. One of the goals of this work is to provide empirical benchmarks to compare putative GPU memory models against. Composing different memory models deserves careful treatment, and given that we do not even have a memory model for single GPU interactions, we believe such a study is outside the scope of this document. CHAPTER 3 GPU TESTING FRAMEWORK In this chapter, we discuss in detail the GPU testing framework. The high-level flow of the framework is shown in Figure 3.1. First, the GPU litmus tool is given a GPU litmus test in the GPU .litmus format which we describe in Section 3.1. This test specification is used to create CUDA or OpenCL code which can be compiled and executed on a GPU. The program will create a histogram of the results of running the test many times and check if any of the outcomes satisfied the assertion given in the GPU .litmus file. Most of this section is devoted to discussing the GPU program which is produced by the litmus tool and the heuristics (we dub incantations) that we use to expose weak behaviors. Section 3.3 presents critical incantations, which are the incantations without which we are unable to observe any weak behaviors. Section 3.4 presents additional heuristics which greatly increases the number of weak behaviors we are able to observe. We end this chapter by showing the effectiveness of these heuristics in Section 3.5. Figure 3.1. High-level flow of the GPU litmus tool 25 3.1 PTX GPU .litmus Format Figure 3.2 shows a complete example of PTX GPU .litmus test which is able to be parsed by the litmus tool to produce CUDA or OpenCL code. This test encodes the store buffering (SB) litmus test first discussed in Section 2.4. We proceed to discuss each section of the GPU .litmus test in detail. We note that the style and syntax of the GPU .litmus borrows heavily from the .litmus format of the original litmus tool [21] and at each section, we describe the differences from the GPU and original .litmus format. Line 1 starts the test with the name of the architecture and a test name (we dub the PTX architecture GPU PTX) and the test is SB. Lines 2-11 make up the register declarations and initialization. As noted in Section 2.3, PTX has typed registers and as such, we require all registers to be declared in this section. The syntax for declaration is: {tid}:.reg {type} {register} where tid is an integer thread identifier (e.g. 0 for T0), type is a PTX type (listed in Table 2.2), and register is a string of the form r{n} where n is an integer from zero to nine. It is also required that registers requiring a non-zero initialization be initialized here. 1 GPU_PTX SB 2 { 3 0:.reg .s32 r0; 4 0:.reg .s32 r2; 5 0:.reg .b64 r1 = x; 6 0:.reg .b64 r3 = y; 7 1:.reg .s32 r0; 8 1:.reg .s32 r2; 9 1:.reg .b64 r1 = y; 10 1:.reg .b64 r3 = x; 11 } 12 13 T0 \| T1 ; 14 mov.s32 r0,1 \| mov.s32 r0,1 ; 15 st.cg.s32 [r1],r0 \| st.cg.s32 [r1],r0 ; 16 ld.cg.s32 r2,[r3] \| ld.cg.s32 r2,[r3] ; 17 18 ScopeTree 19 (device (cta (warp T0) (warp T1))) 20 21 y: global, x: shared 22 23 exists 24 (0:r2=0 /\ 1:r2=0) Figure 3.2. Example of a GPU .litmus file which specifies the store buffering (SB) test 26 In lines 5 and 6, we initialize T0's registers r1 and r3 to the memory locations x and y, respectively. We initialize T1's registers r1 and r3 similarly in lines 9 and 10. The original .litmus format also has the initialization section, but does not require register declarations, because it deals exclusively with architectures which do not have typed registers. Lines 13-16 describe the concurrent program to be run by the test. The concurrent program consists of several sequential programs (to be ran concurrently) given in vertical columns and separated with a pipe (\|) character. Each sequential program starts with a thread identifier of the form T{n} where n is a integer from zero to nine. Following the thread identifier is a sequence of PTX instructions. We support all instructions listed in Table 2.2 along with several other basic binary operations (e.g. add, xor). The original .litmus format specified programs in the same manner, but did not have a parser for PTX programs. As discussed in Section 2.5.1, a GPU litmus test must specify the location of testing threads in the GPU thread hierarchy. Recent literature has referred to these hierarchy levels as scopes [29, 32]; we adhere to that terminology here and require each GPU .litmus file to contain a scope tree declaration which specifies the testing threads locations in the GPU thread hierarchy. Syntactically this declaration begins with the keyword ScopeTree followed by an S-expression [45] representing a tree of depth four where each level corresponds to a level in the GPU thread hierarchy. Each list begins with an identifier for the level of the hierarchy. From top to bottom, these identifiers are device, cta, and warp. A warp list is simply a list of testing thread IDs (e.g. T0, T1). In the concrete example shown in Figure 3.2, the scope tree is declared on lines 18 and 19; threads T0 and T1 are in the same CTA but different warps. More scope tree examples are shown in Figure 3.3. The original .litmus format did not test GPUs and hence had no need for a scope tree. //scope tree for a 2 threaded test where T0 and T1 are in different ctas (device (cta (warp T0)) (cta (warp T1))) //scope tree for a 2 threaded test where T0 and T1 are in different //warps, but the same CTA (device (cta (warp T0) (warp T1))) //more involved scope tree for a 3 threaded test where T0 and T1 are //in different warps, but the same CTA but T2 is in a different CTA (device (cta (warp T0) (warp T1)) (cta (warp T2)))) Figure 3.3. Additional examples of scope tree declarations 27 A GPU .litmus test must also specify in which region of memory testing locations are, shared or global. To this end, a memory map declaration that appears immediately following the scope tree declaration is required. The syntax is: {loc}:{region} where loc is a memory location and region is either shared or global. This specifies that location loc will be in the region memory region. All memory locations used in the test must be placed in a memory region in this style. In the concrete example shown in Figure 3.2, the memory map is given on line 21 and specifies that location y is in the global memory region and location x is in the shared memory region. The original .litmus format did not test GPUs and hence had no need for a memory map. The GPU .litmus test ends with an assertion about the final state of registers or memory locations. Syntactically, this begins with the keyword exists followed by an assertion in parenthesis. Registers are referred to with the following syntax {n}:{reg} where n is a thread integer identifier (e.g. 0 for thread T0) and reg is the name of a register declared in the initialization portion of the test. Memory locations are simply referred to by their name (e.g. x). We use the characters /\ to refer to the conjunction operator and the equality symbol (=) to refer to equality. In the concrete example shown in Figure 3.2, the assertion on lines 23 and 25 asks if it is possible for register r2 in thread T0 to equal 0 and register r2 in thread T1 to equal 0. The original .litmus has the same syntax for specifying assertions. 3.2 GPU Program Skeleton The GPU program produced by litmus can be split into two parts, the CPU code and the kernel (i.e. GPU) code. While the GPU litmus tool can produce either CUDA or OpenCL code, for ease of presentation, we show only CUDA code in this section. We begin our discussion with the CPU code. Figure 3.4 shows the main loop executed by the CPU in the form of high-level functions. From top to bottom (and noted with the line number), we step through this loop: • line 3 : This loop runs the GPU litmus test ITERATIONS times. The ITERATIONS value can be controlled with a command line argument to the GPU litmus tool. • line 4 : The initialize gpu kernel initializes all global memory used in the test. Recall that GPU memory cannot explicitly be accessed on the CPU and must either be initialized with a special CUDA function or in a separate kernel; we chose the latter and launch the GPU with a single thread in a single CTA to initialize global values. 28 1 ... 2 //main CPU loop 3 for(int i = 0; i < ITERATIONS; i++) { 4 initialize_gpu<<<1,1>>>(); 5 test_kernel<<<ctas,threads>>>(device_results); 6 record_global<<<1,1>>>(device_results); 7 cudaMemcpy(cpu_results, device_results, cudaMemcpyDeviceToHost); 8 record_results(cpu_results); 9 } 10 display_results(); 11 ... Figure 3.4. Testing loop of the CPU portion of the generated program • line 5 : The test kernel kernel runs the concurrent PTX test specified by the GPU .litmus file. The final contents of registers are placed in the global memory array device results so that they may be copied back. It is launched with a variable number of CTAs and threads which we discuss in Section 3.4.1. • line 6 : The record global GPU kernel records any global memory locations needed for the GPU litmus test assertion by placing them in the global memory array device results so that they may be copied back to the CPU. Similar to the initialization kernel, this kernel is launched with a single thread in a single CTA as only several locations will ever need to moved. • line 7 : Here the final contents of registers and memory (which were copied to the global memory array device results) are copied back to the CPU with the built-in CUDA cudaMemcpy function. • line 9 : Next the results are recorded in a histogram and checked against the assertion in the GPU litmus test. • line 11 : After running the test ITERATIONS times, A histogram of results with an emphasis on the results that satisfied the assertion is displayed. An example of the output for a CPU test is seen in Figure 2.7. Next we discuss the high-level implementation of the GPU kernel which runs the con-current PTX program specified in the GPU .litmus file. This kernel was referred to as test kernel in Figure 3.4. The high level code is shown in Figure 3.5 which we discuss for the rest of this section. 29 1 //Inside the kernel test_kernel 2 ... 3 if (tid == T0_tid && wid == T0_wid && cid == T0_cid) { 4 //Execute T0's test 5 ... 6 //Record T0's registers 7 } 8 else if (tid == T1_tid && wid == T1_wid && cid == T1_cid) { 9 //Execute T1's test 10 ... 11 //Record T1's registers 12 } 13 ... Figure 3.5. The kernel code where GPU threads execute the tests specified in the GPU .litmus file. Recall that GPU threads all execute the same kernel and in order for certain threads to execute distinct code, they must branch on a conditional related to their thread ID, warp ID, and CTA ID (tid, wid, and cid, respectively) as seen in lines 3 and 8. Once testing threads (e.g. T0, T1) are filtered into their respective conditional code, they execute their program that is specified in the GPU .litmus file. After the program is executed, the threads record the values of their registers into a global memory array that the CPU can copy and record. The testing thread IDs, testing warp IDs, and testing CTA IDs (e.g. T0 tid, T0 wid, and T0 cid, respectively) are determined by the GPU litmus tool and set such that the scope tree in the GPU .litmus test is satisfied. For example, if the scope tree specifies that T0 and T1 are in different CTAs, then T0 cid and T1 cid are never equal. Conversely if T0 and T1 are in the same CTA, then T0 cid and T1 cid must always be equal. 3.3 Critical Incantations The code presented in Section 3.2 is quite simple and, if executed as is, does not expose any weak behaviors for any GPU litmus tests we ran. Speaking candidly, we had a difficult time observing weak behaviors on GPUs; this project suffered several failed attempts over the course of two years before we found success. We were finally able to observe weak behaviors when we developed two critical incantations, called such because without at least one of these incantations present, we are unable to observe weak behaviors. We dub these two incantations general bank conflicts and memory stress. 30 3.3.1 General Bank Conflicts Recall from Section 2.2.1 that concurrent memory accesses on GPU hardware are sus-ceptible to bank conflicts due to poor memory accesses patterns within a warp. CUDA documentation states that when a bank conflict occurs, memory accesses are serialized. We are not concerned with the performance consequences of these memory access patterns; rather we use them to cause stress on the memory system which we (correctly) hypothesized could cause executions revealing weak memory behaviors. Official documentation only refers to bank conflicts applying to the shared memory region. However, we observe that this incantation works just as well for memory locations in the global memory region; as such, we refer to this incantation as general bank conflicts. This incantation lets all threads in the testing thread's warp execute the testing threads program. While the extra threads in the warp execute the same instructions as the testing thread, they are provided with dummy addresses for each memory access instruction. These dummy addresses are computed to be one of the following: • Parallel: All threads in the warp will access their own memory bank for this memory access instruction. • Broadcast: All threads in the warp access the same memory location as the testing thread for this memory access instruction. Note that this option is only computed for read memory accesses; nontesting threads writing to testing locations would cause corrupt results. • Bank Conflict: All threads in the warp will cause a bank conflict with the testing thread on this memory access instruction. In order to test many different GPU access patterns for a given test, the access type (i.e. parallel, broadcast, or bank conflict) for each memory access instruction is randomized for each iteration of the GPU test. The implementation of this incantation is largely in the testing kernel and happens as we filter testing threads into their testing code. Figure 3.6 shows a snippet of code implementing this general bank conflict for a specific testing thread T0 in the GPU testing kernel. We describe in detail this code snippet next. • Line 3 : Here the testing thread is filtered only by warp ID wid and CTA ID cid, thus the entire warp of the testing thread enters this code. 31 1 ... 2 //T0's entire warp executes test 3 if (wid == T0_wid && cid == T0_cid) { 4 //Assign unique address that potentially cause bank conflicts 5 bc_x = compute_address(access_type_x, T0_tid, x, tid); 6 bc_y = compute_address(access_type_y, T0_tid, y, tid); 7 ... 8 //Execute T0's test with new addresses 9 ... 10 if (tid == T0_tid) { 11 //Record T0's registers 12 ... 13 } 14 } 15 ... Figure 3.6. Code snippet of the general bank conflict incantation implementation • Line 5-6 : The example shows two memory locations x and y. New addresses bc x and bc y are computed (bc stands for bank conflict) via the compute address function which will return the original x and y location for the testing thread, but different addresses for other threads in the warp. The access type argument (i.e. access type x and access type x) indicate what type of access (i.e. bank conflict, parallel, broad-cast) will happen for each address. They are randomized per iteration. • Line 8 : All threads in the warp execute T0's test using their newly computed ad-dresses. • Line 10-13 : Only the testing thread (T0) records the results. While this incantation is one of our key ingredients for observing weak behaviors on GPUs, it does carry some consequences. Specifically, there must now be enough continuous memory space starting at the testing locations to allow all 32 threads in the warp to cause bank conflicts with the testing thread. Given that bank conflicting addresses are 32 words apart, this requires 32 ∗ 32 ∗ 4 = 4096 bytes of memory per testing location as opposed to simply four bytes before. Given the amount of memory on current GPUs (over 1 GB for global memory and 64 KB for shared memory), this is not an issue for tests with a small number of testing locations. 32 3.3.2 Memory Stress Memory systems implemented on modern multiprocessors have complicated caching protocols which implement involved eviction and write-back policies [14]. Our hypothesis is that stressing this system with relentless memory accesses will put these protocols in interesting states and, in turn, trigger weak memory model executions. For example, a memory bus may be more likely to transfer data out of order when it is under heavy stress then if it is only servicing several requests. To this end, all nontesting threads are employed to read and write from nontesting memory locations for an incantation we dub memory stress. We implement two functions, mem stress write() and mem stress read(), which re-peatedly write and read to nontesting memory locations, respectively. These functions implement efficient GPU access patterns by ensuring accesses contain no bank conflicts and are largely optimally aligned; additionally, we make sure that warps do not diverge. This allows memory to be written to and read from as rapidly and by as many threads as possible. The general bank conflict incantation described in Section 3.3.1 discusses that each testing memory location now uses 4096 bytes of memory where most of it is padding to allow for bank conflicts. Here, we take advantage of that padding memory, which is targeted by the memory stressing functions. We emphasize that these stressing functions do not touch the actual testing locations as that would interfere with the GPU litmus test. As a fail safe, the mem stress write() function writes chaotic values which would easily be recognizable as unwanted interference in the histogram of results. The high-level code of how this incantation is implemented is shown in Figure 3.7. First, it is shown that testing threads are filtered off to perform the PTX program specified in the GPU .litmus file in lines 2-10. The remaining threads enter the memory stress region in lines 13-19. In our implementation, half of the warps (i.e. warps with even numbered warp ids) write to the memory, while the other warps read from memory. We admit that there are many different ways to stress the memory system and due to the lack of intimate documentation about caching protocols implemented on these chips, we are unable to rigorously explain why these access patterns work as well as they do. However, we are able to observe that this technique is crucial for exposing weak behaviors and provide results which facilitate interesting observations and discussions about GPU memory models (e.g. see Chapter 4). 33 1 ... 2 //Filter off T0 3 if (wid == T0_wid && cid == T0_cid) { 4 ... 5 } 6 //Filter off T1 7 else if (wid == T1_wid && cid == T1_cid) { 8 ... 9 } 10 ... 11 //All threads not testing, stress the memory system 12 else { 13 //Even number warps do the writes 14 if (wid % 2 == 0) 15 mem_stress_write(); 16 17 //Odd numbered warps do the reads 18 else 19 mem_stress_read(); 20 } 21 ... Figure 3.7. High-level structure of the memory stress incantation implementation 3.4 Extra Incantations In this section, we present two additional incantations that we call random threads and synchronization. While these incantations are not critical (i.e. weak behaviors are observed without them), their presence dramatically increases the number of weak memory behaviors we observe. We report on the effectiveness of these incantations in Section 3.5. 3.4.1 Random Threads To test many different physical locations of testing threads on the GPU, the launch parameters (i.e. how many threads and CTAs with which the kernel is launched) and global IDs of testing threads are randomized for each iteration of the test. We call this incantation random threads. To implement this, global IDs (a combination of thread ID, warp ID, and CTA ID) are randomly assigned to testing threads such that the scope tree given in the specification remains valid. The memory model described in NVIDIA documentation [19, 20] (which is what we hope to test and eventually formalize) is unaware of concrete global IDs (e.g. thread ID = 1, CTA ID = 2); the model simply cares about the relationship between global IDs. That is, the model gives ordering guarantees based on if threads are interacting within the same CTA or across different CTAs. This incantation attempts to 34 get a good sampling of different concrete IDs over the relationships specified in the scope tree of the GPU .litmus file. Randomizing global IDs and launch parameters can have several consequences for how the testing threads are executed on hardware. For example, multiple CTAs may be sched-uled on an SM (streaming multiprocessor) if there is enough resources. By randomly selecting the number of threads per CTA (one of the limiting factors in how many CTAs are scheduled on an SM) and the CTA ID of testing threads, we allow the opportunity for testing threads to be mapped to a variety of SM assignments across the GPU. This may even allow some threads to be executed on the same SM, while others are on different SM. Documentation states that when a bank conflict happens, memory accesses are serialized [19, p. 187]. Given that the general bank conflict incantation is a critical incantation, we believe that this serialization may facilitate weak memory model executions. For example, we hypothesize that two memory instructions may be reordered if one access is issued com-pletely concurrently while the other must be completely serialized. By randomly assigning the thread ID of testing threads, our hope is that the testing thread is placed in a variety of places in the serialization order, thus exposing more weak behaviors. 3.4.2 Synchronization To allow testing threads to execute their respective tests closely in sync with one another, and hence promote interactions while memory values are actively moving through the memory system, testing threads synchronize immediately before the PTX programs specified in the GPU .litmus file are executed. This incantation is borrowed directly from the original litmus tool and is called synchronization. As a notable difference, GPUs do not guarantee forward-progress for interactions at certain levels of the GPU thread hierarchy, and naive synchronization implementations are prone to deadlock. Specifically, CTAs are not guaranteed to be scheduled concurrently [19, p. 12] and threads in the same warp do not have forward progress guarantees with respect to each other [46]. To ensure that CTAs will be scheduled concurrently, we adopt the persistent thread model presented in [47] in which the number of CTAs launched is limited to be at most the number of SMs on the GPU. Because each SM can run at least one CTA, this ensures all CTAs will be ran concurrently. To ensure threads within a warp do not deadlock, only a single thread per warp (i.e. the testing thread) is allowed to execute the synchronization barrier; this method was presented in [27]. Due to the warp synchronous execution model of the GPU, the other threads in the warp will not continue execution until the testing thread has been released from the barrier. 35 Because only one instance of the GPU litmus test is executed per kernel, the barrier implementation needs only to synchronize testing threads once. This is accomplished via an atomic add instruction and a spin loop. The barrier values are reset at each iteration in the initialize kernel called in the main CPU loop. 3.5 Effectiveness of Incantations In this section, we discuss the effectiveness of the incantations described in Sections 3.3 and 3.4. We benchmark all combinations of critical and extra incantations by running several tests which attempt to expose different reorderings and find the most effective incantations for different GPU configurations (see Section 2.5.1 for the configurations we test). We run each test 100,000 times on three different GPU chips across three generations of architectures; from oldest to newest, these chips are Tesla C2075 (Fermi), GTX Titan (Kepler), and GTX 750 (Maxwell). We report the average the number of weak behaviors observed per set of incantations. 3.5.1 Inter-CTA Incantations We first consider how effective incantations are for inter-CTA GPU configurations. We benchmark three tests, chosen for the different reorderings they attempt to expose. These tests are: • Message Passing (MP): This test is described in Section 5.2.1 and tests a handshake idiom. • Load Delaying (LD): This test (also known as load buffering) is described in Section 5.2.2 and tests if load operations may be reordered with program order later write operations. • Store Buffering (SB): This test is described in Section 5.2.3 and tests if store operations may be reordered with program order later read operations. These benchmarks are provided to give a general idea of how effective incantations are and not as an exhaustive study on how to most effectively run individual tests. Therefore, we limit our benchmarking to these basic tests and do not consider tests with fences or other synchronization constructs (e.g. dependencies). Table 3.1 shows the results of running these tests under different incantation com-binations. The first column specifies the critical incantation used. Notice that if no critical incantation is present, no weak behaviors are observed despite the presence of 36 Table 3.1. Effectiveness of incantations for inter-CTA GPU configurations Critical Incantations Extra Incantations MP LD SB None 0 0 0 Randomization 0 0 0 Sync 0 0 0 None Randomization + Sync 0 0 0 General Bank Conflicts None 836 0 0 Randomization 1984 0 3 Sync 0 0 0 Randomization + Sync 2867 0 2 None 234 653 760 Randomization 290 313 291 Sync 6614 211 268 Memory Stress Randomization + Sync 4878 2838 3328 Memory Stress + General Bank Conflicts None 73 28 6 Randomization 368 92 93 Sync 202 223 35 Randomization + Sync 2901 636 716 extra incantations. For extra incantations, we write randomization for the random thread incantation discussed in Section 3.4.1 and sync for the synchronization incantation described in Section 3.4.2. We use the plus (+) symbol between two incantations when both are present. We observe from the results in Table 3.1 that the number of weak behaviors observed is highly dependent on both the test and incantations used. For example, for MP with general bank conflicts, we are unable to observe weak behaviors with the only the sync extra incantation. We observe that the memory stress critical incantation with both sync and randomization seems to be the most effective set of incantations; however, LD and SB are greatly reduced if sync or randomization are used exclusively as the extra incantations. 3.5.2 Intra-CTA Incantations We now consider how effective incantations are for intra-CTA GPU configurations. While in Section 3.5.1, we were able to use different tests to show the effectiveness of incantations, the only one of the three tests (MP, LB, SB) that we are able to observe for intra-CTA configurations is MP. This may be because for intra-CTA configurations, our incantations are still not enough to expose weak behaviors, or because there is a stronger memory model implemented at this level. Because we are only able to observe MP, we only show results for this test. We have two variants of the MP test at this GPU configuration which are: 37 • Message Passing Global (MP-Global): This is the same message passing test used in Section 3.5.1, except that under this GPU configuration, all threads are in the same CTA and target the global memory region. • Message Passing Shared (MP-Shared): This is the same message passing test as MP-global, but in this GPU configuration, all memory accesses target the shared memory region. Table 3.2 shows the results of running these tests under different incantation combina-tions. Similar to the inter-CTA tests, we observe that critical incantations are required for observing any weak behaviors and that the number of weak behaviors observed is highly dependent on both the test and incantations used. In the intra-CTA tests, the general bank conflict incantation is by far the most effective; in fact, the memory stress incantation by itself produces very few if any weak behaviors. This is the opposite of what we observed for the inter-CTA tests where memory stress was the most effective critical incantation. Additionally, for intra-CTA tests, the sync incantation without the randomization will produce no weak behaviors. This section shows that the effectiveness of incantations depends heavily on the GPU configuration of the test. Currently, all incantations are controllable via command line Table 3.2. Effectiveness of incantations for intra-CTA GPU configurations Critical Incantations Extra Incantations MP-Global MP-Shared None 0 0 Randomization 0 0 Sync 0 0 None Randomization + Sync 0 0 General Bank Conflicts None 877 0 Randomization 2150 2061 Sync 0 0 Randomization + Sync 1989 2223 None 7 0 Randomization 7 0 Sync 2 0 Memory Stress Randomization + Sync 0 0 Memory Stress + General Bank Conflicts None 0 0 Randomization 336 1249 Sync 0 0 Randomization + Sync 1360 1722 38 arguments. Future work may analyze tests and dynamically configure incantations based on the GPU configuration in the test. CHAPTER 4 NOTABLE RESULTS AND CASE STUDIES In this chapter, we discuss notable testing results and case studies of CUDA applications. We go over some initial notations and considerations in Section 4.1. The first results that we discuss are interesting with respect to general memory consistency properties (e.g. coherence) and documentation in the PTX ISA manual [20]. Specifically, Section 4.2 shows that some deployed GPUs implement controversial relaxed coherence behaviors. Section 4.3 discusses the L1 cache memory annotation on Fermi architectures and how it cannot be used reliably for any inter-CTA interactions; this has programming consequences as it is the default memory annotation for the CUDA compiler. Section 4.4 tests the .volatile memory annotation and compares our observations with vendor documentation. The second half of this chapter presents CUDA case studies where developers have made assumptions about the GPU memory model which may lead to erroneous behaviors. Section 4.5 discusses two GPU spin-locks which do not use fences: one from the popular CUDA by Example book [2] and the other from Owens and Stuart's paper entitled Efficient Synchronization Primitives for GPUs [48]. Both of these lock implementations assume that read-modify-write atomics provide sequentially consistent behavior; however, we show that this is not the case. We conclude by examining a GPU concurrent deque appearing in both a publication [49] and the book GPU Computing Gems: Jade Edition [50, pp. 485-499]. We show that the provided fence-less implementation could lead to the undesirable case of stale data being read from the deque. 4.1 Notations and Considerations In the tests presented in this chapter, we use a parameterizable fence instruction that we note membar.{scope}. This fence is then instantiated for the different membar scopes, namely .cta and .gl (the third scope .sys is used only a few times in this document for reasons given in Section 5.1). We say that the membar has scope None for tests with no fence. Some tests have more than one fence instruction; however, in this chapter, we only 40 consider tests where both fences have the same scope annotation. That is, for scope .cta all membars will have the .cta annotation. While this does not test all possible combination of fences, this chapter is largely concerned with testing if weak behaviors are observed, and if so, is it possible to experimentally disallow them. To that end, we do not enumerate all fence combinations. All testing results come from running 100,000 iterations. Additionally, we observe far fewer weak behaviors on the GTX 750 (Maxwell) chip than the other chips. We hypothesize several reasons for this. The GTX 750 is a substantially smaller chip than the others (having only 5 SMs); this means there are less physical resources to run threads that stress the memory system in the crucial memory stress incantation (see Section 3.3.2). Another reason might be that we have not fine tuned our tool to test this chip, given that it has only been available for a few months at the time of writing. Finally, this chip may simply implement a stronger model than the others. 4.2 Coherence of Read-Read (CoRR) Coherence is a property of memory consistency that applies only to single address behaviors. It has been defined as SC for a single address [14, p. 14]. Nearly all modern CPU memory models guarantee coherence, with the exception of Sparc RMO [51, pp. 265-267] which allows reads from the same address to be reordered. This behavior can be seen in the coherence of read-read (or CoRR) litmus test; a PTX instance of this test is shown in Figure 4.1. In this test, T1 is able to read the updated value from memory followed in program order by a read which returns stale data. If this behavior is allowed, we would like to investigate which memory fence (i.e. membar) placed in between the loads in T1 is required to disallow it. This weak behavior (i.e. CoRR) has been controversial in CPU memory models as it is observable on many ARM chips but confirmed as buggy behavior [16, 52]. Additionally, new language level memory models (e.g. OpenCL 2.0 [53] and C++11 [9]) disallow this behavior and it is unclear how to efficiently implement such languages on hardware with initial state: x = 0 T0 T1 st.cg. s32 [x], 1 ; ld.cg. s32 r1 , [x] ; membar .{ scope } ; ld.cg. s32 r2 , [x] ; assert: 1:r1=1 ∧ 1:r2=0 Figure 4.1. Test specification for CoRR 41 this relaxation. We test this behavior on GPUs and show that older architectures (Fermi and Kepler) allow this behavior, but newer chips (Maxwell) experimentally do not. Table 4.1 shows the results of running the CoRR test on three GPUs with all different architectures (Fermi, Kepler, and Maxwell). We test all three GPU configurations de-scribed in Section 2.5.1. We observe that CoRR is indeed observable on Kepler and Fermi architectures for all GPU configurations but is not observable at all on the newer Maxwell architecture. We observe that only the smallest scoped fence membar.cta is required to experimentally disallow this test for any of the tested GPU configurations. 4.3 Fermi Memory Annotations Recall that the .ca memory annotation loads from the L1 cache (see Table 2.2) and that separate CTAs may have separate L1 caches if they are mapped to different SMs (see Section 2.2). The PTX manual [20, p. 121] explicitly states that multiple L1 caches are incoherent by stating: Global data is coherent at the L2 level, but multiple L1 caches are not coherent for global data. If one thread stores to global memory via one L1 cache, and a second thread loads that address via a second L1 cache with ld.ca, the second thread may get stale L1 cache data, rather than the data stored by the first thread. In this section, we test the L1 memory annotation (i.e. .ca) across CTAs to determine what extent this operator can be used reliably for inter-CTA interactions. 4.3.1 Message Passing Through L1 Consider the test shown in Figure 4.2. This type of test is named message passing (MP) and describes a handshake idiom. Specifically, T0 writes some data to location x followed Table 4.1. Results for CoRR tests Fermi Kepler Maxwell GPU Configuration Fence Tesla C2075 GTX Titan GTX 750 None 7356 8572 0 D-warp:S-cta-Shared .cta 0 0 0 .gl 0 0 0 None 3668 10047 0 D-cta:S-ker-Global .cta 0 0 0 .gl 0 0 0 None 3246 4769 0 D-cta:S-ker-Global .cta 0 0 0 .gl 0 0 0 42 initial state: x = 0, y = 0 T0 T1 st.cg. s32 [x], 1 ; membar .{ scope } ; st.cg. s32 [y], 1 ; ld.ca. s32 r1 , [y] ; membar .{ scope } ; ld.ca. s32 r2 , [x] ; assert: 1:r1=1 ∧ 1:r2=0 Figure 4.2. Test specification for MP-L1 by a flag to location y. We test if T1 is allowed to read the updated flag followed by a read in program order that returns stale data. Notice that we use the .ca memory annotation for all load operations; we dub this test MP-L1. Because our aim here is to test multiple L1 caches, we only consider the GPU configuration where T0 and T1 are in different CTAs and thus, x and y must be in the global memory region. This corresponds to the GPU configuration D-cta:S-ker-Global. We report the results of running MP-L1 in Table 4.2. We observe that on Fermi architectures, no fence is strong enough to disallow the MP-L1 test. To emphasize this point, we include the .sys fence in our tests which we largely exclude from this document for reasons explained in Section 5.1. We emphasize that the .sys is documented to be the strongest fence in the PTX documentation, as it enforces orderings across all interactions including multidevice interactions [20, p. 169]. We observe that not even the .sys fence restores orderings to this example on Fermi architecture; however, this behavior is able to be experimentally disallowed on Kepler and Maxwell with what we interpret to be the appropriately scoped fence (i.e. membar.gl). This behavior not appearing on Kepler and Maxwell is possibly because the documentation states that the L1 cache has been disabled for global memory accesses on these architectures and global memory is cached in the L2 cache regardless of the memory annotation [19, p. 194]. That is, we believe this issue to be Table 4.2. Results for MP-L1 tests Fermi Kepler Maxwell GPU Configuration Fence Tesla C2075 GTX Titan GTX 750 D-cta:S-ker-Global None 11648 8129 3 .cta 455 3087 0 .gl 208 0 0 .sys 201 0 0 43 the result of multiple L1 caches interacting; if the L1 cache is disabled for global memory accesses, then we will not see the symptoms of their interactions. 4.3.2 Mixing Memory Annotations The previous section showed that inter-CTA interactions cannot implement a message passing (MP, or handshaking) protocol reliably (i.e. disallow stale values from being read from memory) using the .ca exclusively for loads. In this section, we mix memory annota-tions in an attempt to restore orderings between multiple L1 caches. We hypothesize that perhaps we may be able to propagate values up from the L2 cache to the L1 cache by reading the cache line first from the L2. We get this hypothesis from the PTX ISA manual which states that after an L2 load (i.e. .cg), "... existing cache lines that match the requested address in L1 will be evicted" [20, p. 121]. While it is not clear what guarantees (if any) are provided in this quote, it seems to suggest that a read from the L2 will somehow effect the L1 cache (e.g. by evicting values). The most basic test we could think of to examine this behavior is a variation of CoRR (see Section 4.2) where we first read data from the L2 cache via the .cg memory annotation and then attempt to read the same data from the L1 cache via the .ca annotation. This would correspond to the memory value being propagated up the cache hierarchy (from the L2 to L1) after it is first read from the L2. This test, which we dub CoRR-L2-L1, can be seen in Figure 4.3. The results of running this test are shown in Table 4.3. We observe that in the Fermi architecture, no fence is strong enough to guarantee that updated values will be read reliably from the L1 cache even when they are first read from a shared cache. Similar to Section 4.3.1, to emphasize this point, we include the .sys fence in our tests which we largely exclude from this document for reasons explained in Section 5.1. initial state: x = 0 T0 T1 st.cg. s32 [x], 1 ; ld.cg. s32 r1 , [x] ; membar .{ scope } ; ld.ca. s32 r2 , [x] ; assert: 1:r1=1 ∧ 1:r2=0 Figure 4.3. Test specification for CoRR-L2-L1 44 Table 4.3. Results for CoRR-L2-L1 tests Fermi Kepler Maxwell GPU Configuration Fence Tesla C2075 GTX Titan GTX 750 D-cta:S-ker-Global None 10247 4739 0 .cta 1989 0 0 .gl 1669 0 0 .sys 1706 0 0 4.3.3 CUDA Programming Consequences Because of the two previous results, we are convinced that on Fermi architectures, the .ca memory annotation cannot be used for reliable inter-CTA communication at all (i.e. it is not possible to disallow stale values from being read from memory). Interestingly, the .ca memory annotation is the default annotation for the CUDA compiler [20, p. 121]. Therefore, any programmer who wishes to develop GPU code with inter-CTA interactions needs to explicitly specify that the L2 memory annotation (i.e. .cg) be used. This can be accomplished with the nvcc command line argument: -Xptxas -dlcm=cg -Xptxas -dscm=cg. We show throughout Chapter 5 that we are able to reliably use fences to disallow stale values from being read when the L2 memory annotation is used. As a further consequence, the (single) memory consistency example provided in the CUDA manual [19, p. 95] computes a reduction (i.e. summing the values of a vector) and uses a memory load to retrieve values across CTAs. Even though the example provides a fence, we have shown in this section that no fence is sufficient under default compilation schemes (i.e. .ca memory annotations) to disallow stale values from being read. Thus this example is broken on Fermi architectures if compiled without explicitly specifying the .cg annotation to be used, of which the CUDA guide makes no mention. 4.4 Volatile Operators The PTX ISA provides the .volatile memory annotation with the following documen-tation [20, p. 136]: "st.volatile may be used with .global and .shared spaces to inhibit optimization of references to volatile memory. This may be used, for example, to enforce sequential consistency between threads accessing shared memory". It is not clear to us which GPU configurations (i.e. inter or intra CTA and memory regions) to which this documentation is extending sequential consistency guarantees (or if fences are additionally required to provide sequential consistency); we see this phrasing as a potential source of confusion and test the behavior of this annotation in this section. 45 Figure 4.4 presents a simple MP style test using the .volatile annotation which we dub MP-volatile. Table 4.4 shows the results of running this test on all GPU configurations discussed in Section 2.5.1. We observe that without fences, the .volatile annotation does not enforce sequentially consistent behavior at any GPU configuration. However, weak behaviors can be experimentally disallowed when (what we interpret to be) the appropriate fences are included (.cta or .gl for intra-CTA configurations and .gl for the inter-CTA configura-tion). While the exact intention of the documentation is unknown, we suggest a rewording to alleviate potential confusion. Tentatively, we suggest amending the original documentation as such: st.volatile may be used with .global and .shared spaces to inhibit opti-mization of references to volatile memory. This may be used in conjunction with the appropriate memory fence to enforce sequentially consistent executions between threads. 4.5 Spin-Locks In this section, we test two GPU spin-lock mutex implementations; the first is given in the book CUDA by Example [2], the second is given by Jeff Stuart and John Owens in their paper Efficient Synchronization Primitives for GPUs [48]. We show that these implementions do not satisfy what is generally considered to be the correct specification for a mutex. Specifically, we show that a critical section may read data values that are stale w.r.t. the previous critical section for inter-CTA interactions. We then show that the addition of memory fences experimentally provides the expected behavior. We document these behaviors in terms of short litmus tests and the results of running them in our testing framework. initial state: x = 0, y = 0 T0 T1 st. volatile .s22 [x],1 ; membar .{ scope } ; st. volatile .s32 [y],1 ; ld. volatile . s32 r0 , [y] ; membar .{ scope } ; ld. volatile . s32 r2 , [x] ; assert: 1:r0=1 ∧ 1:r2=0 Figure 4.4. Test specification for MP-volatile 46 Table 4.4. Results for MP-volatile tests Fermi Kepler Maxwell GPU Configuration Fence Tesla C2075 GTX Titan GTX 750 None 2007 3078 0 D-warp:S-cta-Shared .cta 0 0 0 .gl 0 0 0 None 822 3025 0 D-cta:S-ker-Global .cta 0 0 0 .gl 0 0 0 None 699 7948 7 D-cta:S-ker-Global .cta 219 3120 0 .gl 0 0 0 4.5.1 CUDA by Example CUDA by Example presents a mutex implementation for combining CTA-local partial sums [2, pp. 251-254]. The mutex implementation is a simple atomic compare-and-swap (i.e. CAS) spin-lock with an atomic exchange release. We reproduce a simplified version of the lock and unlock functions in Figure 4.5 for reference. Note that the original implementation had an error which we have repaired as given in the official errata for the book (see https: //developer.nvidia.com/cuda-example-errata-page). The locks are used to update a global value c with the CTA-local partial sums located in cacheIndex[0]. Only one thread per CTA executes this code. This part of the imple-mentation is shown in Figure 4.6. While the book does not explicitly mention memory consistency issues, the following paragraph suggests that the behavior typically expected from a lock can be obtained by only using atomic operations. For context, it is explaining why unlock must be an atomic exchange rather than simply a store [2, p. 254]. 1 __device__ int mutex; 2 3 __device__ void lock( void ) { 4 while( atomicCAS( mutex, 0, 1 ) != 0 ); 5 } 6 7 __device__ void unlock( void ) { 8 atomicExch( mutex, 0 ); 9 } Figure 4.5. Implementation of lock and unlock given in CUDA by Example 47 1 ... 2 //cacheIndex is equal to tid 3 if (cacheIndex == 0) { 4 lock.lock(); 5 c += cache[0]; 6 lock.unlock(); 7 } Figure 4.6. Code snippet from the mutex example given in CUDA by Example Atomic transactions and generic global memory operations follow different paths through the GPU. Using both atomics and standard global memory operations could therefore lead to an unlock() seeming out of sync with a subsequent attempt to lock() the mutex. The behavior would still be functionally correct, but to ensure consistently intuitive behavior from the application's perspective, it's best to use the same pathway for all accesses to the mutex. We distill this mutex implementation into a GPU litmus test named CAS spin-lock (abbreviated to CAS-SL) shown in Figure 4.7. The reader may wish to refer back to Table 2.2 for a description of some of the PTX instructions used in this test. This test describes two threads interacting via a CAS spin-lock. The y memory location is the mutex and x is the global data accessed in the critical section. The test begins in a state where T0 has the mutex (y = 1). T0 stores a value to x and then releases the mutex with an atomic exchange. T1 attempts to acquire the lock with a CAS instruction, then checks if the lock was acquired successfully via the setp command. If the lock was acquired, i.e. r0 = 0, then T1 attempts to read the global data in x. This is enforced using PTX predicated execution [20, p. 160]; that is, instructions annotated with @r1 will only execute if the setp command was satisfied. The final constraint describes an execution where T1 successfully acquires the lock (i.e. 1:r0 = 0) yet does not see the updated value in x (i.e. 1:r2 = 0). Table 4.5 shows the test outcomes for variants of the CAS-SL test for three different chips. We only test GPU configuration D-cta:S-ker-Global because that is the interaction initial state: x = 0, y = 1 T0 T1 st.cg.u32 [x], 1 ; membar .{ scope } ; atom . exch .b32 r0 ,[y],0 ; atom .cas .b32 r0 ,[y],0,1 ; setp .eq.u32 r1 , r0 , 0 ; @r1 membar .{ scope } ; @r1 ld.cg.u32 r2 ,[x] ; assert: 1:r0=0 ∧ 1:r2=0 Figure 4.7. Test specification for CAS-SL 48 Table 4.5. Results for CAS-SL tests Fermi Kepler Maxwell GPU Configuration Fence Tesla C2075 GTX Titan GTX 750 D-cta:S-ker-Global None 86 1607 0 .cta 17 692 0 .gl 0 0 0 that is described in the CUDA by Example application (it is an inter-CTA mutex). We observe that without fences, T1 can indeed load stale values. While the .cta fence scope reduces the number of times we observe the weak behavior, the (.gl) fence is required to completely disallow the behavior based on our experimental results. The CAS-SL test distills the locking behavior in CUDA by Example to a simple message passing idiom. If T1 is able to see a stale value, then the total sum could be computed without considering T0's contribution; this will lead to an incorrect summation result. The implementation in CUDA by Example has inter-CTA interactions and is lacking fence instructions which leaves the code vulnerable to this error. 4.5.2 Efficient Synchronization Primitives for GPUs In their paper Efficient Synchronization Primitives for GPUs, Stuart and Owens provide synchronization primitives for GPUs [48]. They include a spin-lock that is similar to the one presented in Section 4.5.1, with the difference being that they use atomic exchange instead of compare-and-swap for the locking function. They continue to discuss how to optimize the mutex functions by reducing contention for a memory location using a method they refer to as a backoff strategy, which does not introduce any additional memory ordering operations (e.g. memory fences). The authors explicitly make the assumption that an atomic exchange can be used in place of a store and memory fence by stating [48, p. 3]: "Also, we use atomicExch() instead of a volatile store and threadfence() because the atomic queue has predictable behavior, threadfence() does not (i.e. it can vary greatly in execution time if other memory operations are pending)". We were unable to locate unambiguous justifications for the above assumptions in any NVIDIA documentation (CUDA or PTX). The following paragraph from the PTX ISA may be related, but seems to be restricted to atomicity and single address interactions; it does not seem to account for memory accesses inside the critical section [20, pp. 166-167]: Atomic operations on shared memory locations do not guarantee atomicity with respect to normal store instructions to the same address. It is the programmer's responsibility to guarantee correctness of programs that use shared memory 49 atomic instructions, e.g., by inserting barriers between normal stores and atomic operations to a common address, or by using atom.exch to store to locations accessed by other atomic operations. We distill this mutex implementation to a litmus test named exchange spin-lock (ab-breviated to EXCH-SL) shown in Figure 4.8 which describes two threads interacting via an atomic exchange spin-lock. The description is identical to the CAS-SL test described in Section 4.5.1, except here atomic exchange is used for the locking mechanism instead of atomic compare-and-swap. The final constraint describes an execution where T1 success-fully acquires the lock (1:r0 = 0), yet does not see the updated value in x (1:r2 = 0). Table 4.6 shows the test outcomes for variants of the CAS-SL test for three different chips. We only test GPU configuration D-cta:S-ker-Global because that is the interaction that is described in the paper. We observe that without fences, T1 can indeed load stale values. The .cta fence reduces the number of times we observe the weak behavior; however, the (.gl) fence is required to disallow the behavior based on our experimental results. While the paper Efficient Synchronization Primitives for GPUs does not provide con-crete examples using the locking mechanisms, this test distills a simple locking message passing idiom one might implement using this mutex. Traditionally, lock implementations have provided sufficient synchronization to ensure that critical sections observe the most recent values computed in previous critical sections [14, p. 64]; that is, values protected by locks should have sequentially consistent behavior (sequential consistency is described initial state: x = 0, y = 1 T0 T1 st.cg.u32 [x], 1 ; membar .{ scope } ; atom . exch .b32 r0 ,[y],0 ; atom . exch .b32 r0 ,[y],1 ; setp .eq.u32 r1 , r0 , 0 ; @r1 membar .{ scope } ; @r1 ld.cg.u32 r2 ,[x] ; assert: 1:r0=0 ∧ 1:r2=0 Figure 4.8. Test specification for EXCH-SL Table 4.6. Results for EXCH-SL tests Fermi Kepler Maxwell GPU Configuration Fence Tesla C2075 GTX Titan GTX 750 D-cta:S-ker-Global None 98 1468 0 .cta 12 638 0 .gl 0 0 0 50 in Section 2.4). As seen in our results in Table 4.6, this is not the case without fences. Although the paper makes no claims about formal synchronization properties, given the traditional properties of locks, we feel that it may not have been intentional to allow such behaviors. 4.6 Dynamic Work Balancing In this section, we test certain behaviors of a concurrent deque used in dynamic load balancing on GPUs as seen in [49] and again presented in the book GPU Computing Gems Jade Edition (Applications of GPU Computing Series) [50, pp. 485-499]. This technique is used in two applications: octree partitioning and four-in-a-row game simulation. We show that the provided implementations allow threads to read partial or stale data from the work deque in certain situations due to weak memory orderings on the hardware. We could not find any mention of weak memory model considerations in either publication documenting these concurrent deques. Specifically, the dynamic load balancing is set up in the form of work-stealing deques (one per CTA) containing abstract task types. We show that in the case when one thread steals a task immediately after it was pushed by another thread, the stealing thread may not observe the recently pushed task, yet the deque will be updated as if the recently pushed task was correctly stolen. This can lead to several undesirable situations, including skipping tasks or loading partial tasks if tasks are large enough to be split into several load instructions (as is the case in both provided applications). While the provided implementations of the octree partitioning and four-in-a-row simu-lator are advertised as being for architectures with compute capability sm 13, our testing framework largely targets generic address instructions which are not supported earlier than sm 20. As such, we are unable to test on the advertised architecture. However, we believe this remains a substantial issue given that memory fences are supported on all architectures and the book GPU Computing Gems Jade Edition is used as a reference for modern GPU computing. 4.6.1 CTA Level Work Stealing Deques Here we briefly describe the dynamic load balancing technique. This is achieved through concurrent work-stealing deques, one per CTA. The particular concurrent deque, described in [54], avoids expensive read-modify-write instructions in the common case. In this deque, there is a separate global head and tail index value; tasks are added and removed by the deque owner from the tail index (leaving the head pointer on the opposite 51 side of the deque, see Figure 4.9). The tail points to an empty cell and is decremented to find a task. If there are no tasks remaining in the deque, the CTA may try to steal a task from another CTA's deque at the head index. Because the deque owner and thieves are accessing the deque from different ends, expensive synchronization is not needed when the deque contains more than 1 element. Synchronization may be required between multiple thieves accessing the same deque, but stealing is claimed to be the less common case. 4.6.2 Synchronization Between Owner and Thief We now describe in detail the interaction when an owner pushes a task and a thief immediately steals the task. We first reproduce the code for the push and steal functions (adapted from [50, pp. 485-499]) in Figure 4.10. Note that head is a structure that contains an index and a counter. The counter is provided to avoid the ABA problem [55], which does not arise in our simple interaction. The purpose of this discussion is not to examine all possible interactions between a deque owner and a thief, but rather to examine and then test one particular interaction. This interaction starts with an empty deque (in this simplified interaction, all tasks are simply integers and locations are initialized to 0). This means that the head and tail indexes point at the same location as seen in Figure 4.11. The deque owner then pushes a task to the deque (say the integer value 1) via the push function presented above. Now head points to the value 1 and tail has been incremented as seen in Figure 4.12 At this point, another CTA attempts to steal from the deque by calling the steal function. First, it checks for an empty deque. As we can see, the deque is not empty. Next, the task pointed to by head (copied into the value oldHead) at location 0 is loaded which is the value 1. Finally, the thief checks if another thief has already stolen the task using Figure 4.9. Example configuration of the concurrent deque 52 1 /* only 1 thread ever calls this function, therefore 2 no RMW required (e.g. atomic add) / 3 __device__ void push( task ) { 4 tasks[tail] = task; 5 tail++; 6 } 7 8 //steal function 9 __device__ Task steal( void ) { 10 int oldHead = head; 11 12 / Check for empty deque / 13 if (tail <= oldHead.index) 14 return EMPTY; 15 16 task = tasks[oldHead.index]; 17 newHead = oldHead; 18 newHead.index++; 19 if (CAS(&head, oldHead, newHead)) 20 return task; 21 22 / Unable to steal because of another thief */ 23 return FAILED; 24 } Figure 4.10. Implementation of push and steal for the concurrent deque Figure 4.11
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6rz2mbp