| Title | Compiler optimizations and autotuning for stencils and geometric multigrid |
| Publication Type | dissertation |
| School or College | College of Engineering |
| Department | Kahlert School of Computing |
| Author | Basu, Protonu |
| Date | 2016 |
| Description | Stencil computations are operations on structured grids. They are frequently found in partial differential equation solvers, making their performance critical to a range of scientific applications. On modern architectures where data movement costs dominate computation, optimizing stencil computations is a challenging task. Typically, domain scientists must reduce and orchestrate data movement to tackle the memory bandwidth and latency bottlenecks. Furthermore, optimized code must map efficiently to ever increasing parallelism on a chip. This dissertation studies several stencils with varying arithmetic intensities, thus requiring contrasting optimization strategies. Stencils traditionally have low arithmetic intensity, making their performance limited by memory bandwidth. Contemporary higher-order stencils are designed to require smaller grids, hence less memory, but are bound by increased floating-point operations. This dissertation develops communication-avoiding optimizations to reduce data movement in memory-bound stencils. For higher-order stencils, a novel transformation, partial sums, is designed to reduce the number of floating-point operations and improve register reuse. These optimizations are implemented in a compiler framework, which is further extended to generate parallel code targeting multicores and graphics processor units (GPUs). The augmented compiler framework is then combined with autotuning to productively address stencil optimization challenges. Autotuning explores a search space of possible implementations of a computation to find the optimal code for an execution context. In this dissertation, autotuning is used to compose sequences of optimizations to drive the augmented compiler framework. This compiler-directed autotuning approach is used to optimize stencils in the context of a linear solver, Geometric Multigrid (GMG). GMG uses sequences of stencil computations, and presents greater optimization challenges than isolated stencils, as interactions between stencils must also be considered. The efficacy of our approach is demonstrated by comparing the performance of generated code against manually tuned code, over commercial compiler-generated code, and against analytic performance bounds. Generated code outperforms manually optimized codes on multicores and GPUs. Against Intel's compiler on multicores, generated code achieves up to 4x speedup for stencils, and 3x for the solver. On GPUs, generated code achieves 80% of an analytically computed performance bound. |
| Type | Text |
| Publisher | University of Utah |
| Subject | autorunning; code generation; compilers; geometric multigrid; parallel programming; stencil computations |
| Dissertation Name | Doctor of Philosophy in Computer Science |
| Language | eng |
| Rights Management | © Protonu Basu |
| Format | application/pdf |
| Format Medium | application/pdf |
| Format Extent | 27,183 bytes |
| Identifier | etd3/id/4084 |
| ARK | ark:/87278/s6dj8q0g |
| DOI | https://doi.org/doi:10.26053/0H-175S-S8G0 |
| Setname | ir_etd |
| ID | 197634 |
| OCR Text | Show COMPILER OPTIMIZATIONSANDAUTOTUNING FORSTENCILSANDGEOMETRICMULTIGRID by ProtonuBasu A dissertationsubmittedtothefacultyof The UniversityofUtah in partialfulfillmentoftherequirementsforthedegreeof DoctorofPhilosophy in Computer Science SchoolofComputing The UniversityofUtah May2016Copyright c ProtonuBasu2016 All RightsReservedThe University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Protonu Basu has been approved by the following supervisory committee members: Mary Hall , Chair 7-Dec-2015 Date Approved Samuel Williams , Member 17-Nov-2015 Date Approved Rajeev Balasubramonian , Member 10-Nov-2015 Date Approved Martin Berzins , Member 10-Nov-2015 Date Approved Ganesh Gopalakrishnan , Member 10-Nov-2015 Date Approved and by Ross Whitaker , Chair/Dean of the Department/College/School of Computing and by David B. Kieda, Dean of The Graduate School. ABSTRACT Stencil computationsareoperationsonstructuredgrids.Theyarefrequently found inpartialdifferentialequationsolvers,makingtheirperformancecriticalto a rangeofscientificapplications.Onmodernarchitectureswheredatamovement costs dominatecomputation,optimizingstencilcomputationsisachallengingtask. Typically,domainscientistsmustreduceandorchestratedatamovementtotackle the memorybandwidthandlatencybottlenecks.Furthermore,optimizedcodemust map efficientlytoeverincreasingparallelismonachip. This dissertationstudiesseveralstencilswithvaryingarithmeticintensities,thus requiring contrastingoptimizationstrategies.Stencilstraditionallyhavelowarith- metic intensity,makingtheirperformancelimitedbymemorybandwidth.Contem- porary higher-order stencils aredesignedtorequiresmallergrids,hencelessmemory, but areboundbyincreasedfloating-pointoperations.Thisdissertationdevelops communication-avoidingoptimizationstoreducedatamovementinmemory-bound stencils. Forhigher-orderstencils,anoveltransformation, partialsums, isdesigned to reducethenumberoffloating-pointoperationsandimproveregisterreuse.These optimizations areimplementedinacompilerframework,whichisfurtherextended to generateparallelcodetargetingmulticoresandgraphicsprocessorunits(GPUs). The augmentedcompilerframeworkisthencombinedwithautotuningtoproduc- tivelyaddressstenciloptimizationchallenges.Autotuningexploresasearchspaceof possibleimplementationsofacomputationtofindtheoptimalcodeforanexecution context.Inthisdissertation,autotuningisusedtocomposesequencesofoptimiza- tions todrivetheaugmentedcompilerframework.Thiscompiler-directedautotuning approachisusedtooptimizestencilsinthecontextofalinearsolver,Geometric Multigrid (GMG).GMGusessequencesofstencilcomputations,andpresentsgreater optimization challengesthanisolatedstencils,asinteractions between stencils must also beconsidered.The efficacyofourapproachisdemonstratedbycomparingtheperformanceof generated codeagainstmanuallytunedcode,overcommercialcompiler-generated code,andagainstanalyticperformancebounds.Generatedcodeoutperformsmanu- ally optimizedcodesonmulticoresandGPUs.AgainstIntel'scompileronmulticores, generated codeachievesupto4xspeedupforstencils,and3xforthesolver.OnGPUs, generated codeachieves80%ofananalyticallycomputedperformancebound. iv TomyparentsandsisterCONTENTS ABSTRACT : ::::::::::::::::::::::::::::::::::::::::::::::::: iii LIST OFFIGURES: ::::::::::::::::::::::::::::::::::::::::::: x LIST OFTABLES : :::::::::::::::::::::::::::::::::::::::::::: xiv ACKNOWLEDGEMENTS : :::::::::::::::::::::::::::::::::::: xv CHAPTERS 1. INTRODUCTION : :::::::::::::::::::::::::::::::::::::::: 1 1.1 GeometricMultigridandStencilComputations.................. 2 1.2 ClassificationofStencilComputations......................... 3 1.2.1 StencilCoefficientTypes............................... 5 1.2.2 RadiusofStencils.................................... 7 1.2.3 CommonStencilIterations............................. 7 1.2.3.1 Jacobi......................................... 7 1.2.3.2 Gauss-SeidelandGauss-SeidelRed-Black.............. 7 1.2.4 ArithmeticIntensityofStencils.......................... 8 1.2.5 SummaryofStencilsOptimized.......................... 9 1.2.5.1 Higher-OrderMethods............................. 9 1.2.5.2 GridBoundaryConditions......................... 10 1.3 ApproachestoReducingDataMovement...................... 10 1.4 OptimizationChallenges................................... 12 1.5 Domain-SpecificOptimizationTechniques...................... 14 1.6 Domain-SpecificExtensionsandAutotuning.................... 14 1.7 Contributions............................................ 16 1.8 ThesisStructure......................................... 17 2. CHILL : ::::::::::::::::::::::::::::::::::::::::::::::::::: 19 2.1 OrganizationofCHiLL.................................... 19 2.2 IntegerSetsandRelations.................................. 20 2.3 IterationSpaces.......................................... 21 2.4 RelationsforLoopTransformations........................... 22 2.5 DependenceGraphandLegalityofTransformations.............. 24 2.6 LoopTransformationsinCHiLL............................. 24 2.6.1 LoopPermutation.................................... 25 2.6.2 LoopSkewing....................................... 25 2.6.3 LoopTiling......................................... 26 2.6.4 LoopFusion......................................... 26 2.7 ComputingFootprintofArrayReferences...................... 30 2.8 ExtendingPolyhedralTechnologyinCHiLL.................... 30 2.9 CUDA-CHiLL........................................... 31 2.9.1 ParallelDecomposition................................ 31 2.10 Summary............................................... 32 3. THEminiGMGBENCHMARK : :::::::::::::::::::::::::::: 35 3.1 V-cycle................................................. 35 3.2 Domain-DecompositionandParallelism........................ 36 3.3 V-cycleandOperatorCodeSkeletons......................... 41 3.3.1 Smooth............................................ 43 3.3.2 Residual............................................ 46 3.3.3 RestrictionandInterpolation............................ 46 3.4 OptimizationChallenges................................... 46 4. COMMUNICATION-AVOIDINGOPTIMIZATIONS : ::::::::: 49 4.1 TypesofCommunication................................... 49 4.2 Communication-AvoidingOptimizations....................... 50 4.2.1 FusingComponentsofSmooth.......................... 50 4.2.2 DeepGhostZones.................................... 51 4.2.3 WavefrontComputation............................... 53 4.2.4 ParallelCodeGeneration............................... 61 4.2.5 Residual-RestrictionFusion............................. 65 4.2.6 Smooth-Residual-RestrictionWavefront................... 66 4.3 AutotuningOpportunities.................................. 70 4.4 CompilerImplementation.................................. 71 4.4.1 FusingComponentsofSmooth.......................... 72 4.4.2 OverlappingGhostZones.............................. 73 4.4.3 StencilsasAccumulation............................... 74 4.4.4 RebuildingtheDependenceGraph....................... 77 4.4.5 Wavefronts.......................................... 78 4.4.6 SmoothResidualRestrictionFusion...................... 80 4.4.7 ParallelCodeGeneration............................... 80 4.5 PuttingItTogether....................................... 81 4.6 Results................................................. 82 4.6.1 ProblemSpecification................................. 85 4.6.2 miniGMGConfiguration............................... 85 4.6.3 ManuallyOptimizedandBaselineminiGMG............... 85 4.6.4 EvaluatedPlatforms.................................. 86 4.6.5 AnalysisofGSRBSmoothPerformance................... 86 4.6.5.1 ComputingRooflineMemoryBounds................. 87 4.6.5.2 OptimizationsandSmoothPerformance............... 90 4.6.6 Smooth,ResidualandRestrictionFusion.................. 94 4.7 SummaryandConclusion.................................. 95 5. OPTIMIZATIONSFORCOMPUTE-INTENSIVESTENCILS : : 96 5.1 StencilDefinitionsandAccuracy............................. 97 5.2 StencilReordering:PartialSums............................. 98 5.2.1 CompositionofOptimizations........................... 105 vii 5.3 CompilerImplementation.................................. 105 5.3.1 BackgroundReview................................... 106 5.3.2 AbstractionsforPartialSums........................... 106 5.3.3 DerivingStencilPointsandBoundingBox(BB).............. 107 5.3.4 DerivingCoefficients(Coeff)............................ 107 5.3.5 DerivingPartialSumsandBuffers....................... 108 5.3.6 ExploitingReuseinPartialSums........................ 109 5.3.7 ExploitingSymmetrytoReduceFloating-PointOperations.... 109 5.3.8 CodeGeneration..................................... 110 5.4 ExperimentalResults...................................... 111 5.4.1 ReviewofEvaluatedPlatforms.......................... 112 5.4.2 ProblemSolvedonminiGMG........................... 112 5.4.3 ExperimentalMethodology............................. 113 5.4.4 ComputingRooflineMemoryBounds..................... 116 5.4.5 StencilPerformance................................... 117 5.4.6 SmoothPerformanceontheFineGrid.................... 119 5.4.7 SmoothPerformanceThroughouttheV-Cycle.............. 121 5.4.8 miniGMGSolverPerformanceandError.................. 122 5.4.9 DistributedMemoryResults............................ 127 5.5 Conclusions............................................. 127 6. GEOMETRICMULTIGRIDONGPUs : ::::::::::::::::::::: 130 6.1 Gauss-SeidelRed-Black(GSRB)SmoothonGPUs............... 131 6.1.1 StrategyforParallelDecompositionofSmooth.............. 131 6.2 CodeGenerationwithCUDA-CHiLL......................... 132 6.2.1 MappingtoBlocksandThreads......................... 135 6.2.2 ExtensionstoCodeGenerationinCUDA-CHiLL............ 135 6.2.3 SpaceofGeneratedGSRBVariants....................... 137 6.3 ExperimentalResults...................................... 137 6.3.1 miniGMGonGPUs................................... 139 6.3.2 ExperimentalMethodology............................. 139 6.3.3 OptimizationsinHandtunedCode....................... 141 6.3.4 PerformanceofSmooth................................ 141 6.4 Conclusions............................................. 142 7. RELATEDWORK : :::::::::::::::::::::::::::::::::::::::: 143 7.1 OptimizationstoReduceDataMovement...................... 143 7.2 StencilReorderingOptimizations............................ 144 7.2.1 ComparisonwithArrayCommonSubexpressionElimination... 144 7.2.2 OtherStencilReorderingTechniques...................... 145 7.3 OptimizationsforSolvers................................... 146 7.4 StencilComputationsonGPUs.............................. 147 7.4.1 ManualOptimizationEfforts............................ 147 7.4.2 Domain-SpecificAutomatedOptimizationEfforts............ 148 7.5 SummaryandConclusions.................................. 149 viii 8. FUTURERESEARCHANDCONCLUSIONS : ::::::::::::::: 150 8.1 Contributions............................................ 150 8.1.1 OptimizingMemoryBandwidthLimitedStencils............ 151 8.1.2 OptimizingCompute-IntensiveHigher-OrderStencils......... 151 8.1.3 ParallelCodeGeneration............................... 152 8.2 FutureWork............................................ 153 8.2.1 NonperiodicBoundaryConditions........................ 153 8.2.2 TargetEmergingMany-CoreArchitectures................. 154 8.2.3 CodeGenerationforEmergingRuntimes.................. 154 8.3 Conclusion.............................................. 155 APPENDIX: STENCILS : :::::::::::::::::::::::::::::::::::::: 156 REFERENCES: ::::::::::::::::::::::::::::::::::::::::::::::: 159 ix LIST OFFIGURES 1.1 TheGeometricMultigridV-cycle.Thehierarchicallinearsolverstarts with large,fine-resolutiongridsandcomesdowntheV-cyclebysucces- siveapplicationofsmooth,residual,andrestriction.TheGMGgoes backuptheV-cyclebyapplicationsofsmoothsandinterpolation.... 3 1.2 Visualizationofstencilsoptimized:Figures(a)(b)(c)(d)areconstant- coefficient3D7-,13-,27-and125-pointstencils,respectively.Figures(e) and (f)are2Dcross-sectionsof3Dstencils,(e)isa7-pointvariable coefficientstencil,and(f)illustratesrestrictionandinterpolationstencils. 4 1.3 Illustrationof(a)Jacobiiterationsand(b)Gauss-SeidelRed-Black (GSRB) iterationswitha2D5-pointstencilongrid.Jacobiisan out-of-place sweep.Tocomputeastencilitreadsin5pointsfroman input gridandwritesoutacomputedvaluestoanoutputgrid.GSRB partitions thegridpointsintoeitherredorblackpointssuchthateach red pointissurroundedbyblackpointsandviceversa.EachGSRB sweepupdateseithertheredorblackpoints.Itisanin-placeoperation, as thegridbeingupdatedisalsoread.Thefigureillustratestheupdate of aredpointwhichrequiresitsownvalueandvaluesfromtheblack neighbors................................................. 8 1.4 Visualizationofperiodicboundaryconditionsforstructuredgrids.Fig- ure (a)showsa2D5-pointstenciland(b)illustratesapplicationof the stencilontheupper-leftcornerofa2Dgrid.Periodicboundary conditions meanpointsoutsidetheboundarywraparoundthegrid. Forthe2D5-pointstencilwhichreadspoint0-4,thepoints3and4 wrap aroundthegridasillustratedbythedashedlines............. 11 1.5 AutotuningbasedontheCHiLLcompilerframework.CHiLLhasa scripting languageinterface.NoveltransformationswerebuiltintoCHiLL. The autotunersimplementedinresearchpresentedheregeneratedCHiLL scripts todirectoptimization.CHiLLranthescriptstogeneratecode variants.................................................. 15 3.1 ThemultigridV-cycleforsolving Luh = fh. Note,superscriptsdenote grid spacing.............................................. 37 3.2 Visualizationofthedomain/process/subdomainhierarchyinminiGMG. 38 3.3 ExecutionoftheminiGMGV-cycle.Anodeisassigneda 2563 domain whichisdecomposedintoalistof 643 subdomains(boxes).Thelistof subdomainsispartitionedintoMPIprocesses.Thesubdomainsowned byeachMPIprocessarethencomputedinparallelbyasingleOpenMP thread. .................................................. 39 3.4 Applicationofa2D5-pointstencilofa4x42Dgrid.(a)Shapeofthe stencil. (b)Applicationofthisstencilofaninteriorandboundarypoint of thegrid.Ghostzoneisshadedgray.......................... 40 3.5 Visualizationofghostzonesanddataexchangebetweensubdomains.A 12 12 domain(grid)isgeometricallydecomposedto9,4 4 subdomains (tiles). Eachsubdomainneedsaghostzonewhichmustbeexchanged betweenneighboringtilesafterastencilcomputationisappliedtothe entiredomain............................................. 40 4.1 Twoapplicationsofa3D5-pointstencilona4x4gridwithatwo-deep ghost zone.(a)Thestencilisfirstappliedona8x8gridtocompute the valuesforthe6x6grid(bluepoints).Theouterlayerofgridpoints shaded greyarereadandusedastheghostzone.(b)Thesecondstencil sweepusesthe6x6gridtocomputetheoutputofthe4x4grid(red points).Forthesecondstencilsweep,thebluepointsareusedasthe ghost zone................................................ 54 4.2 Deeperghostzonechangesthecommunicationpatternandvolumebe- tweenasubdomainanditsneighbors.Aone-deepghostzonerequires exchangewithleft,right,top,andbottomneighbors.Anadditional layerofghostzonenowrequiresexchangewithneighboronthecorners as well.Inaddition,alargervolumeofdatamustbeexchanged...... 54 4.3 ProgressoftheGSRBwavefront............................... 57 4.4 Jacobiwavefront........................................... 58 4.5 Multiplethreadsworkingcollaborativelytoprocessasubdomain/box. 62 4.6 ExampleparalleldecompositionsonHopper,whichhas6-corespersocket. All theboxesinasubdomainmayworkinparallel,orallthethreads mayworkononeboxcollaboratively,ornestedparallelismmaybeused. 65 4.7 Wavefrontapplyingsmooths,residualandrestriction............... 68 4.8 Codegenerationstepsforaccumulationtransformation............. 76 4.9 Loopskewingandlooppermutationtocreateawavefront.......... 79 4.10 SpeedupsofCHiLL-generatedandmanuallytunedGSRBsmoothrel- ativetothebaselinecodeonHopper.Thespeedupsareshownforall levelsoftheV-cycle.Generatedcodeoutperformsexpert-writtencode on allsizesexceptthe 323 box,andalwaysoutperformsbaselinecode.. 88 4.11 SpeedupsofCHiLL-generatedandmanuallytunedGSRBsmoothrel- ativetothebaselinecodeonEdison.Thespeedupsareshownforall levelsoftheV-cycle.Generatedcodeoutperformsexpertwrittencode on allsizesexceptthe 323 box,andalwaysoutperformsbaselinecode. 89 4.12 SpeedupfortheMGSolverattainedfromincrementallyenabledopti- mizations intheCHiLLcompiler.Theperformanceresultsarecatego- rized bytypeofsmoothandarchitecture."VC"isvariable-coefficient. 93 xi 5.1 (top)Visualizationsofthediscretized3DLaplacianoperators(stencils) used inthischapter.(bottom)2Dcrosssectionsthroughthecentersof 3D stencils.Colorisusedtodenotethecoefficientassociatedwiththat point.The27-and125-pointstencilshavecomplexsymmetriesthatwe exploit. ................................................. 98 5.2 Visualizationof2D9-pointstencilapplicationona2Dgrid.Figure showsstenciloperatorbeingappliedonthreeconsecutiveiterationsof the innerloop(j,i),(j,i+1)and(j,i+2).Theedgeofpoints{(j+1,i+1), (j, i+1),(j-1,i+1)},boundbytheboldrectangle,getreusedbythe three iterations.Thecoefficientsofthestencilarecolorcoded,blue = w1, green = w2, andred = w3. ............................... 101 5.3 Illustrationofderivingpartialsums.Therightedgefromtheinputarray is loadedandmultipliedbyweightsstoredinthearrayofcoefficients. The sumofproductsoftheloadedpointsandtheright,center,andleft edges ofthearrayofcoefficientsareBO,B1,andB2,respectively..... 102 5.4 Thereuseoftheleadingedgeofpointsloadedatiteration<j,i>gets captured inthreebuffersR,C,andL(top).BufferentiresR[i](1), C[i+1] (2),L[i+2](3)correspondtoB0,B1,andB2fromFigure5.3. The loadededgeisfactoredinto r1 and r2 based onsymmetryofthe color-codedcoefficients.ThefactorsareusedtocomputeB0,B1,and B2. Thefinaloutputout[j][i] is thesumofbufferentries........... 102 5.5 Increasingsymmetriesincoefficientsallowsustoincreasinglyreduce floating-pointcomputation.Symmetryaboutthe jaxis (inb)permits discarding halfthecoefficients,andsymmetryaboutthe iaxis (c)and the diagonal(d)letsthecompilerconsiderevenfewercoefficients..... 103 5.6 Visualizationofa2Dplanefromthe3Darrayofcoefficientsforthe 125-pointstencils(left).AsshowninFigure5.5(d),thereare6unique coefficients0-5.Whenapplyingthe125-pointstencilatiteration<j,i> using partialsums,theleading2Dplaneof25pointsisloadedand factored accordingtotheuniquecoefficients(right).Thesixfactors r0 - r5 are multipliedbyappropriatecoefficientstocomputepartial sums whicharethenbuffered.Thefactorsarecreatedbysumming loaded pointswhicharemultipliedbythesamecoefficient,andcoeffient 0 correspondstoloadedpointin[k][j][i+2]. .................... 103 5.7 Codegenerationstepsforpartialsumtransformation............... 111 5.8 ApplyOp(y=Ax)stencilperformanceattainedwiththeCHiLLcompiler byoptimization,operator,andplatform.TheRooflinememorybound is forthenoncommunication-avoidingimplementation.Partialsumscan movecompute-limitedoperationstowardsamemory-limitedstate..... 118 5.9 JacobismoothperformanceattainedwiththeCHiLLcompilerbyop- timization, operator,andplatform.TheRooflinememoryboundisfor the noncommunication-avoiding(nowavefront)implementationandis lowerthanApplyOpduetoadditionaldatamovementliketheRHS. The wavefronttransformationallowsCHiLLtoexceedthislimit...... 120 xii 5.10 JacobismoothperformanceonEdisonattainedwiththeCHiLLcom- piler asafunctionoflevelintheV-Cycle(2563 fine gridsdownto 163 coarse grids)forthe27-and125-pointoperators.Observethat the referenceimplementationofthememory-limited27-pointoperator receivesacacheboostonthecoarserlevels,whilethecompute-limited 125-pointdoesnot.......................................... 123 5.11 miniGMGperformance(millionsofdegreesoffreedomsolvedpersec- ond) usingeithertheIntelcompiler(baseline)ortheCHiLLcompiler. The labelsindicatetheoverallsolverspeedupattainedviaCHiLL.The performanceofthetenth-order125-pointsolveriswithinafactorof2 of thesecond-order7-pointsolver,butprovidesnearlyamilliontimes lowererror............................................... 125 5.12 ErrorattainedinminiGMGasafunctionofoperatorandgridsize.A 1GB vectorrepresentsasingle5123 grid. Multigridwillrequireseveralof these grids.Observethatthetenth-ordermethoddeliversathree-digit increase inaccuracyforevery8 increase inmemory............... 126 5.13 TheperformancebenefitofCHiLLisnotlostasoneweakscalesminiGMG with a27-pointstencilto31,944cores(1331nodes).Theabovefigure illustrates weak-scaledminiGMGtimetosolutionusingeitherthebase- line codeortheCHiLLcompiler.FourweightedJacobirelaxationsare used ateachlevel,withBiCGStabasthebottomsolver,anda 2563 domain pernode.......................................... 128 6.1 OrganizationofCUDAthreadsintoa3D(2,4,64)gridwith2D(32,16) thread blocks.Each2D(X,Y)planeinthe3Dgridhas82Dthread blocksofdimension(32,16).Each2Dplaneinthe3Dgridworksona single subdomain(box)inminiGMG,andthereare64suchplanesto process64subdomains...................................... 133 6.2 Illustrationofworkdonebyeachthreadblock.Figure(a)shows8, (32,16) threadblocksworkingona 643 subdomain(box),witheach thread computingacolumnof64outputpoints.Thebluecolumnin figure (b)representsthe32*16*64=32,768outputgridpointscomputed byeachthreadblock....................................... 133 6.3 Executiontimesfor480iterationsofGSRB(240Red,240Black)smooth on 643 boxes.Lowerbarmeansbetterperformance.Thehorizontal lines correspondstoperformanceofmanuallytunedcodes.Thex-axis are the2Dthreadblocksizes(TX,TY).Thegridcorrespondingtoeach thread blockis(64/TX,64/TY,64).Bestperformancewasachieved with threadblock(64,16)andgrid(1,4,64)...................... 140 6.4 ThefractionoftheeffectiveDRAMbandwidthachievedbyGSRB smooth.Highervaluescorrespondtobetterperformance............ 140 xiii LIST OFTABLES 1.1 Descriptionofstencilsoptimized.VCandCCstandforvariableand constantcoefficient,respectively.#Flopsand#Bytesareperupdate of agridpoint,andweonlyaccountforcompulsorycachemisses.We also takeintoaccountwriteallocation,thatis,wedonotconsidercache bypass.................................................. 5 2.1 AsubsetoftransformationsavailableinCHiLL................... 24 4.1 Thetableillustratestheclassificationofoptimizationsdescribedinthis chapterasreducingverticalcommunication,horizontalcommunication, or both.Parallelcodegenerationhasbeenbeenleftoutasitisnot a communication-avoidingoptimizationbuthelpsimprovewavefront, whichreducesverticalcommunication........................... 51 4.2 Overviewofevaluatedplatforms............................... 87 4.3 Configurationsofbest-performingcodevariantsforGSRBsmoothon Hopper.................................................. 88 4.4 Configurationsofbest-performingcodevariantsforGSRBsmoothon Edison. .................................................. 89 5.1 ThetableshowstheRooflinememory(DRAM)boundsforApplyOp computation withfourdifferentstencils.ApplyOpreadsinagrid, applies thestenciloperatortoitandwritesoutthecomputedvalues to adifferentoutputgrid.TheboundsareexpressedintermsofMillion Stencils perSecondthatcanbecomputedwhenDRAMbandwidthis the bottleneck.Thestencilwasappliedtoa 2563 domain whichwas split intoalistof 643 sudomains. Toaccountforghostzonesa 643 sudomain translatesto 663 and 683 grids forstencilswithradius1and 2 respectively.Stencilswithlargerradiusrequirelargerdatavolumes and havelowerRooflinememorybounds........................ 117 5.2 ThetableshowstheRooflinememory(DRAM)boundsfortheJacobi smoothsusingfourdifferentstencils.Theboundsareexpressedinterms of MillionStencilsperSecondthatcanbecomputedwhenDRAM bandwidth isthebottleneck.Inadditiontoinputandoutputgrids, smoothsrequirereadingintwomoregrids(arrays)rhsandlambda. This extradatamovementmeanssmoothshavelowerRooflinebounds than ApplyOp............................................. 117 5.3 CHiLLwasabletoselectoptimizationsuniquelyforeachmultigrid levelandplatform.<#,#>denotesthenumberofinter-andintra-box threads withnestedOpenMP................................. 124 6.1 OverviewofevaluatedNVIDIAGPU........................... 139 ACKNOWLEDGEMENTS First, I'dliketothankmydissertationadvisor,MaryHall,forherencouragement and guidance.Herdedicationtoresearchandemphasisontacklingimportant, challenging,andrealproblemshasbeeninspiring.Overtheyears,shehasprovided an incredibleamountofhelpinwritingandpresentingmyresearch.Thisdissertation has greatlybenefitedfromheradviceandediting. Most oftheresearchinthisdissertationhasbeendoneincollaborationwith researchersfromBerkeleyLabaspartofthex-tuneproject.Iwouldliketothank SamuelWilliams,LeonidOliker,BrianVanStraalen,andPhillipColella.Theyhave helpedgivemyresearchanewdirection. SamuelWilliams,whoisalsoonmycommittee,hasbeenpivotaltomyPhD. Sam's technicalknowledge,attentiontodetail,andmostimportantly,enthusiasm, has benefitedmeimmensely.IamthankfultoSamforallthediscussionsinperson and overemail,andforbeingonmyPhDcommittee. I amthankfultomydissertationcommitteemembersMartinBerzins,Ganesh Gopalakrishnan, andRajeevBalasubramonianfortheirfeedbackandencouragement. I hopesomeofourdiscussionscanleadtofuturecollaborations. My labmatesintheCTOPgrouphavebeengoodfriends,collaborators,and proofreaders.IwouldliketothankAnandVenkat,whohashelpedmedebugcodeat all hoursoftheday.IalsowanttothankManuShantharam,SauravMuralidharan, and AmitRoyforgivingmefeedbackondraftsofmypapersanddissertation.Iam thankful toChunChen,whohadimplementedCHiLL.Iwishhewasaroundlonger. My friendsinSaltLakeCityandelsewherehavebeencrucialtomylifeasa graduate student.I'dliketoespeciallythankDan,Keita,Mike,Didem,Nil,Anusua, Anand, Shusen,Megan,Grant,Kristina,Grace,Nic,andAvishekformakingthelast six yearsenjoyable,andhelpingmehandletheupsanddownsofthePhDprocess.I wouldliketothankmyparents,ParthaSarathiandSubhraBasu,andmysister RajashreeBasuKundufortheirlove,supportandpatience.Iamforevergratefulto them forthevaluestheyhaveinstilledinme,andforbelievinginmeevenwhenmost others didn't.Iamhumbledbytheirsacrificesthatmadeitpossibleformetravelto the othersideoftheworldandpursuemygraduatestudies. I wouldliketothanktheNationalEnergyResearchScientificComputingCenter (NERSC) foraccesstotheirmachinesHopperandEdison.NERSCissupportedby the OfficeofScienceoftheU.S.DepartmentofEnergyunderContractNo.DE-AC02- 05CH11231. IwouldalsoliketoexpressmygratitudetoBerkeleyLabforhosting me forthe2013and2014summers.ThisresearchwasfundedbytheDepartment of EnergyOfficeofScienceaward#DE-SC0008682,theNationalScienceFoundation award#CCF-1018881,andthetheDepartmentofDefense,throughacontractto the UniversityofMaryland. xvi CHAPTER 1 INTRODUCTION Stencil computationsareoperationsonstructuredgrids.Astructuredgrid[1] representsauniformlydiscretizedcontinuousdomain,andastencilapplicationon the gridissimplyanapplicationofadifferentialoperator.Stencilcomputationsarea ubiquitous patterninparallelcomputing,andtheyarefrequentlyfoundattheheart of partialdifferentialequation(PDE)solvers.AsPDEsolversareusedinalarge fraction ofscientificapplications,rangingfromfluiddynamicstoelectromagnetics, the performanceofstencilcomputationsiscriticaltoscientificcomputing. Stencil computationstraditionallyhaveverylowarithmeticintensities,where arithmetic intensityisthenumberoffloating-pointoperationsperformedperbyte read frommemory.Stencilcomputationsexecuteaslowas0.2floating-pointop- erations (flops)perbyte.Thislowarithmeticintensity,combinedwiththefact that memorybandwidthofmodernmachineslagsfarbehindfloating-pointpower, makestencilcomputationsnotoriouslymemorybandwidth-limited.Thus,improving performanceofstencilcomputationsrequiresreducingandmanagingdatamovement. The importanceofstencilcomputationsinscientificcomputing,andthewidening performancegapbetweencomputationanddatamovementhasmotivatedeffortsfrom diverseresearchcommunitiestodevelopoptimizationsforstencils.Codeoptimiza- tion expertshavedesignedmanualandautomaticoptimizationstoimprovemem- ory bandwidthuse,andappliedmathematicianshavedevelopedcompute-intensive higher-order stencils thatneedfarlessmemory. In thisdissertationweoptimizebothtraditionalmemorybandwidthlimitedsten- cils andthemorecompute-intensivehigher-orderstencilsusingacompiler-based approach.Noveloptimizationsformemory-andcompute-boundstencilsareimple- mentedinacompilerframework.Acompiler-basedapproachnotonlyhelpsimprove application programmerproductivity,butalsoallowsthecodeoptimizationexpert2 to leveragedecadesofcompilerresearch.Thiscompiler-basedapproachisusedto optimize stencilcomputationsinisolation,andinthecontextofalinearsolver, Geometric Multigrid. 1.1 GeometricMultigridandStencilComputations Multigrid (MG)[2]methodsareextensivelyusedinavarietyofnumericalsimu- lations tosolvelinearsystems(Ax=b). Multigridmethodsuseahierarchyofgrids with differentresolutionstoacceleratetheconvergenceofiterativelinearsolvers. Traditionaliterativesolversoperateongridsatasingleresolutionandrequirea higher numberofiterations.Multigridusescorrectionsofthesolutionfromiterations on thecoarserlevelstoimprovetheconvergencerateofthesolutionatthefinest level.GeometricMultigrid(GMG)isaspecialcaseofMGwherethelinearoperator A is simplyastencilcomputationappliedtothegrid x. Geometric Multigrid(GMG)isahierarchicalapproachtosolvingalinearsystem. GMG hasfourkeyoperations:smooth,residual,restriction,andinterpolation.These operationsallinvolvecomputationofstencilsandareappliedinasequenceknownas the V-cycle and illustratedinFigure1.1.GMGstartswithalarge,fineresolution grid. Itappliesseveraliterationsofsmoothonit,calculatestheresidualandthen restricts intoasmaller,coarsergrid.Thissequenceisappliedmultipletimesuntil a bottomgridsizeisreached.Furthersmoothsareappliedtothecoarsestgridat the bottomlevel.Oncethecoarsestgridissolvedthealgorithmappliesthesolution to thefinergrids.Thisisdonebyconsecutiveapplicationsofsmoothsfollowedby interpolation.Interpolationistheinverseoftherestrictionoperationandinterpolates valuesfromasmallcoarsegridtoalarger,finerone. Smoothinvolvesastencilcomputationfollowedbypointwisegridupdates.Mul- tiple iterationsofsmoothareappliedateachgridlevelinGMG;thetimespent in smoothdominatesruntime.Residualisalsoastencilcomputation,butunlike smooth,onlyasingleiterationisrequiredateachlevel.Restrictionandinterpolation are stencilcomputationswhichareinversesofeachother.Restrictionisareduction operationwhichaveragesmultiplepointsonafinegridandwritesoutthevaluetoa smaller coarsergrid.Interpolationisascatteroperationwhichmapsasinglepoint in acoarsegridtomultiplepointsinafinergrid.3 B o t t o m S o l v e Smooth Residual Restriction Smooth Interpolation Figure 1.1: The Geometric Multigrid V-cycle. The hierarchical linear solver starts with large, fine-resolution grids and comes down the V-cycle by successive application of smooth, residual, and restriction. The GMG goes back up the V-cycle by applications of smooths and interpolation. 1.2 Classification of Stencil Computations Figure 1.2 illustrates the various stencils optimized in this dissertation. The following four features of stencil computations listed below have been used to char-acterize the various stencils. These stencil features help understand performance characteristics and identify optimization challenges and opportunities. 1. Stencil coefficient types 2. Radius of stencils 3. Stencil iteration types 4. Arithmetic intensity of stencils Table 1.1 lists the stencils from Figure 1.2 with the corresponding characteristics mentioned above. The first five stencils in Figure 1.2 ((a)-(e)) are used in smooth and residual. In GMG smooth and residual operations use the same stencil. For example, the GMG where smooth uses a 7-point stencil will also have a residual operation using the same 7-point stencil. The last stencil, Figure 1.2(f), represents both the restriction and interpolation operations. Restriction reads 8-points, averages 4 (a) (b) (c) (d) beta_i[] beta_j[] (e) (f) Figure 1.2: Visualization of stencils optimized: Figures(a)(b)(c)(d) are constant-coefficient 3D 7-, 13-, 27- and 125-point stencils, respectively. Figures(e) and (f) are 2D cross-sections of 3D stencils, (e) is a 7-point variable coefficient stencil, and (f) illustrates restriction and interpolation stencils. 5 Table1.1: Descriptionofstencilsoptimized.VCandCCstandforvariableand constantcoefficient,respectively.#Flopsand#Bytesareperupdateofagridpoint, and weonlyaccountforcompulsorycachemisses.Wealsotakeintoaccountwrite allocation,thatis,wedonotconsidercachebypass. Stencil Radius Iterations#Flops#BytesArithmeticIntensity 7-pointVC 1 GSRB17800.21 7-pointVC 1 Jacobi17480.35 7-pointCC 1 Jacobi8240.33 13-pointCC 2 Jacobi15240.63 27-pointCC 1 Jacobi32241.33 125-pointCC 2 Jacobi134245.58 8-pt Restriction 1 out-of-place1100.10 their value,andwritesittoacoarsergrid.Thecoarsergridisthushalfthesizeof the finegridineachdimension.Interpolationistheinverseoperation,whereasingle pointfromthecoarsegridisscatteredtoeightpointsinthefinegrid. 1.2.1 StencilCoefficientTypes Stencil computationssweepthroughgrids,performingaweightedsumofpoints read fromaninputgrid(array).Iftheweightsusedareconstantscalars,itistermed a constantcoefficient stencil. Howeverifthecoefficientorweightschangefromone grid pointtoanother,theyarenotconstantsandhavetobestoredinseparategrids. These typeofstencilcomputationsareclassifiedas variablecoefficient stencils. Stencils inFigures1.2(a)(b)(c)(d)areconstantcoefficientstencils,thecolorat eachstencilpointinthefiguresrepresentstheweightorcoefficient.Figure1.2(e) representsa3Dvariablecoefficient7-pointstencil.LikethestencilinFigure1.2 (a), thisstencilalsoreadsin7points,butdoesnotusescalarconstantsforweight; instead itreadstheweightsfromarrays beta_i, beta_j and beta_k. CodeList- ings 1.1and1.2showsimplifiedcodeforthe7-pointconstantcoefficientandvariable coefficientstenciloperators,respectively.Invariablecoefficientstencils,inaddition to theinputandoutputgrids,thegridscorrespondingtothecoefficientsarealso accessed, creatinghighertrafficthroughoutthememoryhierarchy.6 1 double w1,w2; 23 for (k=0;j<N;k++) 4 for (j=0;j<N;j++) 5 for (i=0;i<N;i++){ 67 phi_out[k][j][i]=w1 * phi_in[k][j][i]+ 8 w2 * ( phi_in[k][j][i+1]+phi_in[k][j][i-1] 9 + phi_in[k][j+1][i]+phi_in[k][j-1][i] 10 + phi_in[k+1][j][i]+phi_in[k-1][j][i]); 11 12 } Listing 1.1:Out-of-placegridsweepsforJacobiiterationswitha3D7-point constantcoefficientstenciloperator.Thecodesnippetcorrespondstothestencil in Figure1.2(a).Theconstantcoefficientsare w1 and w2. 12 int sweep_color; 34 for (k=0;j<N;k++) 5 for (j=0;j<N;j++) 6 for (i=0;i<N;i++){ 78 if((i+j+k+sweep_color)%2==0){ 9 phi[k][j][i]= 10 beta_i[k][j][i+1]*( phi[k][j][i+1]-phi[k][j][i]) 11 -beta_i[k][j][i] *( phi[k][j][i]-phi[k][j][i-1]) 12 +beta_j[k][j+1][i]*( phi[k][j+1][i]-phi[k][j][i]) 13 -beta_j[k][j][i] *( phi[k][j][i]-phi[k][j-1][i]) 14 +beta_k[k+1][j][i]*( phi[k+1][j][i]-phi[k][j][i]) 15 -beta_k[k][j][i] *( phi[k][j][i]-phi[k-1][j][i])); 16 } 17 } Listing 1.2:In-placegridsweepsforGauss-SeidelRed-Blackiterationswitha 3D 7-pointvariablecoefficientstenciloperator.Thecodesnippetcorrespondsto the stencilinFigure1.2(e).Thevariablecoefficientsare beta_i, beta_j and beta_k. 7 1.2.2 RadiusofStencils In stencilcomputations,ateachgridpoint,asetofneighboringpointsareread. The radiusofthestencilistheoffsetofthefarthestpointreadineachdimension. The 7-and27-pointstencilshavearadiusof1,astheyread 1 pointsineachof the i,j,k dimensions. Similarly,the13-and125-pointsstencilsread 2 pointsin eachofthe i,j,k dimensions andhavearadiusof2.Stencilswithalargerradius need toreadinmoregridpoints,andhavealargerworkingset,andalsoperforman increased numberoffloating-pointoperations. 1.2.3 CommonStencilIterations Stencil iterationscanbeeitherout-of-placegridsweepsorin-placegridsweeps. In anout-of-placesweepthegridbeingupdatedwithcomputedvaluesisdifferent from thegridsbeingread.Inanin-placesweep,thegridbeingreadandwrittenis the same.ThethreecommontypesofstenciliterationsareJacobi,Gauss-Seidel,and Gauss-Seidel Red-Black(GSRB). 1.2.3.1 Jacobi Figure 1.3(a)illustratesJacobiiterationswitha2D5-pointstencil,andListing1.1 showscodeforJacobiiterationswitha3D7-pointconstantcoefficientstencil.The grid beingupdatedisdifferentfromthegridbeingread.Alloutputgridpoints can beupdatedindependently,makingJacobiiterationsanembarrassinglyparallel computation. 1.2.3.2 Gauss-SeidelandGauss-SeidelRed-Black Gauss-Seidel iterationsarein-placegridsweeps,wherethegridbeingupdatedis also beingread.Thismeansthatatagivengridsweep,somepointwouldhavebeen updatedandotherswouldnot.Thismeanthatthereisadependenceonthegrid pointsbeingupdated,whichlimitsparallelism. Gauss-Seidel Red-BlackiterationsaimtoincreasetheparallelisminGauss-Seidel iterations bypartitioningtheread/writegridintoredandblackpoints.Thered pointsaresurroundedbyblackpointsandvice-versa.AsingleGauss-Seideliteration then getsdividedintotwoiterations:arediterationfollowedbyablackone.The red iterationupdatestheredgridpointsbyreadingaredpointanditsneighboring8 Jacobi Gauss-Seidel Red-Black Out-of-place Sweep In-place Sweep (a) (b) Figure 1.3: Illustration of (a) Jacobi iterations and (b) Gauss-Seidel Red-Black (GSRB) iterations with a 2D 5-point stencil on grid. Jacobi is an out-of-place sweep. To compute a stencil it reads in 5 points from an input grid and writes out a computed values to an output grid. GSRB partitions the grid points into either red or black points such that each red point is surrounded by black points and vice versa. Each GSRB sweep updates either the red or black points. It is an in-place operation, as the grid being updated is also read. The figure illustrates the update of a red point which requires its own value and values from the black neighbors. black points, and the black iteration updates the black points similarly. The red and black iterations are embarrassingly parallel where all writes are independent. Figure 1.3(b) illustrates the use of a 2D 5-point stencil in GSRB iterations, where a red point is being updated with values from itself and the neighboring black points. Listing 1.2 shows GSRB iterations for a 3D, variable-coefficient 7-point stencil, where phi is being read and updated. The if-condition which guards the statement ensures that only red or black points get updated based on the value of sweep_color. 1.2.4 Arithmetic Intensity of Stencils Arithmetic intensity of a stencil computation is the ratio of floating-point oper-ations (flops) performed to the memory (in bytes) that needs to be read in from the DRAM. By quantifying the balance between computation and communication, arithmetic intensity identifies a stencil as either being limited by memory bandwidth or floating-point intensity, and thus guides the selection of required optimization. The arithmetic intensity for the stencil discussed in this dissertation is shown in Table 1.1. To compute a stencil for Jacobi iterations of the 7-, 13-, 27- and 125-point 9 constantcoefficientstencils(rows3-6),weneedtoread8bytes,writeallocate8bytes and write8bytes,atotalof24bytesperstencil.The7-,13-,27-and125-point stencils need8,15,32and134flops,andthustheyhaveapproximatearithmetic intensities0.33,0.63,1.33and5.58,respectively. Variablecoefficient(VC)stencilsreadinmorearrays(grids)andthushavemuch higher requirements,ascanbeseemfromrows1and2inTable1.1.TheJacobi iterations forthe7-pointVCstencilsread(phi, beta_i,beta_j,beta_k) in32bytes, write allocate8bytesandwriteback8bytesforatotalof48bytesperstencil.The computation executes17flopstocomputeeachstenciloutput.TheGSRBiterations only computehalfthegridpointsineachgridsweep,thusneedtwosweepsand double thedatamovementtoupdateallgridpoints.Computingeachstencilin GSRB iterationneeds80bytesperstencil(not 48 2 =96, asthesamegridisread and updated,thuswriteallocationisexcluded)andexecutes17flopsperstencil. 1.2.5 SummaryofStencilsOptimized In thisdissertationthestencilsoptimizedhavebeenillustratedinFigure1.2,and their featureslistedinTable1.1.Table1.1classifiesthestencilsbasedontheirshape (radius and#points),size(#points),typesofcoefficients,iterations,andfinally, the demandstheyplaceonthememorybandwidthandfloating-pointunitsofthe processor. In additiontothesestencils,the8-pointrestrictionoperationusedinGeometric Multigrid isalsoatarget.Thisstencilisareductionoperationwhichtakesasinputa larger N3 grid andoutputsasmaller (N=2)3 grid; itaveragesvaluesof8pointsfrom the largergridandwritesittoasinglepointinthesmallerone. 1.2.5.1 Higher-OrderMethods In additiontothefeatureslistedinTable1.1,the7-,13-,27-and125-pointstencils can beclassifiedbythe order of thePDEsolvertheyareusedin.TheorderofaPDE solverdenoteshowfasttheerrorcomputeddecreasesasthegridspacingdecreases (or gridsizeincreases).Ifasolvercomputesanerror using agridofsize 643, as wedecreasegridspacingbyhalf,thegridsizeincreasesto 1283 and theerrordrops to 1=p, where p is the order of themethod.Thuserrordropsexponentiallyasthe10 order ofamethodincreases.Inthisdissertation,the7-pointstencilsaresecond-order stencils usedinsecond-ordersolvers.The13-,27-and125-pointstencilsareusedin higher-order PDEsolvers,andtheyarefourth-,sixth-,andtenth-order,respectively. 1.2.5.2 GridBoundaryConditions Stencil computationsareappliedonstructuredgridswithboundarieswhichcan becommonlyclassifiedintoperiodicornonperiodic.Applicationofa2D5-point stencil withaperiodicboundaryconditionisillustratedinFigure1.4.Pointsonthe boundaryofthegridhaveneighborswhichwraparoundthegrid.Asillustratedinthe figure, thetop-leftandtopneighborsoftheupper-leftgridpointsarethebottom- left andtop-rightpoints,respectively.Inthisdissertationonlyperiodicboundary conditions havebeenconsidered. 1.3 ApproachestoReducingDataMovement Traditionalstencilcomputationsexecutefarlessthanaflopforeverybyteofdata they need,thustheirperformanceislimitedbytheirheavymemorydemands.In the pastonmachineswithsmallercaches,operationsonlargestructuredgridscould easily beboundbycapacitymissesincache,leadingtoavarietyofstudiesonblocking and tilingoptimizations[3,4,5,6,7,8,9].Inrecentyears,numerouseffortshave focusedonimprovingmemorybandwidthbyincreasingtemporallocality.Automated and manualoptimizationshaveattemptedtoincreaselocalitybyfusingmultiple stencil sweepsthroughtechniqueslikecache-oblivious,time-skewing,wavefront,or overlappedtiling[10,11,12,13,14,15,16,17,18,19,20,21].Almostallofthesehave concentratedtheireffortsonisolatedstencilcomputationsandhavenotconsideredan entiresolversuchasGMG.Solversuseanumberofstencilcomputationsandrequire severaloptimizationstobeappliedinsequence.Thispresentsthetwinchallenges of developingoptimizationsthatcanbecomposed,andcomingupwiththebest sequence ofoptimizations. Toreducetheheavymemorydemandsofstencilcompuations,appliedmath- ematicians areadoptinghigher-ordermethodsforsolvers[22,23].Higher-order methodsachievedesiredaccuracywhileworkingonmuchsmallergrids(lowerres- olution). Smallergridslowermemorycapacityrequirements,whichleadstore-11 1 2 3 0 4 0 1 3 4 2 (a) (b) Figure 1.4: Visualization of periodic boundary conditions for structured grids. Figure (a) shows a 2D 5-point stencil and (b) illustrates application of the stencil on the upper-left corner of a 2D grid. Periodic boundary conditions mean points outside the boundary wrap around the grid. For the 2D 5-point stencil which reads point 0-4, the points 3 and 4 wrap around the grid as illustrated by the dashed lines. duced data movement when computations stream through the grids. Higher-order methods achieve increased accuracy by using stencil computations with a higher number of floating-point operations. To update a grid point, higher-order stencils operate on a larger neighborhood of points; this often leads to performance being limited by the intensity of floating-point computation rather than memory bandwidth. Optimizing compute-intensive higher-order stencils have received far less attention than the memory-bound stencils. Manual optimizations [12, 24] and automated approaches [25, 26] alleviate performance bottlenecks of compute-intensive stencils by reducing loads and/or computation. Like optimizations for memory bound stencils, these techniques have only targeted isolated stencil computations on a single large grid, and not in the context of a solver, and importantly, these optimization efforts have ignored the interplay between optimizations to reduce computation intensity and to reduce memory traffic. 12 1.4 OptimizationChallenges The widevarietyofstencilspresentsaspectrumofoptimizationchallenges.With arithmetic intensitiesthatrangefrom0.2toover5(Table1.1),stencilcomputations stress differentsubsystemsonanode.Thissectionoutlinesthevariousoptimization challengespresentedbystencilsandGMG. Memory bandwidthoptimizations Traditionallystencilcomputationshavehadverylowarithmeticintensities. This makestheperformanceofstencilcomputationsnotoriouslylimitedbythe memory bandwidthofthemachines.Toimprovememoryuse,optimizations often fusemultiplestencilsweepsintoone,andtradeoffredundantfloating- pointcomputationfordatamovement. Difference betweenstencilapplicationsandsmooths Stencil computationscanbeexpressedas y = Ax, where A is thestencilbeing applied tothegrid x. Smooths,representedas( xnew = x + wD1(b Ax) ), require twomorearrays b and D1. Thismeansmoredatamovementandmakes smoothsevenmorememorybandwidth-limitedthanstencilapplications.Thus, optimizing smoothsmayrequireevenmoreaggressivebandwidthoptimization than simplestencilapplications. Managing floating-pointcomputations Stencils inTable1.1witharithmeticintensitygreaterthanonehavehigh intensitiesoffloating-pointoperations.Infact,the125-pointstencilexecutes over5flopsperbytemoved.Insuchcases,thestencilcomputationmaynot evenachieveperformancecorrespondingtotheDRAMbandwidthbecauseit is limitedbythelargenumberoffloating-pointoperations.Optimizingthese compute-boundstencilsrequiresidentifyingandexploitingreuseofcomputation to reducefloating-pointoperations.Furthermore,toachievehighfloating-point performanceonmodernmachines,itisessentialforstencilcodestoeffectively use theSIMD(singleinstructionmultipledata)instructionsavailableonmodern architectures.SIMDinstructionsprocessmultipledataelementssimultaneously13 and canpotentiallyimprovefloating-pointperformancebyseveralintegerfac- tors (bythewidthoftheSIMDunit). Composingasequenceofoptimizations Optimizations whichreuseandreducefloating-pointoperationsandtargetcompute- boundstencilsmustbedesignedtoworkwithoptimizationsthattargetmemory bandwidth-limited stencils.Reducingfloating-pointintensityofacompute- boundstencilwillimproveitsperformance,andmaymakeitlimitedbythe memory bandwidth.Atthispoint,theoptimizationneedstobecombinedwith memory bandwidthoptimizationstopushtheperformanceevenhigher. Optimization acrossGMGoperators Geometric Multigridhasfiveprincipaloperations,ofwhichfour(smooth,resid- ual, restriction,andinterpolation)involvecalculationofstencils.Inaddition to optimizingeachoftheseoperatorsinisolation,itispossibletooptimize across them.Forexample,fusingtwoormoreoftheseoperationswillreduce the numberoftimesgridsmustbestreamedintomemoryandimprovememory bandwidth optimization.Unfortunately,suchfusionwillalsoinvolvetradeoffs suchasincreasedworkingsetandredundantcomputation.Finally,effectiveness of suchoptimizationswilldependonthestencil,iteration,andarchitecture. ThusoptimizingGMGmeanssearchingalargerspaceofpossibleoptimizations compared tooptimizingasinglestencilcomputation. Mapping computationtoparallelthreads The increasingnumberofhardwarethreadsonmodernarchitecturesmakes mapping parallelisminstencilcodestothreadscrucialtoperformance.There is adelicatebalanceofparallelism,locality,redundantcomputationandworking set instencilcomputations.Increasingoneofthemmayadverselyaffectthe other. Toachievehighperformance,wemustbeabletogenerateandsearch manypossibleparallelvariantswhichimplementthesamestencilcomputation.14 1.5 Domain-SpecificOptimizationTechniques Eventhoughalargenumberofoptimizationsforstencilsareknowntothecompiler community,currentcommercialcompilersfailtogeneratehighlyoptimizedstencil code.Staticcompilingtechniquescannotanticipateallpossibleexecutionenviron- ments,suchasarchitecture,probleminputsize,andtrade-offsbetweenoptimiza- tions. Thus,stateoftheartcommercialcompilersdonottypicallyriskpotential slowdownbyapplyingaggressiveoptimizations.Toaddressthelackofsupport from commercialcompilers,severaldomain-specificcompilersandtoolshavebeen developed[27,28,29,30,31,26]. These automatedapproachestailortheiroptimizationsandcodegenerationto stencil computations.Withtheexceptionofthedomain-specificlanguage Halide [31], all otherdomain-orapplication-specificapproacheshaveconcentratedonisolated stencil applicatons.Theyhavenotappliedtheirtechniquestooptimizinganentire solverwithseveralstencilcomputations,andtheyhaveconcentratedeffortsontra- ditional memory-boundstencilswithlimitedornoemphasisonhigher-orderstencils. Another domain-specificapproachtooptimizingscientificcodespresentedin[12, 32], buildsapplication-specificcodegenerators.Thesearethenusedtogenerate manyoptimizedcodevariantsimplementingthesamecomputation;thesevariants are thensearchedtofindtheoptimalcode.Thisapproachgeneratesveryhigh- qualitycode,targetsbothstencilsinisolationandinsolvers,andhasoptimized compute-intensivestencils.Unfortunately,thisapproachrequirescodegeneratorsto berewrittenforeachnewapplicationwithoutmuchreuseofoptimizationsacross application. Thisrequiressignificanthumaneffortandhurtstheproductivityofthe optimization expert. 1.6 Domain-SpecificExtensionsandAutotuning Autotuning hasbeenusedtoremedythissituation.Autotuningacomputation involvesgeneratingandsearchingaspaceofpossibleimplementationsoftheinput computation tofindtheoptimalcodeforagivenexecutioncontext. This dissertationpresentsresearchthatextendsacompilerframeworkwithknown and noveldomain-specificoptimizationstargetingstencilcomputationsandGMG. This approachofaddingnoveloptimizationsinacompilerframeworkandleveraging15 the new and known optimization through autotuning is referred to as compiler-directed autotuning. Building domain-specific optimizations into a compiler frame-work [33] allows reuse of known optimizations across applications and greatly reduces the effort of writing application-specific code generators. A domain-specific autotuner is then created to drive the augmented compiler framework to generate multiple code variants for a computation, and the best code variant for a given execution is then empirically found. The novel transformations for stencils and Geometric Multigrid are built into the CHiLL [33] compiler framework. CHiLL is a code transformation framework which supports dependence analysis, loop transformation and code generation. CHiLL allows composition of optimizations and is developed to support autotuning. The compiler framework has a scripting language interface; the scripts (also called trans-formation recipes) are sequences of optimizations which direct compiler application of transformations [34]. Figure 1.5 illustrates this approach. An autotuner generates Autotuner CHiLL Compiler framework CHiLL script1 CHiLL script2 CHiLL script3 gen code1.c gen code2.c gen code3.c Autotuner drives the compiler frameworks using scripts Compiler framework generates many code variants and best one is selected Figure 1.5: Autotuning based on the CHiLL compiler framework. CHiLL has a scripting language interface. Novel transformations were built into CHiLL. The autotuners implemented in research presented here generated CHiLL scripts to direct optimization. CHiLL ran the scripts to generate code variants. 16 multiplescriptswhichareusedbyCHiLLtogeneratemultiplecodevariantsfora computation, andthebestperformingvariantisfinallyselected. The benefitofcompilerdirectedautotuninghasbeendemonstratedinthisresearch byapplyingthetechniquetooptimizingstencilsinisolationandinthecontext of GeometricMultigrid.Noveldomain-specificcompileroptimizationsformemory limited stencilsandcomputeintensivehigher-orderstencilshavebeendeveloped. Considerable interactionsbetweenoptimizationsformemoryandcomputebounds stencils isseen,highlightingtheneedtocomposesequencesofoptimizations.Further- more, totargetmorecomplexandrealisticstructuredgridcodesseeninapplications, optimizations arepresentedforaGMGbenchmark(miniGMG [32]) thatproxiesblock structured AMRcodes.BlockstructuredAMRcodessuchasCHOMBO[35]are commonly seeninmodernscientificapplications.Theypartitionagridintosmaller sub-grids onwhichstencilcomputationsareapplied,andpresentmoreoptimization challengestothecompilerthanastencilbeingappliedtoasinglegrid. 1.7 Contributions This dissertationpresentsnovelresearchwhichusescompileroptimizationsand autotuning tooptimizestencilcomputationsandGeometricMultigrid.Thecontri- butions ofthisdissertationareoutlinedbelow. 1. Communication-avoidingoptimizations This researchdevelopsdomain-specificcompileroptimizationsandusesauto- tuning toreduceinter-andintra-nodecommunication.Thegeneratedcode achievesupto4xspeedupforsmooths,3xfortheGMGsolver,andmatchesor bettershighlyoptimizedmanuallytunedcode. 2. Optimize compute-intensivehigher-orderstencils Higher-order GMGsolversareimplementedtoillustratethehighaccuracy obtained bythesemethodsandidentifyincreasedintensityoffloating-point operationsastheperformancebottleneck.Researchpresentedinthisdisser- tation developsandimplementsanoveloptimization, partialsums, which reuses computationtoreducefloating-pointoperations.Byusingpartialsums17 in conjunctionwithcommunication-avoidingoptimizations,speedupsupto 4x for higher-ordersmoothsareachieved. 3. Parallelcodegenerationformany-andmulticorearchitectures Researchpresentedinthisdissertationusesacompilerframeworktogenerate optimized OpenMPandCUDAparallelcodeforstencilstotargetmany-and multicorearchitectures.Theparallelcodegenerationcapabilityisusedto explore differentthreadingstrategies. 1.8 ThesisStructure The remainingchaptersinthisdisserationare: Chapter 2 introducestheCHiLLcompilerframework.Itdescribesthebasic compiler abstractionsandknowncompileroptimizationsinCHiLLwhichhave beenleveragedinthisdissertation. Chapter 3 describesthe miniGMG benchmarkwhichwasusedtoimplement and optimizeGeometricMultigrid.Thechapterexpainsindetailtheprincipal operationsinGMGandtheirimplementationinthebenchmark. Chapter 4 describesofcommunication-avoidingoptimizationsusedtoopti- mize memorybandwidthoperationsinGMG.Thechapterfirstintroducesthe optimizations, thenexplainstheirimplementationdetailsandfinallypresents performanceresultsandanalysis. Chapter 5 motivatestheuseofhigher-orderstencilsandthenidentifiesperfor- mance bottlenecksassociatedwiththesecompute-intensivemethods.Motiva- tions foranewoptimizationtargetinghigher-orderstencilsispresented,followed byitsdescriptionandimplementation.Theperformanceofthehigher-order stencil isfinallypresentedandtheefficacyofhigher-ordermethodsisshown. Chapter 6 describescodegenerationforgraphicsprocessingunits(GPUs). The chapterexploresparalleldecompositionstrategiestomapGMGtothe large numberofhardwarethreadsavailableontheseplatforms.18 Chapter 7 discusses researchrelatedtooptimizingstencilcomputationsand GMG. Chapter 8 presentsconclusionsandoutlinesfutureresearch.CHAPTER 2 CHILL The noveltransformationspresentedinthisdissertationarebuiltintotheCHiLL compiler framework[36,33].CHiLLisalooptransformationandcodegeneration frameworkwithascriptinglanguageinterface.CHiLLwasdesignedtosupport autotuning byallowingsequencesoftransformationstobecomposedandapplied. CHiLL alsoexposesparametersofthetransformationsforautotuning.Theinputto CHiLL isasourcecodewritteninC(orFortran),andatransformationscript.The script describesthesetoftransformationstobecomposedtooptimizetheprovided source [34].Inresearchpresentedhere,thescriptiseithergeneratedbyanautotuner or writtenbyanexpertprogrammer,butitcanalsobederivedautomaticallybya compiler decisionalgorithm[37].Afterapplyingoptimizations,CHiLLgeneratesop- timized C(orFortran)code.TotargetCUDAcodegenerationforNVIDIAGPUs,we use CUDA-CHiLL[34,37].ItisathinlayerbuiltontopofCHiLLtogenerateCUDA code.ThischapterdescribesfundamentalabstractionsinCHiLLandCUDA-CHiLL, and givesexamplesofhowtheyareused. 2.1 OrganizationofCHiLL AttheheartofCHiLLisapolyhedralframeworkthatcomposescomplextransfor- mation sequences.InternallyCHiLLusesOmega+,anenhancedversionofOmega[38], and Codegen+[39].Apolyhedralmodelrepresentseachstatement'sexecutioninthe loopnestasalatticepointinthespaceconstrainedbyloopbounds,knownasthe iteration space.Thenalooptransformationcanbesimplyviewedasamappingfrom one iterationspacetoanother.CHiLLmanipulatesiterationspacesderivedfromthe original program,usingadependencegraphasanabstractiontoreasonaboutthe safetyofthetransformationsunderconsideration[40].InCHiLL,iterationspacesare representedasintegersets,andlooptransformationsarelinearmappingsappliedto20 these integersets.Omega+isusedtorepresenttheintegersetsaslinearmappings, apply themappingstotheintegersets,andcomputedatadependences. In apolyhedralmodel,aftertherelationsrepresentinglooptransformationshave beenappliedtoinputiterationspaces,optimizedcodeisgeneratedfromtherewritten iteration spaces.Codegenerationinvolvesscanningthepolyhedrarepresentingthe iteration spacesofanoptimizedloopnest.Thequalityofthegeneratedcodedirectly impacts performance.Therefore,CHiLLusesacodegeneratorcalledCodeGen+that has advancedthestateoftheartinpolyhedralscanning. In theremainderofthischapterwedescribetheiterationspaceofastatement in aloopnest,relationsusedtotranformiterationspaces,datadependence,and legalityoftransformations.Weusetheseconceptstoillustratefourwell-knownloop transformations implementedinCHiLL:looppermutation,loopfusion,looptiling, and loopskewing.FinallyweintroduceCUDA-CHiLL,aCUDAcodegenerationtool built ontopofCHiLL. 2.2 IntegerSetsandRelations Iteration spacesandmappingbetweeniterationspacesarerepresentedmathe- matically usingOmega+.Omega+providesinterfacestorepresentintegersetsand mappings usingPresburgerformulas[41].Presburgerformulasareconstructedby combiningaffine(equalityorinequality)constraintsonintegervariableswiththe logical operations :, ^, and _, andthequantifiers 8 and 9. Forexample,letS bethe set ofintegersbetween0and64,and Seven is thesetofevenintegersinthatrange. These setsaredefinedasfollows: S := f[i] :0 < i< 64g Seven := f[i] : 9 : (0 < i< 64 && i = 2 )g R := f[i] > [i0] : i0 = i + 1g Relations areusedtomapbetweensets.TheyarealsoexpressedusingPresburger arithmetic. Therelation R, shownabove,adds1toeachelementofthesetitis applied to.Thus,if R is appliedto S, itwillmapittoanewsetofintegers,where the numberswillbebetweem0and65.21 2.3 IterationSpaces In apolyhedralmodel,loopssurroundingastatementcanbedescribedasa polyhedroninanintegerlinearspace.Thuswecanrepresentaloopnest'siteration space withasetofinequalitiesonloopindexvariablesintheaffinedomain,where these variableshaveonlyintegercoefficients.Thisrepresentationoftheiteration space issuitableforperfectlynestedloops(allassignmentstatementsintheinnermost loop).FortheloopnestinListing2.1,theiterationspaceofthestatementcanbe simply representedas: IS0 := f[i; j] :0 <= i <N&&0 <= j <Ng ForimperfectlynestedloopsasshowninListing2.2,additionalinformationisneeded to captureorderingofstatementsandrepresenttheloopstructure.Fortheimperfect loopnest,thestatementsS1,S2,andS3areatdifferentlevelsofnesting.To reason abouttheeffectoftransformationsonimperfectloopnests,theiterationspace representationmustcapturethisloopstructure. Tocaptureorderingconstraintsbetweenstatementsinanimperfectloopnest, weaddanauxiliarylooptoeachlooplevel,withanadditionalauxiliaryloopasthe last dimension.Theseauxiliaryloops sink all statementstothelooplevelofthe innermost loop.Thismeanstheiterationspacesofallthestatementsintheloopnest havethesamedimensionality.Furthermore,theauxiliaryloopsareadded,ensuring that theorderinwhichthestatementsareexecutedispreserved,inotherwords,the lexicographicorder of thestatementsiscorrect[33].Thusforann-deeploopnest, wehave(2n+1)-dimensioniterationspaces,andallstatements,irrespectiveoftheir nesting depth,havethesamenumberofdimensionsintheiriterationspace.Although auxiliary loopscarryspecialmeaningduringlooptransformations,auxiliaryloopsand other loopsaretreatedequivalentlyduringcodegeneration.Auxiliaryloopiteration spaces arealwaysconstantvaluedanddonotshowupinthefinalcode.Forthecode Listing 2.2,theiterationspacesofthestatementsare: IS1 := f[i; j] :0 <= i <N&&j = 1g IS2 := f[i; j] :0 <= i <N&&0 <= j <ig IS3 := f[i; j] :0 <= i <N&&j = i 1g 22 Fromthe2Diterationspaceofthestatementsitcanbeclearlyseenthatallthestate- mentshavethesameloopnestinglevel(iterationofsamedimensionality).Internally, the 2Dspaceisstoredasa5Dspacewiththehelpofauxiliaryloops. r1 := f[i; j] > [0; i; 0; j; 0]g r2 := f[i; j] > [0; i; 1; j; 0]g r3 := f[i; j] > [0; i; 2; j; 0]g 2.4 RelationsforLoopTransformations CHiLL usesOmega+torepresentiterationspacesassets,andusesrelationsto map betweeniterationspacestorepresentlooptransformations.Theuseofrelations to representloopsisillustratedwiththeexampleofloopshifting.Forexample,the iteration spaceforcodeListing2.1isshownbelowas IS0. Therelation R, whichshifts the iterationspaceineachdimensionby1,isappliedto IS0, theoutputiteration space is IS00. IS0 := f[i; j] :1 <= i <N && 1 <= j <Ng R := f[j;i] > [j0; i0] : i0 = i + 1&& j0 = j + 1g IS00 := R(IS0) := f[j;i] :2 <= i <N + 1&&2 <= j <N + 1g The loopnestcorrespondingto IS00 is showninListing2.3.Itcanbeeasily seen thattheloopboundshavebeenshiftedbyunityineachdimension.Inaddition, CHiLL addsanegativeshifttothearrayreferencestocountertheshiftintheiteration space. In theremainingdiscussion,relationswillbeabbreviatedto: R := {[i]->[i']]}, unless furtherdetailsarenecessary.Eachlooptransformationfromann-deeploop nest toanewm-deeploopnestisrepresentedasarelation map. Theloopsmarked as ci's in map are theauxiliaryloops.Ignoringtheauxiliaryloopwecanrewrite map as map0. map := f[c1; l1; ;;;;cn; ln; cn+1] > [c0 1; l0 1; ;;;;c0 m; l0 m; c0 m+1]g map0 := f[l1; ;;;;ln] > [l0 1; ;;;;l0 m]g 23 1 #define N 64 23 void func0(){ 45 double A[N+2][N+2],B[N+2][N+2]; 6 int i,j; 78 for(i=1;i<N;i++){ 9 for(j=1;j<N;j++){ 10 //Statement S0 11 //iteration space IS0 12 A[j][i]=B[j][i-1] 13 +B[j][i]+B[j][i+1]; 14 }} 15 } Listing 2.1:Simpleperfectlynested loop. 1 #define N 64 2 void func1(){ 34 double Sum[N],A[N][N],B[N]; 5 int i,j; 67 for(i=0;i<N;i++){ 8 //Statement S1 9 //iteration space IS1 10 Sum[i]=0; 11 for(j=0;j<i;j++){ 12 13 //Statement S2 14 //iteration space IS2 15 Sum[i]=Sum[i] 16 +A[j][i]*B[j];} 17 18 //Statement S3 19 //iteration space IS3 20 B[i]=B[i]-Sum[i]; 21 } 22 } Listing 2.2:Loopnestwiththree statementsatdifferentnesting levels. 1 #define N 64 2 void func0(){ 3 double A[N+2][N+2],B[N+2][N+2]; 4 int i,j; 56 //Loop bounds have been shifted by (+1) 7 for(i=2;i<=N;i++){ 8 for(j=2;j<=N;j++){ 9 //Statement S0 10 //Array indices have been shifted by (-1) 11 //Iteration space IS0' 12 A[j-1][i-1]=B[j-1][i-1-1]+B[j-1][i-1]+B[j-1][i+1-1];}} 13 } Listing 2.3:Loopnestaftershifting.24 2.5 DependenceGraphandLegalityofTransformations Dependenceanalysis[40,38]isakeycomponentofanycompilerframeworkwhich ensures correctnessofgeneratedcode.CHiLLusesOmega+fordatadependence (flow,antidependence,outputandinputdependence)analysis[38].Usingthedata dependenceinformationfromOmega+,CHiLLcreatesadependencegraphforthe statementsintheinputloopnest.Thedependencegraphisastandardcomponent of looprestructuringcompilers[40],andinCHiLL,thedependencegraphisusedto test forlegalityoflooptransformations. 2.6 LoopTransformationsinCHiLL CHiLL implementsawiderangeoflooptransformationalgorithmswhichtrans- form aloopnestfromonestateofrepresentationtoanother,moresuitableone. The representationofaloopnestincludesthestatements,iterationspaces,andthe dependencegraph.CommonlooptransformationsimplementedinCHiLLarelisted in Table2.1,foracompletelistrefertotheCHiLLusermanual[36].Thefollowing subsection brieflyoutlinestheknownlooptransformationswhichhavebeenusedin the researchpresentedinthisdissertation. Table2.1: AsubsetoftransformationsavailableinCHiLL. Transformation Purpose Looppermutation Change theorderofloops. Looptiling Partitiontheiterationspacetosmallblocks and iteratethroughblocksinsequence. Loopunrolling Similar totilinginchangingtheiterationorder, but usesexplicitunrolledloopbody. Loopfusion Fusedistinctloopsfordifferentstatementsintoone. Loopdistribution Distribute differentstatementsinasingleloopnest intodistinctsubloopseachencloseaseparatestatement. Looppeeling Unrolls anumberofiterationsfromthe beginningortheendofaloop. Loopsplitting Split theoriginallooptosubloopseachrepresenting a disjointpartoftheoriginaliterationspace. Loopshifting Adjuststheindexoftheloopbyadding specifiedamounttowhatthethenontransformedindex. Loopskewing Modifytheiterationspacesothatmultiple dependencesarecarriedbythesameloopnestinglevel.25 2.6.1 LoopPermutation Looppermutationchangestheorderingoftheloops[40].Foran-deeploopnest, givenapermutation of thelooporder,therelationforpermutationcanbeexpressed as: permute := f[l1; ;;;;ln] > [l 1 ; ;;;;l n]g CHiLL usesthedependencegraphoftheloopnesttomakesurepermutationislegal and doesnotviolatedatadependences.CHiLLalsocarefullyupdatesauxiliaryloops to reflectthechangeinlooporder[33]. PermutationusingCHiLLisillustratedincodeListings2.4-2.6.Listing2.4shows the inputcodefora2Dstencil.TheCHiLLscriptdrivingthetransformationisshown in Listing2.6.ThefirstthreelinesdirectCHiLLtotheloopnestinthegivenfileand procedure.Line5, original(), initializesCHiLL,setsuptheiterationspace(with auxiliary loops),andcollectsdependenceinformationforthestatementsintheloop nest. Thecommandtopermuteisgiveninline7.Thecommandspecifiesthefinal ordering oftheloopnest.Thecommandcanberepresentedbytherelationpermute', whichisappliedtotheiterationspaceoftheloopnest.Oncethetransformationis complete, polyhedrascanninggeneratestheoutputcodeshowninListing2.5. permute0 := f[l1; l2] > [l2; l1]g 2.6.2 LoopSkewing Loopskewingchangestheiterationspaceofastatementinaloopnestbyadding an outerloopindexvaluetoaninnerloopindex.Skewingisillustratedusing Listings 2.4,2.7,and2.8.InthegeneratedcodeinListing2.7,theinnerloopindex i is nowalinearfunctionofouterloopindex j. Toaccountforthechangeisinner loopbounds,thearrayreferenceshavealsobeenupdated.Ingeneral,skewingcan beexpressedastherelationskew,where ai terms areintegerconstants.Thus,the modifiedloopindexissimplyalinearcombinationoftheotherloopindices.The CHiLL scriptshowninListing2.8usesthecommand skew([0],2,[1,1]). This directs CHiLLtoskewlooplevel2(i) surroundingstatement0,suchthatthenew looplevel1isalinearcombinationoflooplevels1and2,withconstantterms a1 and a2 set to1.Themappingcorrespondingtothistransformationisskew'.26 skew := f[l1; l2; ::::;ln] > [l0 1; l2; ::::;ln] : l0 1 = a1 l1 + a2 l2; :::: + an lng skew0 := f[i; j] > [i0; j] : i0 = 1 i + 1 jg 2.6.3 LoopTiling Looptilinginvolvesdecomposingasingleloopintotwoloops,onewhichexecutesa tile ofconsecutiveiterationsandonewhichiteratesoverthetiles.Listings2.9and2.10 illustrate tilingofthe i loopintoloops ii and i with 1Dtilesofwidth16.TheCHiLL command tile(0,2,16,2,counted) tiles looplevel2(i) forstatement0intotiles of size16,andtilecontrollingloop(ii) isatlooplevel2afterthetransformationis complete. Forclarityofpresentation,weassumethattheloopboundsarefrom0to 64 insteadofstartingfrom1,asintheotherexamples.Thissimpletilecommandfor the givenproblemandtilesizecanberepresentedbythemapping tile0 16. Mappings for tilingingeneralaremorecomplexandfurtherdetailscanbefoundin[40,33]. Generated codewherebothloops i and j are tiledisshowninListing2.11,with, correspondingCHiLLscriptinListing2.12. tile0 16 := f[i; j] > [i0; ii;j] : i0 = 16 ii + k && 0 k < 16 &&0 ii < 4g 2.6.4 LoopFusion Combiningstatementsinadjacentloopsintoasingleloopnestisknownasloop fusion [40].Initially,CHiLLwillhavestatementsinaloopnestfusedwhenever possible.ThisautomaticfusionfallsoutofCHiLL'salgorithmtoaddauxiliary loopstoensureallstatementshavethesamedimensionalityofiterationspace.The automatic fusionisillustratedinListings2.13and2.14.CHiLLwillautomatically transform theinputcodeinListings2.13tothefusedinListing2.14bymakinga call tothe original() command. Inaddition,CHiLLalsoprovidesexplicitfuse commands foroptimizationpurposes.Fusionalgorithmtakesasetofstatementsand the looplevelasparameters. In additiontolooptransformationssuchastilingandfusion,CHiLLcanalso beusedtocomputepropertiesofloopnestcomputations.Thenextsectionuses Listing 2.15toillustratehowCHiLLcomputesdatafootprintofanarrayinaloop.27 1 #define N 64 23 void stencil2D() 4 { 5 double A[N+2][N+2],B[N+2][N+2]; 6 int i,j; 78 for(j=1;j<N;i++){ 9 for(i=1;i<N;i++){ 10 A[j][i]=B[j][i] 11 + B[j][i+1]+B[j][i-1] 12 + B[j+1][i]+B[j-1][i]; 13 }} 14 } Listing 2.4:Inputcodefor2Dstencilcomputation. 1 #define N 64 2 void stencil2D() 3 { 4 double A[66UL][66UL]; 5 double B[66UL][66UL]; 6 int i; 7 int j; 8 for (i=1;i<=63;i+=1) 9 for (j=1;j<=63;j+=1) 10 A[j][i]=B[j][i]+B[j][i+1] 11 + B[j][i-1]+B[j+1][i] 12 + B[j-1][i]; 13 } Listing 2.5:Generatedcodeafterloop permutation. 1 source:stencil2D.c 2 procedure:stencil2D 3 format:rose 456 original() 7 permute([2,1]) Listing 2.6:CHiLLscript for looppermutation.28 12 #define __rose_lt(x,y)((x)<(y)?(x):(y)) 3 #define __rose_gt(x,y)((x)>(y)?(x):(y)) 4 #define N 64 56 void stencil2D() 7 { 8 double A[66UL][66UL]; 9 double B[66UL][66UL]; 10 int i; 11 int j; 12 for (j=1;j<=63;j+=1) 13 for (i=j;i<=j+63;i+=1) 14 A[j][-j+i]=B[j][-j+i] 15 + B[j][-j+i+1]+B[j][-j+i-1] 16 + B[j+1][-j+i]+B[j-1][-j+i]; 17 } Listing 2.7:Generatedcodeafterloopskewing. 1 source:stencil2D.c 2 procedure:stencil2D 3 format:rose 456 original() 7 skew([0],2,[1,1]) Listing 2.8:CHiLLscript for loopskewing. 1 #define N 64 23 void stencil2D() 4 { 5 double A[66UL][66UL]; 6 double B[66UL][66UL]; 7 int i,ii; 8 int j; 9 for (j=0;j<=63;j+=1) 10 for (ii=0;ii<=3;ii+=1) 11 for (i=16 * ii;i<=16 * ii + 15;i+=1) 12 A[j][i]=B[j][i] 13 + B[j][i+1]+B[j][i-1] 14 + B[j+1][i]+B[j-1][i ]; 15 } Listing 2.9:Generatedcodeafterloop tiling. 1 source:stencil2D.c 2 procedure:stencil2D 3 format:rose 4567 original() 8 tile(0,2,16,2,counted) Listing 2.10:CHiLLscriptfor looptiling.29 1 #define N 64 23 void stencil2D(){ 4 double A[66UL][66UL]; 5 double B[66UL][66UL]; 6 int i,ii; 7 int j,jj; 8 for (jj=0;jj<=3;jj+=1) 9 for (ii=0;ii<=3;ii+=1) 10 for (j=16*jj;j<=16*jj+15;j+=1) 11 for (i=16*ii;i<=16*ii+15;i+=1) 12 13 A[j][i]=B[j][i] 14 + B[j][i+1]+B[j][i-1] 15 + B[j+1][i]+B[j-1][i]; 16 17 } Listing 2.11:Codeaftertilingloops i, j. 1 source:stencil2D.c 2 procedure:stencil2D 3 format:rose 4567 original() 8 tile(0,2,16,1,counted) 9 tile(0,2,16,1,counted) Listing 2.12:CHiLLscriptfor tiling loops i and j. 1 #define N 64 23 void stencil1D() 4 { 5 double A[N+2]; 6 double B[N+2]; 7 double C[N+2]; 8 int i; 9 int t; 10 for (t=1;j<2;t++){ 11 for (i=1;i<64;i++) 12 A[i]=B[i-1]+B[i]+B[i+1]; 13 for (i=1;i<64;i++) 14 C[i]=B[i-1]+B[i]+B[i+1]; 15 } 16 } Listing 2.13:InputcodetoCHiLL for thefusionexample. 1 #define N 64 23 void stencil1D() 4 { 5 double A[N+2]; 6 double B[N+2]; 7 double C[N+2]; 8 int i; 9 int t; 10 for (t=0;j<=1;t+=1){ 11 for (i=1;i<=63;i+=1) 12 A[i]=B[i-1]+B[i]+B[i+1]; 13 C[i]=B[i-1]+B[i]+B[i+1]; 14 } 15 } Listing 2.14:Automaticloop fusion inCHiLLafterinvokingthe command original(). 1 #define N 64 2 for (i=0;i<N;i++) 3 // Statement S0 4 b[i]=a[i-1]+a[i]+a[i+1]; Listing 2.15:Simpleloopnestusedtocomputedatafootprint.30 2.7 ComputingFootprintofArrayReferences CHiLL cancomputeafootprintspaceforeacharrayreferenceinaloopbyusing the iterationspaceoftheloopinconjunctionwiththearrayreference.Footprint computation isexplainedusingListing2.15.Thestatementintheloophasfour arrayreferences: b[i], a[i], a[i-1], and a[i+1]. Eacharrayreferencegenerates a linearmappingwhichmapsapointintheiterationspaceofthelooptoapointin the datafootprintspace,whichissimplyatransformedintegerset. The linearmappingsgeneratedbyarrayreferences a[i-1], a[i], and a[i+1] are ref1, ref0 and ref+1, respectively,andcanbeexpressedas: ref1 := f[i] > [i0] : i0 = i 1g ref0 := f[i] > [i0] : i0 = ig ref1 := f[i] > [i0] : i0 = i + 1g The footprintspacesofthesearrayreferencesaretheapplicationofthelinearmapping to theiterationspace(IS)oftheloop.Thusthefootprintspacesofthesereferences can berepresentedas: footprint1 = ref1(IS) := f[i] : 1 <= i < 63g footprint0 = ref0(IS) := f[i] > [i0] :0 <= i < 64g footprint1 = ref1(IS) := f[i] > [i0] :1 <= i <= 64g The unionofthefootprintspacesgivesthefootprintaccessedbyarray a. Computing the unioncanoftenresultinanoverapproximation.Furtherdetailsofcomputing arrayfootprintscanbefoundin[33],andtheiruseintransformationssuchas datacopy and arraydataflowanalysiscanbefoundin[33]and[42],respectively. 2.8 ExtendingPolyhedralTechnologyinCHiLL CHiLL isnotmerelyapolyhedralframework,italsoallowsmanipulatingthe intermediaterepresentation(IR)oftheloopsandstatements.Infact,loopunrolling in CHiLLisnotapolyhedraltransformation[33].Thenoveltransformationsaddedto CHiLL aspartofthisdissertationinvolvemodifyingtheIRoftheinputstatements,31 and arenotpolyhedraltransformations.Tomaintaincomposabilityoftransfor- mations inCHiLL,theapproachtakenistorebuildthepolyhedralrepresentation of theprogramafterthenewtransformationisapplied.Thisinvolvescorrectly updatingiterationspacesofthemodifiedstatementsandtheirlexicographicorder, and rebuildingthedependencegraph. 2.9 CUDA-CHiLL Graphics processingunit(GPU)acceleratorshavebecomeacommonhardware target forscientificcomputing,astheyofferhighcomputingpower(teraflopsper node)withbetterpowerefficiencythantraditionalmulticores.Chapter6inthisdis- sertation explorescompileroptimizationsandcodegenerationforsmoothonNVIDIA GPUs. NVIDIA GPUsareprogrammerusingtheCUDAprogrammingmodel.Nvidia GPU architecturesorganizetheparallelismonanodeinatwo-levelhierarchy,with a numberofstreamingmultiprocessors(SMs),eachofwhichhasaSIMDunitwith severalcores.CUDAreflectstwo-levelhierarchy.ACUDAprogram(calledaCUDA kernel)describesacomputationdecompositionintoaone-tothree-dimensionalspace of threadblockscalledagrid,whereablockismappedtooneoftheSMs.Eachthread blockdefinesaone-tothree-dimensionalspace.Akernelthreadprogramisexecuted for eachpointinthegrid. TohelpprogramNVIDIAGPUs,thisdissertationusesCUDA-CHiLL.CUDA- CHILL isalayerontopofCHiLLwhichgeneratesparallelCUDAcodefromsequential code.Oncelooptransformationsandotheroptimizationsareappliedusingexisting machineryinCHiLL,parallelCUDAcodeisgeneratedusingCUDA-CHiLL. 2.9.1 ParallelDecomposition GPUs areatiledarchitecturewhereeachstreamingmultiprocessor(SM)repre- sentsaseparatetile.ParallelcodeshouldbepartitionedacrossSMssothateach thread operatesonmostlyindependent,localizeddata.Subdividingtheiteration space ofaloopintoblocksortileswithafixedmaximumsizehasbeenwidelyused when constructingparallelcomputations[43,44,45].Theshapeandsizeofthe32 tile canbechosentotakeadvantageofthetargetparallelhardwareandmemory architecture. Tiling wasdescribedinSection2.6.3.Hereweuseittocreatethreadandblock loopinCUDA.Aftertiling,CUDA-CHiLLisusedtomapone,twoorthreelooplevels to blockindicesforgriddimensions,andtomapuptothreeloopstothreadindices. This isillustratedwithanexampleofmatrixvectormultiplicationinListing2.16-2.19. The inputsequentialcodeisshowninListing2.16.Theproblemsize,andthus looptripcountisN=1024.Thestatementintheloopbodyiscalledstatement S0. The inputsareanNxNmatrixandN-widevector.CUDA-CHiLLhasa lua scripting language interface.Listing2.17showstheluascriptwhichdirectsCUDA-CHiLLto generate aCUDAkernelformatrixvectormultiply. Forstatement 0 in theinputcode,line12inListing2.17tilesthe i,j loops.The tile sizesforloops i and j are TI=32andTJ=64,respectively.Thetilecontrolling loopsfor i and j are ii and k respectively.Thefinalorderoftheloopsisgiven by {ii,k,i,j}. Listing2.18illustratesthelooprestructuringeffectedbythe tile_by_index command. Loop i has atripcountofTI=32,andthetilecontrolling loop ii has atripcountofN/TI=32.Similarly,loop j has atripcountofTJ=64, and thetilecontrollingloop k has atripcountofN/TJ=16. After tiling,thelooplevelsareassignedtoCUDAblocksandthreadsusingthe cudaize command. Line17inListing2.17usesthe cudaize command toassign loop ii to ablockdimension,andloop i to athreaddimension.Ascanbeseen in Listing2.18,theloopsaremarkedtobeassignedtoblocksandthreads,and Listing 2.19showsthegeneratedCUDAkernelexecutebyeachthread.Loops ii and i havebeenreplacedwithperthreadandperblockidentifiers bx and tx, and correspondingarrayreferenceshavebeenupdatedbyCUDA-CHiLL. 2.10 Summary This chapterdescribestheCHiLLcompilerframeworkandtheCUDA-CHiLL extension. CHiLLwasdesignedtosupportautotuningbyallowingeasyandcorrect compositionoftransformations.CHiLLleveragespolyhedraltechnologyandinter- nally usesOmega+andCodegen+tomathematicallyrepresentandmanipulateloops33 1 #define N 1024 23 void normalMV(float c[N][N], float a[N], float b[N]){ 4 int i,j; 56 for (i=0;i<N;i++) 7 for (j=0;j<N;j++) 8 a[i]=a[i]+c[j][i] * b[j]; 9 } Listing 2.16:Theinputsequentialcodeformatrixvectormultiply 1 TI=32 2 TJ=64 3 N=1024 45 tile_by_index(0,{"i","j"}, 6 {TI,TJ},{l1_control="ii",l2_control="k"}, 7 {"ii","k","i","j"}) 89 cudaize(0,"mv_GPU",{a=N,b=N,c=N*N}, 10 {block={"ii"},thread={"i"}},{}) Listing 2.17:CUDA-CHiLLscriptformatrix-vectormultiply. 1 // ~cuda~ preferredIdx: bx 2 for(ii=0;ii<=31;ii++){ 3 for(k=0;k<=15;k++){ 4 // ~cuda~ preferredIdx: tx 5 for(i=32*ii;i<=32*ii+31;i++){ 6 for(j=64*k;j<=64*k+63;j++){ 7 s0(i,j); 8 }}}} Listing 2.18:TiledCodewithcandidateloopsforCUDAblocksandthreads. 12 __global__ void mv_GPU(float *a,float (*c)[1024],float *b) 3 { 4 int j; 5 int k; 6 int bx; 7 bx=blockIdx.x; 8 int tx; 9 tx=threadIdx.x; 10 for (k=0;k<=15;k+=1) 11 for (j=64 * k;j<=64 * k +63;j+=1) 12 a[tx+32 * bx]=a[tx+32 * bx]+c[j][tx+32 * bx] * b[j]; 13 } Listing 2.19:GeneratedCUDAkernelformatrixvectormultiply34 and togenerateoutputcode,respectively.Transformationsdescribedinthisdisserta- tion werebuiltintoCHiLL.Thisenabledcompositionofnoveltransformationswith knowncompilertechniquesdevelopedovermanydecadesofresearch.CHAPTER 3 THE miniGMGBENCHMARK This chapterdescribesGeometricMultigrid(GMG),afamilyofalgorithmsused to acceleratetheconvergenceofiterativesolvers.ThebasicoperationsinaGeometric Multigrid areessentiallystencilcomputationsoramixofstencilsandpointwise updates.Inthepastmostcompilerresearchinoptimizingstencilsconcentratedon stencil computationsinisolation.Incontrastthisdissertationfocusesonoptimizing a linearsolverthatusesmultiplestencils. Tothatend,optimizedstencilkernelsaregeneratedforthethe miniGMG benchmark.miniGMGisacompactGeometricMultigridbenchmarkwhichproxies multigridsolversinAdaptiveMeshRefinement(AMR)applications;ithasover2000 lines ofCcodewithadozenperformance-criticalfunctions.Compilertechniquesare used tooptimizeimportantstencilkernelsinminiGMGwhichdominateruntime. The followingsectionspresentthebaselineimplementationoftheminiGMGbench- mark andhighlightthestencilcomputationsinthecontextoftheoverallsolver. miniGMG hasfiveprincipaloperations:smooth,residual,restriction,interpolation, and ghostzoneexchangeexecutedinasequenceknownastheV-cycle.Thenext section presentsdetailsoftheV-cycleinminiGMGanddescribesthedatadecompo- sition andparallelismused.Codeskeletonsareusedtomakeourdiscussionconcrete. WeendthechapterbydescribingchallengestooptimizingminiGMG. 3.1 V-cycle Multigrid methodsprovideapowerfultechniquetoacceleratetheconvergenceof iterativesolversforlinearsystemsandarethereforeusedextensivelyinavariety of numericalsimulations.Conventionaliterativesolversoperateondataatasingle resolution andoftenrequiretoomanyiterations.Multigridsimulationscreatea hierarchyofgridlevelsandusecorrectionsofthesolutionfromiterationsonthe36 coarser levelstoimprovetheconvergencerateofthesolutionatthefinestlevel. Geometric multigrid(GMG)beginswithastructuredmesh,whereeachprogressively coarser gridcontainshalfthegridpointsineachdimension.Giventhefactthat the operatorsarethesameirrespectiveofgridspacing,thisexponentialreduction in gridsizescanboundmultigrid'scomputationalcomplexitytoO(N),whereNis the numberofvariables.Whenperformanceishighlycorrelatedtocomputational complexity,thetimespentonthefinergridswilldominatetheruntime. Figure 3.1visualizesthestructureofamultigridV-cycleforsolving Luh = fh, in which L is theoperator, u is thesolution, f is theright-handside,andsuperscripts representgridspacings.Ateachgridspacing,multiplesmoothoperatorsreducethe error inthesolution.ThesmoothcanbeasimplerelaxationsuchasJacobi,or something morecomplex,likeaGauss-Seidel,Red-Black(GSRB). The right-handsideofthenextcoarsergridisdefinedasthe restriction of the residual (fh Luh). Eventually,thegrid(orcollectionofgrids)cannotbecoarsened anyfurtherusinggeometricmultigrid.Atthatpoint,mostalgorithmsswitchtoa bottomsolverthatcanbeassimpleasmultiplerelaxationsorascomplicatedas algebraic multigrid,aKryloviterativesolver,oradirectsparsesolver.Oncethe coarsest gridissolved,themultigridalgorithmappliesthesolution(acorrection)to progressivelyfinergrids.Thisrequiresaninterpolationof u2h onto uh. Smoothat the finergridresolutionisappliedonthenewcorrection. 3.2 Domain-DecompositionandParallelism miniGMG executestheV-cycleona3Ddomain.AsshowninFigure3.2,the benchmarkcreatesaglobal3Ddomain,andpartitionsitintosubdomainsofsizes similar tothosefoundinmultigridsolversinreal-worldAMRapplicationssuchas CHOMBO [35].OurconfigurationofminiGMGfixesthedomain(problem)sizetoa 2563 discretization oneachmulticoreorGPUnode,andusessubdomainsofsize 643. ThusatthefinestleveloftheV-cycle,the 2563 domain isdecomposedintoalistof 64 boxesorsubdomains1 of size 643. Wesubsequentlyusethetermssubdomainsand boxesinterchangeably. 1miniGMG supportsvaryingthesubdomain(box)size;usuallyAMRapplicationsusesizesfrom 323 to 1283. 37 progress within V-cycle Figure 3.1: ThemultigridV-cycleforsolving Luh = fh. Note,superscriptsdenote grid spacing. Figures 3.2and3.3presentstheproblemsizes,paralleldecomposition,andthe V-cycle thatwasusedfortheresearchpresentedhere.Theglobal 2563 domain was decomposedinto 643 boxesorsubdomainsatthefinestresolution.Smooth(Luh = fh), theresidual(fh Luh) iscomputedonthese 643 boxesandthentheyare coarsened to 323 boxesusingtherestrictoperation(Section1.1).Thissequenceis continuedtillwereachthebottomleveloftheV-cycle.Inourimplementationweuse a truncatedV-cyclewhererestrictionstopsatthe 43 level(thebottomlevel).Thus our configurationofminiGMGhas5(643; 323; 163; 83; 43) levelsfortheV-cycle.As this dissertationisfocusedonoptimizingthemultigridV-cycleonasinglenode,a simple relaxationschemeusingthesmoothappliedonthefinergridisalsousedatthe bottomlevel.Thesimplerelaxationschemeatthebottomlevelissufficienttoattain single-nodemultigridconvergence.2 After smoothsareappliedatthecoarsest 43 level, the boxesareinterpolatedtofiner 83 boxes.Thesequenceofapplyingsmoothsand interpolationsiscontinueduntilwereachthefinest 643 boxes.Atthefinestlevel further smoothsareappliedandtheV-cycleiscomplete. The baselineimplementationofminiGMGusestheMPI+OpenMPmodeltoex- press parallelismontraditionalmulticorearchitectures.MPIisalibraryspecification 2miniGMG includesbothCGandBiCGStabbottomsolverstoenablescalablemultigridimple- mentations.Forexperimentswherewehavescaledtoalargernumberofnodesthebottomsolver has beenmodifiedtouseBiCGStab.38 Collection of subdomains owned by an MPI process one subdomain of 643 elements Figure 3.2: Visualizationofthedomain/process/subdomainhierarchyinminiGMG. for message-passing[46].miniGMGdecomposestheglobaldomain(2563) intoa list ofsubdomainswhichisthenpartitionedamongMPIprocesses.Forexample, as illustratedinFigure3.3,onamachinewithfoursocketswemayhavefourMPI processes.Eachprocessgetsa(256 128 128) chunkoftheglobaldomaincontaining 16 boxesofsize 643. InsideaMPIprocessthelistofboxesisprocessedinparallel, eachboxiscomputedbyanOpenMPthread.AswegodowntheV-cyclemoving from larger,finergridstosmaller,coarserones,thenumberofboxesremainsthe same andtheworkperthreaddecreases,thusreducingparallelism. Applying astencilontheboundaryofastructuredgridrequirespointsexterior to thegrid,asillustratedinFigure3.4.Thefigureshowsa2D5-pointstencilbeing applied toaninteriorandaboundarypointona4 4 grid.Toapplythestencilon the boundary,anextralayerofpointscalleda ghost zone 3 (graycoloredpoints)is required. Thustocomputetheillustratedstencilona4 4 grid,alarger5 5 grid needs tobeallocated. When agridisgeometricallydecomposedortiledintosmallergridtiles,each grid tileneedsaghostzone.ThisisillustratedinFigure3.5,wherea12 12 gridis decomposedinto4 4 tiles.Eachgridtileorsubdomainhasaghostzonewhichisa 3Ghost zonesarealsoknownashaloregions.39 Domain 256^3 Each Subdomain Mapped To An MPI Processes List of 64^3 Boxes Computed In Parallel 64 32 16 8 4 Truncated V-cyle with 64^3 fine grids and 4^3 coarsest grids Figure 3.3: Execution of the miniGMG V-cycle. A node is assigned a 2563 domain which is decomposed into a list of 643 subdomains (boxes). The list of subdomains is partitioned into MPI processes. The subdomains owned by each MPI process are then computed in parallel by a single OpenMP thread. copy of the boundary points of its neighboring subdomains. This can be seen in the color coding used in the figure. The tile at the center has blue interior points, and colors of its ghost zone points correspond to the interior grid points of its neighbors. After a stencil computation sweeps through the entire domain and updates each point on the grid, the neighboring subdomains must exchange ghost regions to avoid having stale values. The ghost zone points are read but not updated during the stencil computation, and thus their exchange is necessary to ensure correctness. The size of the ghost zone and data exchange pattern depends on the shape of the stencil used. The global 3D domain (2563 at the finest level) corresponds to equally-sized grids. The grids represent the correction, right-hand side, residual, and stencil coefficients and each grid is stored as a separate array. It is important to note that the grids or arrays corresponding to the global domains are not allocated as contiguous chunks or memory. Instead miniGMG allocates the subdomains within a level as equally sized 40 (a) (b) Figure 3.4: Application of a 2D 5-point stencil of a 4x4 2D grid. (a) Shape of the stencil. (b) Application of this stencil of an interior and boundary point of the grid. Ghost zone is shaded gray. Figure 3.5: Visualization of ghost zones and data exchange between subdomains. A 12 12 domain (grid) is geometrically decomposed to 9, 4 4 subdomains (tiles). Each subdomain needs a ghost zone which must be exchanged between neighboring tiles after a stencil computation is applied to the entire domain. 41 grids (arrays),eachgridcontiguousinmemory.Thismeansatthefinestlevelofthe V-cycle alistof 643 grids isallocatedinsteadof 2563 grids. The miniGMGbenchmarkallocatescontiguousmemoryforthegridsandin- cludes theextramemorytosupportghostzonesneededbythesuddomains.Thus, at eachleveloftheV-cycle,thememoryallocatedforagridofasubdomainis (subdomain_size+2 ghost_zone)3. Atthefinestlevelwhereweuse 643 subdomains with aonedeepghostzone,thebaselineallocatesthe 643 grids as 663 contiguous elements. The benchmarkimplementsoptimizedroutineswhichexchangetheghostzones via buffers.Thedataexchangedoesnotusefloating-pointoperationsbutstillneeds significanttime.Theshapeofthestencildeterminesthedataexchangepattern betweensubdomains,andthedataexchangeisrequiredaftereverystencilsweepof the domainiscomplete. 3.3 V-cycleandOperatorCodeSkeletons This sectionusescodeskeletonsextractedfromtheminiGMGbenchmarktomake the smooth,residual,andrestriction/interpolationsconcrete.Itstartwithcodefor the V-cycletohighlightthesequenceinwhichtheseoperationsandtheghostzone exchangesareinvoked. Lines 120 inListing3.1illustrategoingdowntheminiGMGV-cycleandlines 2435 showthebottomsolves.CodeforgoingbackuptheV-cyclehasbeenomitted for brevity. Application ofmultiplesmoothsateachlevel(0beingthefinestandNumLevel beingthebottom)isshowninlines215. Lines1014 showthesmoothbeing applied tothesubdomains(boxes)inparallel,andlines67 highlighttheghostzone exchangerequiredpriortoandbetweenapplicationsofeachsmooth.Applicationof residual (line16)issimilartosmooth.Residualisalsoappliedinparallelonthe boxesandthereisaghostzoneexchangepriortotheresidualcomputation. Unlikesmooth,residualisonlyappliedonce.Restriction(line19)followsresidual, it iscomputedonceandappliedinparallelontheboxes.Noghostzoneexchangeis required priortoapplyingrestriction.GoingbackuptheV-cycleissimilar,butthe residual isnotapplied,andrestrictisreplacedbyinterpolation.Lines2435 show42 1 for(level=0;level<NumLevel;level++){ 2 for ( smooth=0;smooth<NumSmooths;smooth++){ 34 // communication phase... 5 // the boxes exhange boundaries with neighbors 6 exchange_boundary_phi(); 78 // apply smooth on each box in parallel 9 # pragmaompparallel for private (box) 10 for (box=0;box<NumBoxInSubdomain;box++){ 11 color=smooth; 12 gsrb_smooth_function(Domain->SubDomain[box],phi,rhs,color); 13 } 14 } 15 compute_residual(); 16 // restrict to form the coarse and smaller grid 17 // We go down the v-cycle, ie. from a 64^3 grid to a 32^3 grid 18 compute_restriction(); 19 }//down 20 21 // bottom solve.... 22 23 for (smooth=0;smooth<NumBottomSmooths;smooth++){ 24 25 exchange_boundary_phi(); 26 27 // apply smooth on each box in parallel 28 # pragmaompparallel for private (box) 29 for (box=0;box<NumBoxInSubdomain;box++){ 30 color=smooth; 31 gsrb_smooth_function(Domain->SubDomain[box],phi,rhs,color);} 32 }//bottom solve 33 } 34 // back up the v-cycle..... Listing 3.1:miniGMGV-cycle43 the bottomsolvebeingappliedwiththeghostzoneexchangesbetweeneachapplica- tion. Thebottomsmoothisappliedmanymoretimesthanthenumberofsmooths at otherlevels. 3.3.1 Smooth Smoothsusinganumberofstencilshavebeenoptimizedinthisdissertation. miniGMG hasbeenconfiguredtousesmoothswithGauss-SeidelRed-Black(GSRB) and Jacobiiterationsofvariable-coefficientstencils,andJacobiiterationsofconstant- coefficientstencils.Thissectionpresentsasmoothusingavariable-coefficientstencil. Listing 3.2showscodeforasmooth(Luh = fh), where L = a~ I br~ r. Programmers oftenwishtomaintainflexibilityandthuscreatesmoothoperators bycomposingmultiplesimpleroperators,asillustratedinListing3.2.Thesmooth operatorcalculatestheLaplacian,Helmholtz,andaGauss-Seidelrelaxationinse- quence. Thefirstloopnest(lines213) calculates r~ ru, storingittoatemporary array.Thisloopexecutesthevariable-coefficientstencil.Itreadsinpointsfrom the grid phi and multipliesitwithappropriatecoefficientsfrom beta_i, beta_j and beta_k. Thenextloopnest(lines1520) updatesthattemporarybycalculating a~ ubr~ ru. Thefinalloop-nest(lines2229) performstheGSRBrelaxationusing the temporaryarray. The samevariable-coefficientstencilcanbeusedwithJacobirelaxation.List- ings 3.3and3.4comparetheloopstructurecorrespondingtoGSRBandJacobi relaxes respectively.Forclarity,thestatementsforLaplacian,Helmholtzandthe Relaxation fromListing3.2havebeenrepresentedby S0, S1 and S2. Jacobiiterations havestatements even_S0, even_S1, and even_S2, odd_S0, odd_S1, and odd_S2. Statement odd_S0 executes thesamestencilas even_S0, exceptin odd_S0 the array temp isreadandphiisupdated.Similarlyin odd_S1 and odd_S2, tempandphi are interchanged.Thus,odd-numberedsmoothapplicationswithJacobirelaxations updatetemp,andeven-numberedsmoothsupdatephi.Thecrucialdifferencebetween the tworelaxationschemesisthecomplexif-conditionusedinGSRB(line26in Listing 3.2).Jacobiupdateseverypointonthegridanddoesnothavethiscondition, and hasasimplerif-conditiontocheckwhetherphiortempgetsupdated.44 1 // Laplacian(phi) = b div beta grad phi 2 for (k=0;j<N;k++) 3 for (j=0;j<N;j++) 4 for (i=0;i<N;i++) 5 // statement S0 6 temp[k][j][i]=b*h2inv*( 7 beta_i[k][j][i+1]*( phi[k][j][i+1]-phi[k][j][i]) 8 -beta_i[k][j][i] *( phi[k][j][i]-phi[k][j][i-1]) 9 +beta_j[k][j+1][i]*( phi[k][j+1][i]-phi[k][j][i]) 10 -beta_j[k][j][i] *( phi[k][j][i]-phi[k][j-1][i]) 11 +beta_k[k+1][j][i]*( phi[k+1][j][i]-phi[k][j][i]) 12 -beta_k[k][j][i] *( phi[k][j][i]-phi[k-1][j][i])); 13 14 // Helmholtz(phi) = (a alpha I - laplacian)*phi 15 for (k=0;j<N;k++) 16 for (j=0;j<N;j++) 17 for (i=0;i<N;i++) 18 // statement S1 19 temp[k][j][i]=a * alpha[k][j][i] * phi[k][j][i]-temp[k][j][i]; 20 21 // GSRB relaxation: phi = phi - lambda * (helmholtz-rhs) 22 for (k=0;j<N;k++) 23 for (j=0;j<N;j++) 24 for(i=0;i<N;i++){ 25 // color is 0 for Red pass, 1 for Black 26 if((i+j+k+color)%2==0) 27 // statement S2 28 phi[k][j][i]=phi[k][j][i]-lambda[k][j][i]*(temp[k][j][i]-rhs[k][j ][i]); 29 } Listing 3.2:SmoothoperatorwithGauss-SeidelRed-Blackrelaxations.45 123 for (k=0;j<N;k++) 4 for (j=0;j<N;j++) 5 for (i=0;i<N;i++) 6 //Laplacian 7 S0(); 89 for (k=0;j<N;k++) 10 for (j=0;j<N;j++) 11 for (i=0;i<N;i++) 12 //Helmholtz 13 S1(); 14 15 for (k=0;j<N;k++) 16 for (j=0;j<N;j++) 17 for(i=0;i<N;i++) 18 if((i+j+k+color)/2==0) 19 // GSRB update 20 S2(); Listing 3.3:GSRBrelaxation. 1 if(smooth_application%2==0) { 23 for (k=0;j<N;k++) 4 for (j=0;j<N;j++) 5 for (i=0;i<N;i++) 6 even_S0(); 78 for (k=0;j<N;k++) 9 for (j=0;j<N;j++) 10 for (i=0;i<N;i++) 11 even_S1(); 12 13 for (k=0;j<N;k++) 14 for (j=0;j<N;j++) 15 for(i=0;i<N;i++) 16 even_S2(); 17 } else{ 18 19 for (k=0;j<N;k++) 20 for (j=0;j<N;j++) 21 for (i=0;i<N;i++) 22 odd_S0(); 23 24 for (k=0;j<N;k++) 25 for (j=0;j<N;j++) 26 for (i=0;i<N;i++) 27 odd_S1(); 28 29 for (k=0;j<N;k++) 30 for (j=0;j<N;j++) 31 for(i=0;i<N;i++) 32 odd_S2(); 33 } Listing 3.4:Jacobirelaxation.46 3.3.2 Residual Residual usesthesamestencilassmooth.ThecodeisillustratedinListing3.5. Forbrevityofpresentationtheresidualcomputationhasbeenshownasoneloopnest instead ofseparateloopnestssimilartothebaselinesmooth.Likesmooth,residual uses astencilandrequiresghostzonedata,andthusghostzonesareexchangedbefore computing theresidual.Theresidualiscomputedoncepermultipleapplicationof smoothandcontributesfarlesstotheoverallsolvetime. 3.3.3 RestrictionandInterpolation Restriction andinterpolationareintegraloperationstotheGeometricMultigrid. The restrictionoperationisappliedwhengoing down the multigridV-cycle.Restric- tion takesasinputagridcorrespondingtotheresidualandcomputesacoarser-grained grid fromit.ThecodeforrestrictionisshowninListing3.6.Thepiecewiseconstant restriction usedhereiscommontofinite-volumemethods.Itisaconstant-coefficient stencil whichreadsineightpointsfromtheinputfine-resolutiongrid,computesan averageandwritesittoacoarseroutputgrid.Theoutputgridishalfthesizeof the inputgridineachdimension,andthisleadstothenonunitloopstrides,and the indexingofthecoarsegridinvolvesadivisionbytheconstantcoarseningfactor. The differenceinsizebetweeninputandoutputgridsfortherestrictionoperation differentiatesitfromtheotherstenciloperationswhichuseequal-sizedgrids.The other featureuniquetothestencilusedintherestrictionoperationisthatitdoes not readtheghostzonepointsoftheinputfinegrid.Thusnoghostzoneexchange is requiredpriortoapplyingtherestriction. Interpolationisascatteroperationwhichperformstheinverseofrestriction.It maps apointfromacoarsegridtoeightpointsinthefinegridwhengoing up the V-cycle. Interpolationisnotanoptimizationtargetinthisdissertation. 3.4 OptimizationChallenges Optimizations forminiGMGneedtoaddressthreebroadperformancechallenges on currentandfuturearchitectures: Reducing datamovement. Stencil computationsarecommonlymemory-bandwidthlimited,andthus,47 1 // Input: Residual Operator 2 for(k=0;k<K;k++) 3 for(j=0;j<J;j++) 4 for(i=0;i<I;i++) 5 // statement S3 : Compute residual 6 res[k][j][i]=rhs[k][j][i] 7 - a * alpha[k][j][i] * phi[k][j][i] 8 + b*h2inv*( 9 beta_i[k][j][i+1]*( phi[k][j][i+1]-phi[k][j][i]) 10 -beta_i[k][j][i] *( phi[k][j][i]-phi[k][j][i-1]) 11 +beta_j[k][j+1][i]*( phi[k][j+1][i]-phi[k][j][i]) 12 -beta_j[k][j][i] *( phi[k][j][i]-phi[k][j-1][i]) 13 +beta_k[k+1][j][i]*( phi[k+1][j][i]-phi[k][j][i]) 14 -beta_k[k][j][i] *( phi[k][j][i]-phi[k-1][j][i]) 15 ); Listing 3.5:Residualoperator 1 // Input: Restriction Operator 2 for(k=0;k<K;k+=2) 3 for(j=0;j<J;j+=2) 4 for(i=0;i<I;i+=2) 5 coarser_res[k/2][j/2][i/2]=0.125 *( 6 res[k][j][i]+res[k][j][i+1]+ 7 res[k][j+1][i]+res[k][j+1][i+1]+ 8 res[k+1][j][i]+res[k+1][j][i+1]+ 9 res[k+1][j+1][i]+res[k+1][j+1][i+1] 10 ); Listing 3.6:Restrictionoperator48 minimizing datatrafficisthesinglemostimportantfactorinoptimizingthem for modernarchitectures.GeometricMultigridusesasequenceofstencilcom- putations, andthusoptimizationsforGMGmustaimtoincreaselocalityfor bothindividualstencilsandacrossstencilcomputations. Parallelcodegeneration. With theincreasingnumberofthreadsonanode,expressingparallelisminthe generated codeiscrucialtogettinggoodperformance.Parallelcodegeneration for miniGMGisparticularlyinteresting,asparallelismdecreasesondescending the V-cycleandthreadingstrategiesneedtobetailoredtoadapttothelevelof the V-cycle. Managing floating-pointcomputation. SmoothsinGeometricMultigridcanusecompute-intensivestencilsleadingto floating-pointcomputationsbeingthebottleneck.Insuchcasesoptimizations mustaimtoreusecomputationtoreducefloating-pointoperations.Inaddition to computationreuse,codegenerationmustefficientlyusearchitecturalfeatures suchasSIMD-unitswhichboosttheperformanceoffloating-pointoperations. There isconsiderableinteractionbetweentheseoptimizations,forexample,local- ityincreasingoptimizationsfortheminiGMGmayincreasecomputationanddecrease parallelism. Toexplorethesetrade-offs,automaticcode-tuningframeworksmust composeandapplysequencesofoptimizations.Generatinghigh-performancecode will requiresearchingthespaceofsuchcompositionstopicktheonebestsuitedfor a giveninputstencilandtargetarchitecture.CHAPTER 4 COMMUNICATION-AVOIDING OPTIMIZATIONS The principaloperationsintheGeometricMultigridcommonlyexecutelessthan one floating-pointoperationforeverybyteofdata.Thislowarithmeticintensity, coupled withthefactthatdatamovementisfarmoreexpensivethanfloating-point operations,makestheperformanceofminiGMGmemorybandwidthlimited.Thus, the keytoachievinghighperformanceistoreducedatamovement. In thischapterweintroducethetypesofdatamovementintheminiGMGbench- mark, followedbythedescriptionofoptimizationstargetingdatatraffic.Thede- scription oftheoptimizationsisfollowedbythedetailsoftheirimplementationinthe compiler. Thefinalpartofthechapterpresentstheperformanceresultsandanalysis of thegeneratedcode. As thefocusofthischapterisonreducingdatamovement,thevariable-coefficient 3D 7-pointstencilisusedinthischapter.Variable-coefficientstencilsreadinmany more arrayswhencomparedtoconstant-coefficientones,andconsequentlyrequire higher volumesofdatamovement.Thus,applyingtheoptimizationsdevelopedhere on variable-coefficientstencilshighlightstheefficacyoftheoptimizations. 4.1 TypesofCommunication Data movementintheminiGMGcanbeclassifiedaseitherhorizontalcommuni- cation orverticalcommunication: VerticalCommunication The GeometricMultigridoperatorssmooth,residual,restrictandinterpolation (Listings 3.2,3.5and3.6),areallthreedeeploopnestswhichsweepover 3D grids.Each3Dgridisstoredasanarray,andsweepingorstreaming50 through themgeneratesdatamovementthroughthememoryhierarchy.This memory trafficgeneratedbyeachapplicationofanoperatoristermed vertical communication. HorizontalCommunication Sections 3.1to3.3describehowminiGMGdecomposestheglobaldomaininto subdomainswhichareprocessedinparallelusingthreads(OpenMP)andpro- cesses (MPI).Afteranoperationsuchasasmoothisappliedtothesubdomains, a ghostzoneexchangebetweenneighboringsubdomainsisrequired.Thisdata movementbetweenthreadsandprocessestoupdatetheghostzonesiscalled Horizontalcommunication. 4.2 Communication-AvoidingOptimizations Verticalcommunicationisrequiredforeachoperatorapplicationasitsweeps through 3Dgrids,andhorizontalcommunicationisrequiredbetweenoperatorap- plications toupdateghostzones.Thischapterpresentsoptimizationstoreduceboth typesofdatamovement.Table4.1liststheoptimizationsandthetypeofdata movementtheytarget.TheoptimizationsarepresentedinthecontextofminiGMG used withavariable-coefficientsmoothandresidual,Listings3.2and3.5,respectively. The variable-coefficientstencilsusedintheseoperationshaveaverylowarithmetic intensityandhighlightthedata-movementoptimizations. 4.2.1 FusingComponentsofSmooth Programmers oftenwishtomaintainflexibilityandthuscreatesmoothoperators bycomposingmultiplesimpleroperators,asillustratedinListing3.2and3.4.The smoothoperatorcalculatestheLaplacian,Helmholtz,andeitheraGauss-Seidelor Jacobi relaxationinsequence.Eachofthesesimpleroperatorsisaloopnestwhich sweepsthroughthegrids.ThefinestleveloftheminiGMGV-cyclehasa 2563 domain pernodedecomposedinto 643 subdomains.Eachsubdomainstoresthegridsfor phi, temp, alpha,beta_i, beta_j, beta_k, lambda and rhs as 663 arrays;thenet memory requirementofthesearraysismorethan1GBofmemoryanddoesnotfit intolast-levelcaches.Sincethefinergridsdonotfitintothelast-levelcache,each51 Table4.1: Thetableillustratestheclassificationofoptimizationsdescribedin this chapterasreducingverticalcommunication,horizontalcommunication,orboth. Parallelcodegenerationhasbeenbeenleftoutasitisnotacommunication-avoiding optimization buthelpsimprovewavefront,whichreducesverticalcommunication. Optimization VerticalHorizontal SmoothOperatorFusion X - Deep GhostZones - X Wavefront X - Residual RestrictionFusion X - Smooth-Residual-RestrictionWavefront X X loopnestorgridsweepresultsindatamovementbetweenDRAMandthelast-level cache. The threeseparateloopnestsinsmoothgeneratehighDRAMtrafficasthegrids are streamedintocachethreetimes.Toreducethisdatamovement,thecompilerfuses the multiplesmoothoperatorstogether.Fusionisitselfaverticalcommunication- avoidingoptimization,sincetheresultscomputedbyoneoperatorwillremainin cachewhenusedasinputbythenextoperator. The outputoffusionfortheGSRBandJacobismoothsisoutlinedinListings4.1 and 4.2,respectively.LoopfusionfortheGSRBsmoothrequirestheif-condition that previouslyguardedjusttheGSRBupdatestatement S2() to nowguardthe execution ofallthreestatements S0(), S1(), and S2(). An additionalcommunication-avoidingoptimizationfortheGSRBsmoothisto replace thearray temp with ascalarandnotwriteitbacktomemoryoncompletion. The firsttwostatementswriteto temp, andthelaststatementusesthevaluewritten to update phi. Replacing temp with ascalarsavesverticalcommunicationassociated with accessing temp, thisisnotpossibleintheJacobismoothwhere temp is used across thetwo (k,j,i) loopnests. 4.2.2 DeepGhostZones Smoothcontainsastencilcomputationwhichrequiresghostzones.Stencilcom- putations performedontheboundaryofthesubdomainsreadtheseghostzonepoints but donotcomputevaluestoupdatethem.Thus,afterastencilcomputationsweeps through thegrids,theghostzonevaluesbecomestaleandtheyneedtobeupdated52 1 for (k=0;j<N;k++){ 2 for (j=0;j<N;j++){ 3 for (i=0;i<N;i++){ 45 if((i+j+k+color)%2==0){ 67 S0(k,j,i); /*Laplacian*/ 8 S1(k,j,i); /*Helmholtz*/ 9 S2(k,j,i); /*GSRB relaxation*/ 10 11 }/*end if*/ 12 }}} Listing 4.1:FusedGSRBsmooth. 1 if(sweep%2==0){ 23 for (k=0;j<N;k++){ 4 for (j=0;j<N;j++){ 5 for (i=0;i<N;i++){ 67 even_S0(k,j,i); /*Laplacian*/ 8 even_S1(k,j,i); /*Helmholtz*/ 9 even_S2(k,j,i); /*Jacobi relaxation*/ 10 }}} 11 12 }else if(sweep%2==1){ 13 14 for (k=0;j<N;k++){ 15 for (j=0;j<N;j++){ 16 for (i=0;i<N;i++){ 17 18 odd_S0(k,j,i); /*Laplacian*/ 19 odd_S1(k,j,i); /*Helmholtz*/ 20 odd_S2(k,j,i); /*Jacobi relaxation*/ 21 }}} 22 23 }/*end if*/ Listing 4.2:FusedJacobismooth.53 bydataexchangewithneighboringsubdomains.Deeperghostzonesreduceghost zone exchangesbyperformingredundantcomputationstoupdateghostzones.This is illustratedinFigure4.1,wherea2D5-pointstencilisappliedtoa4x4gridwith a two-deepghostzone.Thefirststencilapplicationcomputesvaluesfora6x6grid, (blue pointsareupdated),thefollowingstencilsweepusesthe6x6gridtocompute the valuesforthefinal4x4grid(redpoints).Thesecondstencilsweepwaspossible without apriordataexchangebecausethefirstsweepupdateda6x6grid,whichmeant the ghostzonetocomputethe4x4gridhasupdatedvalues.Thebluepointsshownin Figure 4.1(b)arecomputedredundantlybuttheyreducedhorizontalcommunication with eachconsecutivestencilsweepworkingonasmallergrid.Thus,deeperghost zones reducethefrequencyofhorizontalcommunication. Deeperghostzonesexchangefewermessagesbutchangethepatternofcommuni- cation betweenneighboringsubdomains.Figure4.2showsthecommunicationpattern for atwo-deepghostzoneusedbya2D5-pointstencil.WhencomparedtoFigure3.5, it isevidentthatalargermessagecorrespondingtodeeperghostzonesmustbe exchanged.Inadditiontothelargermessages,thenumberofneighborsinvolvedalso increases, ascornerpointsintheghostzonealsoneedtobeupdated. The outputcodeforGSRBandJacobismoothswithdeeperghostzonesareshown in Listings4.3and4.4,respectively.Anew t-loophasbeenaddedtoapplysmooth multipletimes.Theloopboundsforthe k,j, and i-loopsarenowfunctionsof t to ensure thatwitheachapplicationofsmooththeregionofthegridwhichiscomputed shrinks inalldimensionsandusesonlyvaliddata.Inadditiontotheloopbounds, the if-conditionusedinGSRBalsohastoincludetheloopindex t, sinceconsecutive smoothsupdateeithertheredorblackpoints. Deeperghostzonesreducehorizontalcommunicationattheexpenseofredundant computation. AsonegoesdowntheV-cycle,thecoarsergridshavemuch-reduced computation andcannotaffordredundantcomputation.Thusghostzonedepthmust betunedforeachlevelofV-cycletofindtheoptimumbalance. 4.2.3 WavefrontComputation Deep ghostzonesallowmultiplesmoothstobeappliedbeforeaghostzoneex- change.Aseachsmoothapplicationisagridsweep,themultiplegridsweepsgenerate54 (a) (b) Figure 4.1: Two applications of a 3D 5-point stencil on a 4x4 grid with a two-deep ghost zone. (a) The stencil is first applied on a 8x8 grid to compute the values for the 6x6 grid (blue points). The outer layer of grid points shaded grey are read and used as the ghost zone. (b) The second stencil sweep uses the 6x6 grid to compute the output of the 4x4 grid (red points). For the second stencil sweep, the blue points are used as the ghost zone. Figure 4.2: Deeper ghost zone changes the communication pattern and volume between a subdomain and its neighbors. A one-deep ghost zone requires exchange with left, right, top, and bottom neighbors. An additional layer of ghost zone now requires exchange with neighbor on the corners as well. In addition, a larger volume of data must be exchanged. 55 1 /* d = ghost zone depth */ 2 for (t=0;t<d;t++){ 3 for (k=t-(d-1);j<N+(d-1);k++){ 4 for (j=t-(d-1);j<N+(d-1);j++){ 5 for (i=t-(d-1);i<N+(d-1);i++){ 67 if((i+j+k+t+color)%2==0){ 89 S0(t,k,j,i); /*Laplacian*/ 10 S1(t,k,j,i); /*Helmholtz*/ 11 S2(t,k,j,i); /*GSRB relaxation*/ 12 13 }/*end if*/ 14 }}}} Listing 4.3:FusedGSRBsmoothwithoverlappedghostzones. 1 /* d = ghost zone depth */ 2 for (t=0;t<d;t++){ 34 if(t%2==0){ 56 for (k=t-(d-1);j<N+(d-1);k++){ 7 for (j=t-(d-1);j<N+(d-1);j++){ 8 for (i=t-(d-1);i<N+(d-1);i++){ 9 10 even_S0(t,k,j,i); /*Laplacian*/ 11 even_S1(t,k,j,i); /*Helmholtz*/ 12 even_S2(t,k,j,i); /*Jacobi relaxation*/ 13 }}} 14 15 }else if(t%2==1){ 16 17 for (k=t-(d-1);j<N+(d-1);k++){ 18 for (j=t-(d-1);j<N+(d-1);j++){ 19 for (i=t-(d-1);i<N+(d-1);i++){ 20 21 odd_S0(t,k,j,i);/*Laplacian*/ 22 odd_S1(t,k,j,i); /*Helmholtz*/ 23 odd_S2(t,k,j,i); /*Jacobi relaxation*/ 24 }}} 25 26 }}/*end if*//*end t*/ Listing 4.4:FusedJacobismoothwithoverlappedghostzones..56 high verticalcommunicationbetweenthecachesandDRAM.Awavefrontcomputa- tion isusedtoreducethisverticalcommunication.Awavefrontfusesmultiplegrid sweepsintoone,therebyreducingDRAMtraffic. WavefrontcomputationisexplainedintermsofGSRBsmooth,whichusesa3D 7-pointstencil,andthenextendedtotheJacobismooth.Thefollowingdiscussion assumes thatafour-deepghostzoneisused,whichmeansfourGSRBsmoothsare applied: red(R1),black(B1),Red(R2),andfinally,ablack(B2)sweep.GSRB partitions thegridintoredandblackpoints,wherearedpointhasonlyblack neighborsandvice-versa.Aredsweepupdatesredpointsandablacksweepupdates blackones. A sweepthrougha3DN N N gridcanbevisualizedasstreamingthroughN fk = 0; ::::;N 1g 2D N N ij-planes. AWavefrontcomputationfusesthefourgrid sweepsR1-B1-R2-B2intoonebycomputingvaluesforasetoffour ij-planes ata time. ThisisillustratedinFigure4.3(a),whichshowsthecrosssectionofa3Dgrid and illustrateshowawavefrontworkson ij-planes. Thefirstredsweep,R1,onplane k = z + 3 is computed,followedbyB1onplane k = z + 2, R2on k = z + 1, andB2 on plane k = z. Atthisstage,plane k = z has gonethroughallfoursweepsandthe wavefrontnowperformsthesamesequenceonplanes k = z+4 to k = z+1, asshown in Figure4.3(b).An ij-plane goesthroughallthefoursweepsR1,B1,R2,andB2 as thewavefrontprogressesthroughit,andinonesweepofthegrid,foursmooths are applied.Thewavefrontdescribedhereprocessesfourplanes,andisthustermed four deep. Jacobi smoothusesanout-of-placestencilcomputationwhereeveryodd-numbered smoothapplicationread phi and updates temp, andeveryeven-numberedapplication reads temp and updates phi. Figure4.4illustratesafour-deepwavefrontforthe Jacobi smooth.Thewavefrontneedsafour-deepghostzoneandfusesfoursmooths intoone.Thefigurehighlightstheping-pongbetween temp and phi. Jacobi iterationslikeGSRBusea3D7-pointstencil,andthusreadinthree ij-planes: top,center,andbottom,andoutputsasingle ij-plane. Inthewavefront showninFigure4.4,thefirstplaneoutputisupdatedto temp. Plane k = z + 7 of temp is computedusingplanes k = z + 8 (top), k = z + 7 (center),andthebottom57 plane z plane z+1 plane z+2 plane z+3 plane z+4 plane z+5 plane z plane z+1 plane z+2 plane z+3 plane z+4 plane z+5 plane z plane z+1 plane z+2 plane z+3 plane z+4 plane z+5 plane z plane z+1 plane z+2 plane z+3 plane z+4 plane z+5 (a) (b) Red Smooth 1 Red Smooth 2 Black Smooth 1 Black Smooth 2 Cross Section of 3D - Grid j k i Figure 4.3: Progress of the GSRB wavefront. 58 plane z plane z+1 plane z+2 plane z+3 plane z+4 plane z+5 grid temp grid phi Smooth 1 : update temp Smooth 3 : update temp Smooth 2 : update phi Cross Section of 3D - Grids j k i Smooth 4 : update phi plane z+6 plane z+7 plane z+8 plane z+9 plane z plane z+1 plane z+2 plane z+3 plane z+4 plane z+5 plane z+6 plane z+7 plane z+8 plane z+9 plane z plane z+1 plane z+2 plane z+3 plane z+4 plane z+5 plane z+6 plane z+7 plane z+8 plane z+9 plane z plane z+1 plane z+2 plane z+3 plane z+4 plane z+5 plane z+6 plane z+7 plane z+8 plane z+9 Figure 4.4: Jacobi wavefront. 59 plane k = z + 6 of phi. Thenextstepofthewavefrontupdatesphiandcomputes plane k = z + 5 of phi using planes k = z + 6, k = z + 5, and k = z + 4 from temp. The nexttwoupdatesproceedsimilarly,asshowninthefigure. In thewavefrontforJacobiiterations,thereisadifferenceof2betweentheplanes that areupdated;forexample,inFigure4.4planes z + 7, z + 5, z + 3, andfinally z + 1 are updatedinsequence.TheGSRBwavefront,ontheotherhand,updates 4 consecutiveplanes: z + 3, z + 2, z + 1, and z. Thiscrucialdifferencebetween wavefrontsforGSRBandJacobiiterationsarisesfromdatadependencesandmeans that manymoreplanesneedtobereadandbeheldinmemoryforJacobiiterations, leading toalargerworkingset. CodeListings4.5and4.6showtheskeletoncodeforthefour-deepGSRBand Jacobi wavefrontsappliedona 643 box,respectively.Itisimportanttonotethatin bothcodefragmentsthe k- and t-loopshavebeeninterchanged(orpermuted).The time step t-loopwhichwaspreviouslyoutermostandwasresponsibleforapplying multiplegridsweepsisnownestedinsidetheoutermost k-loop.Thismeansthe k-dimension isscannedoriteratedthroughonlyonceandcorrespondstoasinglegrid sweep.Thetwoinnermost j- and i-loopsarenotchanged,aswedonotmodifyhow pointsinan ij-plane areupdated/traversed. The modifiedindexingusedinthestatements (S0, S1,...) in thetwocode listings determinestheorderinwhichthe ij-planes areupdatedinthewavefront computations. Theindexingofthestatementspriortocreatingawavefrontwere of theform (t,k,j,i) i.e., S0(t,k,j,i) (Listing 4.3).IntheGSRBwavefront the indexingchangesto (t,k-t,j,i), andforJacobiitis (t,k-2*t,j,i). The modifiedarrayindexexpressionforthe k-dimension meansthatateachiterationof the t-loop,the ij-plane thatisupdatedisone(GSRB)ortwo(Jacobi)planeslower than theoneupdatedinthepreviousiterationof t. Asseenpreviously,forfour iterations ofthet-loop(from0to3),thistranslatestoplanes k = z + 3, k = z + 2, k = z + 1, and k = z getting updatedforGSRB.AndforJacobi,theseareplanes k = z + 7, k = z + 5, k = z + 3, and k = z + 1. 60 1 for (k=-3;k<=66;k++){ 2 for (t=0;t<=min(3,intFloor(t+3,2));t++){ 3 for (j=t-3;j<=-t+66;j++){ 4 for (i=t-3+intMod(-k-color-j-(t-3),2);i<=-t+66;i+=2) 5 { 6 S0(t,k-t,j,i); /* Laplacian */ 7 S1(t,k-t,j,i); /* Helhmoltz */ 8 S2(t,k-t,j,i); /* GSRB */ 9 }}}} Listing 4.5:GSRBWavefrontfora 643 boxwithafour-deepghostzone. 12 for (k=-3;k<=70;k++){ 3 for (t=max(0,k-66);t<=min(3,intFloor(t+3,3));t++){ 4 if(t%2==0){ 5 for (j=t-3;j<=66-t;j++){ 6 for (i=-3;i<=66;i++) 7 { 8 even_S0(t,k-2*t,j,i); /* Laplacian */ 9 even_S1(t,k-2*t,j,i); /* Helhmoltz */ 10 even_S2(t,k-2*t,j,i); /* GSRB */ 11 } 12 }} 13 if(t%2==1){ 14 for (j=t-3;j<=66-t;j++){ 15 for (i=-3;i<=66;i++) 16 { 17 odd_S0(t,k-2*t,j,i); /* Laplacian */ 18 odd_S1(t,k-2*t,j,i); /* Helhmoltz */ 19 odd_S2(t,k-2*t,j,i); /* GSRB */ 20 } 21 } 22 } 23 }} Listing 4.6:JacobiWavefrontfora 643 boxwithafour-deepghostzone.61 4.2.4 ParallelCodeGeneration Wavefrontsholdmultipleplanesinmemory,thusincreasingtheworkingset.This mayleadtospillingoutofthefastercaches(L1/L2).Wegeneratenestedmulti- threaded codeviaOpenMPtoshareplanesacrossthreadsandreducetheworking set perthread. Figure 4.5illustrateshowan ij-plane inaboxgetssharedbetweenanumberof OpenMPthreadsfortheGSRBsmoothwithafour-deepwavefront.Thefourthreads tile theiterationspaceofthe j-loop.EachthreadprocessesthefourR-B-R-Bplanes of thewavefrontbeforeneedingtosynchronize. CodeListings4.7and4.8showskeletoncodesforafour-deepGSRBandJaobi wavefrontappliedona 643 box,respectively.InListing4.5thereisasinglethread processingthebox,andinListing4.7thereare12OpenMPthreadscollaboratively processingthebox.The j-loopinListing4.5whichperformed70iterations(-3to 66) hasbeentiled,andeachtileisassignedtoathread.With12threads,eachthread gets d70=12e = 6 iterations.Line16showswherethethreadsworkingonaboxneed to synchronize.Thesynchronizationpointisafterthecompletionofthe t-loop.This means thatthreadscanprocessallfoursweepsR-B-R-Bbeforesynchronizing. The OpenMPbarrierinline16ofListing4.7meansthatallthethreadsworkingon a boxwaitforallotherthreadstofinishprocessingasetof4planesbeforeproceeding. This isexpensiveandnotrequiredinthiscase,aseachthreadonlyneedstosync with itsimmediateneighbors.Forexample,inFigure4.5thread2needstosyncwith thread 1andthread3.Toamelioratetheeffectoftheexpensivebarrierwefollowedthe strategy ofexpertmanualtunersin[32]andgeneratedcodetoimplementspinlocks. An arrayoflocksdependingonthenumberofcollaboratingthreadswascreated,and eachthreadonlywaitedonitstwoneighbors.Spinlocksimproveperformance,but unfortunately thisisnotaportableapproachandbreaksprogramming(OpenMP) abstractions. CollaborativethreadingwithmultiplethreadsperboxfortheJacobiwavefront is showninListing4.8.Thegeneratedcodeissimpler,butthethreadingismore expensive.Thisisbecause,duetodatadependences,thethreadsmustsynchronize after eachplaneisprocessed.ThisismoreexpensivethanGSRB,wheremultiple planes canbeprocessedbeforesynchronization.FortheJacobi,theOpenMPparallel62 Thread 0 Thread 1 Thread 2 Thread 3 residual j k i Figure 4.5: Multiple threads working collaboratively to process a subdomain/box for directive was used rather than creating an OpenMP parallel region, as in the case of GSRB. Generating code where multiple threads process a box creates three strategies for thread decomposition. As illustrated in Figure 4.6, we can have interbox parallelism, nested parallelism and intrabox parallelism. Each box is processed by a single thread in interbox parallelism, and in intrabox parallelism a single box is processed with all threads working on it. Nested parallelism has multiple boxes with multiple threads working inside each box and leverages nested parallelism in OpenMP. As shown in Figure 4.6, on a system with six threads we can have six boxes being processed in parallel by a single thread, or x boxes being processed with y threads in them such that x y = 6, or have a single box with all 6 threads working on it. Larger boxes have a bigger working set than the smaller boxes down the V-cycle, which suggests that the system should assign more threads per box for the larger grids, and fewer threads for the smaller grids - ultimately the thread distribution is optimized via autotuning. 63 1 #pragma ompparallel private (...)num_threads(12) 2 { 3 tid=omp_get_thread_num(); 45 for (k=-3;k<=66;k++){ 6 for (t=0;t<=min(3,intFloor(t+3,2));t++){ 7 for (j=6*tid-3;j<=min(6*tid+2,66);j++){ 8 for (i=t-3+intMod(-k-color-j-(t-3),2);i<=-t+66;i+=2) 9 { 10 S0(t,k-t,j,i); /* Laplacian */ 11 S1(t,k-t,j,i); /* Helhmoltz */ 12 S2(t,k-t,j,i); /* GSRB */ 13 } 14 } 15 } 16 #pragmaompbarrier(or explicit locks) 17 } 18 } Listing 4.7:ThreadedGSRBWavefrontfora 643 boxwithafour-deepghost zone. 64 1 #pragma ompparallel private(....)num_threads(3) 23 for (k=-3;k<=70;k++){ 4 for (t=max(0,k-66);t<=min(3,intFloor(t+3,3));t++){ 5 if(t%2==0){ 67 #pragmaomp for 8 for (j=t-3;j<=66-t;j++){ 9 for (i=-3;i<=66;i++) 10 { 11 even_S0(t,k-2*t,j,i); /* Laplacian */ 12 even_S1(t,k-2*t,j,i); /* Helhmoltz */ 13 even_S2(t,k-2*t,j,i); /* GSRB */ 14 } 15 } 16 } 17 if(t%2==1){ 18 19 #pragmaomp for 20 for (j=t-3;j<=66-t;j++){ 21 for (i=-3;i<=66;i++) 22 { 23 odd_S0(t,k-2*t,j,i); /* Laplacian */ 24 odd_S1(t,k-2*t,j,i); /* Helhmoltz */ 25 odd_S2(t,k-2*t,j,i); /* GSRB */ 26 } 27 } 28 } 29 30 } 31 } Listing 4.8:ThreadedJacobiWavefrontfora 643 boxwithafour-deepghost zone. 65 Inter-Box Parallelism Thread Configuration <6,1> Nested Parallelism Thread Configuration <2,3> Intra-Box Parallelism Thread Configuration <1,6> 6 threads working in parallel <X,Y> : X boxes, Y threads/box Figure 4.6: Example parallel decompositions on Hopper, which has 6-cores per socket. All the boxes in a subdomain may work in parallel, or all the threads may work on one box collaboratively, or nested parallelism may be used. 4.2.5 Residual-Restriction Fusion Wavefront computation introduced in Section 4.2.3 fuses multiple sweeps of a box/grid into one and reduces vertical communication. We extend the same strat-egy of fusing multiple grid sweeps into one by fusing the residual and restriction computations into one sweep. Unfortunately there exists a data dependence between residual (Listing 3.5) and restriction (Listing 4.9) which prevents this fusion. The dependence arises because every iteration of the triply nested ijk-loop residual updates a single point on the output (finer) grid. Restriction needs to read 8 points from the finer grid and restrict it to a single output point of the coarser grid. If the loop were naively fused, restriction would read points on the input finer grid before they were correctly updated. To break this data dependence, a new compiler transformation was designed to fuse restriction with the other preceding operators. A novel compiler transformation was developed which converts the restriction stencil into an accumulation. This conversion breaks the data dependence and enables the sequence of loops to be fused. The transformation converts restriction, an 8-point out-of-place stencil, to an accumulation. Listing 4.10 shows the output of this transformation. In line 12 of the output code each point from the input grid is read, multiplied by coefficient, and 66 scattered to thecorrectoutputpointinthecoarsegrid.Insteadofrequiringtoread 8 points,restrictionnowreadsasinglepointfromthefinegrid,andthusthedata dependenceisbroken.Sincewearegeneratinganaccumulation,caremustbetaken to zerooutthevaluesintheoutputgridpointsbeforeaccumulatingtothem,and this isperformedinline6ofListing4.10. The outputoftheresidual-restrictionfusionisshowninListing4.11.Line6 zeroesoutplanesofthecoarsegridbeforeaccumulating.Thenestedfor-loop(lines 10-14) computestheresidual(S4)andimmediatelyscattersittothecorrectpointin the outputcoarsegrid.Thusresidual-restrictionisnowcompleteinonegridsweep performedbythecommonoutermost k-loop. 4.2.6 Smooth-Residual-RestrictionWavefront Weextendthewavefrontstrategyfurthertocreateanevendeeperwavefront from fusedsmooths,residual,andrestriction.Thisfusessixgridsweeps(4smooths, residual andrestriction)intoonesweep.Toeliminatetheghostzoneexchange required priortoresidualcomputationwehavetoincreasetheghostzonedepth to fivefromthepreviouslyusedfour.Thedeeperghostzonereduceshorizontal communicationfurther.Thewavefrontcreatedthenreducesverticalcommunication. The deeperwavefrontmeansalargerworkingset.Theworkingsetismanagedby generating nestedparallelcode. The smooth-residual-restrictionwavefrontisillustratedinFigure4.7.Thecross- section ofaboxshowsawavefrontthatissixplanesdeep.Thefirstfourplanes compute 2GSRBsweeps,thefifthplanecomputestheresidual,andthelastplane computes therestrictionandwritestothecoarseroutputgrid. Listing 4.12illustratesthecodegeneratedforthiswavefrontfortheGSRBsmooth applied toa 643 boxwithafive-deepghostzone.Therearethreethreadsworking collaborativelyinsidethebox,andtheysynchronizeusingspinlocks(lines36-38). The codeillustratesthatthe k-loopwasskewedagainstthetime t-loop,andthen they arepermuted,making k the outerloopandgivingasinglegridsweep(inthe k-dimension). Thesmooths(lines9-16)arefollowedbyinitializingaplaneofthe output coarsergrid(lines18-23),andthenthecomputationofresidualandrestriction67 1 /* Input: Restriction Operation */ 2 for(k=0;k<K;k+=2) 3 for(j=0;j<J;j+=2) 4 for(i=0;i<I;i+=2) 5 coarser_res[k/2][j/2][i/2]=0.125 *( 6 res[k][j][i]+res[k][j][i+1]+ 7 res[k][j+1][i]+res[k][j+1][i+1]+ 8 res[k+1][j][i]+res[k+1][j][i+1]+ 9 res[k+1][j+1][i]+res[k+1][j+1][i+1] 10 ); Listing 4.9:Restrictionoperation. 1 /* Output: Restriction as a Scatter Operation */ 2 for(k=0;k<K;k+=2) 3 for(j=0;j<J;j+=2) 4 for(i=0;i<I;i+=2) 5 /* statement S4 : Initialize coarse_res*/ 6 coarser_res[k/2][j/2][i/2]=0; 78 for(k=0;k<K;k++) 9 for(j=0;j<J;j++) 10 for(i=0;i<I;i++) 11 /* statement S5 : Restrict fine_res to coarse_res */ 12 coarser_res[k/2][j/2][i/2]+=0.125* res[k][j][i]; Listing 4.10:Restrictionasanaccumulation. 1 /* Output: Restriction as a Scatter Operation */ 2 for(k=0;k<K;k++){ 3 if(k%2==0){ 4 for(j=0;j<J;j+=2){ 5 for(i=0;i<I;i+=2){ 6 S4();/* statement S4 : Initialize coarse_res*/ 7 }} 8 } 9 10 for(j=0;j<J;j++){ 11 for(i=0;i<I;i++){ 12 S3();/* Compute residual */ 13 S5();/* statement S5 : Restrict fine_res to coarse_res */ 14 }} 15 16 }/* End K */ Listing 4.11:Fusedresidual-restriction.68 j k i (N/2)3 coarse grid Thread 0 Thread 1 Thread 2 Thread 3 smooth (red) smooth (black) smooth (2nd red) smooth (2nd black) residual restriction N3 fine grid Cross Section of 3D - Grid Figure 4.7: Wavefront applying smooths, residual and restriction. 69 1 #pragma ompparallel private(...)\ 2 shared(locks)num_threads(3) 3 { 4 tid=omp_get_thread_num(); 5 num_threads=omp_get_num_threads(); 6 left=min(tid-1,0); 7 right=max(tid+1,num_threads-1); 89 for(k=-4;k<=67;k++){ 10 for(t=0;t<=min(3,intFloor(k+4,2));t++){ 11 for(j=24*tid-4;j<=24*tid+19;j++){ 12 for(i=-4+intMod(-k-color-j-(t-4),2);i<=67;i+=2){ 13 S0(t,k-t,j,i); /* Laplacian */ 14 S1(t,k-t,j,i); /* Helhmoltz */ 15 S2(t,k-t,j,i); /* GSRB */ 16 }}} /* End t,j,i */ 17 18 if(4<=k&&intMod(k,2)==0){ 19 for(j=max(24 * tid-4,0); 20 j <=min(62,24 * tid+18);j+=2){ 21 for(i=0;i<=62;i+=2){ 22 S4(t,k-t,j,i); /* Initialize coarse_res */ 23 }}} /* End if */ 24 25 if(4<=k){ 26 for(j=max(24 * tid-4,0); 27 j <=min(62,24 * tid+18);j++){ 28 for(i=0;i<=63;i++){ 29 S3(t,k-t,j,i); /* Compute residual */ 30 S5(t,k-t,j,i); /* Restrict fine_res to coarse_res */ 31 }}} /* End if */ 32 33 /* After computing the 5-deep wavefront */ 34 /* Threads sync with their right and left neighbors */ 35 36 locks[tid]=k; 37 if(left!=tid){ while (locks[left]<k)pause();} 38 if(right!=tid){ while (locks[right]<k)pause();} 39 40 }/* End k */ 41 42 }/* End OMP region */ Listing 4.12:SimplifiedgeneratedcodeforthreadedwavefrontwithGSRBand residual andrestrictionfused.Thecodeisspecializedfora 643 box,witha five-deepghostzoneand3-threadsworkinginsideabox.70 to thecoarsergrid(lines25-31).Thethreadssynchronizeaftercompletingmultiple smooths,residual,andtherestriction.Thisdeepwavefrontisappliedwhengoing downtheV-cycle.Onthewaybackup,residualandrestrictionarereplacedbythe interpolationoperation.Henceweuseafour-deepwavefrontgoingbackuptheV- cycle, buttheghostzonedepthisstillfixedatfive,resultinginexcesscommunication. In asimilarmanner,wavefrontcomputationscanbegeneratedforJacobistyle stencils, butJacobistencilspresentadditionalchallengestothememorysystem. Jacobi readsandwritestodifferentarrays,leadingtoanevenlargerworkingset. CollaborativethreadingforJacobistencilsislesseffective,sinceduetode |
| Reference URL | https://collections.lib.utah.edu/ark:/87278/s6dj8q0g |



