Using autotuning for accelerating tensor contraction on graphics processing units (GPUS)

Using autotuning for accelerating tensor contraction on graphics processing units (GPUS)

Title	Using autotuning for accelerating tensor contraction on graphics processing units (GPUS)
Publication Type	thesis
School or College	College of Engineering
Department	Computing
Author	Rivera, Axel Y.
Date	2014-12
Description	Tensors are mathematical representations of physical entities that have magnitude with multiple directions. Tensor contraction is a form of creating these objects using the Einstein summation equation. It is commonly used in physics and chemistry for solving problems like spectral elements and coupled cluster computation. Mathematically, tensor contraction operations can be reduced to expressions similar to matrix multiplications. However, linear algebra libraries (e.g., BLAS and LAPACK) perform poorly on the small matrix sizes that commonly arise in certain tensor contraction computations. Another challenge seen in the computation of tensor contraction is the dierence between the mathematical representation and an ecient implementation. This thesis proposes a framework that allows users to express a tensor contraction problem in a high-level mathematical representation and transform it into a linear algebra expression that is mapped to a high-performance implementation. The framework produces code that takes advantage of the parallelism that graphics processing units (GPUs) provide. It relies on autotuning to nd the preferred implementation that achieves high performance on the available device. Performance results from the benchmarks tested, nekbone and NWChem, show that the output of the framework achieves a speedup of 8.56x and 14.25x, respectively, on an NVIDIA Tesla C2050 GPU against the sequential version; while using an NVIDIA Tesla K20c GPU it achieved speedups of 8.87x and 17.62x. The parallel decompositions found by the tool were also tested with an OpenACC implementation and achieved a speedup of 8.87x and 10.42x for nekbone, while NWChem obtained a speedup of 7.25x and 10.34x compared to the choices made by default in the OpenACC compiler. The contributions of this work are: (1) a simplied interface that allows the user to express tensor contraction using a high-level representation and transform it into high-performance code; (2) a decision algorithm that explores a set of optimization strategies for achieving performance; and, (3) a demonstration that this approach can achieve better performance than OpenACC and can be used to accelerate OpenACC.
Type	Text
Publisher	University of Utah
Subject	Autotuning; Gpu; Tensor contraction
Dissertation Institution	University of Utah
Dissertation Name	Master of Science
Language	eng
Rights Management	Copyright © Axel Y. Rivera 2014
Format	application/pdf
Format Medium	application/pdf
Format Extent	1,137,271 bytes
Identifier	etd3/id/3332
ARK	ark:/87278/s66q55gv
DOI	https://doi.org/doi:10.26053/0H-Y4V6-1300
Setname	ir_etd
ID	196897
Reference URL	https://collections.lib.utah.edu/ark:/87278/s66q55gv