Designing a parallel dataflow architecture for streaming large-scale visualization on heterogeneous platforms

Designing a parallel dataflow architecture for streaming large-scale visualization on heterogeneous platforms

Title	Designing a parallel dataflow architecture for streaming large-scale visualization on heterogeneous platforms
Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Vo, Huy T.
Date	2011-08
Description	Dataflow pipeline models are widely used in visualization systems. Despite recent advancements in parallel architecture, most systems still support only a single CPU or a small collection of CPUs such as a SMP workstation. Even for systems that are specifically tuned towards parallel visualization, their execution models only provide support for data-parallelism while ignoring taskparallelism and pipeline-parallelism. With the recent popularization of machines equipped with multicore CPUs and multi-GPU units, these visualization systems are undoubtedly falling further behind in reaching maximum efficiency. On the other hand, there exist several libraries that can schedule program executions on multiple CPUs and/or multiple GPUs. However, due to differences in executing a task graph and a pipeline along with their APIs being considerably low-level, it still remains a challenge to integrate these run-time libraries into current visualization systems. Thus, there is a need for a redesigned dataflow architecture to fully support and exploit the power of highly parallel machines in large-scale visualization. The new design must be able to schedule executions on heterogeneous platforms while at the same time supporting arbitrarily large datasets through the use of streaming data structures. The primary goal of this dissertation work is to develop a parallel dataflow architecture for streaming large-scale visualizations. The framework includes supports for platforms ranging from multicore processors to clusters consisting of thousands CPUs and GPUs. We achieve this in our system by introducing the notion of Virtual Processing Elements and Task-Oriented Modules along with a highly customizable scheduler that controls the assignment of tasks to elements dynamically. This creates an intuitive way to maintain multiple CPU/GPU kernels yet still provide coherency and synchronization across module executions. We have implemented these techniques into HyperFlow which is made of an API with all basic dataflow constructs described in the dissertation, and a distributed run-time library that can be used to deploy those pipelines on multicore, multi-GPU and cluster-based platforms.
Type	Text
Publisher	University of Utah
Subject	Dataflow architecture; Heterogeneous platforms; Multi-CPU; Multi-GPU; Parallel execution
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	Copyright © Huy T. Vo 2011
Format	application/pdf
Format Medium	application/pdf
Format Extent	35,190,222 bytes
Identifier	us-etd3,38255
Source	Original housed in Marriott Library Special Collections, QA3.5 2011 .V6
ARK	ark:/87278/s6gh9zqf
DOI	https://doi.org/doi:10.26053/0H-S91R-DYG0
Setname	ir_etd
ID	194543
OCR Text	Show DESIGNING A PARALLEL DATAFLOWARCHITECTURE FOR STREAMING LARGE-SCALE VISUALIZATION ON HETEROGENEOUS PLATFORMS by Huy T. Vo A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computing School of Computing The University of Utah August 2011 Copyright c Huy T. Vo 2011 All Rights Reserved The U n i v e r s i t y o f Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of has been approved by the following supervisory committee members: , Chair Date Approved , Member Date Approved , Member Date Approved , Member Date Approved , Member Date Approved and by , Chair of the Department of and by Charles A. Wight, Dean of The Graduate School. Huy T. Vo Cláudio T. Silva 03/18/2011 Juliana Freire 03/18/2011 Charles D. Hansen Valerio Pascucci 03/18/2011 João L.D. Comba 03/18/2011 Al Davis School of Computing ABSTRACT Dataflow pipeline models are widely used in visualization systems. Despite recent advance-ments in parallel architecture, most systems still support only a single CPU or a small collection of CPUs such as a SMP workstation. Even for systems that are specifically tuned towards parallel visualization, their execution models only provide support for data-parallelism while ignoring task-parallelism and pipeline-parallelism. With the recent popularization of machines equipped with multicore CPUs and multi-GPU units, these visualization systems are undoubtedly falling further behind in reaching maximum efficiency. On the other hand, there exist several libraries that can schedule program executions on multiple CPUs and/or multiple GPUs. However, due to differences in executing a task graph and a pipeline along with their APIs being considerably low-level, it still remains a challenge to integrate these run-time libraries into current visualization systems. Thus, there is a need for a redesigned dataflow architecture to fully support and exploit the power of highly parallel machines in large-scale visualization. The new design must be able to schedule executions on heterogeneous platforms while at the same time supporting arbitrarily large datasets through the use of streaming data structures. The primary goal of this dissertation work is to develop a parallel dataflow architecture for streaming large-scale visualizations. The framework includes supports for platforms ranging from multicore processors to clusters consisting of thousands CPUs and GPUs. We achieve this in our system by introducing the notion of Virtual Processing Elements and Task-Oriented Modules along with a highly customizable scheduler that controls the assignment of tasks to elements dynamically. This creates an intuitive way to maintain multiple CPU/GPU kernels yet still provide coherency and synchronization across module executions. We have implemented these techniques into HyperFlow which is made of an API with all basic dataflow constructs described in the dissertation, and a distributed run-time library that can be used to deploy those pipelines on multicore, multi-GPU and cluster-based platforms. To my mother, my endless source of support... CONTENTS ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii LIST OF TABLES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xii ACKNOWLEDGEMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii CHAPTERS 1. INTRODUCTION: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Dissertation Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. BACKGROUND AND RELATEDWORK : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2.1 Dataflow Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Executive Policy and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Dataflow vs. Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Visualization Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Heterogeneous Programming Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3. STREAMING SIMPLIFICATION OF TETRAHEDRAL MESHES : : : : : : : : : : : : 11 3.1 Streaming Mesh Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Tetrahedral Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Quadric-Based Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Streaming Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Implementation Details and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Numerical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4.1 Stability and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.3 Large-Scale Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Proposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4. INTERACTIVE RENDERING OF LARGE UNSTRUCTURED GRIDS : : : : : : : : 28 4.1 Interactive Out-of-Core Volume Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.2 Out-of-Core Dataset Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.3 Hardware-Assisted Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.4 Distributed Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.2 VTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.3 iRun vs. iWalk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Proposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5. PARALLEL VISUALIZATION ON LARGE CLUSTERS USING MAPREDUCE : 43 5.1 MapReduce Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 Visualization Algorithms Using MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.1 Isosurface Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.2 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.3 Mesh Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.1 MapReduce Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.2 Isosurface Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.3 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.4 Mesh Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.5 Proposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6. INITIAL DESIGN FOR MULTICORE SYSTEMS : : : : : : : : : : : : : : : : : : : : : : : : : 60 6.1 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 Streaming Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.4 Framework Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.5.1 Multicore Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.5.2 Streaming Tetrahedral Mesh Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7. HYPERFLOW: AN EFFICIENT DATAFLOWARCHITECTURE FOR SYSTEMS WITH MULTIPLE CPUS AND GPUS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76 7.1 HyperFlow Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.1.1 Pipeline Construction Using Task-Oriented Modules . . . . . . . . . . . . . . . . . . . . 77 7.1.2 Pipeline Execution Using Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.1.3 Virtual Processing Elements (VPEs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.1.4 Execution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.1.5 VPE Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.1.6 Memory Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.1.7 Classes of Parallelism Supported . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.2 Synthetic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.1 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.3 Real-Case Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.3.1 High Throughput Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.3.2 Multigrid for Gradient-Domain Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.3.3 Map/Reduce Word Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3.4 Broad-Phase Collision Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.3.5 Isocontouring Structured Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.3.6 Saturation of Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 vi 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8. FUTUREWORKS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 106 8.1 Remote Computing Resources and VPE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 8.2 Distributed Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.3 Persistent Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9. CONCLUSIONS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110 REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112 vii LIST OF FIGURES 2.1 Executive Classification Executives are classified based on their updating scope and policy (a) a centralized push model handling (b) a centralized pull model han-dling (c) a distributed push model handling a data request (d) a distributed pull model 5 2.2 Dataflow Parallelism The three fundamental types of parallelism in the dataflow architecture (a) Task Parallelism (b) Pipeline Parallelism and (c) Data Parallelism. . . . 6 2.3 Dataflow vs. Workflow Two main differences between dataflow and workflow: (a) pipeline topology and (b) execution strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Streaming Simplification Streaming simplification performed on a tetrahedral mesh ordered from bottom to top. The portion of the mesh that is in-core at each step is shown in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Mesh Format Layout diagrams with cell indices on the horizontal and vertex in-dices on the vertical axis. A vertex is active from the time it is introduced until it is finalized, as indicated by the horizontal lines. (a) A standard format for tetrahedral meshes based on the OBJ format. (b) A streaming format that interleaves vertices with cells. Vertex finalization is provided with a negative index (i.e., relative to the most recent vertex). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Streaming Simplification Buffer Example of a buffer moving across the surface of a tetrahedral mesh sorted from left to right. As new tetrahedra are introduced, the tetrahedra that have been in-core the longest are removed from memory. . . . . . . . . . . . 16 3.4 QEM Data Structure Data structures used for our quadric-based simplification. . . . 19 3.5 Conjugate Gradient Solver Conjugate gradient solver for positive semidefinite systems Ax = b. On input x is an estimate of the solution, n = 4 is the number of linear equations, and kmax is a tolerance on the condition number. tr(A) is the trace of A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6 Simplified Scalar Result Volume rendered images of the Fighter dataset show the preservation of scalar values. The original dataset is shown on the left (1,403,504 tetrahedra) and the simplified version is shown on the right (140,348 tetrahedra). . . . . 25 3.7 Simplified Surface Result Views of the mesh quality on the surface of the Rbl dataset. The original dataset is shown on the left (3,886,728 tetrahedra) and the simplified version is shown on the right (388,637 tetrahedra). . . . . . . . . . . . . . . . . . . . 25 3.8 Large-Scale Simplified Simulation Dataset Isosurfaces of the fluid dynamics dataset. A very small portion of the isosurfaces is shown for the original dataset of over a billion tetrahedra (left) and the simplified dataset of only 12 million tetrahedra (right). The isosurfaces are shown up close using flat shading to enhance the details of the resulting surface. Our algorithm allows extensive simplification (almost 1%) with negligible numerical error (1.57% RMS) for the fluid dynamics dataset which is too large to simplify with conventional approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 iRun Overview. (a) The user interacts with the UI by changing the camera from which our system predicts future camera positions. (b) Our octree traversal algorithm selects the octree nodes that are in the frustum, determines the appropriate LOD, and arranges the nodes in visibility order. (c) The geometry cache keeps the working-set of triangles while a separate thread is used to fetch additional geometry from disk. (d) Finally, the geometry is sent to the renderer for object-space sorting followed by a hardware-assisted image-space sorting and compositing which is performed using a modified version of the HAVS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 iRun Visibility Sorter Pseudo-code for visibility sorter based on the camera dis-tance in iRun. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 iRun LOD Traversal Pseudo-code for the octree traversal algorithm. . . . . . . . . . . . . 32 4.4 A Snapshot of iRun Refining the LOD The image on the left is rendered as the user would see it from the current camera position. On the right is a bird's-eye view of the same set of visible nodes. Different colors indicate different levels-of-detail. The geometry cache is limited to only 64MB of RAM in this case. . . . . . . . . . . . . . . . 32 4.5 Distributed Rendering with iRun (a) with a thumbnail client, the client receives thumbnails from the render-servers and composites them and (b) with a full client, the client accesses and renders the geometry directly. . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6 Effects of Geometry Cache Size in iRun iRun visualizes the Turbo Jet model consisting of over 10 millions tetrahedra with multiple levels-of-detail. The geometry cache size of (a), (b) and (c) is 512MB while (d) is 1.5GB. (a) In interactive mode, iRun only displays 40K triangles. (b) In still-rendering mode, most of the geometry cache is used. (c) The user zooms in to a particular region. (d) The geometry cache size is tripled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.7 Geometry Cache Statistics The difference between the displayed nodes of the geometry cache and the ones being fetched during interactions of the Turbo Jet at 2 fps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.8 Distributed Volume Rendering The bullet dataset rendered with a single ma-chine versus a distributed rendering with four clients. . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 Hadoop Architecture Data transfer and communication of a typical MapReduce job in Hadoop. Data blocks are assigned to several Maps, which emit key/value pairs that are shuffled and sorted in parallel. The reduce step emits one or more pairs. . . . . . 45 5.2 MapReduce Isosurface Extraction The map phase generates an isovalue as key and triangle data as value. The reduce phase simply outputs the isosurface as a triangle soup to file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 MapReduce Rasterization The map phase rasterize each triangle and emits the pixel coordinates as value, and its color and depth as value. The reducer emits the smallest depth value for each location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 MapReduce Mesh Simplification The first map phase generates the bin coordi-nate of a vertex as key and a quadric error measure along the three triangle indices as value. The first reduce emits the representative vertex for each triangle. A second map and reduce is applied since multiple current reduce phases are not currently supported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ix 5.5 Hadoop Baseline Results Hadoop baseline test evaluates data transfer costs. In the local cluster we achieve rates up to 428MB/s . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.6 Isosurface Extractions with MapReduce Isosurface results using the MapRe-duce framework for the PPM Richtmyer-Meshkov instability and Skull datasets. . . . . 53 5.7 Surface Rendering with MapReduce Rendering results for the St. Matthew (left), Atlas (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.8 Volume Rendering with MapReduce Volume rendering of the earthquake dataset (SF1) using a 100MP image. Table shows volume rendering statistics for other tetrahedral meshes. The Bullet dataset with 36 million tetrahedra is rendered in 4min and 19s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.9 Comparisons of Simplification with MapReduce Simplified meshes using volumes from 83 to 10243 using decimation rates under 5%. . . . . . . . . . . . . . . . . . . . . 57 6.1 Parallel Dataflow Architecture for Multicore Platforms The overall archi-tecture of our system. The execution process is triggered by the first request of any pipeline module. The scheduler will distribute resources and schedule the next execution of the whole pipeline. Each module is allocated with a specific sum of resources (of CPU and GPU). Inputs will be converted into the appropriate medium for resource before entering actual computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 A Simple Rendering Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 Scheduler Assigments Streaming priority assignments by our scheduler . . . . . . . . 63 6.4 Gaussian Image Pipeline (a) using VTK; (b) using ours . . . . . . . . . . . . . . . . . . . . 68 6.5 Gaussian Image Pipeline Results CPU usage from 1 to 8 threads . . . . . . . . . . . . 69 6.6 Gaussian Image Pipeline Results Strong scaling plot (top) and efficiency plot (down) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.7 ViSUS Performance with Our System Two ViSUS pipelines performing: (a) the 12-Month Average, (b) Multi-View selective rendering and (c) the scalability plot. 72 6.8 Streaming Simplification with Our System Streaming simplification of tetra-hedral meshes under our system (a) with data-parallelism and (c) with a complete parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1 HyperFlow Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Scheduling Logic of HyperFlow Pseudo-code for the main strategy of the VPE scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.3 Execution and Parallelism in HyperFlow Processing two data blocks, part one. . 85 7.4 Execution and Parallelism in HyperFlow Processing two data blocks, part two. . 86 7.5 Topology of Micro-benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.6 Snapshot of the Micro-benchmark running with eight input flows and eight VPEs, at 2 3 of the total execution time. The first four inputs are completely processed at this point, as the first four flows on every module are inactive. . . . . . . . . . . . . . . . . . 88 x 7.7 Micro-benchmarks Gantt Charts Gantt charts and resource utilization when processing 8 input flows. Top half shows the Gantt chart for the complete execution of each benchmark, with one horizontal bar for each VPE. A green state corresponds to the execution of a module; yellow represents an idle state; red represents the end of execution. Bottom half shows the resource utilization plot (ratio of active VPEs / VPEs available.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.8 Micro-benchmark Performance for HyperFlow Micro-benchmark simulation with 1, 2, 4 and 8 threads (green, red, blue and orange bars, respectively) and 1000 flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.9 Edge Detection Result Top: input composed of 624 images, each approximately 6MP for (a) and 3MP for (b) in resolution. Bottom: composition of input images after edge detection pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.10 Edge Detection Pipeline Edge detection pipeline on approximately 4GB of image data. Speedup compared to a system with 1 CPU and 0 GPUs. . . . . . . . . . . . . . . . . . 93 7.11 The Modified Multigrid Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.12 Map-Reduce Example in HyperFlow (a) HyperFlow pipeline definition (left) and how it should be executed with high efficiency, i.e., modules are desired to run in parallel while the rest have to be sequential since it uses global shared resources. (b) Performance Results. Mapping is the most time consuming phase since it has to per-form both word parser and accumulate word counts for each text block. Fortunately, it can be parallelized with very little effort in HyperFlow. . . . . . . . . . . . . . . . . . . . . . . 96 7.13 Collision Detection Scene 105 objects are contained in a 20 20 2 regular grid of spaces. Each object is colored differently according to the space where it is inserted. 99 7.14 OpenMP vs. HyperFlow Comparison for collision detection. . . . . . . . . . . . . . . . . . 100 7.15 Gantt Charts for the Parallel Broad-phase Collision Detection The green states correspond to the execution of the broad-phase algorithm, blue is the constraint solving and integration phase, and yellow represents idle or unrelated computations. . . 100 7.16 Data Communication Diagram in HyperFlow for the Isocontouring Pipeline . . . 101 7.17 Pipeline Structure in HyperFlow for the Isocontouring Pipeline . . . . . . . . . . . . . 102 7.18 HyperFlow vs. Ghost [44] for parallel computation of an isosurface. The speedup of both methods is measured relative to the wall clock time of HyperFlow running on a single processor. No Mem HF refers to the same HyperFlow implementation without memory accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8.1 Node VPE Computing resources on remote machines are utilized with tasks as-signed from Node VPE of local nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.2 Distributed Scheduling Without Persistent Module . . . . . . . . . . . . . . . . . . . . 108 8.3 Implicit and Explicit Data Parallelism Two types of data-parallelism that can be supported in HyperFlow: (a) implicit and (b) explicit. . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.4 Distributed Scheduling with Persistent Module The stars specify persistent modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xi LIST OF TABLES 2.1 Summary of Current Visualization Dataflow Systems . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Analysis of Mesh Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Priority-Queue (P) vs. Randomized (R) Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Implementation Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Performance of Linear Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.6 In-Core and Streaming Simplification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Preprocessing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1 Performance Results for Isosurface Extractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Performance Results for Surface Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 Simplification with MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.1 Running Time and Efficiency Ratio of CPU Usage Between VTK and Our System for a Simple Gaussian Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2 Timing for ViSUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Simplification Time for Achieving 10% Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.1 Performance Results for Streaming Multigrid Gradient-Domain Image Editing . . . . . . 95 7.2 Timing Results for the Word Count Pipeline 83-Million Words . . . . . . . . . . . . . . . . . . 98 7.3 Large Iso-Surface Extraction Using Ghost [44] and HyperFlow . . . . . . . . . . . . . . . . . 103 ACKNOWLEDGEMENTS I would like to thank Prof. Cl´audio Silva deeply for offering me tremendous guidance and sup-port throughout both my undergraduate and graduate years. I consider myself to be very fortunate to get an opportunity to work with him during the past seven years. He has given me countless experiences to become a researcher that I could not have gained anywhere else. He has been greatly patient with me and my "random disappearing". I hope that I did not give him too many troubles. I would like to thank all of my committee members: Prof. Cl´audio Silva, Prof. Juliana Freire, Prof. Charles Hansen, Prof. Valerio Pascucci and Prof. Jo˜ao Comba, for their valuable comments and insightful discussions to make this dissertation possible. I would like to thank Prof. Cl´audio Silva, Prof. Juliana Freire and the entire VisTrails team for giving me an opportunity to work with such a talented group of researchers. It has been my honor. I would like to thank Louis Bavoil, Steve Callahan, and Carlos Scheidegger, whom I have learned so much from during my early years in the VGC group. I also want to thank Brian Summa, Daniel Osmari, Luiz Scheidegger and Jonathan Bronson for working hard with me through many projects. I would like to thank Nick Rathke and the IT team at SCI for picking up his phone at midnight and Liz Jordan for granting me sodas after hours to keep me up during deadlines. I would like to thank Emanuele Santos, Linh Ha, Hoa Nguyen, Anh Vo, Huong Nguyen, Trang Pham, and Khiem Nguyen for supplying me with many home-made meals. They were delicious. Without them, I would have only lived on instant noodles. I would like to thank Gustavo and Manu for many running sessions altogether. I also want to thank Luciano Barbosa, Lauro Lins, Karane Vieira, Tiago Quieroz and the Brazillian "gang" for great friendships. I would like to thank D.A., N., L., C., V., T., H., and S.B for being there for me. I would like to thank my buddy Wang Bung Bu just because he is my best friend. With all my heart, I would like to show my deepest gratitude to my family: my dad, my mom, my sister and recently my brother-in-law and my niece. My parents have been supporting me unconditionally with love and much more for the past 26 years. I hope my dissertation could somehow express my appreciation, though only a fraction of it, to them. CHAPTER 1 INTRODUCTION The recent popularization of commodity multicore CPUs and multi-GPU units has opened many alternatives for the design and implementation of efficient visualization and data analysis algo-rithms. To do this, one typically needs to manually design a solution that distributes the processing work among CPU cores and GPU units, an approach that can be very cumbersome in terms of development time. The ability to combine the processing power of CPUs and GPUs in a more automatic and extensible fashion requires algorithms to encapsulate tasks into high-level modules capable of being executed in two or more kinds of fundamentally different processing elements. This raises an immediate challenge, since CPUs and modern GPUs behave very differently from one another. Also, this heterogeneous integration requires an underlying infrastructure to control execution and data-transfer between modules efficiently. In order to assist developers in these new architectures, libraries and frameworks have been created with capabilities of abstracting native threads and GPU stream processors in execution as well as decomposing programs into task graphs, that can be mapped efficiently to different processing elements at run-time. Among these are Intel Threading Building Blocks [42], AMD Stream SDK [5], HMPP [17], StarPU [6]. StarPU and AMD Stream SDK are also capable of scheduling executions on both CPUs and GPUs simultaneously. On the other hand, existing pipeline models used in scientific and visualization applications, such as AVS [86], SCIRun [73], and VTK-based systems such as Paraview [50], VisIt [19], Vis- Trails [8] and DeVIDE [11], already decompose computing tasks into independent modules and transparently manage data communication inside a pipeline. However, these systems were designed around the assumption of a homogeneous processing model, based only on multicore CPUs. Even the current parallel implementations fail to take advantage of the available shared-memory architec-ture to increase performance. Unfortunately, these systems cannot utilize existing parallel frameworks directly because of their differences in execution models. Current parallel frameworks make use of task graph execution, which is a workflow-based architecture, whereas visualization systems are mostly dataflow-based. Since workflows require more constraints in pipeline topology as well as execution mechanism than 2 dataflow, simply transforming pipelines in these systems into task graphs will not work. Moreover, most current systems also assume, to a large extent, that the data can be maintained in memory, or worse, that multiple copies of the data could be stored across different dataflow modules. These assumptions can cause scalability problems when dealing with large data. Streaming data structures are often used in this case, though most current systems do not include this support. A full revision of the streaming data structures and algorithms is necessary, besides the execution model, in order to extend current systems to use the processing power of both CPUs and GPUs. 1.1 Dissertation Objectives In this dissertation, we present a parallel dataflow architecture that treats all processing elements in a heterogeneous system as first rate computational units, including multicore CPUs and GPU units, along with a number of large-scale visualization experiments that have helped us determine crucial requirements for such framework: Parallel Constructs: the framework supports building pipelines with all three fundamen-tal types of parallelism: task-, data- and pipeline- parallelism, and providing methods for specifying which of them should be incurred on demand. Heterogeneous Deployment: pipelines built with the framework should be able to run on machines with different configurations ranging from a single machine with multiple CPUs and/or GPUs to a cluster of machines. Streaming: large datasets are supported through the use of streaming data structures with all types of parallelism. Extensible API: the framework API has to be flexible enough such that new hardware con-figurations can be added without difficulty. As part of this dissertation, we have built HyperFlow, a parallel dataflow architecture for efficient parallel executions on heterogeneous systems. It includes an adaptive execution scheduler that assigns tasks for different processing elements such as CPUs and GPUs, thus minimizing load balancing problems without added effort by pipeline developers. HyperFlow also supports classical parallel constructions such as Map/Reduce templates, among others. 1.2 Dissertation Structure The rest of the dissertation is organized as follows. Chapter 2 reviews current parallel pro-gramming environments and visualization systems. It also provides a background on dataflow 3 architecture. Next, Chapter 3, 4 and 5 describe our experiments with large-scale visualizations in three applications: streaming simplification of tetrahedral meshes, interactive rendering of large unstructured grids and visualizations with MapReduce, respectively. Chapter 6 presents our initial design of a parallel dataflow architecture for multicore machines. Chapter 7 introduces HyperFlow, our design for multiple CPUs and GPUs systems. Then, Chapter 8 discusses options for extending HyperFlow to distributed environments. Finally, Chapter 9 concludes the dissertation along with future work. 1.3 Remark Many of the results presented in this dissertation have been published in journals and/or pro-ceedings of conferences. In particular, Chapter 3, Chapter 4 and Chapter 6 are based on [87], [88], and [89], respectively. CHAPTER 2 BACKGROUND AND RELATEDWORK Parallel rendering and visualization has been the focus of a large body of work [75, 83, 4, 62, 61]. Even a cursory review is beyond the scope of this chapter. Therefore we point the reader to [18, 47] for a complete introductory survey. In this chapter, we will review design choices of dataflow architecture and provide an overview of current visualizations systems. 2.1 Dataflow Architecture Following the pioneering work of Haber and McNabb [38], many leading visualization systems mentioned in the previous section (e.g., [51, 86, 73, 19, 8, 11]) have been based on the notion of a dataflow network, which is often called a pipeline. Modules (nodes of the network) are processing elements while connections between them represent data dependencies. Inputs for a module may or may not be independent of the others. Thus, a level of coordination is required between the modules and their data dependencies during executions. In the simplest form, this is implemented statically in the algorithm at the module level. However, as dataflow systems become more complex, this coordination is typically assigned to a separate component called the executive. The executive is responsible for all coordination, instructing the module when and on which data it should operate. In this approach, algorithms can be implemented based purely on the computational task required without consideration for the topology of the pipeline. 2.1.1 Executive Policy and Scope Executives can be classified based on their updating scope, centralized vs. distributed, and policy, pull vs. push (or demand- vs. event-driven as stated in [78]). A centralized executive operates globally and is connected to every module to track all changes to the network, as well as handling all requests and queries in the pipeline. This approach gives the system the advantage of having control over the entire execution network, thereby allowing it to be easily distributed across multiple machines. However, this centralized control leads to high overhead in the coordination and reduces the scalability of such systems especially when operating on complex pipelines. In the distributed executive scheme, each module contains its own executive, which is only connected to its immediate upstream and downstream neighbors. This approach does not suffer the same scalability 5 issues as the centralized scheme. However, this myopic view of the network makes distributing resources or coping with more complicated executing strategies a much more difficult problem. With respect to an updating policy, in a pull model a module is executed only when one of its outputs is requested. If a module's input is not up-to-date, it will demand or pull the corresponding upstream module for data. Therefore only modules that are needed for the actual computation will be executed, avoiding wasteful calculations. However, since a request is dependent on all upstream modules to be updated before starting computation, the module that initiates the request will be locked longer than its computing time. In contrast, for a push model modules are updated as soon as an event occurs, e.g., a change in its input. All modules that are dependent on the requested module's output are then computed. This results in a shorter locking period, which is equivalent to the computation time for each module. Nevertheless, this approach has a key disadvantage of redundancy. Modules compute and generate data even when they will not be used. Figure 2.1 illustrates the classifications for executives outlined here. Centralized Push Model E New Event (a) Centralized Pull Model E Data Request (b) Distributed Push Model E New Event E E E E (c) Distributed Pull Model E E E E E Data Request (d) Module Message Data Dependency Executive Figure 2.1: Executive Classification Executives are classified based on their updating scope and policy (a) a centralized push model handling (b) a centralized pull model handling (c) a distributed push model handling a data request (d) a distributed pull model 6 2.1.2 Parallelism There are three fundamental types of parallelism in dataflow execution. The first form of parallelism is task-parallelism. This scheme allows independent branches of the networks to run in parallel each using its own thread. Task-parallelism requires pipelines with many independant modules in order to maximize concurrent executions. Figure 2.2a shows an example where modules A and B are independant of C. Therefore A and B are allowed to be run concurrently with C. The second form of parallelism is pipeline-parallelism: pipelines are partitioned into subnet-works operating on different tasks. During execution, each network would be allocated its own threads and each is computed concurrently on streams of the data. However, pipeline-parallelism is only applicable to separable or streamable data structures. In addition, partitioning of the pipeline into subnetworks must be proportional to the computing resources (e.g., the number of cores) in order to maximize the performance. This is not scalable since this division is typically done by hand. Figure 2.2b, is an example of streaming pipeline parallelism between two modules. The last common form of parallelism is data-parallelism. In this scheme, the data are divided into independant pieces to be processed in parallel. At the end of execution, data are collected and merged into a single result. This is an extremely useful method for large datasets that must be processed out-of-core and and is highly scalable to clusters of machines. However, this performance gain is maintained only if there is one merge operation in the end. Otherwise, data must be redistributed multiple times causing a large degradation of performance. In practice, a single merge is the usual case for rendering pipelines. Figure 2.2c illustrates the use of data-parallelism, where the pipeline AB is duplicated multiple times to work on the separate pieces of the streaming data. A B C D Piece #2 Piece #1 A B Merge A B A B (a) (b) (c) Streamable Data Structure Concurrency Execution Figure 2.2: Dataflow Parallelism The three fundamental types of parallelism in the dataflow architecture (a) Task Parallelism (b) Pipeline Parallelism and (c) Data Parallelism. 7 2.1.3 Dataflow vs. Workflow The term workflow is sometime used interchangeably with dataflow in many literatures. Though they both refer to a set of processes with data paths describing how data should be moved among them, they have two main differences in pipeline topology and execution strategy. Pipeline Topology: a workflow is required to be a directed acyclic graph (DAG) while dataflow is not. As in Figure 2.3a, a pipeline expressed in dataflow is allowed to have feedback connections, e.g.,from the Viewer to the Fetch module. In a workflow, the same feedback connection has to be implemented and triggered manually by the developer at run-time. Execution Strategy: in a workflow, the execution order of modules is usually fixed based on the dependency of data connections. Each time a workflow is executed, every module is also executed exactly once. This is considered to be a task-driven approach comparing to the data-driven one of dataflow, where only modules that receive input data, either as a result of an upstream execution or programmatically commands, will be triggered for execution. Figure 2.3b illustrates this difference. On the left, the execution order for the shown workflow is fixed: Fetch, Filter A, Filter B and Viewer. On the right, depending on whether module Filter A and Filter B produce any output data, the Viewer module will or will not be executed accordingly. 2.2 Visualization Systems Kitware's Visualization ToolKit (VTK) [51] is considered to be the de facto standard visu-alization API and is used by thousands of researchers and developers around the world. The underlying architecture has undergone substantial modifications over the years to allow for the Filter A Filter B Fetch Viewer Filter A Filter B Fetch Viewer Filter A Filter B Fetch Viewer Filter A Filter B Fetch Viewer 1 2 3 1 2 3 4 1 2 3 4 Work ow (DAG) Data ow Work ow (DAG) Data ow (a) (b) Executed module Connection with/without data Figure 2.3: Dataflow vs. Workflow Two main differences between dataflow and workflow: (a) pipeline topology and (b) execution strategy. 8 handling of larger and more complex datasets (e.g., time-varying, AMR, unstructured, high-order). However, its execution model has a number of limitations with respect to parallelism. First, it only supports concurrency execution at a module level, which means that no matter how many threads an algorithm is implemented to run in, the whole network has to be updated serially (that is, one module at a time). Moreover, by default, only a small subset of VTK, such as those inherited from vtkThreadedImageAlgorithm, can run multithreaded. This poses a limitation on the performance and scalability for many applications. While much effort [54, 9] has been put into extending VTK's execution pipeline over the years, it is still challenging and problematic to build highly-parallel pipelines using the existing pipeline infrastructure. This is partly due to the fact that VTK makes use of a demand-driven execution model while some pipelines, in particular those that need streaming and/or time-dependent computations, fit more naturally in an event-driven architecture. Ahrens et al. [2, 1] proposed parallel visualization techniques for large datasets. Their work in-cluded various forms of parallelism including task-parallelism (concurrent execution of independent branches in a network), pipeline-parallelism (concurrent execution of dependent modules but with different data) and data-parallelism. Their goals included the support for streaming computations and support for time-varying datasets. Their pioneering work led to many improvements to the VTK pipeline execution model and serve as the basis for ParaView [50]. ParaView is designed for data-parallelism only, where pipeline elements can be instantiated more than once and executed in parallel with independent pieces of the data. Specifically, a typical parallel execution in ParaView involves a pipeline instantiating one or multiple processes based on the data input. ParaView must then rely on MPI to distribute the processes to cluster nodes to finish the computation. However, ParaView does not support a hybrid MPI/multithreaded model; for multicore architectures MPI is also used for creating multiple processes on the same node. This may impose substantial overhead due to the additional expense of interprocess communication. A related system is VisIt [19]; an interactive parallel visualization and graphical analysis tool for viewing scientific data. Though the pipeline topology in VisIt is fixed and relatively simple, it introduced the notion of contracts between modules, which are data structures that allow modules to negotiate with other modules on how to transfer data. This method has proven to be very useful for optimizing many operations on the dataflow network. Recent versions of the VTK pipeline incorporate many ideas that were originally developed for VisIt and ParaView. DeVIDE [11] is a cross-platform software framework for the rapid prototyping, testing and deployment of visualization and image processing algorithms. It was one of the first systems to fully support a hybrid execution model for demand- and event-driven updating policies. However, it does not target the parallel/high performance perspective of pipeline execution. 9 Table 2.1 summarizes the feature set for the systems discussed above. 2.3 Heterogeneous Programming Frameworks There exist several proposals of programming languages and environments to allow combined computation on CPUs and GPUs, such as the Brook [12] programming language, designed for general-purpose computation using GPUs[72]. At the time of its proposal, GPUs were programmable mainly through graphics APIs. Brook introduces a streaming architecture that exposes the GPU as a parallel programmable unit, composed of kernels (programs) that operate on streams (data). Similar to Brook, the Scout [64] programming language was designed with specific data constraints and a target application in the field of visualization and data exploration. Only regular grid structures are supported in Scout, which are directly mapped to OpenGL textures. Glift [55] is an abstraction layer and generic template library for representing complex random-access data structures (e.g., Stacks, Quadtrees and Octrees) on the GPU. Glift components were designed to support data access both on the CPU and GPU using a streaming computation model. With the introduction of the CUDA [70] programming language by NVIDIA (and subsequently OpenCL[53]), the full infrastructure of the GPU was made available for general purpose computation in a single system. Support for heterogeneous systems is described in several works. Instead of forcing applications to be rewritten into a streaming processing format, HMPP [32, 10] is a programming environment that enhances CUDA programs with a set of compiler directives, tools and software runtime that support multicore CPU parallel programming. HMPP offers a dynamic linking mechanism that allows the GPU to be used as a co-processor in a way that preserves legacy code. The CUDASA programming language [68] extends CUDA to allow distributed computation on multiple GPUs in a local system or across machines on a network. CUDASA introduces abstraction layers with different levels of parallelism. In particular, a bus layer allows manual task delegation to CPU and Table 2.1: Summary of Current Visualization Dataflow Systems Executive Parallelism Streaming Scope Policy Async. Data Task Pipeline Memory Update VTK Dist. Pull X Shared Serial ParaView Dist. Pull X X Dist. Serial VisIt Dist. Pull X X Dist. Serial DeVIDE Cent. Pull/Push Serial SCIRun Cent. Push X Shared N/A VisTrails Cent. Pull N/A 10 GPU cores in arbitrary combinations, while a network layer adds support for clusters of multiple interconnected computers. Shared memory across network nodes relies on the MPI Remote Mem-ory Access (RMA) protocol. It lacks support for improved load balancing and distributed memory accesses. FastFlow [3] introduces a low-level programming framework based on lock-free queues (no memory fences) to support streaming applications. It allows multiparty communication in multicore architectures, with support for cyclic graphs of thread streaming networks, and is specially designed for fine-grained computations. Some recent work discusses more advanced scheduling algorithms for controlling parallel ap-plications. Jimenez et al. [46] introduce a predictive user-level scheduler for a CPU/GPU heteroge-neous system. They discuss and evaluate several scheduling algorithms that improve performance of of functions that have previously been profiled during execution. Unfortunately, current programming environments either make use of task-graphs or a similar DAG-based approach in their execution models or only providing a low-level scheduling API, and thus, cannot be extended to current dataflow systems without major modifications. CHAPTER 3 STREAMING SIMPLIFICATION OF TETRAHEDRAL MESHES In this chapter, we present our experiment with streaming data structures and algorithms in the context of unstructured grid simplification. Simplification techniques have been a major focus of research for the past decade due to the increasing size and complexity of geometric data. Scientific simulations and measurements from fluid dynamics and partial differential equation solvers have produced datasets that are too large to visualize with current hardware. Thus, approximations which maintain a volumetric mesh are necessary to achieve a level of interactivity that is necessary for proper analysis through visualization techniques such as isosurfacing or direct volume rendering. This experiment also shows the importance of streaming in handling large-scale datasets, thus, guiding us to a generalized streaming data structure that can optimize both performance and memory footprint. Although significant work has been done in simplifying triangle meshes, relatively little has been done with tetrahedral meshes. Most of the work in tetrahedral simplification falls into two categories: edge-collapse methods and point sampling methods. These algorithms assume that the entire mesh can be loaded into main memory. However, due to the high memory overhead of storing the mesh connectivity in addition to the geometry, there are limitations on the size of the dataset that can be simplified in this manner. We present an algorithm that streams the data from disk through memory and performs the simplification on a localized portion of the entire mesh. Our streaming algorithm requires only one pass to simplify the entire mesh. Thus, the layout of the mesh is of great importance to produce high quality results. We perform a reordering of the tetrahedral cells and store them on disk using a streaming tetrahedral mesh format. This format provides concurrent access to coherently ordered vertices and tetrahedra. It also minimizes the duration that a vertex remains in-core, which limits the memory footprint of the simplification. Our tetrahedral simplification incrementally works on overlapping portions of the mesh in-core (see Figure 3.1). We use the quadric-error metric to perform a series of edge collapses until a target decimation is reached. This results in a simplification algorithm that can efficiently simplify 12 Figure 3.1: Streaming Simplification Streaming simplification performed on a tetrahedral mesh ordered from bottom to top. The portion of the mesh that is in-core at each step is shown in green. extremely large datasets. In addition, through the use of carefully optimized algorithms, linear solvers, and data structures we show that significant improvements in speed and stability can be achieved over previous techniques. 3.1 Streaming Mesh Layout Traditional object file formats consist of a list of vertices followed by a list of polygonal or poly-hedral elements that are defined by indexing into the vertex list. Dereferencing such a mesh, i.e., ac-cessing vertices via their indices, requires the whole vertex list to be in memory since elements are generally not assumed to reference vertices in any particular order; an element can arbitrarily reference any vertex in the list. Furthermore, streaming such a mesh implies buffering all vertices before the first element is encountered in the stream. A logical progression for large meshes is to store them in a streaming mesh representation that interleaves the vertices and elements and stores them in a "compatible" order. This representation allows a vertex to be introduced (added to an in-core active set) when needed and finalized (removed from the active set) when no longer used. Isenburg and Lindstrom [43] provide a streaming mesh format for triangle meshes that extends the popular OBJ file format. We use their format with straightforward additions to handle tetrahedral meshes. This includes extending the vertices to four values hx;y; z; f i that represent the position in 3D space and a scalar value. In addition, we provide a new element type for a tetrahedral cell that indexes four vertices. This format allows us to finalize a vertex when it is no longer in use by using a negative index into the vertex list, which is backwards compatible with the OBJ format. Figure 3.2 shows an ASCII example of the streaming tetrahedral format. An important consideration with streaming meshes is the ability to analyze the efficiency of a mesh layout. Isenburg and Lindstrom developed several techniques for visualizing properties of the mesh to determine the effectiveness of the layout. We use similar tools to measure tetrahedral mesh quality. An important property is the front width, or the maximum number of concurrently active vertices. An active vertex is one that has been introduced, but not yet finalized. The width of a streaming mesh gives a lower bound on the amount of memory required for dereferencing (and thus 13 1 234 1 2 3 4 56 # standard v -1.0 0.0 0.0 v 0.0 0.0 -1.0 v 0.0 1.0 0.0 v 0.0 0.0 0.0 v 0.0 0.0 1.0 v 1.0 0.0 0.0 c 1 2 3 4 c 1 3 4 5 c 2 3 4 6 c 3 4 5 6 # streaming v -1.0 0.0 0.0 v 0.0 0.0 -1.0 v 0.0 1.0 0.0 v 0.0 0.0 0.0 v 1.0 0.0 0.0 c 1 2 3 4 v 0.0 0.0 1.0 c -5 3 4 5 c -5 3 4 6 c -4 -3 -2 -1 1 234 1 2 3 4 56 (a) (b) Figure 3.2: Mesh Format Layout diagrams with cell indices on the horizontal and vertex indices on the vertical axis. A vertex is active from the time it is introduced until it is finalized, as indicated by the horizontal lines. (a) A standard format for tetrahedral meshes based on the OBJ format. (b) A streaming format that interleaves vertices with cells. Vertex finalization is provided with a negative index (i.e., relative to the most recent vertex). processing) the mesh. Another important property of the mesh is the front span, which measures the maximum index difference (plus one) of the concurrently active vertices. A low span allows a faster implementation because optimizations can be performed that achieve the best performance when the span is similar to the width. Low span makes for efficient indexing in the file format, bounds the width, and for simplification purposes, ensures that vertices do not become stagnant in the buffer, which would prevent all incident edges from being collapsed. The efficiency of our layout depends on quantifying these properties. Thus we provide analysis on different layout techniques so that we can choose the most streamable layout for a given dataset. One simple mesh layout is to sort the vertices on a spatial direction, in particular one that crosses the most tetrahedra. Wu and Kobbelt [91] use this technique for triangle mesh simplification. This can be accomplished for large meshes by performing an out-of-core sort on the vertices [58] and writing them into a new file. An additional file is created to contain a mapping of the old ordering to the new one. Next, the tetrahedral cells are written to a new file and reindexed according to the mapping file. A sort is then performed on the file containing the tetrahedral cells based on the largest index of each cell. Finally, the vertex file and the cell file are read simultaneously and interleaved into a new file by writing each cell immediately after the vertex corresponding to the cell's largest index has been written. Spatial layouts work especially well when considering meshes that have a dominant principal direction. Other techniques may be desirable if the mesh does not have a dominant principal direction, such as a sphere. An approach to handle this type of data is to use a bricking method similar to the one proposed by Cox and Ellsworth [29] in which the vertices are ordered into a fixed number 0 a )C o-tC 0 a )( 0 tc 0 a a ):( 0 a a tc 0 a a ):( 0 a a tc • a tc 0 tc • a )::( o-tC 14 of small cubes for better sequential access. A similar approach is to arrange the vertices using a Lebesgue space-filling curve (i.e., z-order), which provides better sequential access in the average case. This arrangement can be generated by creating an out-of-core octree [22] of the vertices and traversing them in-order. The interleaved mesh is then written to a file in the same manner described above for spatial sorting. The results of the layout produced by bricking and z-order traversal are similar. When streamed they provide a more contiguous portion of the mesh on average, but the front width and span are typically much larger than sorting spatially. Another approach used for laying out the mesh on disk is spectral sequencing. This heuristic finds the first nontrivial eigenvector (the Fiedler vector) of the mesh's Laplacian matrix and was shown by Isenburg and Lindstrom [43] to be very effective at producing low-width layouts. They provide an out-of-core algorithm for generating this ordering for streaming triangle meshes, which we have extended to handle tetrahedra. This method works particularly well for curvy triangle meshes, but tetrahedral meshes are generally less curvy and more compact. Still, in most cases this ordering results in the lowest width, which is ideal for minimizing memory consumption. A final approach is to create a topological layout, which starts at a vertex on the boundary and grows out to neighboring vertices. To grow in a contiguous manner, we use a breadth-first traversal with optimizations to improve coherence [43]. Instead of a traditional FIFO priority queue, we assign priority using three keys. First, the oldest vertex on the queue is used in the same way that it would be in standard breadth-first algorithms. However, if multiple vertices were added to the queue at the same time, a second and third key are used to achieve a more coherent order. The second key is boolean, and gives preference to a vertex if it is the final one in a cell that has not been processed. Finally, the third key is to use the vertex that was most recently put on the queue, which is more likely to be adjacent to the last vertex. These sort keys guarantee a layout that is compact [43], such that runs of vertices are referenced by the next cell, and runs of cells reference the previous vertex (e.g., as in Figure 3.2b). In practice, this traversal can be accomplished out-of-core by breaking the mesh into pieces. Using this approach, we were able to minimize the front span of the datasets in all of our experimental cases. This is ideal because having a span and width that are similar allows us to exploit optimization techniques described in the simplification algorithm. Yoon et al. [94] propose a layout based on local mesh optimizations which reduce cache misses. This approach applied to tetrahedra would also give a compact representation as with our breadth-first layout and could be used to achieve similar results. Table 3.1 shows the front width and span for five different datasets (their statistics are specified in Table 3.2) produced by the layout techniques described above. Spectral sequencing proves to be the superior choice when low width and thus memory efficiency is required. Breadth-first layouts 15 Table 3.1: Analysis of Mesh Layout Dataset Spatial Sort Z-Order Spectral Breadth-First Width Span Width Span Width Span Width Span Torso 3,118 20,784 7,256 122,174 2,894 13,890 5,528 6,370 Fighter 3,894 110,881 9,382 215,697 3,916 28,638 16,629 19,523 Rbl 2,814 5,270 10,232 371,269 2,291 21,764 3,206 3,495 Mito 19,876 33,524 10,202 642,550 6,745 44,190 10,552 11,498 SF1 16,898 65,921 48,532 1,958,212 12,851 131,152 30,258 33,378 Table 3.2: Dataset Statistics Dataset Vertices Tetrahedra Torso 168,930 1,082,723 Fighter 256,614 1,403,504 Rbl 730,273 3,886,728 Mito 972,455 5,537,168 SF1 2,461,694 13,980,162 are not as memory-efficient, but as we will see can be processed fast. Note that unlike [45] we do not require a face-connected order and we do not require that each vertex be finalized before it can be inserted into the in-core mesh. This allows us to avoid local reordering of the tetrahedra or the vertices. 3.2 Tetrahedral Simplification 3.2.1 Quadric-Based Simplification To achieve high-quality approximations, we use the quadric error metric proposed by Garland and Zhou [36]. This metric measures the squared (geometric and field) distances from points to hyperplanes spanned by tetrahedra. The volume boundaries are preserved using a similar metric on the boundary faces and by weighting boundary and interior errors appropriately. The generalized quadric error allows the flexibility of representing field data by extending the codimension of the manifold. Given a scalar function f : D R3 ! R defined over a domain D represented by a tetrahedral mesh, we can represent the vertices at each point p as hxp;yp; zp; fpi, which can be considered a point on a 3D manifold embedded in R4. Thus, by extending the quadric to handle field data, the algorithm intrinsically optimizes the field approximation along with the geometric position. The quadric error of collapsing an edge to a single point is expressed as the sum of squared distances to all accumulated incident hyperplanes, and can in n dimensions be encoded efficiently as 16 a symmetric n n matrix A, an n-vector b, and a scalar c. It is sufficient to component-wise add these terms to combine the quadric error of two collapsed vertices. Finding the point x that minimizes this measure amounts to solving a linear system of equations Ax = b. Once x is computed, we test whether collapsing to this point causes any tetrahedra to flip [20], i.e., changes the sign of their volume, in which case we disallow the edge collapse. Because A is not necessarily invertible, it is important to choose a linear solver that is numerically stable. 3.2.2 Streaming Simplification Combining streaming meshes with quadric-based simplification, we introduce a technique for simplifying large tetrahedral meshes out-of-core. We base our streaming algorithm on [45] and [43], but make several general improvements and provide a list of optimizations that compared to a less carefully engineered implementation results in dramatic speed improvements. First, unless already provided with streaming input, we convert standard indexed meshes and optionally reorder them for improved streamability. Then, portions of the streaming mesh are loaded incrementally into a fixed-size main memory buffer and are simplified using the quadric-based method. Once the in-core portion of the mesh reaches the user-prescribed resolution, simplified elements are output, e.g., to disk or to a downstream processing module. Thus input and output happen virtually simultaneously as the mesh streams through the memory buffer (see Figure 3.3). Figure 3.3: Streaming Simplification Buffer Example of a buffer moving across the surface of a tetrahedral mesh sorted from left to right. As new tetrahedra are introduced, the tetrahedra that have been in-core the longest are removed from memory. 17 To ensure that the final approximation is the desired size, two control parameters have been added: target reduction and boundary weight. Target reduction is the ratio between the number of tetrahedra in the output mesh and the number of tetrahedra in the original mesh. Alternatively, this parameter can be expressed as a target tetrahedral count of the resulting mesh. The boundary weight prevents the shape of the mesh from changing throughout the simplification. We use a fixed value of 100 times the maximum field value in the data for the weight in our experiments. Similar to [36], the scalar field is always normalized to the geometry range before actual simplification begins. Because we only keep a small set of tetrahedra in memory, we do not know the entire mesh connectivity. Thus, we keep the boundary between the set of tetrahedra that are currently in memory and all remaining elements-tetrahedra that have not yet been read or that have been output-fixed to ensure that the simplified mesh is crack-free. We call this boundary the stream boundary, which consists entirely of faces from the interior of the mesh. We can identify the stream boundary faces as they are read in by utilizing the finalization information stored in the streaming mesh. A face of the current in-core mesh is part of the stream boundary if none of its three vertices are finalized. We disallow collapsing any edge that has one or both vertices on the stream boundary. Due to the stream boundary constraint, if we read in one portion of the mesh, simplify it, and write it out to disk in one phase, our output mesh will have unsimplified areas along the stream boundaries. This results in an approximation that is oversimplified in areas and under-simplified in others. To avoid this problem, we follow the algorithm proposed by Wu and Kobbelt [91]. Their algorithm consists of a main loop in which READ, DECIMATE, and WRITE operations are performed in each iteration. The READ operation introduces new elements until it fills the buffer. Next, DECIMATE simplifies the elements in the buffer until either the target ratio is reached or the buffer size is halved. Finally, in their method the WRITE operation outputs the elements with the largest error to file. As in [43], we improve upon [91, 45] by ensuring, to the extent possible, that the relative order among input elements is preserved in the output stream, with the caveat that tetrahedra whose vertices have not yet been finalized (i.e., are on the input stream boundary) must be delayed. Therefore, the output typically retains the coherence of the input. An error-driven output criterion, on the other hand, can considerably fragment the buffer and split off small "islands" that remain in the buffer for a long time without being eligible for simplification, and thus unnecessarily clog the stream buffer. Furthermore, such an output stream generally has poor stream qualities, which affects downstream processing. The front width (i.e., number of active vertices), for example, is particularly important for tetrahedral meshes, for which each active vertex affects on average four times as many elements as in a triangle mesh, and therefore more adversely affects memory requirements and 18 processing delay. Furthermore, we relax the requirement that the stream of tetrahedra (triangles) advance in a face (edge) adjacent manner [45], as this is of no particular value to us, and we allow any coherent ordering of mesh elements. Finally, using the more streamable layouts and simpler streaming mesh formats and API from [43], we gain considerably in performance and memory usage over [45]. 3.3 Implementation Details and Optimizations Since our method processes different mesh portions of bounded size sequentially, a statically allocated data structure is more efficient than dynamic allocations, which collectively increase the memory footprint. The size for this buffer should be O(width) depending on the width of the input mesh. However, in practice, we are able to simplify even a 14 million tetrahedra dataset using only 20MB of RAM (see Section 3.4). In our implementation, we extended Rossignac's corner table [76] for triangle meshes to tetra-hedral meshes. The original corner table requires two fixed-size arrays V and O indexed by corners (vertex-cell associations) c, where V[c] references the vertex of c and O[c] references the "opposite" corner of c. In the case of tetrahedral simplification, the most common query is to find all tetrahedra around a vertex. Therefore, we replace the O array with a link table L of equal size, which joins together all corners of a given vertex in a circular linked list. We additionally store with each vertex an index to one of its corners. We store the mesh internally as three fixed-size arrays of vertices, tetrahedra, and corners (i.e., the links L). Each vertex contains a pointer to one corner and the quadric error information (A;p; e) using a parameterization that explicitly represents the vertex geometry and scalar data in p. In Figure 3.4, only p and vidx are stored out-of-core while the rest are computed on-the-fly. This data structure employs 69 bytes per vertex and 37 bytes per tetrahedra. The quadrics for the tetrahedra are calculated when we read in a new set of tetrahedra, and are then distributed to the vertices. For each finalized vertex, we compute boundary quadrics for all incident boundary faces (if any) that have no other finalized vertices, and distribute these quadrics to the boundary vertices. Garland and Zhou [36] use a greedy edge collapse method and maintain a priority queue for the edges ordered by quadric error. Forsaking greediness, we obtain comparable mesh quality by using a multiple choice randomized approach [91, 30] with eight candidates per collapse. There are several advantages of using randomized selection. One is that we no longer need a priority queue or explicit representation of edges. Instead an edge can be found by randomly picking a tetrahedron and then randomly selecting two of its vertices. Another advantage is that the randomized technique can be further accelerated by exploiting information readily available through our quadric represen- 19 VERTEX float[10] A quadric matrix float[4] p position and eld value float e quadric error at p int idx index to input/output stream int corner vertex-to-corner index bool deleted bool written TETRAHEDRON int[4] vidx vertex indices int[4] link corner links bool deleted int idx position in input stream Figure 3.4: QEM Data Structure Data structures used for our quadric-based simplification. tation. Table 3.3 illustrates the performance of the randomized approach over the priority-queue based approach. Both algorithms simplified the models to 10% of their original resolutions. The randomized results were collected as an average of 3 runs on the same input with different random seeds. Intuition would suggest that the results generated by the priority-queue approach would have smaller error because it always picks the edge with the smallest error to collapse at each step. However, this is not optimal in many cases. A series of minimal edge collapses can lead to a locked state where the edges with the smallest error cannot be collapsed without flipping tetrahedra. In practice, the problem is more likely to occur in the homogeneous regions of a dataset, which contains edges with zero error. The priority-queue approach will greedily simplify this region as much as possible first, leaving it in a locked state. As a result, many neighbor regions (containing edges with small error) may not get simplified because it would violate the flipping constraint. On the other hand, the randomized approach tends to spread the simplification over the whole model, resulting in significantly less locked state. This explains why the maximum error of the Torso dataset using the priority queue approach was very high compared to the randomized approach. In general, the randomized approach produces comparable quality to a priority queue, while demonstrating superior performance. Before we output a tetrahedron, we must ensure that its four vertices are output first. Once a vertex is output, we mark it as not being collapsible in future iterations. To enhance the performance, we use a lazy deletion scheme, where all vertices and tetrahedra to be deleted are initially marked. At the end of each WRITE phase, we make a linear pass through all vertices and tetrahedra to remove marked elements and compact the arrays. Since we do not allocate additional memory during simplification, keeping deleted vertices and tetrahedra does not increase the memory footprint. 20 Table 3.3: Priority-Queue (P) vs. Randomized (R) Approach Model Mean RMS Max Time Torus P 0.08% .0.13% 0.56% 0.31s R 0.06% 0.10% 0.59% 0.13s SPX P 1.37% 2.56% 22.60% 0.52s R 1.53% 2.19% 15.96% 0.30s Torso P 0.00% 0.02% 1.12% 55.98s R 0.00% 0.00% 0.01% 24.11s Fighter P 0.06% 0.13% 2.28% 79.03s R 0.08% 0.16% 3.46% 29.89s Storing large datasets on disk in ASCII format can adversely affect performance because con-verting ASCII numbers to an internal binary representation can be surprisingly slow. We have extended the ASCII stream format in Figure 3.2 to a binary representation. Because our program spends over 30% of the time on disk I/O, this optimization results in a nonnegligible speedup. For example, on the SF1 dataset it improves overall performance by 17%. Since we only maintain a small portion of the mesh in-core, we require a way of mapping global vertex indices to in-core buffer indices. Usually a hash map is used, but with our low-span breadth-first mesh layout, this hash map can be replaced by a fixed-size array indexed using modular arithmetic. We move occasional high-span vertices that cause "collisions" in this circular array to an auxiliary hash [43]. With all of the optimizations described above, our simplifier can run at high speed without any dynamic memory allocation at run time. The performance and memory summary can be found in Table 3.4. The results are for simplifying the Fighter dataset (1.4 M tetrahedra) completely in-core with 1GB of RAM. Table 3.4: Implementation Improvements Improvements Time Memory (sec) (MB) Initial Implementation 212.95 310 QEF + CG 132.68 282 Multiple Choice + Corner Table 38.35 130 4D Normal, Floats 35.99 78 Final / Binary I/O 22.10 78 21 3.3.1 Numerical Issues Great care has to be taken when working with quadric metrics to ensure numerical stability while retaining efficiency. To minimize quadric errors, a positive semidefinite system of linear equations must be solved, for which numerically accurate but heavy-duty techniques such as singular value decomposition (SVD) [56, 52] and QR factorization [48] have been proposed. However, even constructing, representing, and evaluating quadric errors require that special care be taken. We here outline an efficient representation of quadric error functions that leads to numerically stable operations, improved speed, and less storage. The standard representation [36] of quadric errors is parameterized by (A;b;c), and is evaluated as Q(x) = xTAx􀀀2bTx+c (3.1) Typically the three terms in this equation are "large," but sum to a "small" value, resulting in a loss of precision. One can show that the roundoff error is proportional to kAkkxk2. Furthermore, in addition to this quadric information, it is common to store the vertex position (and field value) p that minimizes Q separately. Lindstrom [57] suggested an alternative representation that removes this redundancy: Q(x) = (x􀀀p)TA(x􀀀p)+e (3.2) where A is the same as in the standard representation and Ap = b pTAp+e = c This parameterization (A;p; e) provides direct access to the minimum quadric error e and the minimizer p. This not only saves memory but also results in a more stable evaluation of Q, as the roundoff error is now proportional to kAkkx 􀀀 pk2, and we are generally interested in evaluating Q(x) near its minimum p as opposed to near the origin. Another significant benefit of this representation is that it provides a lower bound ei+ej on Qi+Qj when collapsing two vertices vi and vj. Using randomized edge collapse [91], we can thus often avoid minimizing Qi+Qj if the lower bound already exceeds the smallest quadric error found so far. In this chapter, this representation is used explicitly to speed up the algorithm, reduce in-core storage, and improve numerical robustness rather than as a means of compressing quadric information for out-of-core storage. Our quadric representation also lends itself to an efficient and numerically stable iterative linear solver. To handle ill-conditioned matrices A, we have adapted the well-known conjugate gradient (CG) method [37] to work on semidefinite matrices (see Figure 3.5). As in SVD, we provide a tolerance kmax on the condition number k(A), and preempt the iterative solver when all remaining 22 SOLVE(A, x, b, n, kmax) 1 r = b􀀀Ax negative gradient of Q 2 p = 0 3 for k = 1; : : : ;n iterate up to n times 4 s = rTr 5 if s = 0 then exit solution found? 6 p = p+r=s update search direction 7 q = Ap 8 t = pTq 9 if st tr(A)=(nkmax) then exit insigni cant direction? 10 r = r􀀀q=t update gradient 11 x = x+p=t update solution Figure 3.5: Conjugate Gradient Solver Conjugate gradient solver for positive semidefinite systems Ax = b. On input x is an estimate of the solution, n = 4 is the number of linear equations, and kmax is a tolerance on the condition number. tr(A) is the trace of A. conjugate directions are deemed "insignificant" for reducing Q. The effect of this is similar to zeroing small singular values in SVD. Using our quadric representation, we conveniently initialize the CG solver with the guess x = (pi +pj)=2. Whereas CG methods are typically used to quickly approximate solutions to very large systems using only a few iterations, our method can be consid-ered "direct" in the sense that we solve for each of the n (i.e., 4) components, although in the Krylov basis rather than in the Euclidean basis as done by the Cholesky method. We never require more than n iterations, and only in the rank-deficient case do we perform fewer than n iterations. A final word of caution: The computation of generalized quadrics presented in [36] computes A = I - N, whose null space null(A) = range(N) is spanned by the tetrahedron, via subtraction, which due to roundoff error can leave A indefinite, i.e., with one or more negative eigenvalues. This causes Q to have a "saddle" shape with no defined minimum, and can cause numerical instability. Instead, we compute a 4D "volume normal" using a generalization of the 3D cross product to 4D. n = det 0 BB@ e1 e2 e3 e4 v1x v1y v1z v1s v2x v2y v2z v2s v3x v3y v3z v3s 1 CCA where ei is the i-column of the 4 4 identity matrix and v1, v2 and v3 are three vectors from one of the tetrahedron's vertices to the others. The outer product of this normal with itself gives a positive semidefinite A for a tetrahedron. Because of our attention to numerical stability, with kmax = 104 we are able to use single precision floating point throughout our simplifier, even for the largest meshes. Since the 4D quadric information requires 15 scalars per vertex, this saves considerable memory and improves the speed. 23 3.4 Results 3.4.1 Stability and Error Analysis We have described a CG method for solving the linear equations that arise when minimizing the quadric error. The choice of solver is important because degenerate tetrahedra and regions of near-constant field value can cause singularity. For testing purposes, we constructed a dataset by subdividing a tetrahedron into hundreds of smaller tetrahedra by linear interpolating the vertices and field data. Obviously, these small tetrahedra all lie on the hyperplane spanned by the original tetrahedron. Thus they are solutions to the linear equations. We then picked a solution as a target for each collapsed edge. We experimented with several linear solvers as shown in Table 3.5. The results are for simplifying the Fighter dataset (1.4Mtetrahedra) completely in-core. We experimented with Cholesky factorization, the least square QR factorization [37], and CG. Cholesky with pivoting provides stable solutions for solving Ax = b if A is positive definite. However, in order to solve this linear system when A has rank-deficiency (the semidefinite case), we must solve the under-constrained least square problem. Unfortunately, solving this using normal equations requires us to be able to perform Cholesky factorizations on matrices with arbitrary dimensions less than 4 4. This defeats the purpose of optimizing our simplification for working only with 4 4 matrices. Thus, our implementation uses SVD to handle rank-deficiency matrices detected by the Cholesky method. As a result, this method yields the fastest solution when A is positive-definite but it becomes slower compared to CG when handling rank-deficient matrices. Like Cholesky, QR does not handle the problem of rank-deficiency. Nevertheless, the imple-mentation for solving the least square problem using QR factorization is much simpler than using Cholesky with normal equations since it does not explicitly require a general representation of matrices with arbitrary dimensions. We use the least square version of QR as suggested in [37]. Using all of our solvers, we were able to simplify our subdivided tetrahedron to its original shape with small error in both field and geometry. To compare the correctness of their solutions, we recorded norm-2 residuals of computed solutions to 15;000 linear systems using all 3 methods while simplifying the fighter dataset. Denote by eqr, ech and ecg the sum of all norm-2 residuals of computed solutions ˆx (i.e., kAb􀀀 ˆxk2) using QR, Cholesky and CG, respectively. Given eavg = Table 3.5: Performance of Linear Solvers Dataset QR Cholesky CG SPX 0.57s 0.30s 0.35s Blunt 16.47s 9.88s 7.72s Fighter 46.11s 30.57s 29.89s 24 1 3 (eqr +ech+ecg), relative errors of solutions computed by QR, Cholesky and CG can be computed as refqr;ch;cgg = efqr;ch;cgg=eavg. Our experiment found reqr to have the smallest error with 96%, followed by rech with 99%. Our CG approach obtained a comparable result with rech of 105%. Overall, QR gives the most optimal solution in terms of error, but it is approximately twice as slow as the others. On the other hand, while Cholesky is a good choice for both efficiency and accuracy, its implementation is quite complicated. Therefore, we chose CG over the others for its simplicity and performance while still maintaining comparatively optimal solutions. To estimate the error in the simplified mesh we use two different methods. The first method is to measure the error on the surface boundary of the mesh using the tool Metro [23]. The second method is to measure the error in the field data using a similar approach to Cignoni et al. [21]. We sample the domain of the simplified dataset at points inside the domain of the original one. These points are not only vertices of the input but also interpolated ones inside each tetrahedron. The error is then computed by the differences between their scalar values. Our implementation differs because it ignores points outside the domain of the simplified mesh since these points become part of the surface boundary error. Table 3.6 shows these measured error estimations. Field error percentages are in relation to the range of the field and surface error percentages are in relation to the bounding box diagonal. Figure 3.6 shows an example of the quality of the resulting field and Figure 3.7 shows an example of the quality of the resulting surface. Table 3.6: In-Core and Streaming Simplification Results Number Number Max Field Error Surface Error Dataset of Tets of Tets Time RAM Max RMS Max RMS Input Output (sec) (MB) (%) (%) (%) (%) In-core Torso 1,082,723 108,271 14.88 57 0.012 0.000884 0.120 0.013360 Fighter 1,403,504 140,348 15.46 78 4.845 0.280266 0.038 0.000352 Rbl 3,886,728 388,668 59.10 212 0.020 0.002574 0.025 0.000055 Mito 5,537,168 553,711 47.13 285 0.045 0.007355 0.001 0.000008 SF1 13,980,162 1,398,013 191.69 709 5.626 0.262335 0.036 0.000811 Streaming Torso 1,082,723 108,270 19.07 20 0.019 0.000879 0.161 0.001226 Fighter 1,403,504 140,345 20.87 20 4.549 0.299081 0.102 0.000470 Rbl 3,886,728 388,671 95.54 20 0.025 0.002833 0.036 0.000089 Mito 5,537,168 553,716 73.58 20 0.045 0.007614 0.044 0.000009 SF1 13,980,162 1,398,012 246.15 20 23.287 0.472869 0.169 0.004150 SF1 13,980,162 1,398,017 244.42 50 5.834 0.315583 0.050 0.001394 25 Figure 3.6: Simplified Scalar Result Volume rendered images of the Fighter dataset show the preservation of scalar values. The original dataset is shown on the left (1,403,504 tetrahedra) and the simplified version is shown on the right (140,348 tetrahedra). Figure 3.7: Simplified Surface Result Views of the mesh quality on the surface of the Rbl dataset. The original dataset is shown on the left (3,886,728 tetrahedra) and the simplified version is shown on the right (388,637 tetrahedra). 26 3.4.2 Performance All timing results were generated on a 3.2 GHz Pentium 4 machine with 2.0 GB RAM. For the streaming experiments, we limit the operating system to only 64 MB RAM by using the Linux bootloader. Table 3.6 shows the results of simplifying a collection of datasets to 10% of their original size using our streaming algorithm and the same implementation optimized for in-core execution. Laying out the meshes in a stream efficient manner is a one-time operation and can be performed in-core for all the datasets we tested. Even the largest dataset (14 million tetrahedra) required only about 40 minutes to layout using our Breadth-First approach. We were able to achieve streaming simplification with only a slight increase in time and error compared to an in-core implementation. The streaming technique has the advantage of a smaller memory footprint. With our algorithm, we were able to simplify 14 million tetrahedra while only using 20 MB RAM. Due to the large size of the SF1 dataset, certain parts of the stream were not able to be simplified accurately, resulting in a larger error. By increasing the memory slightly, the quality of the simplification is greatly improved and approaches the in-core quality. This behavior is not due to the randomization algorithm since the large buffer size always produces better quality outputs even with random seeds. Instead, the quality is improved because each set of candidates has a wider range to select their targets. Consider a set of expensive edges that are larger than the buffer size, any edge collapse will result in a large error no matter how random the target is. However, if we increase the buffer size such that the buffer is larger than the expensive edges, randomized edge collapses will take those edges that are not so expensive into account, thus improving the quality of the output. 3.4.3 Large-Scale Experiment Although extremely large meshes exist, it is difficult to obtain unclassified access to them. To stress our algorithm on current PC hardware and to demonstrate the scalability of the technique, we performed streaming simplification on a huge fluid dynamics dataset on a Xeon 3.0GHz machine. The dataset was created from sampling slices of a 20483 simulation and consists of over one billion tetrahedra that use 18 GB of disk space when stored in the binary format. The tetrahedra were laid out in the order in which they were sliced, which is comparable to sorting by axis. An in-core approach to simplifying this dataset would require a machine with 64 GB RAM. We were able to simplify the data using only 829 MB RAM to 12 million tetrahedra (1.2%) in 10 hours on our test machine. Because error estimations were not possible with the entire dataset in-core, we computed the field error on subregions of the mesh separately to verify the results. In the regions that were measured, there was 10.85% maximum and 1.57% RMS field error. Figure 3.8 shows an isosurface extracted from the original and simplified fluid dynamics dataset. 27 Figure 3.8: Large-Scale Simplified Simulation Dataset Isosurfaces of the fluid dynamics dataset. A very small portion of the isosurfaces is shown for the original dataset of over a billion tetrahedra (left) and the simplified dataset of only 12 million tetrahedra (right). The isosurfaces are shown up close using flat shading to enhance the details of the resulting surface. Our algorithm allows extensive simplification (almost 1%) with negligible numerical error (1.57% RMS) for the fluid dynamics dataset which is too large to simplify with conventional approaches. 3.5 Proposition The use of streaming meshes for simplification reduces the memory footprint of a large mesh considerably. Nevertheless, the streaming format is general enough to work on meshes of other types, e.g., hexahedra. It would be interesting to take advantage of the coherent streaming format to perform other visualization techniques fast in an out-of-core fashion, or simply improving the performance by increasing coherency with lower memory footprint. CHAPTER 4 INTERACTIVE RENDERING OF LARGE UNSTRUCTURED GRIDS In this chapter, we present iRun, a system for interactively volume rendering large unstructured grids on commodity PC clusters. This work helps us determine crucial requirements for a large-scale data-parallel visualization system. Since iRun is implemented on top VTK, many limitations of current visualization systems, such as VTK, are also revealed to us. Interactive rendering of arbitrarily large datasets is a fundamental problem in computer graphics and scientific visualization, and a critical capability for many real applications. Interactive visualiza-tion of large datasets poses substantial challenges (see survey by Silva et al. [81]). Current systems for rendering large datasets employ many of the elements proposed by Clark [24] including the use of hierarchical spatial data structures, level-of-detail (LOD) management, hierarchical view-frustum and occlusion culling, and working-set management (geometry caching). Systems along the lines of the one envisioned by Clark have been used effectively in industrial applications for scenes composed primarily of polygonal geometry. For more complex scenes, such as those composed of tetrahedral elements, the problem is not as well studied and can be more difficult for several reasons. First, rendering tetrahedra is not natively supported by current graphics hardware. Thus, efficient algorithms for handling this type of data robustly are required. Second, tetrahedra must be projected in visibility order to accurately composite transparency. This requires special care to traverse the out-of-core hierarchy in the correct order. Finally, visibility techniques such as occlusion culling are not practical because the opacity of the volume is controlled by the user. iRun addresses these issues while still maintaining interactivity on extremely large dataset. The visualization pipeline may be broken down into four major stages: retrieval from storage, processing in main memory, rendering in the Graphics Processing Unit (GPU), and display on the screen. The performance of each of these stages is limited by several potential bottlenecks (e.g., disk or network bandwidth, main memory size, GPU triangle throughput, and screen resolution). iRun uses out-of-core data management and speculative visibility prefetching to maintain a working-set of the geometry in memory. Our rendering approach uses GPU-assisted volume rendering with a 29 dynamic set of tetrahedra and uses an out-of-core LOD traversal. Finally, our system was imple-mented in VTK [51] and allows distributed rendering for high-resolution displays. Using a single commodity PC, we show how our system can render datasets consisting of 36 million tetrahedra while maintaining interactive frame rates. 4.1 Interactive Out-of-Core Volume Rendering iRun interactively renders large unstructured grids in several stages, as illustrated in Figure 4.1. First, a preprocessing step prepares the data for hierarchical traversal. Second, our algorithm interactively traverses the out-of-core data structure and keeps a working-set (geometry cache) of the geometry in memory by using visibility culling, speculative prefetching, and LOD management. Finally, the contents of the geometry cache are rendered using a hardware-assisted visibility sorting algorithm. This scales to multiple PCs for improved image quality or large display capability. The iRun design inspired by iWalk [28], however, requires a more complex solution for volume render-ing on unstructured grids. Due to many similarities in the two approaches, we defer comparisons to Section 4.3.3. Figure 4.1: iRun Overview. (a) The user interacts with the UI by changing the camera from which our system predicts future camera positions. (b) Our octree traversal algorithm selects the octree nodes that are in the frustum, determines the appropriate LOD, and arranges the nodes in visibility order. (c) The geometry cache keeps the working-set of triangles while a separate thread is used to fetch additional geometry from disk. (d) Finally, the geometry is sent to the renderer for object-space sorting followed by a hardware-assisted image-space sorting and compositing which is performed using a modified version of the HAVS algorithm. 30 4.1.1 Preprocessing iRun utilizes an efficient and fully automatic preprocessing algorithm that operates out-of-core for large datasets. We begin by extracting the unique triangles that compose the tetrahedral mesh. This is done out-of-core by writing the four triangle indices of each tetrahedron into a file and using an external sort to arrange the indices from first to last. The resulting file contains adjacent duplicate entries for faces that are on the inside of the mesh and unique entries for boundary triangles. A cleanup pass is performed over the triangle index file to remove duplicates and to create a similar file that contains a boundary predicate for each triangle. iRun employs an out-of-core, hierarchical octree [77] in which each node contains an indepen-dent renderable set of vertices and triangles, similar to iWalk. Because the number of vertices in a tetrahedral mesh is generally much smaller than the connectivity information, for simplicity, we keep the vertex array in-core while creating our out-of-core hierarchy. This allows us to keep the global indexing of the triangles throughout our preprocessing which facilitates filtering in the final stages. Our octree is constructed by reading the triangle index and boundary predicate files in blocks and adding the triangles one-by-one to the out-of-core octree structure. While adding triangles to the octree, we perform triangle-box intersection to determine one or more nodes that contain the triangle. If the triangle spans multiple nodes, we replicate and insert it into each. This can lead to the insertion of a triangle into a node where any or all of the triangle vertices lie outside the node. When a node reaches a preset capacity the node is split into octants and the triangles are redistributed. The result of this phase is a directory structure representing the octree and a hierarchy structure file that contains the octree structure information (see [28]). A subsequent LOD propagation stage works by populating a parent node with a subset of the triangles that are not on the boundary from each of the children. Again, the triangles are replicated as they move up the octree to create self-contained nodes. The subset is selected based on a dynamic LOD strategy introduced by [14]. The idea is to sample the full resolution geometry to achieve a subset that minimizes the image error. We choose to select the triangles that are the largest to maximize node coverage and thereby minimize image error for that node. To ensure that there are no holes in the LOD structure, the boundary triangles are simplified to a reduced representation (e.g., 5%) of the original and inserted into every intersecting node except the leaf nodes. Finally, a cleanup pass on the octree is performed which inserts the referenced vertices into each octree node and clips the triangles based on the bounding box of the node. 4.1.2 Out-of-Core Dataset Traversal iRun uses an out-of-core traversal algorithm that has been extensively optimized for volume rendering. For each camera received from the user interface, we apply frustum-culling on the octree 31 to find all nodes that are visible in this view and mark them as visible nodes. For polygonal meshes, visibility algorithms such as view-frustum and occlusion culling are important for maintaining interactive frame rates (see [26] for a recent survey). For volume rendering, occlusion culling is not feasible due to transparency. Therefore, only view-frustum techniques [33, 27] are deployed here. Depending on whether or not the user is interacting with iRun, the LOD will decide which nodes are to be rendered next. Next, everything is passed to the visibility sorter (see Figure 4.2) and only those that have been cached in the geometry cache are used for rendering while the others are put onto the fetching queue. iRun also does cameraxoxo prediction for each frame by linearly extrapolating previous camera parameters. All of the nodes selected in the predicted camera will also be put on the fetching queue. The LOD management in iRun is a top-down approach working in a priority-driven manner. Given a priority function P(C;N) which assigns priority for every node N of the octree with respect to the camera C, the LOD process starts by adding the root R to a priority queue with the key of P(C;R). Next, iterations of replacing the highest priority node of the queue with its children are repeatedly executed until such refinement will exceed a predefined number of triangles (Figure 4.3). In our experiments, we use two different priority functions to control the LOD of iRun. The first is a Bread-First-Search (BFS) based function that is used during user's interactions: PBFS(C;N)=< l;d >, where l is the depth of N and d is the distance of the bounding box of N to the camera C. In this case, each node's priority is primarily determined by how far it is from the root and subsequently by its distance to the camera when the nodes are on the same level. Briefly, our goal is to evenly distribute data of the octree on the screen to improve the overall visualization of the dataset. An example of this scheme is shown in Figure 4.4. While interacting with iRun, the target frame rate can be achieved by setting a limit on the maximum number of triangles rendered in the current frame. This number is calculated based on SORT (Camera C, Node R, List SortedNodes) if R is not culled if R:Selected SortedNodes.Push(R); else if R:HasChildren SC = R's children sorted ascendingly by distances to C for each node N in SC SORT(C, N, SortedNodes); Figure 4.2: iRun Visibility Sorter Pseudo-code for visibility sorter based on the camera distance in iRun. 32 LOD (Camera C, Node R, PriorityFunction P, int MaxTri) PriorityQueue Q; R:Selected = true; Q.Push(P(C;R);R); Total = 0; while !(Q.Empty()) Node N = Q.Pop(); if N:HasChildren TC = the total number of triangles in N's children if (Total􀀀N:NumberOf Triangles+TC) < MaxTri Total = Total􀀀N:NumberOf Triangles+TC; N:Selected = false; for i = 0 to 7 if N:Children[i] is not empty or culled N:Children[i]:Selected = true; Q.Push(P(C;N:Children[i]);N:Children[i]); Figure 4.3: iRun LOD Traversal Pseudo-code for the octree traversal algorithm. Figure 4.4: A Snapshot of iRun Refining the LOD The image on the left is rendered as the user would see it from the current camera position. On the right is a bird's-eye view of the same set of visible nodes. Different colors indicate different levels-of-detail. The geometry cache is limited to only 64MB of RAM in this case. 33 the number of triangles that were rendered, and the rendering time, for the previous frame. For increased image quality at a given view, iRun will automatically adjust itself to increase the LOD using as much memory as possible when interaction stops. Since we want to cover as much of the screen as possible, a priority function reflecting the projected screen area is necessary for the LOD. We define Parea(C;N) = A, where A is the projected area of the bounding box of N onto the screen. However, this function can be easily replaced by any other heuristic approaches, such as those reflecting nodes scalar ranges, transfer functions, etc., to achieve the best image quality. This approach, however, could raise a problem when the user begins interaction again and the geometry cache is already full. Our next frame would be displayed incompletely since a lower LOD is not available and the higher LOD is too large to be rendered at an interactive rate. To overcome this problem, we will not flush the current data on the screen when increasing the LOD but only when the camera has changed. The trade-off in image quality is insignificant because the amount of memory used by this data is usually very small (e.g., 1%) when compared to the total memory of the geometry cache. iRun separates the fetching from the building of sets of visible nodes. If the fetching queue is empty, the fetching thread will wait until new requests arrive. Otherwise, it will read the requested node from disk and move it to the geometry cache. If the geometry cache is full, cached data will be flushed using a least-recently-used scheme. As a result, the target frame rate of iRun is guaranteed to stay the same throughout user-interaction since the rendering process will never stall while waiting for IO. This improves interac-tion and does not introduce any significant degradation in image quality to the system. Because none of the visible geometry will be culled by occlusion, the amount of geometry shared between frames is very similar-every frame will have at least as much data as the previous frame. In the worst case, there will be at most one level of difference in LOD of the frame because of our BFS-based priority function. 4.1.3 Hardware-Assisted Rendering Leveraging the performance of graphics processing units (GPUs) for direct volume rendering has received considerable attention (for a recent survey, see [82]). Shirley and Tuchman [80] introduce the Projected Tetrahedra (PT) algorithm, which splits tetrahedra into GPU renderable triangles based on the view direction. For correct compositing, the neighborhood information of the original mesh is used to determine an order dependence. More recent work by Weiler et al. [90] performs ray-casting on the mesh by storing the neighbor information on the GPU and marching through the tetrahedra in rendering passes. As with PT, the hardware ray caster requires the neighbor information of the tetrahedra for correct visibility ordering. A more extensible approach was intro- 34 duced by Callahan et al. [15], which operates on the triangles that compose the mesh and requires no neighbor information. This makes immediate mode rendering, working-set management, LOD, and preprocessing much simpler than it would be by using the tetrahedra directly. Because of this, the algorithm has since been extended to perform dynamic LOD [14] as well as progressive volume rendering [13] for large datasets. An important consideration for the interactive rendering of unstructured grids is the choice in volume rendering algorithms. Three aspects need to be considered: speed, quality, and the ability to handle dynamically changing data. By using a hardware-assisted volume rendering system, we can address both the speed and quality issues. In our system, we use the Hardware-Assisted Visibility Sorting (HAVS) algorithm of Callahan et al. [15] because it combines speed, quality, and most importantly, it does not require topological information and thus handles dynamic data. HAVS operates by sorting the geometry in two phases. The first is a partial object-space sort that runs on the CPU. The second is a final image-space sort that runs on the GPU using a fixed A-buffer implemented with fragment shaders called the k-buffer. The HAVS algorithm considers only the triangles that make up the tetrahedral mesh. Thus it does not require the original tetrahedra or the neighbor information of the mesh. This allows us to render a subset of the octree nodes independently without merging the geometry. Unlike rendering systems for opaque polygonal geometry, special care needs to be taken when rendering multiple octree nodes to ensure proper compositing. At each frame, our algorithm re-solves the compositing issue by sorting the active set of octree nodes that are in memory in visibility order (front-to-back). When octree nodes of different sizes are in the active set, we sort by the largest common parent of the nodes. The original HAVS algorithm has also been modified to iterate over the active set of nodes in one pass and perform the object-space and image-space sort on each piece in visibility order. To ensure a smooth transition between octree nodes, the k-buffer is not flushed until the last node is rendered. 4.1.4 Distributed Rendering Though our algorithm can efficiently render large unstructured grids on a single commodity PC, it was designed to be employed on a cluster of PCs in a distributed manner. Chromium [41] is a system that was introduced to perform parallel rendering on a cluster of graphics workstations. Distributed rendering can also be used to visualize the data on larger displays. Moreland and Thompson [66] describe a parallel rendering algorithm that uses Chromium and an image-composite engine (ICE-T) built with VTK for visualizing the results on a display wall. A major distinction between iRun and Chromium is that while Chromium pushes data through the pipeline to the render devices, iRun pulls data from a geometry-cache to the render devices. This 35 pulling approach results in a conceptually simpler framework for parallel rendering where the CPU and GPU are tightly connected and data are fetched to the geometry-cache as needed before being transformed into graphics primitives. iRun can partition rendering across a distributed network. This feature is useful for driving a display wall, where each system controls a single tile of display. This approach can further improve performance when rendering scenes with complicated geometry. In iRun, each display on the display wall is driven by a dedicated render-server implemented in VTK. The portion of the display wall that a render-server will be responsible for is specified to the render-server on startup. A skewed view frustum is calculated based on the region of responsibility, and this frustum allows the render-server to cull the geometry to only the set visible on its display. Each render-server has access to the full geometry, but only loads the portions that it needs or anticipates needing. The render-servers are coordinated by a single system that controls the camera. They receive camera description asynchronously to the render cycle, and wait for the render cycle to complete before updating VTK's camera. The update mechanism is implemented as a vtkInteractorStyle to allow for seamless integration. The camera-server allows render-servers to establish and break TCP connections arbitrarily. Camera descriptions are sent out periodically, regardless of whether the camera has changed, to allow recently connected render-servers to quickly synchronize with the rest of the display wall. Two clients have been written to run on the camera-server, both implemented in VTK. The first (Figure 4.5a) is nearly identical to the render-servers, except that it broadcasts camera coordinates instead of receiving them. This is accomplished by listening for the timer event which the interactor styles use for motion updates. The camera is read inside the event handler and provided to the broadcast code. This client has access to the same geometry and renders it on its own. It requires an adequately functional GPU, and relies on the automatic LOD adjustment to maintain interactive frame rates. The second client (Figure 4.5b) is able to run on a less capable system. It also runs as a VTK implementation, but uses VTK's interaction styles with no geometry loaded. In this mode, the render-servers additionally frame capture their rendered output and transmit downsampled versions back to the camera-server. The camera server receives and assembles the snapshots asynchronously to the render cycle, and periodically requests the message loop to execute code to display the composite. 36 Geometry Camera Render Render Render Render (a) Thumbnail Client Camera Server Render Server Render Server Render Server Render Server Display Geometry Camera Info Thumbnails (b) Full Client Figure 4.5: Distributed Rendering with iRun (a) with a thumbnail client, the client receives thumbnails from the render-servers and composites them and (b) with a full client, the client accesses and renders the geometry directly. Table 4.1: Preprocessing Results Data Set Input Time (h:m:s) Output Verts Tets Tris Size Total Clip Tris/leaf Verts Tris Size Torso 169K 1.1M 2.2M 34MB 01:45 01:12 13K 3.2M 6.7M 158MB Fighter 257K 1.4M 2.8M 50MB 02:27 01:28 18K 3.6M 7.8M 182MB Rbl 730K 3.9M 7.9M 143MB 06:13 04:08 42K 7.8M 17M 410MB Mito 972K 5.5M 11M 206MB 10:59 05:00 30K 11M 22M 538MB Turbo Jet 1.7M 10M 20M 344MB 15:49 07:58 27K 12. 31M 697MB Sf1 2.5M 14M 28M 516MB 37:41 25:56 58K 24M 63M 1.4GB Bullet 6.3M 36M 62M 1.3GB 1:10:30 42:19 32K 48M 117M 2.8GB 4.2 Results We generated all of our results on a 3.0-GHz Pentium D machine with 2.0 GB of RAM and a 500 GB SATA hard-disk with an nVIDIA 7800 GTX. Table 4.1 shows timing results and data sizes before and after preprocessing. We were able to preprocess the largest dataset, which contained 36 million tetrahedra or 63 million triangles, in just over an hour. For all models in this chapter, we target the output octree to have at most 10 thousand triangles per leaf. Due to the triangles added during simplification and clipping, the final number of triangles per leaf is slightly higher. The rest of the section will use these datasets unless otherwise stated. In Figure 4.6, we show an example of how iRun can give users various levels-of-detail on demand. Assume a user has a machine with approximately 512MB of RAM available for rendering, and wants to use iRun to explore different features of the Turbo Jet dataset using a 512 512 37 viewport. The user can use iRun to load the dataset with 512MB of RAM dedicated to the geometry cache and a target frame rate of 5 (for example). Interactions at this frame rate only give the user a low LOD representation of this dataset since iRun can only render as many as 40 thousands triangles at each frame. It is almost impossible to get a coherent overview of the dataset at a resolution less than 0:1%. Thus, the user would want to stop interactions for iRun to refine its visualization. With more detail, the user can now select a specific region of interest to explore. For example, the user may wish to take a closer look at the region in the center of the dataset through a zoom. This causes the LOD to be automatically adjusted by iRun. With a full use of the geometry cache, the quality of our image is very close to the full resolution at that same view using 1.5GB of RAM. Figure 4.7 is a plot showing the difference between the displayed and fetched nodes while performing interactions with the Turbo Jet as in the case study above. We selected the target frame rate of 2 frames per second to increase the readability of the plot while the interaction speed is fairly high (mouse-movement of over 30 pixels/frame). Our set of movements contains excessive rotations, zooms, and translates. As shown in the figure, the total number of nodes that need to be loaded for the next frame stays relatively low compared to those that are rendered. For distributed rendering results, we set up four machines-with two Dual-Core AMD Opterons on each-as clients. Our goal was to render the bullet dataset consisting of 3GB of data (117M faces) displayed on another server. The render window on the server had a 1024 1024 viewport and each client only rendered a subwindow of 512 512. Figure 4.8 illustrates the differences between distributed rendering versus local rendering on a single client machine. The right image shows the results of the distributed rendering. With the target frame rate of 10, we were able to achieve a good volume-rendered image after filling up 24 nodes of the buffer. The frame rate of the distributed renderer can be reported fairly as the minimum frame rate of the four clients. For this view we achieved a frame rate of 11.62 fps. The left image shows the same dataset with the same view rendered using only one of the clients with the same window size. By filling up only 14 nodes of the buffer the frame rate is reduced to 2.77 fps on a single machine. This is because there are four times as many pixels that need to be rendered at each frame. Moreover, disk accesses also increase up to four times compared to the distributed rendering. 4.3 Discussion 4.3.1 Limitations There are issues in our current implementation that could use improvement. First, due to clipping and boundary triangles, node size can grow larger than desired. Bounding this limit would be a useful feature. Another limitation is that our current LOD strategy may not be suitable for all 38 Figure 4.6: Effects of Geometry Cache Size in iRun iRun visualizes the Turbo Jet model consisting of over 10 millions tetrahedra with multiple levels-of-detail. The geometry cache size of (a), (b) and (c) is 512MB while (d) is 1.5GB. (a) In interactive mode, iRun only displays 40K triangles. (b) In still-rendering mode, most of the geometry cache is used. (c) The user zooms in to a particular region. (d) The geometry cache size is tripled. 39 0 10 20 30 40 50 60 70 80 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 Frame Number of Nodes Displayed Nodes Prefetching Nodes Figure 4.7: Geometry Cache Statistics The difference between the displayed nodes of the geometry cache and the ones being fetched during interactions of the Turbo Jet at 2 fps. 40 Figure 4.8: Distributed Volume Rendering The bullet dataset rendered with a single machine versus a distributed rendering with four clients. "~ I ... . ' 41 datasets. Therefore a more automatic way of determining the LOD triangles would improve image quality. Finally, even though the number of vertices is generally much less than the number of tetrahedra, our current method of keeping the vertices in-core during preprocessing would not be feasible on a PC for extremely large datasets with hundreds of millions of tetrahedra. 4.3.2 VTK In term of coding, we found VTK to be very helpful when implementing our system. By leveraging existing functionality provided by this framework, we were able to focus on algorithmic instead of engineering contributions. For example, the simplification and clipping phase of the preprocessing are all done using VTK classes. Additionally, the client-server and user interface are built on top of VTK. However, using VTK provided the following challenges: Because VTK is for general use, we were forced to modify existing classes to get desired functionality. For example, we added our own timers and modified the rendering pipeline order. In addition, current VTK data structures for geometry are incompatible with OpenGL, which makes using fast display structures such as triangle arrays difficult. VTK is not thread-safe. iRun required solving many synchronization problems among threads. For example, one condition was occurring when the visibility sorter received the camera slower than the rendering window, causing compositing artifacts in the resulting image. To manage the client-server architecture for parallel rendering, modifications to the VTK in-teractors were necessary. Specifically, the addition of an asyncExec method to vtkRenderWi- ndowInteractor and its derived classes was necessary to queue commands for later execution. 4.3.3 iRun vs. iWalk iRun shares many ideas with iWalk [28]. However, due to the complexity of volume rendering tetrahedral data, many of the algorithms presented in iWalk were not suitable for iRun. In iRun, each node of the octree contains a self-describing vtkUnstructuredGrid of varying LODs that can be rendered independently. iWalk only keeps triangles in the leaf nodes and depends on occlusion culling instead of LOD to remain interactive. This also affects the traversal algorithm, which is more sophisticated in iRun because LOD as well as screen coverage need to be considered due to the transparent nature of the nodes. The fetching thread also works differently. iRun separates fetching from building visible sets of nodes because of the added complexity of visibility sorting for the nodes. Another difference is that due to compositing problems that occur with overlapping triangles from neighboring nodes, we require a more exact triangle-box intersection that is based on the 42 triangle, not the vertices. iRun also requires the clipping of triangles that extend beyond the node's bounding box to avoid incorrect visibility ordering across neighboring nodes. Unlike iWalk, we require boundary triangles (simplified or full resolution) at each level of the tree to avoid holes in the rendered image. Finally, we note that the implementation of both systems is completely disjoint. 4.4 Proposition Through our experiences with iRun, we have learned that (1) interactive large-scale visualization pipelines need both push and pull updating policy for their level-of-detail and view-dependant features, respectively, and (2) current visualization systems such as VTK are not designed for concurrent executions of modules (not even thread-safe) and their pipeline constructs are limited by supporting backward-compatibility. CHAPTER 5 PARALLEL VISUALIZATION ON LARGE CLUSTERS USING MAPREDUCE In this chapter, we take a first step at investigating the suitability of cloud-based infrastructure for large-scale visualization. We have designed a set of MapReduce-based algorithms for memory-intensive visualization techniques, and performed an extensive experimental evaluation. Our results indicate that MapReduce offers a potential foundation for a combined storage, processing, analysis, and visualization system that is capable of keeping pace with growth in data volume (attributable to scalability and fault-tolerance) as well as growth in application diversity (attributable to extensibility and ease of use). Because MapReduce framework can be decomposed into a pair of data-parallel jobs with a serial (shuffling) phase in between, a careful study on this topic can help us determine necessary parallel constructs that are required in a modern dataflow architecture to efficiently support large data visualization. We also found that common visualization algorithms can be naturally expressed using the MapReduce abstraction. Even simple implementations of these algorithms are highly scalable. Cloud computing has emerged as a viable, low-cost alternative for large-scale computing and has recently motivated both industry and academia to design new general-purpose parallel programming frameworks that work well with this new paradigm [31, 71, 95]. In contrast, large-scale visualization has traditionally benefited from specialized couplings between hardware and algorithms, suggesting that migration to a general-purpose cloud platform might incur in development costs or scalability.1 The MapReduce framework [31] provides a simple programming model for expressing loosely-coupled parallel programs using two serial functions, Map and Reduce. The Map function processes a block of input producing a sequence of (key, value) pairs, while the Reduce function processes a set of values associated with a single key. The framework itself is responsible for "shuffling" the output of the Map tasks to the appropriate Reduce task using a distributed sort. The model is sufficiently expressive to capture a variety of algorithms and high-level programming models, 1Scalability refers to the relative performance increase by allocating additional resources. 44 while allowing programmers to largely ignore the challenges of distributed computing and focus instead on the semantics of their task. Additionally, as implemented in the open-source platform Hadoop [39], the MapReduce model has been shown to scale to hundreds or thousands of nodes [31]. MapReduce clusters can be constructed inexpensively from commodity computers connected in a shared-nothing configuration (i.e., neither memory nor storage are shared across nodes). Such advantages have motivated cloud providers to host Hadoop and similar frameworks for processing data at scale [25, 92, 7]. These platforms have been largely unexplored by the visualization community, even though these trends make it apparent that our community must inquire into their viability for use in large-scale visualization tasks. The conventional modus operandi of "throwing datasets" through a (par-allel) graphics pipeline relegates data manipulation, conditioning, and restructuring tasks to an offline system and ignores their cost. As data volumes grow, these costs - especially the cost of transferring data between a storage cluster and a visualization cluster - begin to dominate. Cloud computing platforms thus open new opportunities in that they afford both general-purpose data processing as well as large-scale visualization. 5.1 MapReduce Overview MapReduce is a framework to
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6gh9zqf