Managing provenance for knowledge discovery and reuse

Managing provenance for knowledge discovery and reuse

Title	Managing provenance for knowledge discovery and reuse
Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Koop, David Allen
Date	2012-05
Description	Serving as a record of what happened during a scientific process, often computational, provenance has become an important piece of computing. The importance of archiving not only data and results but also the lineage of these entities has led to a variety of systems that capture provenance as well as models and schemas for this information. Despite significant work focused on obtaining and modeling provenance, there has been little work on managing and using this information. Using the provenance from past work, it is possible to mine common computational structure or determine differences between executions. Such information can be used to suggest possible completions for partial workflows, summarize a set of approaches, or extend past work in new directions. These applications require infrastructure to support efficient queries and accessible reuse. In order to support knowledge discovery and reuse from provenance information, the management of those data is important. One component of provenance is the specification of the computations; workflows provide structured abstractions of code and are commonly used for complex tasks. Using change-based provenance, it is possible to store large numbers of similar workflows compactly. This storage also allows efficient computation of differences between specifications. However, querying for specific structure across a large collection of workflows is difficult because comparing graphs depends on computing subgraph isomorphism which is NP-Complete. Graph indexing methods identify features that help distinguish graphs of a collection to filter results for a subgraph containment query and reduce the number of subgraph isomorphism computations. For provenance, this work extends these methods to work for more exploratory queries and collections with significant overlap. However, comparing workflow or provenance graphs may not require exact equality; a match between two graphs may allow paired nodes to be similar yet not equivalent. This work presents techniques to better correlate graphs to help summarize collections. Using this infrastructure, provenance can be reused so that users can learn from their own and others' history. Just as textual search has been augmented with suggested completions based on past or common queries, provenance can be used to suggest how computations can be completed or which steps might connect to a given subworkflow. In addition, provenance can help further science by accelerating publication and reuse. By incorporating provenance into publications, authors can more easily integrate their results, and readers can more easily verify and repeat results. However, reusing past computations requires maintaining stronger associations with any input data and underlying code as well as providing paths for migrating old work to new hardware or algorithms. This work presents a framework for maintaining data and code as well as supporting upgrades for workflow computations.
Type	Text
Publisher	University of Utah
Subject	Provenance; VisTrails; Workflow
Subject LCSH	Information retrieval
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	Copyright © David Allen Koop 2012
Format	application/pdf
Format Medium	application/pdf
Format Extent	5,032,164 bytes
Identifier	us-etd3/id/659
Source	Original in Marriott Library Special Collections, ZA3.5 2012 .K66
ARK	ark:/87278/s6hx1td5
DOI	https://doi.org/doi:10.26053/0H-9D7R-DE00
Setname	ir_etd
ID	194823
OCR Text	Show MANAGING PROVENANCE FOR KNOWLEDGE DISCOVERY AND REUSE by David Allen Koop A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science School of Computing The University of Utah May 2012 Copyright c David Allen Koop 2012 All Rights Reserved The Univers i ty of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of David Allen Koop has been approved by the following supervisory committee members: Juliana Freire , Chair 7/18/2011 Date Approved Cláudio T. Silva , Member 7/18/2011 Date Approved Valerio Pascucci , Member 7/18/2011 Date Approved Susan Davidson , Member 7/18/2011 Date Approved Matthias Troyer , Member 7/18/2011 Date Approved and by Alan Davis , Chair of the Department of School of Computing and by Charles A. Wight, Dean of The Graduate School. ABSTRACT Serving as a record of what happened during a scientific process, often computational, prove-nance has become an important piece of computing. The importance of archiving not only data and results but also the lineage of these entities has led to a variety of systems that capture provenance as well as models and schemas for this information. Despite significant work focused on obtaining and modeling provenance, there has been litte work on managing and using this information. Using the provenance from past work, it is possible to mine common computational structure or determine differences between executions. Such information can be used to suggest possible completions for partial workflows, summarize a set of approaches, or extend past work in new directions. These applications require infrastructure to support efficient queries and accessible reuse. In order to support knowledge discovery and reuse from provenance information, the manage-ment of those data is important. One component of provenance is the specification of the com-putations; workflows provide structured abstractions of code and are commonly used for complex tasks. Using change-based provenance, it is possible to store large numbers of similar workflows compactly. This storage also allows efficient computation of differences between specifications. However, querying for specific structure across a large collection of workflows is difficult because comparing graphs depends on computing subgraph isomorphism which is NP-Complete. Graph indexing methods identify features that help distinguish graphs of a collection to filter results for a subgraph containment query and reduce the number of subgraph isomorphism computations. For provenance, this work extends these methods to work for more exploratory queries and collections with significant overlap. However, comparing workflow or provenance graphs may not require exact equality; a match between two graphs may allow paired nodes to be similar yet not equivalent. This work presents techniques to better correlate graphs to help summarize collections. Using this infrastructure, provenance can be reused so that users can learn from their own and others' history. Just as textual search has been augmented with suggested completions based on past or common queries, provenance can be used to suggest how computations can be completed or which steps might connect to a given subworkflow. In addition, provenance can help further science by accelerating publication and reuse. By incorporating provenance into publications, authors can more easily integrate their results, and readers can more easily verify and repeat results. However, reusing past computations requires maintaining stronger associations with any input data and un-derlying code as well as providing paths for migrating old work to new hardware or algorithms. This work presents a framework for maintaining data and code as well as supporting upgrades for workflow computations. iv To my parents CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv CHAPTERS 1. INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Dissertation Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Scientific Workflow Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 VisTrails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3. VISCOMPLETE: DATA-DRIVEN SUGGESTIONS FOR VISUALIZATION SYS-TEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Generating Data-driven Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.2 Mining Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.3 Generating Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.4 Biasing the Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1 Triggering a Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.2 Computing the Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.3 The Suggestion Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6.1 Data and Validation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4. EFFICIENT EVALUATION OF EXPLORATORYQUERIESOVER PROVENANCE COLLECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Provenance and Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.2 Queries Over Provenance Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.3 Graphs and Isomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Indexing Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Standard Graph Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1.1 Identifying Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1.2 Index Construction and Query Processing . . . . . . . . . . . . . . . . . . . 39 4.3.2 Wildcard Graph Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.2.1 2-Component Frequent Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2.2 Summary Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2.3 Index Construction and Query Processing . . . . . . . . . . . . . . . . . . . 42 4.3.2.4 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.1 Index Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.1.1 Mining Frequent Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1.2 Generating 2-component Frequent Subgraphs . . . . . . . . . . . . . . . . 45 4.4.1.3 Selecting Summary Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1.4 Building the Discriminative Index . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.2.1 Wildcard Query Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4.3 Index Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5 Workflow Completions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 Implementing Workflow Completions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.6.1 Theoretical Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.6.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.7.1 Subworkflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.7.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.7.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5. VISUAL SUMMARIES FOR GRAPH COLLECTIONS . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.1.1 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.1.2 Graph Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.2 Computing Graph Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2.1 A Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2.2 Edit Distance and the Assignment Problem . . . . . . . . . . . . . . . . . . 68 5.3.2.3 Including Neighborhood Information . . . . . . . . . . . . . . . . . . . . . . 69 5.3.3 Diffusion Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 vii 5.3.3.1 Similarity Flooding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.3.2 Scoring Unmatched Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4 Summary Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.1 Compound Similarity Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 Visualizing and Interacting with Graph Summaries . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5.1 Layout and Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5.2 Controlling the Amount of Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5.3 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5.4 Manipulating the Summary Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6.1 Metabolic Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6.2 Visualization Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.6.3 Molecular Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.7.1 Overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.7.2 Multi-Edge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7.3 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7.4 How Much Summarization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6. SUPPORTING REPRODUCIBLE AND REUSABLE PUBLICATIONS . . . . . . . . . . . 87 6.1 Bridging Workflow and Data Provenance Using Strong Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.1.1 Persisting Data Provenance Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.1.1.1 Deriving Strong Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.1.1.2 File Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.1.2.1 Input files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.1.2.2 Output files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.1.2.3 Intermediate files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1.1.2.4 Customization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1.2 Linking Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1.2.1 Algorithms for Querying Linked Provenance . . . . . . . . . . . . . . . . 94 6.1.2.2 Embedding Provenance with Data . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1.3 Using Strong Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1.3.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1.3.1.1 In-memory caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1.3.1.2 Persistent caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.3.2 Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.4 Sharing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1.4.1 Centralized Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1.4.2 Decentralized Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.5.1 Storing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1.5.2 Finding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1.6 ALPS Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2 The Provenance of Workflow Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2.1 Workflow Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.1.0.1 Incompatible workflows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 viii 6.2.1.0.2 Provenance of module implementation. . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.1.1 Detecting the Need for Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.1.2 Processing Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.1.2.3 Developer-defined upgrades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2.1.2.4 Automatic upgrades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2.1.2.5 User-assisted upgrades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2.1.3 Provenance Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2.2.1 Replace, Remap, and Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2.2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2.2.3 Subworkflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2.2.4 Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 ix LIST OF FIGURES 3.1 The VisComplete suggestion system and interface. (a) A user starts by adding a module to the pipeline. (b) The most likely completions are generated using indexed paths computed from a database of pipelines. (c) A suggested completion is presented to the user. The user can browse through suggestions using the interface and choose to accept or reject the completion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Three of the first four suggested completions for a "vtkDataSetReader" are shown along with corresponding visualizations. The visualizations were created using these completions for a time step of the Tokamak Reactor dataset that was not used in the training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Deriving a path summary for the vertex D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Predictions are iteratively refined. At each step, a prediction can be extended upstream and downstream; in the second step, the algorithm only suggests a downstream addition. Also, predictions in either direction may include branches in the pipeline, as shown in the center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 At each iteration, we examine all upstream paths to suggest a new downstream vertex. We select the vertex that has the largest frequency given all upstream paths. In this example, "vtkDataSetMapper" would be the selected addition. . . . . . . . . . . . . . . . . . . . . 19 3.6 One of the test visualization pipelines applied to a time step of the Tokamak Reactor dataset. VisComplete could have made many completions that would have reduced the amount of time creating the pipeline. In this case, about half of the modules and completions could have been completed automatically. . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 Box plot of the percentages of operations that could be completed per task (higher is better). The statistics were generated for each user by taking them out of the training data. 27 3.8 Box plot of the percentages of operations that could be completed given two types of tasks, novice and expert. The statistics were generated by evaluating the novice tasks using the expert tasks as training data (novice) and by evaluating the expert tasks using the novice tasks as training data (expert). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.9 Box plot of the average prediction index that was used for the completions in Figure 3.7 (lower is better). These statistics provide a measure of how many suggestions the user would have to examine before the correct one was found. . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 A standard containment query searches a collection to find workflows with the specified subgraph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 An exploratory query allows wildcards to permit less-specific queries. The dashed lines in the query are wildcard paths; each result must contain a path between the connected modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 A representative workflow from a collection of workflows used for habitat modeling. . . . 35 4.4 Because the graphs identified by a feature may also be identified by subgraphs of that feature, we choose discriminative features to be those whose subgraphs collectively identify many more graphs. For example, F1 is selected because the graphs identified by the combination of F2 and F3 is 28 " 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5 While the features F2 and F3 occur together often, they are usually disjoint as defined by the two-component feature F1: \| suppF1q\| ! \| suppF2q X suppF3q\|. . . . . . . . . . . . . . . 41 4.6 Because each subgraph of a frequent subgraph is also frequent, we choose summary features to be those whose supergraphs have much smaller frequency. . . . . . . . . . . . . . . 42 4.7 Our index has two tiers, the summary features which summarize frequent features and provide verification-free answers, and the discriminative features which point to both the original workflow database and the summary features. Note that for this illustration, many items have been omitted from the figure; in practice, each workflow is indexed by at least one discriminative feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.8 The construction of our index involves feature mining, followed by the identification of summary features, which are used to determine discriminative features and build the index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.9 Query processing is faster because the discriminative index limits the number of candi-dates and summary graphs limit the number of computationally-expensive verifications. . 48 4.10 Workflow completions are generated from a completion query by replacing wildcards with modules and connections according to existing workflows in a collection. . . . . . . . . 50 4.11 Comparison of the number of subgraph isomorphism verifications required for queries with different numbers of results across different indexing schemes. For both the visualization workflows (a) and the Yahoo! Pipes workflows (b), we used the proposed scheme having both summary features and 2-component subgraphs (S+2C), a scheme using only summary features (S), and the original feature-based indexing scheme (Orig.). The actual number of results is plotted as a baseline (Actual) as well as the number of candidates (including summary graphs) after filtering for the proposed scheme (Cands.). 54 4.12 The effect of varying the thresholds for identifying the (a) discriminative and (b) sum-mary features for the proposed index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.13 Mean ratio of the number of isomorphisms computed to the number of matching graphs according to the number of edges in the query graph, shown for the proposed scheme (Summary + 2C), only summary graphs (Summary), and the original feature-based scheme (Original). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 A summary graph constructed from four molecules. The supplemental video shows the order of summarization and how edit operations allow users to tune the visualization. . . . 61 5.2 A summary graph of enzyme relation graphs from the citric acid cycle for eight or-ganisms. Notice that color can highlight differences while levels of gray indicate how common a graph component is. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 xi 5.3 A comparison of graph-matching algorithms when run on the same two starting graphs, shown in (a); mismatched vertices and edges in (b) and (c) are highlighted. A vertex-only matching (b) has issues with mismatched edges (e.g., the two Bob nodes from the red graph match the Bob and Robert nodes from the blue graph equally well). A neighborhood matching will correct some errors because it takes into account neigh-boring nodes, but it will not propagate this information to other nodes. For example, the neighborhoods of the two Cynthia/Cindy nodes in each graph (one neighborhood from each graph is highlighted in (a)) will match equally well and may cause an edge mismatch as shown in (c). Global methods, like A search and diffusion matching seek to resolve such problems, leading to matchings without mismatched edges (d). . . . . . . . . 71 5.4 The settings for GraphSum allows users to control summarization, adjust vertex and edge coloring, and toggle the display of individual graphs. . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 A summary of eight graphs representing visualization workflows generated by different students for a specific homework problem. A single student's work is highlighted in the context of the other graph, allows specific comparisons with the group as a whole. . . . . . 78 5.6 We define an initial ordering to merge graphs to create the summary graph, but after each merge, we use the similarity scores for individual nodes to order the individual node merges. The figure shows this node merge ordering for the workflow summary graph shown in Figure 5.5. This ordering provides a natural method for navigating the amount of summarization in a linear fashion. Note that any dependent merges must occur before a given merge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.7 A summary graph of enzyme relation graphs from the citric acid cycle for eight or-ganisms. We can show all colors to identify which entities appear in each graph (a). Our GraphSum application allows a user to highlight a subset of the graphs to better show individual differences (b). Users can also hide the other graphs to show only the similarities and differences between the selected graphs (c). . . . . . . . . . . . . . . . . . . . . . . 80 5.8 A piece of a molecular summary graph that shows how edit operations can be used to transform one summary into another via two break operations, (a) to (b), followed by two join operations, (b) to (c). With editing operations, the user can decide how the summary is best presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 When provenance information references file-system paths, there is no guarantee those files will not be moved or modified. We propose references that are linked to a persis-tent repository which maintains that data and with hashing and versioning allows for querying, reuse, and data lineage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 The upstream signature S(M) for a module is calculated recursively as the signature of the module concatenated with the upstream signatures of the upstream subworkflow for each port and the signature of the connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 Given a file which has been moved and renamed, we can use the managed file store and provenance to first locate the managed copy, and we can locate the original input files as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 Embedding provenance with data: provenance can be either saved to a separate file or serialized to XML and embedded in an existing file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5 The ManagedInputFile configuration allows the user to choose to create a new reference from a file on his local filesystem or use an existing reference from the managed store. . . 101 xii 6.6 An ALPS workflow colored by execution status (a). Blue modules were not executed since intermediate data existed in persistent storage; yellow modules were cached in memory; and green modules were executed. The results of a parameter exploration of the fitting range (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.7 A workflow comparing road maintenance and number of miles of road by state before and after upgrading two packages. In (a), the AggregateData module has been replaced, and the developer has specified an upgrade to combine multiple aggregation steps into a single ComposeData module. In (b), the interface of ExtractColumn has been updated to offer a new parameter. Finally, in (c), the interface of the plotting mechanism has not changed, but the implementation of that module has, as evidenced by the difference in the background of the resulting plots. . . . . . . . . . . . . . . . . . . . . . . . . 110 6.8 On the right, we show the provenance of upgrading workflow (A) to the updated work-flow (B). Besides the provenance of the upgrade, here we show the provenance of the executions of both (A) and (B). Note that version information is maintained in both forms of provenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.9 Incompatible (left) and valid (right) versions of a workflow. In an incompatible work-flow, the implementation of modules is missing, and thus, no information is available about the input and output ports of these modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.10 Upgrading a single module automatically involves deleting all connections, replacing the module with the new version, and finally adding the connections back. . . . . . . . . . . . 116 6.11 Workflow Evolution before and after upgrades as well as after retagging the nodes. . . . . 122 xiii ACKNOWLEDGEMENTS I would like to thank my advisors, Juliana Freire and Cl´audio Silva, for their guidance, support, and direction throughout my work. I would also like to thank the other members of my committee for their help in advancing my work: Valerio Pascucci, Susan Davidson, and Matthias Troyer have given advice and helped organize my ideas. I have had the opportunity to work with a number of talented collaborators, and I am grateful for their contributions that have aided this work. Thank you to all of the members of the VisTrails team, especially Erik Anderson, Steven Callahan, Tommy Ellqvist, Emanuele Santos, Carlos Scheidegger, and Huy Vo, who have helped to assist me with my work and troubleshoot bugs; it has been a joy to work with such a talented team. I would also like to thank Bela Bauer, Brigitte Surer, and the rest of the ALPS group, for their help in testing many of the techniques integrated with VisTrails. Other collaborators and co-authors, including Philippe Bonnet, Daniel Fink, Steve Kelling, and Jeff Morisette, have been influential in my understanding of the issues in computational science, reproducibility, and reuse. There have also been a number of people and organizations that have helped shape and facilitate my work, and I am grateful for this support. Thanks to folks I have worked with at VisTrails, Inc., including Douglas Alves, Benjamin Burnett, and Ramesh Pinnamaneni, for their work and insight. Also, thank you to the staff at the School of Computing and Scientific Computing & Imaging Institute at the University of Utah who helped troubleshoot technical and procedural issues, dealt with scheduling and equipment needs, and made sure necessary paperwork was completed. Thank you to Microsoft Research for a summer internship to expand my understanding, and to everyone involved with provenance challenge workshops that helped grant me a broader understanding of the field. Thank you to the various funding agencies that have made the research possible, including the Department of Energy SciDAC (VACET and SDM centers), and the National Science Foundation (grants IIS-0746500, CNS-0751152, IIS-0713637, OCE-0424602, IIS-0534628, CNS-0514485, IIS-0513692, CNS-0524096, CCF-0401498, OISE-0405402, CCF-0528201, CNS-0551724). Also, thank you to the students in the scientific visualization courses of 2007 and 2008 at the University of Utah, who provided the provenance information used in this work. Thank you to my friends and family for their support. Thanks to all of the friends I've met in Utah as well as those from my years in Madison, Grand Rapids, and elsewhere. Specifically, thanks to John, Steve, Erik, Joel, and Carlos, for dealing with rants and questions over lunch, as well as the disc golfers and ultimate players who have provided the excuse for a break from work. Thank you to my parents, Jan and Al, who always supported my education and continued to encourage my work throughout the years. Finally, thank you to Jen, my wife, for her support and understanding during this process. xv CHAPTER 1 INTRODUCTION 1.1 Motivation Serving as a record of what happened during a scientific process, often computational, prove-nance has become an important piece of computing. The importance of archiving not only data and results but also the lineage of these entities has led to a variety of systems that capture provenance as well as models and schemas for this information. Despite significant work focused on obtaining and modeling provenance, there has been litte work on managing and using this information. Querying this information has been studied for feasibility and interoperability concerns, but applications that drive these queries have been limited. One of the applications for provenance is reproducibility- exactly replicating a process or computation. This work proposes reuse as an improved application, allowing provenance users to migrate work to new techiques or hardware and more easily extend published findings. Provenance documents how something was accomplished; a collection of such information is thus extremely valuable in understanding solutions. Furthermore, using data mining, it is possible to determine similar solutions or common pieces of provenance information. Such parts can then be used to derive new complete or partial solutions. Note that an important component in such mining is the structure of the provenance; with more abstraction, it can be easier to locate patterns. In order to suggest relevant suggestions, existing structure can be used to index into a summary of collected provenance. Along similar lines, while suggestions are targeted to help users for specific tasks, summaries of collections of provenance can be useful for browsing the information. Because provenance is often understood as a graph of dependencies, a textual summary of information is usually difficult to parse. On the other hand, visually a collection of graphs is complicated by the fact that comparing graphs is NP-Complete. This work presents algorithms to build summaries of collections of graphs. In addition, it demonstrates methods for editing these summaries by splitting and joining multinodes and multiedges. To support queries, completions, and summaries for provenance, there must be infrastructure to support efficient access to the information. Indexing techniques are often used to speed queries over certain fields in databases, but because provenance information is stored as a graph, a query 2 can leverage both distinct node criteria and connectivity constraints. Thus, indexing must also encapsulate these features. However, because subgraph isomorphism is NP-Complete, even com-paring two graphs in a collection to test their equivalence can be difficult. Existing techinques for graph indexing leverage discriminative subgraphs that help filter candidates, limiting the number of full verifications that need to be calculated via subgraph isomorphism calculations. However, provenance queries can be more vague, referencing only loosely-connected pieces of a subgraph, and may return large numbers of results. This work proposes a framework to adapt existing indexing techniques to make provenance queries more efficient. A key concern in provenance is the data associated with the steps involved. For exploratory science, it is not always possible or efficient to curate data and ensure their longevity. At the same time, referencing data by filenames or URIs is problematic; a file can be moved or deleted, and linking provenance from outside the originating machine is difficult. This work proposes a framework for both identifying and managing the input and output data involved inline with provenance information. For the goal of reuse-not simply reproducibility, it is important to have the ability to migrate and adapt documented processes to use new hardware or techniques. To identify possible incom-patibilities and how the necessary changes may be completed, documenting version information is required. At the same time, the provenance of the changes themselves can be invaluable when diag-nosing differences in results. Thus, provenance plays an important role in both allowing upgrades but also in documenting changes. 1.2 Thesis Statement Techniques for the management and analysis of provenance enable applications for knowledge discovery and reuse by leveraging the information contained in provenance stores. 1.3 Dissertation Objectives In this dissertation, we present a set of techniques for managing and analyzing provenance information as well as applications that use this framework to aid in future work. The goal is to use provenance, often viewed as archival data, to help develop solutions that use this information. The outline of this dissertation can be separated into four contributions: A method to suggest possible workflow completions using provenance information about previously constructed workflows [85]. This technique both offers starting points for novice users and reduces the effort for more experienced users in constructing workflows. A framework for indexing provenance information that permits more exploratory queries and supports queries that have large numbers of results. The techinques augment existing 3 graph indexing techniques by adding an extra layer to the index for more quickly locating large numbers of results as well as incorporating disconnected features. A technique to display a collection of graphs in a visual summary that allows discovery of similarities and differences. This can be used to display collections of provenance graphs so users can discover changes, and the summaries are editable so they can serve to develop reusable analogies that can be applied to other work. An infrastructure to support new modes of publication. To support work on executable papers andWeb-based publications, it is necessary to maintain links to data [84] and support the longevity of provenance for later use through upgrades [86]. The rest of this dissertation is organized as follows. Chapter 2 reviews background on prove-nance and other work in this area. Chapter 3 describes VisComplete, a recommendation system for workflows that uses provenance information to derive completions. Chapter 4 describes an indexing scheme for querying provenance information. Then, Chapter 5 describes techniques for visualizing collections of graphs, including provenance graphs. Chapter 6 describes contributions that enabled greater reuse and longevity for publications. Finally, Chapter 7 presents conclusions and directions for future work. CHAPTER 2 BACKGROUND As this work relies on provenance, it is important to start by reviewing what constitutes prove-nance, how it is generated and captured, and what techniques exist to manipulate and access this information. The techinques and frameworks have been implemented or integrated into computa-tional work, usually via an existing system. The VisTrails scientific workflow system has served as a testbed for this work, and both the system and workflows in general have prompted and aided many of the applications. However, the framework and algorithms are general and can be integrated with other systems. This chapter begins by defining provenance before describing scientific workflow systems and provenance capabilities. VisTrails is used as an example to highlight how provenance and scientific workflow systems are coupled. 2.1 Provenance Provenance is the lineage or history of some object, including relationships to other objects that influence it. It can refer to the trail of ownership of a piece of artwork from painter to current owner, the steps in baking a cake from ingredient collection to finished product, or the processes involved in deriving a scientific result from experimental setup to analyses. While the term has not always been associated with science, the concepts are ingrained into both the work and mindset of scientists. Published results are derived from information about the exact procedures followed, captured data, annotations, and documented analyses. In addition, all of this information is documented in the publication so other scientists can validate procedures and reproduce and extend results. This provenance is often as important, if not more, than the results. As computing resources are used for more tasks and data are stored in digital form, the pace of work has accelerated and the complexity of tasks has increased. Manually keeping track of all steps followed, parameters set, and data used is burdensome and prone to error. For computational tasks, it is more efficient to have computers record this information. Computational provenance, then, tracks the steps and data involved in some computational task. The provenance (also referred to as the audit trail, lineage, and pedigree) of a data product contains information about the process and data used to derive the product [47, 137]. It provides important documentation that is key to preserving the data, to determining its quality and authorship, and to reproducing as well as 5 validating the results. These are all important requirements of the scientific process. The scope and granularity of provenance information vary based on the task and capture mech-anism. For example, fine-grained information about the lineage database tuples can be captured and used to analyze query results [144]. For more general tasks, the granularity of provenance varies from a listing of all low-level system/kernel calls [49, 107] to abstracted workflow descrip-tions [81, 145, 153]. Note that low-level capture is more general but requires significant work to obtain a high-level description. More abstract provenance can be more easier understood but may lack some of the details. Another classification for provenance information involves the type of information being col-lected. Prospective provenance captures the specification of a computational task (i.e., a script or workflow)-it corresponds to the steps that need to be followed (or a recipe) to generate a data product or class of data products. Retrospective provenance captures the steps that were executed as well as information about the execution environment used to derive a specific data product-a detailed log of the execution of a computational task. Note that retrospective provenance can be captured for any task regardless of whether that task has prospective provenance. For example, information like which processes were run, who ran them, and how long they took, can be captured without knowing the sequence of steps ahead of runtime. Provenance can also contain user-defined information, documentation that cannot be automat-ically captured but records important decisions and notes. These data are often captured in the form of annotations. Annotations can be added at different levels of granularity and associated with components of both prospective and retrospective provenance. To investigate the capabilities of various systems, relationships between them, and models for storage, challenges were proposed and accomplished by a set of teams. The first challenge highlighted different methods for capturing and querying provenance information [121]. The second investigated interoperability of provenance models [122], and the third challenge focused on using the Open Provenance Model (OPM) as a model for exchanging provenance between systems [123]. As a very general model, OPM allows a variety of types of provenance information to be recorded without enforcing many constraints [105]. 2.2 Scientific Workflow Systems Computational tasks can be represented using a variety of mechanisms including computer programs, scripts, and workflows. They can also be constructed interactively using specialized tools (e.g., ParaView [82] for scientific visualization, GenePattern [54] for biomedical research) that often have their own internal format to represent a task. Some complex computational tasks require that different tools be weaved together, including loosely-coupled resources, specialized libraries, 6 distributed computing infrastructure, and Web services. For example, to analyze the results of a CT scan, it may be necessary to preprocess the data with different parameters, visualize each result, and compare them. To ensure reproducibility of the entire task, it is beneficial to have a description that captures these steps and the different parameter values used. Workflow and workflow-based systems have recently grown in popularity within the scientific community as a means to assemble complex processes [40, 46, 81, 104, 114, 138, 145, 149, 153, 156]. Not only do they support the automation of repetitive tasks, but they can also systematically capture provenance information for the derived data products [39]. Most workflow systems support provenance capture, although each adopts its own data and storage models [39, 47]. These range from specialized Semantic Web languages (e.g., RDF and OWL) and XML dialects that are stored as files in the file system, to tables stored in relational databases. A workflow describes a set of computations as well as an order for these computations. To simplify the presentation, we focus on dataflows; but note that our approach is applicable to more general workflow models. In a dataflow, computational flow is dictated by the data requirements of each computation. A dataflow is represented as a directed acyclic graph where nodes are the compu-tational modules and edges denote the data dependencies as connections between the modules-an edge connects the output port of a module to an input port of another. Often, a module has a set of associated parameters that can control the specifics of one computation. Some workflows also utilize subworkflows where a single module is itself implemented by an underlying workflow. Because workflows abstract computation, there must be an association between the module instances in a workflow and the underlying execution environment. This link is managed by the module registry which maps module identifiers to their implementations. For convenience and maintenance, related modules are often grouped together in packages. Thus, the module identifier may consist of package identifier, a module name, an optional namespace, and information about the version of the implementation. Version information can serve to inform us when implementations or interfaces in the environment change, and is part of provenance information. Consider, for example, the VisTrails system [153]. In VisTrails, each module corresponds to a Python class that derives from a predefined base class. Users define custom behaviors by implementing a small set of methods. These, in turn, might run some code in a third-party library or invoke a remote procedure call via a Web service. The Python class also explicitly describes the interface of the module: the set of allowed input and output connections, given by the module's ports. A VisTrails package consists of a set of Python classes. One of the benefits of workflow systems is that they lend themselves to visual programming environments. Those browsing a collection of workflows should be able to gain an idea of the computation from a depiction of the general structure without reading a long code listing. Connec- 7 tions show relationships between modules without the need to trace variable names, and parameter settings can be located with the modules they affect. This enables users to more quickly set parameter values, add computational modules, or delete extraneous analyses. 2.3 VisTrails VisTrails (http://www.vistrails.org) is an open-source system that supports data exploration and visualization. It combines and substantially extends useful features of scientific workflow and visualization systems. Similar to scientific workflow systems [81, 117, 145, 156], VisTrails allows the specification of computational processes which integrate existing applications, loosely-coupled resources, and libraries according to a set of rules; and similar to visualization systems [71, 82, 91, 154], VisTrails makes advanced scientific and information visualization techniques available to users, allowing them to explore and compare different visual representations of their data. As a result, users can create complex workflows that encompass important steps of scientific discovery, from data gathering and manipulation, to complex analyses and visualizations, all integrated in one system. A distinguishing feature of VisTrails is a comprehensive provenance infrastructure that trans-parently captures and maintains detailed history information about the steps followed and data derived in the course of an exploratory task [48]. Whereas workflows have been traditionally used to automate repetitive tasks, for applications that are exploratory in nature, such as simulations, data analysis, and visualization, very little is repeated-change is the norm. As a user generates and evaluates hypotheses about data under study, a series of different, albeit related, workflows are created as they are adjusted in an iterative process. VisTrails was designed to manage these rapidly-evolving workflows: it maintains provenance of data products (e.g., visualizations, plots) of the workflows that derive these products, and their executions. The system also provides annotation capabilities that allow users to enrich the automatically captured provenance. Besides enabling reproducible results, VisTrails leverages provenance information through a series of operations and intuitive user interfaces that help users to collaboratively analyze data. Notably, the system supports reflective reasoning by storing temporary results, by providing users the ability to examine the actions that led to a result and to follow chains of reasoning backward and forward [113]. Users can navigate workflow versions in an intuitive way, undo changes but not lose any results, visually compare multiple workflows, and show their results side-by-side in a visualization spreadsheet [17, 48, 135]. Because the need for data analysis and visualization is pervasive across disciplines, VisTrails was designed with usability and extensibility in mind. VisTrails addresses important usability issues that have hampered a wider adoption of workflow and visualization systems. To cater 8 to a broader set of users, including many who do not have programming expertise, it provides a series of operations and user interfaces that simplify workflow design and use, including the ability to create and refine workflows by analogy, to query workflows by example, and to suggest workflow completions as users interactively construct their workflows using a recommendation system [85, 130]. VisTrails is also linked to a new framework that allows the creation of custom applications that can be more easily deployed to (nonexpert) end users [127, 128]. The extensibility of VisTrails comes from an infrastructure that makes it simple for users to integrate tools and libraries, as well as to quickly prototype new functions. This has been instrumental to enable the use of the system in a wide range of application areas, including environmental sciences [16, 70], psychiatry [10], astronomy [147], cosmology [9] , high-energy physics [42], quantum physics [7], and molecular modeling [64]. CHAPTER 3 VISCOMPLETE: DATA-DRIVEN SUGGESTIONS FOR VISUALIZATION SYSTEMS 3.1 Introduction Data exploration through visualization is an effective means to understand and obtain insights from large collections of data. Not surprisingly, visualization has grown into a mature area with an established research agenda [109], and a number of software systems have been developed that support the creation of complex visualizations [30, 71, 82, 83, 101, 114, 153, 154]. However, a wider adoption of visualization systems has been greatly hampered due to the fact that these systems are notoriously hard to use, in particular, for users who are not visualization experts. Even for systems that have sophisticated visual programming interfaces, such as DX, AVS, and SCIRun, the path from the raw data to insightful visualizations is laborious and error-prone. Visual programming interfaces expose computational components as modules and allow the creation of complex visualization pipelines which combine these modules in a dataflow, where connections between modules express the flow of data through the pipeline. They have been shown to be useful for comparative visualization and efficient exploration of parameter spaces [17]. Through the use of a simple programming model (i.e., dataflows) and by providing built-in constraint checking mechanisms (e.g., that disallow a connection between incompatible module ports), they ease the creation of pipelines. Notwithstanding, without detailed knowledge of the underlying computational components, it is difficult to understand what series of modules and connections ought to be added to obtain a desired result. In essence, there is no "roadmap"; systems provide very little feedback to help the user figure out which modules can or should be added to the pipeline. A novice user (i.e., an experienced programmer that is unfamiliar with the modules and the dataflow of the system), or even an advanced user performing a new task, often resorts to manually searching for existing pipelines to use as examples. These examples are then adapted and iteratively refined until a solution is found. Unfortunately, this manual, time-consuming process is the current standard for creating visualizations rather than the exception. Recent work has shown that provenance information (the metadata required for reproducibility) can be used to simplify the process of pipeline creation by allowing pipelines to be refined and queried by example [130]. For example, a pipeline refinement can act as an analogy template for 10 creating new visualizations. This is a powerful tool and can be helpful in situations when the user knows in advance what they want the end result to be. However, during pipeline creation, it is not always the case that the user has an analogy template readily available for the visualization that is desired. In these cases, the user is relegated to manually searching for examples. In this chapter, we present VisComplete, a system that aids users in the process of creating visualizations by using a database of previously created visualization pipelines. The system learns common paths used in existing pipelines and predicts a set of likely module sequences that can be presented to the user as suggestions during the design process. The quality and nature of the suggestions depend on the data from which they are derived. Whereas in a single-user environment, suggestions are derived based on pipelines created by a specific user, in a multi-user environment, the "wisdom of the crowds" can be leveraged to derive a richer set of suggestions that includes examples with which the user is not familiar. User collaboration and social data reuse has proven to be a powerful mechanism in various domains, such as recommendation systems in commercial settings (e.g., Amazon, e-Bay, Netflix), knowledge sharing on open Web sites (e.g., Wikipedia), image labeling for computer vision (e.g., ESPGame), and visualization creation (e.g., ManyEyes). The underlying theme shared by these systems is that they use information provided by many users to solve problems that would be difficult otherwise. We apply a similar concept to pipeline creation: pipelines created by many users enable the creation of visualizations by consensus. For the user, VisComplete acts as an auto-complete mechanism for pipelines, suggesting modules and connections in a manner similar to aWeb browser suggesting URLs. The completions are presented graphically in a way that allows the user to easily explore and accept suggestions or disregard them and continue working as they were. Figure 3.1 shows an example of VisComplete incorporated into a visual programming interface and Figure 3.2 shows some example completions for a single module. We propose a recommendation system that leverages information in a collection of pipelines to provide advice to users of visualization systems and aid them in the construction of pipelines. By modeling pipelines as graphs, we develop an algorithm for predicting likely completions that searches for common subgraphs in the collection. We also present an interface that displays the recommended completions in an intuitive way. Our preliminary experiments show that VisComplete has the potential to reduce the effort and time required to construct visualizations. We found that the suggestions derived by VisComplete could have reduced the number of operations performed by users to construct pipelines by an average of over 50%. Note that although in this chapter we focus on the use of VisComplete for visualization pipelines, the techniques we present can be applied to general workflows. 11 (a) (b) (c) Database of Pipelines Figure 3.1: The VisComplete suggestion system and interface. (a) A user starts by adding a module to the pipeline. (b) The most likely completions are generated using indexed paths computed from a database of pipelines. (c) A suggested completion is presented to the user. The user can browse through suggestions using the interface and choose to accept or reject the completion. 12 VTKCell vtkRenderer vtkActor vtkPolyDataMapper vtkTubeFilter vtkStreamTracer vtkDataSetReader VTKCell vtkRenderer vtkActor vtkDataSetMapper vtkContourFilter vtkDataSetReader VTKCell vtkRenderer vtkActor vtkPolyDataMapper vtkGlyph3D vtkMaskPoints vtkDataSetReader Figure 3.2: Three of the first four suggested completions for a "vtkDataSetReader" are shown along with corresponding visualizations. The visualizations were created using these completions for a time step of the Tokamak Reactor dataset that was not used in the training data. 13 The rest of this chapter is organized as follows. In Section 3.2, we discuss related work. In Section 3.3, we present the underlying formalism for generating pipeline suggestions, and in Section 3.4, we describe a practical implementation that has been integrated into the VisTrails system [153]. We then detail the use cases we envision in Section 3.5, report our experiments and results in Section 3.6, and provide a discussion of our algorithm in Section 3.7. We conclude in Section 3.8, where we outline directions for future work. 3.2 Related Work Visualization systems have been successfully used to bring powerful visualization techniques to a wide audience. Seminal workflow-based visualization systems, such as AVS Explorer [154], Iris Explorer [111], and Visualization Data Explorer [71], have paved the way for more recent systems designed using an object-oriented approach such as SciRun [114] for computational steering and the Visualization Toolkit (VTK) [83] for visualization. Systems that incorporate standard point-and- click interfaces and operate on data at a larger scale, such as VisIt [30] and ParaView [82], still use workflows as their underlying execution engine. Development in workflow systems for visualization is ongoing, as seen in projects such as MeVisLab [102] for medical visualization and VisTrails [153] for incorporating existing visualization libraries with other tools in a provenance capturing framework. Our completion strategy can be combined with and enhance workflow and workflow-based visualization systems. Recommendation systems have been used in different settings. Like VisComplete, these are based on methods that predict users' actions based solely on the history of their previous interac-tions [68]. Examples include Unix command-line prediction [87], prediction of Web requests [50, 112], and autocompletion systems such as IntelliSense [103]. Senay and Ignatius have proposed incorporating expert knowledge into a set of rules that allow automated suggestions for visualization construction [132], while Gilson et al. incorporate RDF-based ontologies into an information visualization tool [55]. However, these approaches necessarily require an expert that can encode the necessary knowledge into a rule set or an ontology. Fu et al. [50] applied association rule mining [3] to analyze Web navigation logs and discover pages that co-occur with high frequency in navigation paths followed by different users. This information is then used to suggest potentially interesting pages to users. VisComplete also derives predictions based on user-derived data and does so in an automated fashion, without the need for explicit user feedback. However, the data it considers are fundamentally different from Web logs: VisComplete bases its predictions on a collection of graphs and it leverages the graph structure to make these predictions. Because association rule mining computes rules over sets of elements, it does not capture relationships (other than co-occurrence) amongst these elements. 14 In graphics and visualization, recommendation systems have been proposed to simplify the creation of images and visualizations. Design Galleries [97] were introduced to allow users to explore the space of rendering parameters by suggesting a set of automatically generated thumb-nails. Igarashi and Hughes [72] proposed a system for creating 3D line drawings that uses rules to suggest possible completions of 3D objects. Suggestions have also been used for view point selection in volume rendering. Bordoloi and Shen [158] and Takahashi et al. [143] present methods that analyze the volume from various view points to suggest the view that best shows the features within the volume. Like these systems, we provide the user with prioritized suggestions that the user may choose to utilize. However, our suggestions are data-driven and based on examples of previous interactions. An emerging trend in image processing is to enhance images based on a database of existing images. Hays and Efros [61] recently presented a system for filling in missing regions of an image by searching a database for similar images. Along similar lines, Lalonde et al. [90] recently introduced Photo Clip Art, a method for intelligently inserting clip art objects from a database to an existing image. Properties of the objects are learned from the database so that they may be sized and oriented automatically, depending on where they are inserted into the image. The use of databases for completion has also been used for 3D modeling. Tsang et al. [151] proposed a modeling technique that utilizes previously created geometry stored in a database of shapes to suggest completions of objects. Like these methods, our completions are computed by learning from a database to find similarities. But instead of images, our technique relies on workflow specifications to derive predictions. Another important trend is that of social visualization. Web-based systems such as VisPortal [19, 76] provide the means for collaborative visualization from disjoint locations. Web sites such as Sens.us [63], Swivel [141], and ManyEyes [157] allow many users to create, share, and discuss visualizations. One key feature of these systems is that they leverage the knowledge of a large group of people to effectively understand disparate data. Similarly, VisComplete uses a collection of pipelines possibly created by many users to derive suggestions. 3.3 Generating Data-driven Suggestions VisComplete suggests partial completions (i.e., a set of structural changes) for pipelines as they are being created by a user. These suggestions are derived using structural information obtained from a collection G of already-completed pipelines. Pipelines are specified as graphs, where nodes represent modules (or processes) and edges determine how data flows through the modules. More formally, a pipeline specification is a directed acyclic graph GpM;Cq, whereM consists of a set of modules and C is a set of connections between 15 modules inM. A module is a complex object which contains a set of input and output ports through which data flows in and out of the module. A connection between two modulesma andmb connects an output port of ma to an input port of mb. 3.3.1 Problem Definition The problem of deriving pipeline completions can be defined as follows. Given a partial graph G, we wish to find a set of completions CpGq that reflect the structures that exist in a collection of completed graphs. A completion of G, Gc, is a supergraph of G. Our solution to this problem consists of two main steps. First, we preprocess the collection of pipelines G and create Gpath, a compact representation of G that summarizes relationships between common structures (i.e., sequences of modules) in the collection (Section 3.3.2). Given a partial pipeline p, completions are generated by querying Gpath to identify modules and connections that have been used in conjunction with p in the collection (Section 3.3.3). 3.3.2 Mining Pipelines To derive completions, we need to identify graph fragments that co-occur in the collection of pipelines G. Intuitively, if a certain fragment always appears connected to a second fragment in our collection, we ought to predict one of those fragments when we see the other. Because we are dealing with directed acyclic graphs, we can identify potential completions for a vertex v in a pipeline by associating subgraphs downstream from v with those that are upstream. A subgraph S is downstream (upstream) of a vertex v if for every v1 P S, there exists a path from v to v1 (v1 to v). In many cases where we wish to complete a graph, we will know either the downstream or upstream structure and wish to complete the opposite direction. Note that this problem is symmetric: we can change one problem to the other by simply reversing the direction of the edges. However, due to the (very) large number of possible subgraphs in G, generating predictions based on subgraphs can be prohibitively expensive. Thus, instead of subgraphs, we use paths, i.e., a linear sequence of connected modules. Specifically, we compute the frequencies for each path in G. Completions are then determined by finding which path extensions are likely given the existing paths. To efficiently derive completions from a collection of pipelines G, we begin by generating a summary of all paths contained in the pipelines. Because completions are derived for a specific vertex v in a partial pipeline (we call this vertex the completion anchor), we extract all possible paths that end or begin with v and associate them with the vertices that are directly connected downstream or upstream of v. Note that this leads to many fewer entries than the alternative of 16 extracting all possible subgraph pairs. And as we discuss in Section 3.6, paths are effective and lead to good predictions. More concretely, we extract all possible paths of length N, and split them into a path of length N 1 and a single vertex. Note that we do this in both forward and reverse directions with respect to the directed edges. This allows us to offer completions for pipeline pieces when they are built top-down and bottom-up. The path summary Gpath is stored as a set of (path, vertex) pairs sorted by the number of occurrences in the database and indexed by the last vertex of the path (the anchor). Since predictions begin at the anchor vertex, indexing the path summary by this vertex leads to faster access to the predictions. As an example of the path summary generation, consider the graph shown in Figure 3.3. We have the following upstream paths ending with D: A Ñ C Ñ D, B Ñ C Ñ D, C Ñ D, and D. In addition, we also have the following downstream vertices: E and F. The set of correlations between the upstream paths and downstream vertices is shown in Figure 3.3. As we compute these correlations for all starting vertices over all graphs, some paths will have higher frequencies than others. The frequency (or support) for the paths is used for ranking purposes: predictions derived from paths with higher frequency are ranked higher. Besides paths, we also extract additional information that aid in the construction of completions. Because we wish to predict full pipeline structures, not just paths, we compute statistics for the in-and out-degrees of each vertex type. This information is important in determining where to extend a completion at each iteration (see Figure 3.4). We also extract the frequency of connection types for each pair of modules. Since two modules can be connected through different pairs of ports, this information allows us to predict the most frequent connection type. A C B D E F path vertex A Ñ C Ñ D E A Ñ C Ñ D F B Ñ C Ñ D E B Ñ C Ñ D F C Ñ D E C Ñ D F D E D F Figure 3.3: Deriving a path summary for the vertex D. 17 vtkImageDataGeometryFilter vtkDataSetMapper vtkMergeFilter vtkWarpScalar vtkContourFilter vtkDataSetReader vtkActor vtkWarpScalar vtkDataSetReader vtkImageDataGeometryFilter vtkActor vtkDataSetMapper vtkMergeFilter vtkContourFilter vtkRenderer Iteration i Iteration i+1 vtkImageDataGeometryFilter vtkDataSetMapper vtkMergeFilter vtkContourFilter vtkDataSetReader Figure 3.4: Predictions are iteratively refined. At each step, a prediction can be extended upstream and downstream; in the second step, the algorithm only suggests a downstream addition. Also, predictions in either direction may include branches in the pipeline, as shown in the center. 18 3.3.3 Generating Predictions Predicting a completion given the path summary and an anchor module v is simple: given the set of paths associated with v, we identify the vertices that are most likely to follow these paths. As shown in Algorithm 1, we iteratively develop our list of predictions by adding new vertices using this criteria. Algorithm 1: Generate Predictions Input: A set of paths P Output: A set of workflow completions P GENERATEPREDICTIONS(P) (1) possibles Ð FIRSTPREDICTIONpPq (2) P Ð r s (3) while \|possibles\| ¡ 0 (4) do p Ð REMOVEFIRSTppossiblesq (5) newPossibles Ð REFINEppq (6) if \|newPossibles\| 0 (7) then P Ð P 􀀀 p (8) else possibles Ð possibles 􀀀 newPossibles At each step, we refine existing predictions by generating new predictions that add a new vertex based on the path summary information. Note that because there can be more than one possible new vertex, we may add more than one new prediction for each existing prediction. Figure 3.4 illustrates two steps in the prediction process. To initialize the list of predictions, we use the specified anchor modules (provided as input). At this point, each prediction is simply a base prediction that describes the anchor modules and possibly how they connect to the pipeline. After initialization, we iteratively refine the list of predictions by adding to each suggestion. Because there are a large number of predictions, we need some criteria to order them so that users can easily locate useful results. We introduce confidence to measure the goodness of the predictions. Given the set of upstream (or downstream depending on which direction we are currently predicting) paths, the confidence of a single vertex cpvq is the measure of how likely that vertex is, given the upstream paths. To compute the confidence of a single vertex, we need to take into account the information given by all upstream paths. For this reason, the values in Gpath are not normalized; we use the exact counts. Then, as illustrated by Figure 3.5, we combine the counts from each path. This means we do not need any weighting based on the frequency of paths; the formula takes this into account automatically. Specifically, cpvq °PPupstreampvq countpv \| Pq °PPupstreampvq countpPq 19 vtkWarpScalar vtkImageDataGeometryFilter vtkMergeFilter vtkContourFilter (a) Pipeline Fragment vtkWarpScalar vtkMergeFilter vtkContourFilter vtkDataSetMapper = 234 vtkImageDataGeometryFilter vtkMergeFilter vtkContourFilter vtkPolyDataMapper = 179 vtkMergeFilter vtkContourFilter vtkPolyDataMapper vtkDataSetMapper > 179 > 234 (b) Paths Database Figure 3.5: At each iteration, we examine all upstream paths to suggest a new downstream vertex. We select the vertex that has the largest frequency given all upstream paths. In this example, "vtkDataSetMapper" would be the selected addition. 20 Then, the confidence of a graph G is the product of the confidences of each of its vertices: cpGq ¹ vPG cpvq While each vertex confidence is not entirely independent, this measure gives a reasonable approx-imation for the total confidence of the graph. Because we perform our predictions iteratively, we calculate the confidence of the new prediction pi􀀀1 as the product of the confidence of the old prediction pi and the confidence of the new vertex v: cppi􀀀1q cppiq cpvq For computational stability, our implementation uses log-confidences so the products are actually sums. Because we wish to derive predictions that are not just paths, our refinement step begins by identifying the vertex in the current prediction from which we wish to extend our prediction. Recall that we computed the average in- and out-degree for each vertex type in the mining step. Then, for each vertex, we can compute the difference between the average degree for its type and its current degree for the current prediction direction. We choose to extend completions at vertices where the current degree is much smaller than the average degree. We also incorporate this measure into our vertex confidence so that predictions that contain vertices with too many edges are ranked lower: cdpvq cpvq 􀀀 degree-differencepvq We stop iteratively refining our predictions after a given number of steps or when no new predictions are generated. At this point, we sort all of the suggestions by confidence and return them. If we have too many suggestions, we can choose to prune our set of predictions at each step by eliminating those which fall below a certain threshold. 3.3.4 Biasing the Predictions The prediction mechanism described above relies primarily on the frequency of paths to rank the predictions. There are, however, other factors that can be used to influence the ranking. For example, if a user has been working on volume rendering pipelines, completions that emphasize modules related to that technique could be ranked higher than those dealing with other techniques. In addition, some users will prefer certain completions over others because they more closely mirror their own work or their own pipeline structures. Again, it makes sense to bias completions toward user preferences. We can adapt our algorithm to include such bias by incorporating a weighting factor in the confidence computation. Specifically, we adjust our counts by weighting the contribution of each path according to a pipeline importance factor determined by a user's preferences. 21 3.4 Implementation Our implementation is split into three specific steps: determining when completion should be invoked, computing the set of possible completions, and presenting these suggestions to the user. Computing the possible completions requires the machinery developed in the previous section. The other steps are essential to make the approach usable. The interface, in particular, plays a significant role in allowing users to make use of suggestions while also being able to quickly dismiss them when they are not desired. 3.4.1 Triggering a Completion We want to provide an environment where suggestions are offered automatically but do not interfere with a user's normal work patterns. There are two circumstances in pipeline creation where it makes sense to automatically trigger a completion: when a user adds a new module and when a user adds a new connection. In each of these cases, we are given new information about the pipeline structure that can be used to narrow down possible completions. Because users may also wish to invoke completion without modifying the pipeline, we also provide an explicit command to start the completion process. In each of the triggering situations, we begin the suggestion process by identifying the modules that serve as anchors for the completions. For new connections, we use both of the newly connected modules, and for a user-requested completion, we use the selected module(s). However, when a user adds a new module, it is not connected to the rest of the existing pipeline. Thus, it can be difficult to offer meaningful suggestions since we have no surrounding structure to leverage. We address this issue by first finding the most probable connection to the existing pipeline, and then continue with the completion process. Finding the initial connection for an added module may be difficult when there are multiple modules in the existing pipeline than can be connected to the new module. However, because visual programming interfaces allow users to drag and place new modules in the pipeline, we can use the initial position of the module to help infer a likely connection. To accomplish this, we compute the user's layout direction based on the existing pipeline, and locate the module that is nearest to the new module and can be connected to it. 3.4.2 Computing the Suggestions As outlined in the previous section, we compute possible completions that emanate from a set of anchor modules in the existing pipeline using path summaries derived from a database of pipelines, and rank them by their confidence values. Depending on the anchor modules, a very large set of completions can be derived and a user is unlikely to examine a long list of suggestions. Therefore, 22 we prune our predictions to avoid rare cases. This both speeds up computation and reduces the likelihood that we provide meaningless suggestions to the user. Specifically, because our predictions are refined iteratively, we prune a prediction if its confidence is significantly lower than its parent's confidence. Currently, this is implemented as a constant threshold, but we can use knowledge of the current distribution or iteration to improve our pruning. VisComplete provides the user with suggestions that assist in the creation of the pipeline struc-ture. Parameters are also essential components in visualizations, but because the choice of pa-rameters is frequently data-dependent, we do not integrate parameter selection with our technique. Instead, we focus on helping users complete pipelines, and direct them to existing techniques [17, 77, 78, 96] to explore the parameter space. Note that it might be beneficial to extend VisComplete to identify commonly used parameters that a user might consider exploring, but we leave this for future work. 3.4.3 The Suggestion Interface In concert with our goal of unobtrusiveness, we provide an intuitive and efficient interface that enables users to explore the space of possible completions. Auto-complete interfaces for text generally show a set of possible completions in a one-dimensional list that is refined as the user types. For pipelines, this task is more difficult because it is not feasible to show multiple completions at once, as this would result in visual clutter. The complexity of deriving the completion is also greater. For this reason, our interface is two-dimensional: users can select from a list of full completions and then increase or decrease the extent of the completion. Current text completion interfaces defer to the user by showing completions but allowing the user to continue to type if he does not wish to use the completions. We strive for similar behavior by automatically showing a completion along with a simple navigation panel when a completion is triggered. The user can choose to interact with the completion interface or disregard it completely by continuing to work, which will cause the completion interface to automatically disappear. The navigation interface contains a set of arrows for selecting different completions (left and right) and depths of the current completion (up and down). In addition, the rank of the current completion is displayed to assist in the navigation and accept and cancel buttons are provided (see Figure 3.1(c)). All of these completion actions, along with the ability to start a new completion with a selected module, are also available in a menu and as shortcut keys. The suggested completions appear in the interface as semitransparent modules and connections, so that they are easy to distinguish from the existing pipeline components. The suggested modules are also arranged in an intuitive way using a set of simple heuristics that respect the layout of the current pipeline. The first new suggested module is always placed near the anchor module. The 23 offset of the new module from the anchor module is determined by averaging the direction and distance of each module in the existing pipeline. The offset for each additional suggested module is calculated by applying this same rule to the module to which it is appended. Branches in the suggested completion are simply offset by a constant factor. These heuristics keep the spacing uniform and can handle upstream or downstream completions whether pipelines are built top-down or left-right. 3.5 Use Cases We envision VisComplete being used in different ways to simplify the task of pipeline construc-tion. In what follows, we discuss use cases which consider different types of tasks and different user experience levels. The types of tasks performed by a user can range from the very repetitive to the unique. Obviously, if the user performs tasks that are very similar to those in the database of pipelines, the completions that are suggested are very full-almost the entire pipeline can be created using one or two modules (see Figure 3.2 for examples). On the other hand, if the task that is being performed is not often repeated and nothing similar in the database can be found, VisComplete will only be able to assist with smaller portions of the pipeline at a time. This can still aid the user by showing the possible directions to proceed with pipeline construction, albeit at a smaller scale. The experience level of users that could take advantage of VisComplete also varies. For a novice user, VisComplete replaces the process of searching for and tweaking an example that will perform their desired visualization. For example, a user who is new to VTK and desires to compute an isosurface of a volume might consult documentation to determine that a "vtkContourFilter" module is necessary and then search online for an example pipeline using this module. After downloading the example, they may be able to manipulate it to produce the desired visualization. Using VisComplete, this process is simplified- the user needs only to start the pipeline by adding a "vtkContourFilter" module and their pipeline will be constructed for them (see Figure 3.1). Mul-tiple possible completions can easily be explored and unlike examples downloaded from the Web, VisComplete can customize the suggestions by providing completions that more closely reflect a specific user's previous or more current work. For experienced users, VisComplete still offers substantial benefits. Because experts may not wish to see full pipelines as completions, the default depth of the completions can be adjusted as a preference so that only minor modifications are suggested at each step. Thus, at the smallest completion scale, a user can leverage just the initial connection completion to automatically connect new modules to their pipeline. The user could also choose to ignore suggested completions as they add modules until the pipeline is specific enough to shrink the number of suggestions. Unlike the novice user who may iterate through many suggestions at each step, the experienced user will likely 24 choose to ignore the suggestions until they provide the desired completion on the first try. 3.6 Evaluation 3.6.1 Data and Validation Process To evaluate the effectiveness of our completion technique, we used a set containing 2875 visual-ization pipelines along with logs of the actions used to construct each pipeline. These pipelines were constructed by 30 students during a scientific visualization course.1 Throughout the semester, the students were assigned five different tasks and carried them out using the VisTrails system, which captures detailed provenance of the pipeline design process: the series of the actions a user followed to create and refine a set of related pipelines [48]. The first four tasks were straightforward and required little experimentation, but the final task was open-ended; users were given a dataset without any restrictions on the use of available vi-sualization techniques. As these users learned about various techniques over the semester, their proficiency in the area of visualization presumably progressed from a novice level toward the expert level. To predict the performance gains VisComplete might attain, we created user models based on the provenance logs captured by VisTrails. User modeling has been used in the HCI community for many years [23, 24], and we employed a low-level model for our evaluation. Specifically, we assumed that at each step of the pipeline construction process, a VisComplete user would either modify the pipeline according to the current action from the log or select a completion that adds a part of the pipeline they would eventually need. We assumed that a user would examine at most ten completions and could select a subgraph of any of these suggestions. Because VisComplete requires a collection of pipelines to derive suggestions, we divided our dataset into training and test sets. The training sets were used to construct the path summaries while the test sets were used with the user models to measure performance. We note that this model presumes a user's foreknowledge of the completed pipeline, and this certainly is not always the case. Still, we believe this simple model approximates user behavior well enough to gauge performance. We also assumed a greedy approach in our model; a user would always take the largest completion that matched their final pipeline. Note that this might not always yield the best performance because the quality of the suggestions may improve as the pipeline is further specified. 1http://www.vistrails.org/index.php/SciVisFall2007 25 3.6.2 Results Figure 3.6 shows one of the test pipelines with the components that VisComplete could have completed highlighted along with its resulting visualization. To evaluate the situation where a set of users create pipelines that all tend to follow a similar template, we performed a leave-one-out test for each task in our dataset. Figure 3.7 shows that our suggestion algorithm could have eliminated over 50%, on average, of the pipeline construction operations for each task. Because Task 1 was more structured than the other tasks, it achieved a higher percentage of reduction. Because Task 4 was more open-ended, although the average percentage is also high, the results show a wider variation (between 30% and 75%). This indicates that the completion interface can be faster and more intuitive than manually choosing a template. Because it is much more likely that our collection will contain pipelines from a variety of tasks, we also evaluated two cases that examined the type of knowledge captured by the pipelines. Since Task 5 was more open-ended and completed after the four other tasks, we expected that most users would be proficient using the tool and closer to the expert user described in Section 3.5. We ran the completion results using Tasks 1 through 4 as the training data (2250 pipelines) and Task 5 (625 pipelines) as the test data to represent a case where novice users are helping expert users, but we also ran this test in reverse to determine if pipelines from expert users can aid beginners. Figure 3.8 shows that both tests achieved similar results; this implies that the variety of pipelines from the four novice tasks balanced the knowledge captured in the expert pipelines. Our testing assumed that users would examine up to ten full completions before quitting. In reality, it is likely that users would give up even quicker. To evaluate how many predictions a user might need to examine before finding the desired completion, we recorded the index of the chosen completion in our tests. Figure 3.9 shows that the the chosen completion was almost always among the first four. Note that we excluded completions that only specified the connection between the new module and the existing pipeline because these trivial completions are possible at each prediction index. Our results show that VisComplete can significantly reduce the number of operations required during pipeline construction. In addition, the completion percentages might be higher if our tech-nique were available to the users because it would likely change user's work patterns. For example, a user might select a completion that contains most of the structure they require plus some extraneous components and then delete or replace the extra pieces. Such a completion would almost certainly save the user time but was not captured with our user model. Finally, the parameters (e.g., pruning threshold, degree weighting) for the completion algorithms were not tuned. We plan to evaluate these settings to possibly improve our results. The completion examples shown in the figures of this chapter, with the exception of Figure 3.6, 26 vtkActor vtkDataSetReader vtkMaskPoints vtkArrowSource vtkGlyph3D vtkPolyDataMapper vtkProperty vtkRenderer VTKCell vtkTransform vtkTransformFilter vtkStreamTracer vtkRungeKutta4 vtkSphereSource vtkTubeFilter vtkPolyDataMapper vtkActor vtkProperty vtkOutlineFilter vtkPolyDataMapper vtkActor Completed Added By User Figure 3.6: One of the test visualization pipelines applied to a time step of the Tokamak Reactor dataset. VisComplete could have made many completions that would have reduced the amount of time creating the pipeline. In this case, about half of the modules and completions could have been completed automatically. 27 Figure 3.7: Box plot of the percentages of operations that could be completed per task (higher is better). The statistics were generated for each user by taking them out of the training data. Figure 3.8: Box plot of the percentages of operations that could be completed given two types of tasks, novice and expert. The statistics were generated by evaluating the novice tasks using the expert tasks as training data (novice) and by evaluating the expert tasks using the novice tasks as training data (expert). 28 Figure 3.9: Box plot of the average prediction index that was used for the completions in Figure 3.7 (lower is better). These statistics provide a measure of how many suggestions the user would have to examine before the correct one was found. used the entire collection of pipelines to generate predictions. Figure 3.6 used only the pipelines from Tasks 1-4. 3.7 Discussion To our knowledge, VisComplete is the first approach for automatically suggesting pipeline com-pletions using a database of existing pipelines. As large volumes of data continue to be generated and stored and as analyses and visualizations grow in complexity, the creation of new content by consensus and the ability to learn by example are essential to enable a broader use of data analysis and visualization tools. The major difference between our automatic pipeline completion technique and the related work on creating pipelines by analogy [130] is that instead of using a single, known sequence of pipeline actions, our method uses an entire database of pipelines. Thus, instead of completing a pipeline based on a single example, VisComplete uses many examples. A second important difference is that instead of predicting a new set of actions, our method currently predicts new structure regardless of the ordering of the additions. This also means that VisComplete only adds to the structure while analogies will delete from the structure as well. By incorporating more provenance information, as in analogies, VisComplete might be able to leverage more information about the order in which 29 additions to a pipeline are made. This could improve the quality of the suggested completions. We note that there will be situations where data about the types of completions that should occur are not available. Also, some suggestions might not correspond to the user's desires. If there are no completions, VisComplete will not derive any suggestions. If there are completions that do not help, the user can dismiss them by either continuing their normal work or by explicitly canceling completion. Currently, we determine the completions in an offline step (by precomputing the path summary, Section 3.3). We could update the path summary as new pipelines are added to the repository, incorporating new pipelines as they are created. In addition, we could learn from user feedback by, for example, allowing users to remove suggestions that they do not want to see again. Completions could be further refined by assigning greater weight to those that more closely mirror the current user's actions, even if they are not the most likely in the database. One important aspect of our technique is that it leverages the visual programming environment available in many visualization systems. In fact, it would be difficult to offer suggestions without a visual environment in which to display the structural changes. In addition, the information for the completions comes from the fact that we have structural pipelines from previous work. Without an interface to construct pipeline structures, it would be more difficult to process the data used to generate completions. However, we should note that turnkey applications that are based on workflow systems, such as ParaView [82], may also be able to take advantage of completions in a more limited way by providing a more intelligent set of default settings for the user during their explorations. 3.8 Summary We have described VisComplete, a new method for aiding in the design of visualization pipelines that leverages a database of existing pipelines. We have demonstrated that suitable pipeline frag-ments can be computed from the database and used to complete new pipelines in real-time. Fur-thermore, we have shown how these completions can be presented to the user in an intuitive way that can potentially reduce the time required to create pipelines. Our results indicate that substantial effort can be saved using this method for both novice and expert users. There are several areas of future work that we would like to pursue. As described above, we would like to update the database of pipelines incrementally, thus allowing the completions to be refined based on current information and feedback from the user. We plan to refine the quality of the results by formally investigating the confidence measure and its parameters. We would also like to explore suggesting finished pipelines from the database in addition to the constructed completions we currently generate. For finished pipelines, we could display not only the completed pipeline structure but also a thumbnail of the result from an execution of that pipeline. CHAPTER 4 EFFICIENT EVALUATION OF EXPLORATORY QUERIES OVER PROVENANCE COLLECTIONS 4.1 Introduction Increasingly, scientific exploration requires advanced computing capabilities to help researchers obtain insights into large datasets. The processes required to analyze and visualize data are often defined as workflows, which are iteratively refined as researchers formulate and test hypotheses. To manage these complex analyses, including the intermediate and final data products, workflow systems have been developed and track the provenance of the data products as well as of the workflow evolution [39, 47]. As the volume of provenance captured by these systems grows and is shared among users, new opportunities are created for knowledge reuse. Different kinds of queries can be posed against provenance [121]. Since workflow provenance can be represented as a graph [39], queries that seek the detailed derivation history of a given data product require that the provenance graph be recursively traversed (backwards), starting from the node that represents the data product. Another useful class of queries involve exploring the structure of the workflows that derive the data products. The workflows (and workflow traces) shared in provenance repositories expose users to examples of (sophisticated) uses of tools and libraries [33, 110]. By querying this information, users can leverage the collective wisdom it encodes. Not only can users find workflows that are relevant for a particular task and learn to assemble new workflows by example [18, 130, 131], but recom-mendation systems can be built to leverage this information to guide users in the workflow design process [85]. This is especially important given the fact that, despite the growing popularity of workflow systems, constructing workflows is often a challenging and time-consuming task. Detailed knowledge of the underlying computational components is necessary to determine what modules and connections ought to be added to obtain a desired result. While there has been work on speeding up recursive queries over provenance graphs [65], the problem of evaluating structural queries has been largely overlooked. In this paper, we study the problem of efficiently evaluating structural queries that are exploratory in nature. Exploratory 31 queries are naturally expressed as simple graphs that may contain wildcards in constrast to standard containment queries like the one shown in Figure 4.1. For example, Figure 4.2 shows an exploratory query posed by a scientist interested in habitat modeling reports that were generated using the RandomForest model with a climate predictor layer. This query can be quickly defined without the need to understand exactly how the different components are connected. Queries with wildcards are useful to search for workflows (or subworkflows) that contain a given structural pattern, but can also be used and to identify possible directions for completing an unfinished workflow. For example, when a workflow designer is faced with a known input and desired output, it is helpful to identify different subworkflows that can connect the source and sink nodes of the the graph (see Section 4.5). Although there has been substantial work on graph indexing techniques to speed up the evalu-ation of fully-specified structural queries [133, 166, 168], the same cannot be said of the problem of efficiently evaluating exploratory queries: Existing approaches have focused on connected-graph queries, not queries that are disconnected or contain wildcards. In addition, while the filtering step in these indexing schemes significantly reduces the number of required (and costly) subgraph isomorphism checks, vague queries often have a large number of answers, all of which must be verified through subgraph isomorphism. FG-Index introduced a verification-free indexing scheme to address this issue [29], but this comes at a cost: when the number of frequent subgraphs is large, the index may become prohibitively large. We propose a flexible, two-level framework to support exploratory queries over provenance collections. Building on graph indexing techniques, we add 2-component frequent subgraphs to the index to support vague queries like those with wildcards and summary graphs to limit the time spent verifying candidate graphs after the filtering step. By augmenting the collection with summary graphs before constructing a discriminative index, we can process queries by verifying summary graphs first, reducing the total number of subgraph isomorphism checks required. We implemented a prototype mechanism and evaluated it on two large collections of provenance information. This chapter is organized as follows. We review workflow definitions as well as graph termi-nology in Section 4.2 before introducing our indexing framework in Section 4.3. In Section 4.4, we detail our implementation, and Section 4.5 describes applying the framework to workflow completions. We evaluate our framework using provenance data from visualization and Yahoo! Pipes workflows in Section 4.6. We discuss extensions and limitations in Section 4.7 and review related work in Section 4.8 before concluding in Section 4.9. 32 FieldData StaticPredictor ClimatePredictor Resampler SpatialDef BuildMDS FieldDataQuery RunModels MAXENT RandomForest BuildMaps BuildReport HTMLReport FieldData StaticPredictor ClimatePredictor Resampler SpatialDef BuildMDS RunModels RandomForest BuildMaps BuildReport HTMLReport RunModels GLM BuildMaps BuildReport HTMLReport Figure 4.1: A standard containment query searches a collection to find workflows with the specified subgraph. 33 FieldData StaticPredictor ClimatePredictor Resampler SpatialDef BuildMDS RunModels RandomForest BuildMaps BuildReport HTMLReport GLM FieldData RSPredictor ClimatePredictor BuildMDS RunModels RandomForest BuildReport HTMLReport MARS ClimatePredictor RunModels RandomForest HTMLReport Figure 4.2: An exploratory query allows wildcards to permit less-specific queries. The dashed lines in the query are wildcard paths; each result must contain a path between the connected modules. 34 4.2 Background Before presenting our indexing framework for exploratory queries over provenance collections, we will review terminology and definitions. Specifically, we wish to abstract these constructs to graphs in order to leverage and extend existing graph indexing techniques. We first review the correspondence between provenance and workflows, then define queries over workflow collections, and finally abstract this to graphs. 4.2.1 Provenance andWorkflows Provenance information is represented as a directed acyclic graph (DAG) encoding dependen-cies among computational steps. Similarly, workflows can also be represented as a graph specifying the order of computation, and most scientific workflows are dataflows which are also DAGs. Fur-thermore, when provenance is generated during the execution of a workflow, the provenance graph directly reflects the structure of the workflow. Thus, a query over provenance graphs (or parts of that query) can often be translated into a query over workflows. In many cases, the workflow specification is shared among several provenance traces derived from multiple executions of similar workflows. In addition, the workflow graph can be much more compact than the provenance graph, especially for workflows that include looping constructs. Thus, while our indexing framework can be directly applied to provenance graphs, it is usually more efficient to index the workflows behind the provenance graphs. A workflow is a set of steps usually associated with some partial order. The steps followed can be controlled by their order, a set of logical constraints, or dictated by human input. A dataflow is a special kind of workflow that is a DAG.1 In a dataflow, each node performs a computation and edges define the flow of data from the outputs of one node to the inputs of another [92]. While general workflows may contain cycles and explicit control constructs [1], their provenance can be represented as DAG-with loops unrolled and branches selected. Formally, a workflow w is a set of computational modules linked by connections that define the flow of data from one module to another. This is often represented as a DAG whose vertices are modules and edges are connections. Each vertex and edge is distinguished with the type of module or connection it represents. For example, the center module in Figure 4.3 has the type RunModels, and the type of connection from it to the BuildMDS module is defined by the ports used to connect the modules. 1The dataflow model is the most prevalent model supported by scientific workflow systems. 35 FieldData StaticPredictor ClimatePredictor BuildMDS RunModels MAXENT RandomForest BuildMaps BuildReport Figure 4.3: A representative workflow from a collection of workflows used for habitat modeling. 4.2.2 Queries Over Provenance Collections A provenance collection consists of a set of provenance records. The collection may contain records generated from multiple executions of a single workflow, from a variety of workflows created as part of a collaborative scientific project, or from an entire database of workflows built by members of a scientific research group over a period of many years. Note that large, distributed collections implicitly contain a wealth of scientific information, cataloging different strategies, experimental approaches, and results. As described earlier, because provenance records often contain (or link to) the specifications of the workflows that were run, these collections often contain an embedded collection of workflows. Some queries can be posed against a single workflow, others involve the differences between two workflows, but many are best answered by examining an entire collection. If a user wishes to know exactly which predictors in the workflow shown in Figure 4.3 affect the maps generated in the report, they need only analyze that single workflow. Another important type of query is identifying differences between pairs of workflows [15]. However, users are often interested in searching a collection of workflows to find those that exhibit specific behaviors. For example, a user may wish 36 to locate all workflows that run using a RandomForest module, a climate predictor, and that generate a report. We focus on this type of query. More formally, given a workflow collection W and a query q, we wish to find the subset Wq W such that every w P Wq satisfies q. Like others [18, 85], we posit that a query can be represented as a workflow. The most basic type of workflow query is containment. Formally, a workflow containment query q is a workflow specification, and a workflow w P W satisfies q if there exists an injective function f that maps modules in q to modules in w such that typepmq typepfpmqq, m P q, fpmq P w, and cpm1;m2q P q ùñ Dc1pfpm1q; fpm2qq P w and typepcq typepc1q where cpm1;m2q is a connection from modulem1 to modulem2. Thus, the query is satisfied when a workflow contains the query workflow. This type of query can be used when looking for a particular region of functionality; for example, searching for all workflows that run a predictor and resamples its results.2 The problem with these containment queries is that the user must know exactly what to look for-the exact module types and connectivity. We suggest a more powerful form of workflow queries where the query allows wildcards for module or connection types. This relaxation allow queries to specify existence of paths in addition to direct connections, and existence of a module rather than a specific module type. More formally, an exploratory workflow query is a partial workflow q, a workflow where modules and connections can have the wildcard type meaning any type of module or connectivity is allowed. Then, a workflow w satisfies the exploratory query q if there exists an injective function f such that typepmq or typepmq typepfpmqq typepcpm1;m2qq ùñ D pathpfpm1q; fpm2qq P w typepcq ùñ typepcq typepfpcqq where fpcq fpcpm1;m2qq c1pfpm1q; fpm2qq. Note that exploratory queries offer far greater flexibility; users can query a collection without worrying about steps that are not important to their search. For example, a user may wish to find all workflows that use a RandomForest module and eventually output an HTML report that includes information from that model; whether or not BuildMap is used is not relevant to the user. In an exploratory query, wildcards can be used to indicate that a path must connect the two modules but with no restrictions on what modules that path connects. See Figures 4.1 and 4.2 for an example of the difference between containment and exploratory queries. 2Note that workflow queries may also include information about parameters: these can also be specified as part of the workflow. 37 4.2.3 Graphs and Isomorphisms Because we wish to make use of existing graph indexing approaches, we propose a translation from workflow queries to queries over collections of graphs. Workflows can be naturally represented as labeled graphs whose vertices and edges are labeled by the module or connection type. Formally, a workflow w can be represented by the labeled graph GpV;Eq where each module in w is repre-sented by a vertex in V and each connection is an edge in E. In addition, the labeling functions, LV pvq and LEpeq are defined as the types of the modules and connections, respectively. Then, a basic workflow query can be immediately translated into a subgraph isomorphism problem, and exploratory workflow queries can be translated to an extension of subgraph isomorphism involving wildcards. Two graphs G and H are isomorphic if there exists a bijective function f : V pGq Ñ V pHq such that for every edge pvi; vjq P EpGq, there exists an edge pfpviq; fpvjqq P EpHq and vice versa. If G and H are labeled graphs, then f must also preserve labels: LV pviq LV pfpviqq and LEppvi; vjqq LEppfpviq; fpvjqqq. If we relax f to be an injective function, then G is subgraph isomorphic to H, G H, again with the same restrictions for labeled graphs. Much of the existing graph indexing work has focused on speeding up graph containment queries: given a query graph Q, find all graphs G in the collection for which Q G. This type of query is analogous to our workflow containment query, and thus these approaches do not support exploratory queries with wildcards. To extend these techniques, we must first extend the definition of subgraph isomorphism to incorporate wildcards. A wildcard graph G is a labeled graph where any edge can have a special label that denotes a path (not necessarily a single connection) between two vertices. Then G is wildcard subgraph isomorphic toH if G te \| LEpeq ; e P EpG qu (G excluding all wildcard edges) is subgraph isomorphic to H and for each wildcard edge pvi; vjq, there exists a path from fpviq to fpvjq such that no internal vertex in this path is in fpV pG qq. Note that the restriction on the path ensures that a query graph where vertices are specified and not identified as path of a path cannot be used in a wildcard path. 4.3 Indexing Framework With the abstraction of provenance and workflow queries over provenance collections to graph queries, we will propose extensions to existing graph indexing frameworks to support exploratory queries. The inherent graph structure in provenance queries means they are subject to theoreti-cal constraints on subgraph isomorphism which is known to be NP-Complete [31]. Thus, doing a subgraph isomorphism check for each graph in the collection will not scale. We propose a two-level framework that extends existing graph indexing techniques by incorporating summary 38 graphs that capture verification-free subgraph isomorphisms and discriminative features defined over the extended provenance collection. Our goal is to reduce the number of total subgraph isomorphism checks while at the same time allowing less-specific, and thus more exploratory, queries. The framework is rooted in the observation that even if we cannot resolve a query without any verification as in [29], we can reduce the number of subgraph isomorphism computations by finding a subset of the result set with limited verification. 4.3.1 Standard Graph Indexing Standard graph indexing seeks to make subgraph containment queries over collections of graphs more efficient by limiting the number of subgraph isomorphism checks. Indexing strategies have primarily fallen into two categories: feature-based methods (see e.g., [133, 166, 168]) and hierar-chical organization [62, 162]. We focus on feature-based methods because for exploratory searches, users are often querying for specific (and often disconnected) features. Feature-based graph indexing identifies features that aid in distinguishing graphs in a collection from each other. Each feature is linked to the graphs that contain it, and all features are organized in a hierarchy according to feature size. Queries are evaluated by identifying a set of features contained by the query and computing the intersection of the graphs associated with each feature. A graph must contain the same set of features as the query, but this is not sufficient as the features do not necessarily uniquely identify a graph. Thus, we must check whether each candidate graph is subgraph-isomorphic to the query. The fewer isomorphisms we compute, the faster the query execution. Thus, we wish to find a set of features that minimizes the size of the candidate set. 4.3.1.1 Identifying Features The first ingredient in graph indexing is identifying features that will help to differentiate the graphs in our collection. To minimize the size of the index, we wish to find a set of features that serves to filter the collection into small subsets of graphs such that the features are not redundant. Formally, given a graph collection G, a subgraph H is frequent with respect to a threshold T if \| suppHq\| ¥ T where the support of a subgraph H is suppHq tG \| H G P Gu Note that if a query graph contains a given frequent subgraph H, we can immediately exclude all graphs in G that are not in suppHq. For a frequent subgraph H, any subgraph of H is also frequent because any graph that contains H must also contain all subgraphs of H. This means that there may exist a large number of frequent subgraphs when a dataset has a large pattern that occurs frequently. More generally, when frequent 39 subgraphs have similar support values, they serve to prune nearly the same set of graphs. We desire to select a smaller set of frequent subgraphs that still provides good pruning power. This implies that selected feature subgraphs should not significantly overlap. Given the collection G and a set of subgraphs F, a subgraph F is discriminative if suppFq " XF1PF;F 1 F suppF1q Figure 4.4 shows a set of frequent subgraphs, their respective supports, and the size of the intersec-tion of the supports of their subgraphs. Note that F2 and F3 are well indexed by F4, F5, and F6 and thus are not discriminative. 4.3.1.2 Index Construction and Query Processing After identifying the discriminative features, we build an index by organizing the features into a hierarchy to facilitate apriori pruning. Note that this hierarchy may contain features that are not discriminative in order to simplify traversals during query processing. Because each feature is linked to a list of graphs that contain the feature, we can easily prune our search space for each feature in the query. A query is processed by starting with individual vertices and building features with increasing size by traversing the hierarchical index. Once we have the maximal features from the query, we intersect the lists of graphs associated with each of the features. The intersection of these graph lists forms the candidate set of graphs that may satisfy the query. Because we do not know if the candidates actually match the query, we must then verify each candidate by computing a subgraph isomorphism. Note that because subgraph isomorphism can be costly, it is important to have features which prune a large portion of the collection. 4.3.2 Wildcard Graph Indexing Standard graph indexing techniques present two major issues when dealing with exploratory provenance queries. The first is that they usually assume that queries are connected graphs which is not necessarily the case when dealing with workflows. For example, suppose that a user wishes to find a workflow that uses a particular data source and produces a figure in a specific output format. In this case, the user does not care what the internals of the workflow are, so the standard containment query does not apply. A second issue is that answering queries with a large number of satisfying workflows may result in many subgraph isomorphism calculations. A vague query like one to find a common subworkflow might produce many candidates after filtering, all of which need to be verified. We introduce 2-component frequent subgraphs and summary graphs to address these issues. 40 DISCRIMINATIVE FEATURES A C B A C A B C A B F1 F2 F3 F4 F5 F6 \| suppF1q\| 10; \| suppF2q\| 32; \| suppF3q\| 35; \| suppF4q\| 70; \| suppF5q\| 54; \| suppF6q\| 49 \| suppF4q X suppF6q\| 36; \| suppF4q X suppF5q\| 39; \| suppF2q X suppF3q\| 28 Figure 4.4: Because the graphs identified by a feature may also be identified by subgraphs of that feature, we choose discriminative features to be those whose subgraphs collectively identify many more graphs. For example, F1 is selected because the graphs identified by the combination of F2 and F3 is 28 " 10. 4.3.2.1 2-Component Frequent Subgraphs Because exploratory queries frequently contain only pieces of a graph, we propose an indexing strategy that considers disconnected frequent subgraphs. Most existing frequent subgraph mining algorithms can be extended to also consider disconnected subgraphs. The problem with doing so is that the number of frequent subgraphs jumps exponentially. Any frequent subgraph with n vertices has on the order of 2n possible disconnected frequent subgraphs that are also frequent. We can classify these disconnected subgraphs by the number of components. An m-component subgraph is a subgraph whose vertices can be partitioned into no fewer than m sets such that there does not exist any path from a vertex in one set to a vertex in another set. Including 2-component subgraphs in our set of frequent subgraphs only increases the number of frequent subgraphs by a quadratic amount. In addition, a frequent subgraph with more than n components contains Opn2q 2-component subgraphs so we still have a large number of features to help prune the search space. Figure 4.5 shows an example where the two-component subgraph F1 filters many more graphs than F2 and F3. This usually occurs when the query identifies components as nonoverlapping, but many of the graphs indexed by the single-component features have them overlapping. 4.3.2.2 Summary Subgraphs While frequent subgraphs prune the search space and help quickly locate graphs that may satisfy the query, we still need to check every graph that remains after pruning. This verification step involves the computation of a subgraph isomorphism, and this can be even more costly when 41 2-COMPONENT FEATURE A C B F1 F2 F3 C E A C B C E \| suppF1q\| 22; \| suppF2q\| 62; \| suppF3q\| 57; \| suppF2q X suppF3q\| 50 Figure 4.5: While the features F2 and F3 occur together often, they are usually disjoint as defined by the two-component feature F1: \| suppF1q\| ! \| suppF2q X suppF3q\|. wildcards are involved. Cheng et al. proposed FG-Index as a way to eliminate the verification step by noting that when the query is itself a frequent subgraph, the indexed graphs automatically satisfy the query [29]. While these verification-free answers are ideal, indexing all of the frequent subgraphs-not only discriminative ones-can lead to prohibitive index sizes. We propose summary subgraphs as a scalable way to limit the number of verification steps. A summary subgraph F is linked to a subset of the graph collection where each graph G is a supergraph of F. Then, if a summary subgraph satisfies the query, we know that all of the graphs the summary subgraph indexes also satisfy the query. In addition, we will only include subgraphs that do not have immediate supergraphs that index a similar number of graphs. See Figure 4.6 for an example showing which subgraphs are selected as summary features. Formally, a subgraph F is a summary subgraph in a set of graphs F if for all F1 P F, F1 F: suppFq ! suppF1q A summary subgraph is analogous to the -tolerance closed frequent subgraph [29], but we use them differently. When a query graph H is found to be a subgraph of a summary subgraph G, we know that all of the graphs that G indexes also satisfy H. Thus, if this single verification of H G succeeds, we avoid verifying all of the graphs G indexes. Note that H may satisfy other graphs; we leave the remaining graphs to either other summary graphs or basic verification using subgraph isomorphism. However, because mining features are required for feature-based indexing techniques, finding summary subgraphs takes minimal computation. 42 A C C E D B A A C C E D B A E D B A A C D B A F1 F2 F3 F4 SUMMARY FEATURE \| suppF1q\| 52; \| suppF2q\| 62; \| suppF3q\| 57; \| suppF4q\| 60 \| suppF1q X suppF2q\| 50; \| suppF1q X suppF3q\| 51; \| suppF1q X suppF4q\| 50 Figure 4.6: Because each subgraph of a frequent subgraph is also frequent, we choose summary features to be those whose supergraphs have much smaller frequency. 4.3.2.3 Index Construction and Query Processing Our index is composed of both summary and discriminative features. Both summary and dis-criminative features link to supergraphs of themselves that exist in the graph database, as illustrated in Figure 4.7. Because we need to identify the summary subgraphs during query processing just like any other candidate graph, our discriminative features will index to those graphs as well as those in the graph database. Additionally, after identifying the summary subgraphs, any graph indexing scheme can be applied to this extended graph database. Index construction begins by mining a set of connected and 2-component frequent subgraphs. Then, we identify the summary graphs and add them to the collection. Next, we create an index over the augmented collection; because we have already mined features, it is more efficient to use a feature-based scheme. As described earlier, discriminative feature
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6hx1td5