Auto-tuning methods for scaling resources in networks and cloud-based systems

Auto-tuning methods for scaling resources in networks and cloud-based systems

Title	Auto-tuning methods for scaling resources in networks and cloud-based systems
Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Novak, Joseph Herschel
Date	2019
Description	We address two fundamental problems in scaling network traffic and Cloud-based resources: eliminating the need to pre-tune to a specific operating environment and decreasing the effective launch time of virtual machines. Existing methods that adapt computing resources or network traffic to the operating environment have parameters and thresholds that must be manually pre-tuned. Determining appropriate values is a time-consuming and arduous task. When the environment or offered workload changes in unexpected ways, they may no longer be effective and systems must be re-tuned to the new conditions. In this dissertation, we develop methods to eliminate the need to pre-specify the parameters and thresholds. We design methods to adaptively set these parameters and thresholds, either algorithmically or by exploiting characteristics intrinsic to the system model. We refer to the latter as natural thresholds. We develop an active queue management algorithm that scales network traffic, relying only on network run-time measurements and a natural threshold, the knee on the delay-utilization curve. For scaling Cloud-based applications, we develop simple, knobless, reactive approaches that are robust against changing workload. We model the miss ratio curve of memory-intensive applications as a hyperbola and choose a natural threshold on the curve at run time. To decrease effective virtual machine launch time, we use Cloud functions as temporary computing resources while the virtual machine launches. We implement the active queue management algorithm in the Linux kernel in a testbed, which we build, and in a network simulator. We implement resource scaling on two major Cloud provider platforms. The natural threshold methods we develop are effective in scaling network traffic and Cloud-based resources. Our methods outperform state-of-the-art algorithms in most of our evaluations.
Type	Text
Publisher	University of Utah
Subject	Active queue management; AQM; cloud; miss ratio curves; MRC; scaling
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	© Joseph Herschel Novak
Format	application/pdf
Format Medium	application/pdf
ARK	ark:/87278/s62k2bct
Setname	ir_etd
ID	1699888
OCR Text	Show AUTO-TUNING METHODS FOR SCALING RESOURCES IN NETWORKS AND CLOUD-BASED SYSTEMS by Joseph Herschel Novak A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science School of Computing The University of Utah May 2019 Copyright c Joseph Herschel Novak 2019 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Joseph Herschel Novak has been approved by the following supervisory committee members: , Sneha Kumar Kasera Chair 11/26/2018 Date Approved , Ryan Stutsman Member 11/26/2018 Date Approved , Neal Patwari Member 11/26/2018 Date Approved , Erik Brunvand Member 11/26/2018 Date Approved , Hari Sundar Member 11/26/2018 Date Approved by Ross Whitaker the Department/College/School of and by David B. Kieda , Chair/Dean of Computing , Dean of The Graduate School. ABSTRACT We address two fundamental problems in scaling network traffic and Cloud-based resources: eliminating the need to pre-tune to a specific operating environment and decreasing the effective launch time of virtual machines. Existing methods that adapt computing resources or network traffic to the operating environment have parameters and thresholds that must be manually pre-tuned. Determining appropriate values is a time-consuming and arduous task. When the environment or offered workload changes in unexpected ways, they may no longer be effective and systems must be re-tuned to the new conditions. In this dissertation, we develop methods to eliminate the need to pre-specify the parameters and thresholds. We design methods to adaptively set these parameters and thresholds, either algorithmically or by exploiting characteristics intrinsic to the system model. We refer to the latter as natural thresholds. We develop an active queue management algorithm that scales network traffic, relying only on network run-time measurements and a natural threshold, the knee on the delay-utilization curve. For scaling Cloud-based applications, we develop simple, knobless, reactive approaches that are robust against changing workload. We model the miss ratio curve of memory-intensive applications as a hyperbola and choose a natural threshold on the curve at run time. To decrease effective virtual machine launch time, we use Cloud functions as temporary computing resources while the virtual machine launches. We implement the active queue management algorithm in the Linux kernel in a testbed, which we build, and in a network simulator. We implement resource scaling on two major Cloud provider platforms. The natural threshold methods we develop are effective in scaling network traffic and Cloud-based resources. Our methods outperform state-of-the-art algorithms in most of our evaluations. CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTERS 1. 2. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 1.5 1.6 Scaling Network Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scaling Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scaling Memory-Intensive Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Copyright Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 3 5 5 AUTO-TUNING ACTIVE QUEUE MANAGEMENT . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 DUK Goals and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Dependence on a Natural Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 AQM Probability Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 DUK Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Capacity Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Interval Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 CoDel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 PIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Physical Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 The Importance of ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Flow Completion Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 8 9 9 11 11 13 14 14 15 15 16 17 17 18 18 21 21 21 23 3. CLOUD FUNCTIONS FOR FAST AND ROBUST RESOURCE AUTO-SCALING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Reactive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Predictive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Scaling Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Using CFs with VMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 FEAT Scaling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Scaling Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Scaling In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Combined Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Provider-Dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Provider-Independent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Publisher/Subscriber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Cloud Function Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Evaluation Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Measurement Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. 33 36 37 37 37 37 38 39 39 40 40 42 42 42 43 44 45 45 45 46 46 47 49 49 50 50 AUTO-SCALING CLOUD-BASED MEMORY-INTENSIVE APPLICATIONS . 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Auto-Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Miss Ratio Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Profiling Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Natural Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Nonlinear Impact of Memory on Performance . . . . . . . . . . . . . . . . . . . . . 4.3.2 MRC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Miniature Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Application Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 System Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Cloud Function Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Miss Ratio Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Scaling Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 MRC Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 73 77 77 77 78 78 78 79 79 80 81 81 82 84 86 88 89 4.6.2 Probing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 CPU Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Subnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.4 Startup Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.5 Code Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.6 Web Server Cache - Redis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.7 Batch Processor - Image Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Caching Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 Batch Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. 89 92 92 92 93 93 93 93 94 95 95 95 96 98 SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2.1 Alternative Natural Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2.2 Reusing State of Cloud Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2.3 Control Systems Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2.4 Heterogeneous VM Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2.5 I/O Bound Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2.6 Combining Scaling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 vi LIST OF FIGURES 2.1 Delay-Utilization Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Configuration 1 - Single Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Configuration 2 - Dual Transmitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Large-Scale Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Queuing Delay, Drop Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Queuing Delay, Mark Packets with ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Completion Time CDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Scaling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Scale Out Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Cost Savings During Scale Out Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 AWS Pub/Sub without Fast Launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6 Azure Pub/Sub without Fast Launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.7 AWS Pub/Sub with Fast Launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.8 Azure Pub/Sub with Fast Launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.9 Network Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.10 AWS Req/Resp Production Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.11 Azure Req/Resp Production Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1 Effects of Physical Memory Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 MRC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3 Locations on MRCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.4 Architectural Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5 System Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.6 MRC Memory Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.7 Capturing Memory References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.8 AWS Caching Application Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.9 Azure Caching Application Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.10 Production Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.11 AWS Batch Application Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.12 Azure Batch Application Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 viii LIST OF TABLES 2.1 Experiment Summary Data, Drop Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Experiment Summary Data, Mark Packets with ECN . . . . . . . . . . . . . . . . . . . . . 25 3.1 Synthetic Summary. Pub/Sub Is Listed on the Left Side of Each Column, Req/Resp on the Right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Production Summary, Request/Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Measurement Period Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 Simulation Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 Function API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2 Algorithm Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3 Lines of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5 Image Watermark Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.6 Caching Application Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.7 Batch Application Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 ACKNOWLEDGMENTS I would like to express my great appreciation to my advisor, Sneha Kumar Kasera, for the guidance, encouragement, and sage counsel through this arduous journey. His reassuring words helped me see this work to its fruition. Thanks also to Ryan Stutsman for his creative ideas and brilliant editing. I would like to thank Neal Patwari for collaborating on earlier work and for his continued support. Special thanks go to Erik Brunvand and Hari Sundar for excellent questions and constructive criticism. I would like to thank Professor Vishal Misra at Columbia University for many discussions related to the active queue management research. Thanks to all my lab mates including Arijit, Naveen, Mojgan, Phil, Phillip, Christopher, Shamik, Harsimran, Aarushi, Shobhi, and Pavithra for help along the way. Thank you, my dear family and friends, for your patience and support. You are my motivation and inspiration. This material is based upon work supported by the National Science Foundation under Grant No. 1302688. CHAPTER 1 INTRODUCTION Methods that adapt computing resources or network traffic to the environment in which they operate often have parameters and thresholds that must be manually set or hardcoded [24, 25, 38, 48, 51, 53, 58, 59, 76, 91], including those claiming to be auto-tuned [68]. In this respect, they are not completely auto-tuned. When the network environment or offered workload changes in unexpected ways, the parameters and threshold may no longer be appropriate. In this work, we explore methods to adaptively set these parameters and thresholds, either algorithmically or by exploiting characteristics intrinsic to the system model. We refer to the latter as natural thresholds. Through these methods, we create network and Cloud elements that are truly auto-tuned. 1.1 Scaling Network Traffic The primary goal of Active Queue Management (AQM) is to determine the best way to preemptively drop packets to prevent unnecessary delays through a network while keeping utilization high. Many ideas have been proposed, yet despite the potential benefits of AQM, none of the ideas have been widely adopted in routers and other network elements. The key problem is that the existing ideas require pre-specification of AQM parameters and thresholds (e.g., RED [25], CoDel [68]). Given the dynamic nature of networks due to changing traffic patterns and also due to changes in link capacity or bandwidth, finding the right values of the AQM parameters and thresholds is very hard. Moreover, one cannot capture all network situations with the same set of parameters. Another wellknown existing work, PIE [76], uses a pre-tuning approach where the AQM parameters are experimentally determined for the network element and the environment in which the AQM would be deployed. This approach also fails to capture wide-scale changes in the network conditions. We tackle the challenge of designing an AQM method that is based 2 on the principle that no parameters should be pre-specified or pre-tuned. Instead, we look for a threshold that occurs naturally in queuing systems. Our AQM approach does not require pre-specification of even those parameters including exponentially weighted moving average (EWMA) weights or protocol timer intervals that are routinely specified in protocol design and implementation. We develop an AQM algorithm that relies only on network run-time measurements and a natural threshold, the knee on the delay-utilization curve [72]. 1.2 Scaling Virtual Machines A customer of a Cloud provider buys services from the provider to host applications on virtual machines (VMs). The customer then sells services provided by the applications to its clients. A client often has specific requirements regarding the response time of the service tied to an application. When the application’s workload exceeds the capabilities of the VM on which it is hosted, the response time of the application suffers. One way to continue to keep the response times small is to add more computing power to the system by adding more or larger VMs. Cloud providers offer the customer the ability to automatically horizontally or vertically scale (auto-scale) applications to meet ever-changing processing demands. However, two critical problems must be addressed to effectively and automatically scale resources. First, VMs can take tens of seconds or even minutes to launch [60], and more time is needed to load the application on the VM. During this time, the demand on the existing VMs can exceed their capacity, resulting in increased response times. Second, determination of the processing power to add or to remove for the application is tricky as the application load can vary unpredictably over different time scales. Today, customers must specify thresholds to Cloud providers for triggering scaling operations, but these pre-specified thresholds are not robust against changing load demands. We design and build a novel scaling architecture that (1) uses Cloud functions (CFs) available at Cloud providers as interim processing resources to deal with the delay in launching VMs and (2) uses an auto-scaling algorithm, which we develop, that does not require pre-specified thresholds or parameters of any kind [73]. 3 1.3 Scaling Memory-Intensive Applications Accessing paged memory has higher latency than physical memory by up to 6 orders of magnitude [45]. Further, applications experience a nearly unbounded decrease in performance with a linear decrease in the amount of physical memory. Even a small increase in perceived latency can cause a significant loss of revenue. For example, Amazon reports 1% reduction in sales (amounting to about $1.4 million in 2016) for 100 ms of additional latency [78] and Google reports 25% less search traffic for 500 ms of additional latency in rendering results [57]. Sizing a machine’s physical memory correctly is therefore critical to operating cost and, more importantly, performance. Sizing physical memory too large leads to inefficient use of resources, and consequently an increase in operating cost. Sizing it too small causes the application to exhaust the available physical memory and resort to paging to disk. We design and build a reactive auto-tuning method to scale resources for Cloud-based memory-intensive applications. Existing reactive methods to scale Cloud-based resources including CPU and memory are primarily based on setting thresholds [27, 34, 53, 54]. The thresholds must be pre-specified using knowledge of the expected workload. If the runtime workload changes, they may no longer be sufficient to scale resources effectively. Therefore, we develop a reactive auto-tuning method to scale physical memory that does not depend on pre-specification of thresholds. We use memory miss ratio curves1 [61] to determine the amount of physical memory required by an application. We implement our system on two major Cloud provider platforms and evaluate it by horizontally and vertically scaling two classes of memory-intensive applications: caching and batch processing. 1.4 Summary We address the following significant challenges in this dissertation: • Existing methods to control queuing delay in networks require pre-specification of parameters and thresholds that are difficult to tune to the network environment. Further, when the network environment changes, the values may no longer be efficient. It is therefore necessary to find new methods that do not depend on pre-specification 1 An MRC shows the miss rate for a given memory size. 4 of parameters and thresholds. • Existing methods to scale Cloud resources require a priori knowledge of workload patterns or some reference input such as a specific pre-specified CPU utilization level. They run at a specific time interval that must also be pre-specified. When workload patterns change, these methods may not scale appropriately, causing either excessive delays in processing the workload or under-utilization of Cloud resources. Thus, we need alternative methods to reduce these inefficiencies. • Existing methods to scale memory-intensive applications in Cloud environments require setting of thresholds and knowledge of the expected workload. As workload patterns change over time, the pre-tuning may no longer be effective. We therefore need a new method of determining an appropriate amount of memory required by an application at run time. To address these problems, this dissertation makes the following contributions: • Auto-Tuning Active Queue Management (Chapter 2). We use natural thresholds to derive an equation to set a packet drop probability in an AQM algorithm we develop to automatically scale offered TCP load to available bandwidth. We develop a method to algorithmically adjust the time between iterations of our algorithm. We evaluate our algorithm under a variety of network conditions on physical hardware and in simulation. • Cloud Functions for Fast and Robust Resource Auto-Scaling (Chapter 3). We create an algorithm that scales Cloud resources without the need to pre-specify parameters or thresholds. We use an event-driven model to eliminate the need to pre-specify time parameters. We develop a novel method of using Cloud functions as temporary computing resources to quickly scale these Cloud resources. We implement our system on two major Cloud provider platforms and evaluate it on both synthetic workloads and traces from a live production server farm. • Auto-Scaling Cloud-Based Memory-Intensive Applications (Chapter 4). We develop a novel approach to scaling the physical memory of VMs in Cloud-based 5 memory-intensive applications. The method does not require pre-specification of thresholds or knowledge of the expected workload. We scale the VMs both horizontally and vertically on two major Cloud provider platforms. 1.5 Publications The following publications are related to the work contained in this dissertation: • Joe H. Novak, Sneha Kumar Kasera, “Auto-Tuning Active Queue Management”, in Ninth International Conference on Communications Systems and Networks (COMSNETS), 2017. • Joe H. Novak, Sneha Kumar Kasera, Ryan Stutsman, “Cloud Functions for Fast and Robust Resource Auto-Scaling”, in Eleventh International Conference on Communications Systems and Networks (COMSNETS), 2019. 1.6 Copyright Notice Some material contained within this dissertation has been previously published and is used with permission. • Copyright c 2017 IEEE. Reprinted, with permission, from J. Novak et al., “AutoTuning Active Queue Management” in Ninth International Conference on Communications Systems and Networks (COMSNETS), 2017 • Copyright c 2019 IEEE. Reprinted, with permission, from J. Novak et al., “Cloud Functions for Fast and Robust Resource Auto-Scaling” in Eleventh International Conference on Communications Systems and Networks (COMSNETS), 2019 CHAPTER 2 AUTO-TUNING ACTIVE QUEUE MANAGEMENT 2.1 Introduction Active queue management (AQM) algorithms preemptively drop packets to prevent unnecessary delays through a network while keeping utilization high. Many AQM ideas have been proposed, but none have been widely adopted because these rely on pre-specification or pre-tuning of parameters and thresholds that do not necessarily adapt to dynamic network conditions. We develop an AQM algorithm that relies only on network run-time measurements and a natural threshold, the knee on the delay-utilization curve. We call our AQM algorithm Delay Utilization Knee (DUK) based on its key characteristic of keeping the system operating at the knee of the delay-utilization curve. We implement and evaluate DUK in the Linux kernel in a testbed, which we build, and in the ns-3 [74] network simulator. We find that DUK can attain reduced queuing delay and reduced flow completion times compared to other algorithms with virtually no reduction in link utilization under varying network conditions. Active Queue Management (AQM) tries to answer the question, “What is the best way to preemptively drop packets to prevent unnecessary delays through a network while keeping utilization high?” Many ideas have been proposed to answer this question1 , yet despite the potential benefits of AQM, none of the ideas have been widely adopted in routers and other network elements. The key problem is that the existing ideas require pre-specification of AQM parameters and thresholds (e.g., RED [25], CoDel2 [68].) Given 1 Recently, it has been discussed in the context of the continuing or even worsening bufferbloat problem [30, 68, 76, 90]. Bufferbloat is the oversizing of buffers in network devices, which can lead to excessive delays without any real benefit to network providers. 2 Contrary to its claims, CoDel requires specification of parameters including a minimum queuing delay. 7 the dynamic nature of networks due to changing traffic patterns and also due to changes in link capacity or bandwidth, finding the right values of the AQM parameters and thresholds is very hard. Moreover, one cannot capture all network situations with the same set of parameters. Another well-known existing work, PIE [76], uses a pre-tuning approach where the AQM parameters are experimentally determined for the network element and the environment in which the AQM would be deployed. This approach also fails to capture wide-scale changes in the network conditions. In this chapter, we tackle the challenge of designing an AQM method that is based on the principle that no parameters should be pre-specified or pre-tuned. Instead, we look for a threshold that occurs naturally in queuing systems. Our AQM approach does not require pre-specification of even those parameters including EWMA3 weights or protocol timer intervals that are routinely specified in protocol design and implementation. When we view the network link in terms of utilization and the queue corresponding to the link in terms of delay (Figure 2.1 shows a typical delay-utilization curve), we see that as utilization increases, delay also increases. At a certain point, however, there is a very large increase in delay for only a small improvement in utilization. This disproportionate increase in delay is of little to no value to the applications at the endpoints. We want to avoid this unstable region of high increase in delay with little increase in utilization. As a result, a natural threshold becomes apparent. This point at which the change in delay becomes greater than the change in utilization is the knee on the curve. Rather than pre-specified queue length thresholds or delay parameters, we use this natural threshold to derive an expression for the AQM packet drop or marking probability.4 We recognize that the delay-utilization curve is valid over large time scales. However, networks operate over short time scales. We do not have the luxury of taking measurements or reacting over a long time. Conceptually, we base our approach on the delay-utilization curve, but implement it at short time scales. Our AQM algorithm adapts to network environments without the need for tuning or adjustment of thresholds. Very importantly, it does not require any buffer sizing, i.e., it 3 Exponentially Weighted Moving Average. 4 AQM algorithms probabilistically drop packets or mark them using Explicit Congestion Notification (ECN) [77] when they detect congestion. 8 does not require setting limits on buffer sizes and hence, our AQM mechanism does not depend on tail drops. We call our AQM algorithm Delay Utilization Knee (DUK in short) based on its key characteristic of keeping the system operating at the naturally occurring knee of the delay-utilization curve. The primary goal of DUK is to quickly adapt to any bandwidth or offered load without reducing link utilization or unnecessarily increasing queuing delay. We implement and evaluate DUK in the Linux kernel in a testbed, which we build, and in the ns-3 simulator under a variety of network operating conditions. Our results indicate we can obtain robustness to changing and differing network environments without sacrificing performance. Very importantly, we demonstrate that with ECN, our approach decreases flow completion times compared to other algorithms. We compare DUK with two contemporary algorithms, CoDel and PIE, and find that DUK can attain decreased queuing delay with virtually no reduction in the link utilization. We choose to compare DUK to CoDel because it claims to have no parameters or thresholds to tune to the network environment. We compare DUK to PIE to demonstrate the advantages of a self-adaptive algorithm compared to one that needs to be pre-tuned and may not react well when network conditions change. Thus, DUK offers a highly robust alternative to existing AQM methods and can be widely deployed without worrying about pre-selection or pre-tuning of AQM parameters. 2.2 DUK Goals and Approach The key goals in designing DUK are as follows. First, DUK should not depend upon any pre-tuning or pre-specifcation of parameters but only on run-time measurements. Second, DUK should not make any traffic source behaviour assumptions. The traffic at a network node can be a mix of different flavors of TCP, multimedia traffic or other types of traffic with their own congestion control mechanisms, etc. Thus, determining parameters based on any particular flavor of TCP is not likely to be optimal across different traffic types. In designing DUK, we do not necessarily aim for optimality. Unlike some of the existing work on self-tuning AQM [92], we do not make any TCP assumptions. DUK adapts its drop or marking probability using local measurements only. Third, DUK does not require any per-connection information such as per-connection round trip times (RTTs) required by some existing approaches [37, 76]. Moreover, per-connection information is 9 also not necessarily available when traffic is encrypted (e.g., when using IPsec). 2.2.1 Dependence on a Natural Threshold Given that the primary goal of an AQM algorithm is to balance the queuing delay with the link utilization, we choose the knee on the delay-utilization curve as the desired natural threshold. At this point, the first derivative of the delay D with respect to the utilization U is equal to unity. We label this point dD/dU = 1 in Figure 2.1. We want to avoid the region that lies to the right of this point because this is where the delay increases more quickly than the utilization and makes the queueing system unstable. One could possibly argue choosing dD/dU slightly greater than 1 or possibly less than 1. Our choice maximizes utilization while keeping the delays low without any flirtation with the unstable region. The tangent at the knee is always 1 regardless of the time units we use for delay. As the granularity of run-time measurements becomes finer, the measured change in utilization decreases proportionally. Thus, the knee location naturally scales to the time units. For example, if we measure the change in delay over 1 second, the change in utilization will be proportionally higher than if we measure the change in delay over 10 ms. Although the delay-utilization curve is valid for long time scales, we base our approach on this concept. In practice, we must react over small time scales. We choose time units in terms of the resolution of the system clock for greatest accuracy. The time interval algorithm described later ensures that the time units are appropriate for the network environment. 2.3 AQM Probability Determination In this section, we derive an expression for the probability of dropping or marking packets with ECN using the natural threshold dD/dU = 1 for a router queuing system comprising a FIFO queue and a link with a certain bandwidth. DUK does not need a priori knowledge of the link bandwidth. Let dD dU \|Ui be the point where the queuing system is initially operating. The utilization at this point is Ui . Let dD dU \|U f be the point that the queuing system reaches after some packets are admitted or dropped (or marked). The utilization at this point is U f . Let Ui−1 be the utilization during the previous evaluation of the algorithm. In order to determine the packet drop or marking probability, we make the simplifying 10 assumption that the second derivative of the delay-utilization curve is constant over a short time. Note that we neither control nor enforce this assumption in our experiments. Under this assumption, dD dD d2 D \|U f = \|Ui + 2 (U f − Ui ) dU dU dU DUK calculates dD dU \|U f = dD dU (2.1) \|Ui from measurements taken at run time. Given our goal to reach 1, we have: 1= We derive an expression for d2 D dD \|Ui + 2 (U f − Ui ) dU dU d2 D (U f dU 2 (2.2) − Ui ) as follows. We compute the change in delay dD over a short time dt as the difference between the admit rate a and the transmit rate β divided by the capacity C as shown in Equation 2.3. We assume the arrival rate A and the capacity C remain constant over dt. We do not control these assumptions in our experiments. a−β a − UC dD = = dt C C (2.3) Here, U is the link utilization. The admit rate is equal to the probability of admitting or keeping a packet pk multiplied by the arrival rate. a = pk A (2.4) Using Equation 2.3 and the chain rule, we obtain: dD dD dt a − UC dU = = / dU dt dU C dt (2.5) We differentiate Equation 2.5 with respect to time to obtain the second derivative as shown in Equation 2.6. d2 D dU a − UC d2 U dU 3 = − 1/ ( ) − /( ) dU 2 dt C dt2 dt (2.6) To reach our desired point on the delay-utilization curve, we find the second derivative at U = U f . Over a short time interval ∆t, we approximate dU/dt by (U f − Ui )/∆t, and d2 U/dt2 by ((U f − Ui ) − (Ui − Ui−1 ))/∆t2 . We plug these into Equation 2.6 to obtain the following: d2 D U − Ui−1 (U f − Ui ) = −(1 + ∆t) + i 2 dU U f − Ui (2.7) At our reference point, dD = dU, we have U f − Ui /∆t = ( a − U f C )/C. We also find U f − Ui to be equal to ∆t( Ca − Ui )/(1 + ∆t) and use Equations 2.7 and 2.2 to obtain Equation 2.8. 11 1= dD (U − Ui−1 )(1 + ∆t) \|U −(1 + ∆t) + i dU i ∆t( Ca − Ui ) (2.8) Finally, we plug Equation 2.4 into Equation 2.8 and solve for pk . We obtain the drop or marking probability p = 1 − pk from Equation 2.9 on which we base the DUK algorithm. C p = 1− A " (Ui − Ui−1 )(1 + ∆t) + Ui (1 − dD dU \|Ui +(1 + ∆t )) ∆t # (2.9) The various parameters in the above equation can be computed very efficiently with minimal overhead as we discuss in the next section. Equation 2.9 can evaluate to values outside the range [0, 1] during abrupt changes in delay or utilization. We account for this in the DUK algorithm described below. 2.4 DUK Algorithm We show the DUK algorithm in Algorithm 1. The router queuing system executes the algorithm on a periodic basis. The algorithm computes a drop or marking probability for packets received over the subsequent period. The device then drops or marks the packets with the probability determined by the algorithm. If a packet is not dropped, the device accepts it and appends it to the tail of the transmission queue. The algorithm consists of two sections. The first initializes variables and detects idle link conditions. Here, DUK sets the drop or marking probability to zero any time the queue becomes empty or when no data are being received. We measure all values used in Equation 2.9. The utilization is the transmission rate divided by the estimate of the link capacity U = β C. We develop an efficient algorithm for estimating C, described below. The second section computes the drop or marking probability according to Equation 2.9. It is possible for this equation to return a value outside of [0, 1] when abrupt changes in the delay-utilization curve occur. Our algorithm checks for this condition and truncates the probability to valid limits. 2.4.1 Capacity Estimate We present an approach for estimating the capacity in Algorithm 2. We attempt to measure a full time interval worth of packets. At the beginning of a time interval, we mark the last packet in the queue. When the router transmits the marked packet, we first check 12 Algorithm 1: DUK Algorithm 1 2 3 4 5 6 7 // // // // // // // Section 1-initialize and check idle A = arrival rate C = estimate of capacity k = queue length in packets dD dU \|Ui = current d(delay) / d(utilization) ∆t = estimate of measurement time interval Ui − Ui−1 = change in utilization 8 9 10 11 12 if (k == 0 or A == 0) { p = 0; return; } 13 14 15 // Section 2 Compute p as per Equation 2.9 16 17 18 19 20 21 22 if (p < p = } else if p = } 0) { 0; (p > 1) { 1; if the time interval has expired. If it has not and if additional packets have arrived at the queue, we move the mark to the new last packet. We compute the estimate when the router transmits the marked packet. Algorithm 2: Capacity Estimate Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1) At the start of an interval a) mark the last packet in the queue b) attach a timestamp t1 to the packet c) set n = 0 2) For each packet transmitted a) set l to the number of bytes in the packet b) set n = n + l 3) When the router transmits the marked packet a) if more packets have arrived and the time interval has not expired i) move the mark to the new last packet ii) continue with step 2 b) else i) record timestamp t2 ii) compute the estimate C = n/(t2 − t1) 13 2.4.2 Interval Determination The measurement time interval ∆t between iterations of the algorithm can affect stability. Existing AQM approaches use a nonadaptive, arbitrarily specified value for this interval. Following the primary theme of this chapter, we use a measurement-based selfadaptive approach for determining ∆t, as we describe below and show in Algorithm 3, that does not rely on pre-specifications. Algorithm 3: Time Interval Algorithm 1 2 3 // Plast\|cur = previous or current congestion level as per Equation 2.10 // β old\|cur\|new = old, current, or new estimate of transmission rate // ∆told\|cur\|new = old, current, or new measurement time interval 4 5 6 7 8 9 if (Pcur != Plast ) { if (β cur >=β old at Pcur and ∆tcur <∆told at Pcur ) { β new at Pcur = β cur ; ∆tnew at Pcur = ∆tcur ; } 10 Set m and c from curve fit of Equation 2.11 11 12 } 13 14 ∆tnew = mPcur + c We express the backlog or congestion at the network queue as a percentage as shown in Equation 2.10 where P is the percentage of congestion, M is the estimate of the ingress link capacity, and C is the estimate of the egress link capacity. P = 100 ∗ ( M − C )/M (2.10) We collect triples ( Pi , ∆ti , β i ) of congestion level Pi , time interval samples ∆ti , and transmission rate β i as the network environment changes. We want to find the minimum ∆ti for the current Pi . For each Pi we encounter, we update ∆ti and β i if the current transmission rate is greater than or equal to any previous sample and the current time interval is less than any previous sample. Effectively, we keep the smallest time interval for which the transmit rate does not decrease. When we encounter a new congestion level Pi , we do not have an initial value for ∆ti . To remedy this, we fit a line (Equation 2.11) to the data points we have collected so far and use this as the initial value for the new congestion level. In 14 Equation 2.11, m is the slope and c is the vertical axis intercept (corresponding to the time interval when there is no congestion). ∆t = mP + c (2.11) We use Algorithm 3 to determine the next time interval ∆tnew . We note the slope of the line in Equation 2.11 is expected to be negative because of the following intuition: the higher the congestion level, the faster the AQM algorithm should react, and correspondingly, the lower should be the value of ∆t. 2.4.3 Implementation We note that the computational overhead of DUK’s probability calculation is minimal because it is evaluated only once per time interval. However, to minimize computation cost, we use scaled integer rather than floating point arithmetic. We also reduce the number of divisions required by simplifying the ratio of C/A. DUK keeps track of the number of bytes of the numerator and denominator over the same time interval and performs a single division to obtain the ratio. With these simplifications, there are three divisions (one for each of dD/dU, C/A, and the primary quotient over ∆t) and three multiplications in the probability calculation. By comparison, PIE’s probability calculation involves one division and two multiplications. Our commodity Linux box can evaluate DUK’s probability 25 million times per second. At this rate, the minimum time interval is about 40 nanoseconds, which is more than sufficient for today’s highest network speeds. PIE estimates the current queuing delay by dividing the queue length by the average departure rate, thus requiring an EWMA parameter. We instead use a technique similar to CoDel, which uses timestamps to determine the queuing delay. We run the interval determination algorithm as a low-priority background task that refines the fitted line. It is not in the critical path and does not impact performance. 2.5 Experimental Setup We implement DUK in the Linux kernel on physical hardware in a testbed that we build. We extensively compare its performance with CoDel and PIE. Before we describe our setup, we briefly review CoDel and PIE and their parameters. 15 2.5.1 CoDel The CoDel algorithm measures the sojourn time of each packet in the queue. CoDel always admits packets to the queue and makes a drop or marking decision each time a packet is removed from the queue based on the measured sojourn time. If the sojourn is above a target time of 5 ms for a period of at least 100 ms, CoDel enters a drop/mark state. While in this state, CoDel drops or marks packets with exponentially increasing frequency until the sojourn time falls below 5 ms. 2.5.2 PIE PIE is based on a feedback system that uses various parameters and thresholds to control its estimates and calculations. We use the default values for these parameters and thresholds. When a packet is dequeued, PIE estimates the queuing delay from the queue length and an EWMA of the dequeue rate. Using PIE’s notation, every update period of Tupdate = 30 ms, it increments or decrements the probability of dropping or marking a packet by adding two weighted factors. The first is the difference between the delay estimate and a reference delay. The weight for this factor is α̃ = 0.125 Hz and the reference delay is delay re f = 20 ms. The second is the difference between the current and previous delay estimates. The weight for this factor is β̃ = 1.25 Hz. Estimates are updated only if the queue length is greater than a threshold of dq threshold = 10 KBytes. PIE uses a token bucket design that allows packets to be excluded from the drop or marking decision during an initial congestion transient. The duration of this is controlled by the parameter max burst = 100 ms. We base our CoDel and PIE modules on their respective kernel [69] and reference implementations [14]. Additionally, we simulate DUK in ns-3 to conduct large scale experiments. We use the ns-3 CoDel module [62]. We port PIE to ns-3. We run DUK, CoDel, and PIE in each experiment for comparison. To compare the performance of the base algorithms, we eliminate the maximum queue limits at which CoDel and PIE revert to DropTail. We add a variable RTT component to the network to mitigate synchronization effects. 16 2.5.3 Physical Hardware We run Linux Kernel 3.9.10 with TCP CUBIC in our testbed. Our modules are Linux qdisc5 modules [42]. We use two network configurations for our experiments. The first consists of a single transmitter, a router, and a receiver (Figure 2.2). We set the physical link speeds to either 100 Mbps or 1 Gbps depending on the experiment. We use a token bucket to create a bottleneck between the router and receiver. We program the token bucket to run a bandwidth schedule specific to the experiment. In the second configuration (Figure 2.3), we add an additional transmitter and receiver. We connect the transmitters to the router through a link layer switch. We connect the receivers to the router through a separate link layer switch. This configuration ensures the queuing algorithms in the switches do not interfere with the results. The link rates shown create a physical bottleneck between the router and the switch on the receiver side. We do not use a token bucket in this configuration. We use commodity Intel-based hardware for all machines. The router is a 3 GHz Core i5-2320. In Figure 2.2, the transmitter is a 3.4 GHz i7-4770 and the receiver is a 2.26 GHz Core i5 430M. For diversity (Figure 2.3), we add a 2 GHz Pentium 4 transmitter and a 1.6 GHz Atom N2600 laptop receiver. We use TCP flows unless otherwise stated. We use the Linux netem qdisc delay [35] to add delay to the RTT. This delay simulates a link level delay and affects all traffic on the link. We set the netem buffer size high enough that it does not interfere with the experiments. The netem program takes delay and jitter as parameters. Delay is normally distributed and the delay parameter is the mean of the distribution. Unless otherwise stated, we use a delay of 50 ms and a jitter of 10 ms. We set the packet size to 1448 bytes. We use a custom TCP application that simulates bulk transfer of a large data stream such as file transfer or video streaming. In some scenarios, we create short-lived flow transmitters to simulate web browsing. When a short-lived transmitter finishes sending a flow, we create another to take its place. The application sets the TCP buffer sizes as large as the operating system allows to minimize effects of host limitations. We use iperf to generate UDP traffic. 5 A qdisc (queuing discipline) is a network scheduler that is part of the network traffic control subsystem in the Linux operating system. 17 2.5.4 Simulation We implement DUK in ns-3 version 3.19 to create large-scale environments. We port PIE to ns-3 and we add wrapper code to CoDel that allows it to interface to our statistics module. We use the Network Simulation Cradle (NSC)6 to run various TCP stacks. We load New Reno, CUBIC, and HTCP to create a heterogeneous environment. We create transmitters that continually send data to simulate large file transfers or video streaming. We create short-lived flow transmitters to simulate web browsing. When a short-lived flow completes, the sender creates another to take its place. We do not run ECN in ns-3 because it is not supported with NSC. Figure 2.4 shows our ns-3 network topology. We create this topology with the transmitters and receivers separated by router and delay nodes. We run the AQM algorithms on the router. We modify the BridgeNetDevice ns-3 class to create a delay node in which we can configure the RTT and jitter. We set the delay to 50 ms and the jitter to 10 ms. We set the transmitter and receiver sides of the network to 100 Mbps. We create a bottleneck by setting the router’s egress link to 20 Mbps. 2.6 Evaluation There is little variation between runs of the same experiment for the same algorithm. Nevertheless, we run each experiment 5 times and report averages. We evaluate the following metrics: (i) Queuing delay is the amount of time a packet spends in the queue and being served. (ii) Queuing delay standard deviation shows the queuing delay variation over the the experiment and quantifies jitter. (iii) Queuing delay interquartile range (IQR) is a measure of the statistical dispersion of the queuing delay and quantifies stability. (iv) Utilization is the percentage of the available capacity used by the system. (v) Drops/marks is the percentage of packets dropped or marked. (vi) Completion time is the amount of time for flows to complete. In some cases, we report third quartile completion times for the short-lived flows. 6 NSC allows the simulator to run on a real-world TCP stack by loading a Linux TCP stack. We use nsc 0.5.3 with liblinux2.6.18. 18 2.6.1 The Importance of ECN We would like to make an important point about ECN before describing our experimental results. We believe that ECN is important to the successful deployment of AQM systems. As our drop experiments show, DUK is able to control the queue length much better than competing mechanisms; however, that comes at the cost of the increased drop probability. This is not surprising, as with TCP flows, a decrease in queuing delay induces a quadratic increase in drop or probability [37]. Using the well-known ”square-root p” formula of TCP throughput, if we have N flows passing through a congested link, p 2/pN/R where p is the drop probability and R is the round trip then we have C = delay. If the AQM mechanism is successfully able to reduce the delay to R aqm , we have p p C = 2/pN/R = 2/p aqm N/R aqm . This implies p aqm = ( R/R aqm )2 p. Thus, any gains from reducing the queuing delay are lost (quadratically) in the loss rate. With higher loss rate comes a greater potential to cause timeouts and to introduce application jitter, especially in shorter flows. However, if we use ECN instead of drops, p aqm is simply a feedback signal and we can get lower latency at zero cost. This logic applies to all AQM mechanisms and hence, it is our strong belief that AQM works best with ECN. Since ECN is widely deployed now, time is ripe for successful deployments of AQM as well. 2.6.2 Experimental Results We now present our results. We refer to Figure 2.2 as configuration 1 and Figure 2.3 as configuration 2. We summarize drop experiments in Table 2.1 and ECN in Table 2.2. We show results after initial transients have settled. We show graphs of sample runs in Figures 2.5 and 2.6 for drops and ECN, respectively. We average the data once per second for clarity at the expense of resolution. This obscures oscillations; however, the standard deviation and IQR indicate the severity. Note that in each graph, DUK has the least amount of queuing delay and is the lowest plot in each figure. DUK’s queueing delay in most cases is less than half the queuing delay of either CoDel or PIE. In the VoIP experiment, it is about one tenth of the other algorithms. 2.6.2.1 Experiment 1 - Vary Link Capacity We use configuration 1 at 100 Mbps. Initially, the token bucket sets the capacity to 100 Mbps. We create 40 short-lived flows of 250 KB each. At 15 seconds, the token bucket sets 19 the capacity to 50 Mbps, 20 Mbps at 30 seconds, 50 Mbps at 60 seconds, and 100 Mbps at 75 seconds. We end the experiment at 90 seconds. The tables show metrics during the most severe congestion that occurs at 20 Mbps capacity. We show sample runs in Figures 2.5(a) and 2.6(a) averaged over 2-second intervals for clarity. DUK exhibits a 10x improvement in queuing delay, standard deviation, and IQR in the drop experiment. We see that the third quartile ECN flow completion times for DUK are about 1.25 seconds shorter than PIE or CoDel. This improvement comes from the shorter queuing delay. 2.6.2.2 Experiment 2 - Vary Link Capacity, Data Center Environment This is similar to Experiment 1, but at higher link capacity. Figures 2.5(b) and 2.6(b) show sample runs. Unfortunately, testbeds that operate at higher data rates, which are typical in data center networks, which would also allow changes in the queue management, are not available to us. We use a 1 Gbps link rate, the maximum supported by our hardware, in configuration 1 to approximate a data center environment. We remove the netem delay to simulate the low RTTs typical of data centers [1]. Initially, the token bucket sets the capacity to 1 Gbps. The short-lived flow sender creates 100 flows of 2500 KB each. At 15 seconds, the token bucket sets the capacity to 500 Mbps, 200 Mbps at 30 seconds, 500 Mbps at 60 seconds, and 1 Gbps at 75 seconds. We run the experiment for 90 seconds. The tables show data for the 200 Mbps interval. Utilization is nearly identical, but DUK significantly outperforms in queuing delay. It also improves completion times. In a real data center, we expect the timer interval to be less than 1 ms; however, we are limited by the resolution of our Linux implementation. With a higher precision timer, DUK can attain lower delays. 2.6.2.3 Experiment 3 - Vary Flows Figures 2.5(c) and 2.6(c) show sample runs. We evaluate the response to changes in offered load using configuration 2. We begin with 5 bulk TCP flows between the first sender-receiver pair. At 30 seconds, we add 5 flows between the second pair. At 60 seconds, we add 5 flows of cross traffic from the first sender to the second receiver. At 90 seconds, we add 5 flows from the second sender to the first receiver. At 120 and 150 20 seconds, we remove 5 flows from each sender. The tables show data for the 20-flow time. DUK’s queuing delay is 20x lower for drops and 3x lower for ECN. 2.6.2.4 Experiment 4 - Vary Round Trip Time We set the RTT between the first sender-receiver pair to 25 ms with a jitter of 5 ms. We set the RTT between the second pair to 100 ms with a jitter of 15 ms. We run 3 bulk flows between each pair for 120 seconds. We show sample runs in Figures 2.5(d) and 2.6(d). With a light load, PIE becomes unstable. This in line with the prediction in [38], which states that AQM systems become more unstable with fewer flows and higher RTTs. A principled approach to make the PI controller self-tuning has been presented in [92], but PIE is not self-adaptive in that sense. As a result, PIE begins to oscillate and introduces significant jitter. With ECN, CoDel and PIE mark only a few packets. DUK is able to maintain a consistent and low queuing delay. 2.6.2.5 Experiment 5 - VoIP We evaluate DUK in the presence of unresponsive UDP flows. Note that ECN marking is not an option with UDP. We simulate a small VoIP network with 5 Mbps dedicated to voice calls. Each call is 87.2 Kbps [15] and the link saturates with 58 simultaneous calls. Using configuration 1, we add 10 UDP flows every 10 seconds for 60 seconds. Beginning at 90 seconds, we remove 10 UDP flows every 10 seconds. Figure 2.5(e) shows a sample run. We compute metrics for the saturated time. DUK shows a 10x lower queuing delay. 2.6.2.6 Experiment 6 - Large-Scale Heterogeneous Network Simulation We use ns-3 to evaluate DUK in a large-scale network with a mix of TCP implementations. We create one always-on sender of each TCP variant. We connect three receivers to each sender for a total of 9 bulk flows to simulate video streaming or large file transfers. We create one short-lived flow sender of each TCP variant and connect 9 receivers to each. Each pair runs 50 simultaneous connections (1350 total) to simulate web traffic. In total, there are 44 nodes and 1809 connections. This experiment illustrates DUK’s ability to scale to large networks. DUK shows 2x-3x improvement in delay, IQR, and standard deviation. Figure 2.5(f) shows a sample run. 21 With IQR, a lower number indicates higher stability. In nearly every case, DUK’s IQR is lower than the other algorithms and in many cases, significantly lower. It is 1/3 lower for VoIP and 1/2 lower for large-scale simulation. 2.6.3 Flow Completion Times We perform additional ECN experiments to investigate flow completion times. Because of its shorter queuing lengths, DUK’s completion times are shorter. We illustrate with CDFs (Cumulative Distribution Functions). In the first experiment, we use configuration 1 at 100 Mbps with a 20 Mbps bottleneck. We run 25 senders that continually send short-lived flows of 250 KB each. We observe in Figures 2.7(a) and 2.7(b) that improvement is dependent on RTT. The third quartile of DUK’s completion times is nearly 50% lower with 25 ms RTT and 33% lower with 50 ms RTT. In the second experiment, we use configuration 1 at 1 Gbps with a 500 Mbps bottleneck. We remove the delay node. We run 250 bulk flows background traffic. We run 50 senders that continually send short-lived flows of 25 KB each. The third quartile shows a 30% to 40% reduction in time. 2.6.4 Fairness We measure the instantaneous jitter for each experiment. The instantaneous jitter is the difference in queuing delay between transmissions of successive packets. We find that the instantaneous jitter is less than 2 ms for each experiment for each algorithm. Fairness is not a prime motivation in our design. However, we compute Jain’s Fairness Index [47] for experiments with long-lived TCP flows. 7 In all experiments, all algorithms exhibit an index within 0.5% of each other. 2.7 Related Work AQM algorithms can be classified according to the metrics they use in their respective drop decisions. These metrics include queue length, network load, mismatch between Fairness Index is defined by the equation J = (∑in=1 xi )2 / ∑in=1 xi2 where n is the number of flows and xi is the throughput achieved by flow i. Higher index values indicate more fairness with a maximum value of unity. 7 Jain’s 22 arrival rate and transmission rate, and queuing delay. RED computes a probability proportional to the queue length. If the queue length grows beyond an upper limit, RED always drops arriving packets with a fixed probability. One reason that RED has not been adopted is that it can be difficult to configure the queue length thresholds. Variants of RED include additional characteristics in an attempt to assist in its configuration. For example, Self Configuring RED [23] includes the number of flows in the drop decision. BLUE [24] is a network load-based algorithm that increases the drop probability if it detects packet loss caused by buffer overflow. It decreases the probability when the link becomes idle. The user must choose values for the amount by which BLUE increases or decreases the probability. Adaptive Virtual Queue (AVQ) [51] uses a token bucket as a virtual queue. When it overflows, AVQ drops packets from the actual queue. The user must make appropriate choices for utilization and damping factor parameters. YELLOW [58] uses the mismatch between arrival rate and transmission rate as well as queue length. Parameters include a reference queue length, queue drain coefficients, and link utilization factor. REM [7] uses arrival rate and queue length to maintain a reference queue length. REM’s parameters include target queue length and responsiveness coefficient. RaQ [81] uses arrival rate and queue length. Parameters include reference arrival rate and queue length. GREEN [91] changes the probability based on a function of arrival rate and capacity. Capacity is a threshold and probability increase and decrease sizes are parameters that must be correctly chosen. More recent algorithms such as CoDel and PIE focus on queuing delay. CoDel begins to drop packets only after the queuing delay is above a threshold of 100 ms for a duration of a 5 ms threshold. It increases the number of drops as long as the queuing delay is above the 5 ms threshold. The drop decision is not probabilistic, but is based on time. CoDel exponentially decreases the time between drops until the queuing delay is reduced below the threshold. Hollot et al. [37] use classical control theory to develop the PI (ProportionalIntegral) controller. PIE is based on this work. PIE is a linear feedback control system with thresholds of queuing delay and maximum queue length. Its parameters include deviation from a target queue delay and a balance between queuing delay deviation and jitter. 23 Hong et al. [39] set a probability to maintain a reference queue length, which is an arbitrarily chosen constant tuned to the network. Our work differs in that we set the drop probability to maintain a naturally occurring position on the delay-utilization curve. Chiu and Jain [12] define the knee of a throughput versus offered load curve. In [46], Jain presents CARD TCP end-point congestion avoidance. It uses the gradient of the RTT vs. TCP window size curve to determine if the TCP window size should be increased or decreased. The amounts by which CARD changes the TCP window size are parameters. Our approach differs in fundamental ways. We implement an AQM algorithm rather than a per-flow TCP end-point algorithm and we view the network in terms of utilization. Consequently, we use a different curve as the basis of our calculations. We use measurements to determine the probability dynamically. 2.8 Conclusions We developed an AQM approach called DUK that operates based on a natural threshold and run-time measurements instead of relying on pre-specified or pre-tuned parameters. We implemented DUK in the Linux kernel in a custom testbed. We also implemented it in the ns-3 simulator. We found that DUK achieves similar performance in terms of link utilization but reduces queuing delays and flow completion times compared to two of its leading contemporaries. We explored scaling network traffic to available resources. Next, we explore scaling end system resources in public Clouds to the offered workload without the need to pre-specify parameters or thresholds. Table 2.1: Experiment Summary Data, Drop Packets Experiment Algorithm Mean Std. Dev. of IQR of Utilization Packet Queuing Queuing Queuing (%) Drops (%) Delay Delay (ms) Delay (ms) (ms) 1 - Vary Capacity DUK 1.00 1.16 1.00 98.5 16.96 CoDel 30.76 11.99 18.00 99.7 9.73 PIE 25.82 9.18 13.00 99.8 9.86 2 - Vary Capacity, DC DUK 0.19 0.42 0.00 99.9 24.33 CoDel 14.10 3.86 7.00 99.9 7.55 PIE 18.85 0.90 1.00 100.0 5.91 3 - Vary Flows DUK 0.80 0.98 1.00 99.9 15.19 CoDel 17.62 6.06 8.00 100.0 6.41 PIE 19.50 6.02 8.00 100.0 5.73 4 - Vary RTT DUK 0.47 0.84 1.00 99.2 1.98 CoDel 12.20 6.38 8.00 99.7 0.96 PIE 24.77 5.94 8.00 99.8 0.60 5 - VoIP DUK 3.43 4.29 5.00 99.6 14.92 CoDel 30.52 10.04 15.00 100.0 8.44 PIE 27.50 9.68 14.00 100.0 7.27 6 - Large Scale DUK 37.73 31.89 45.55 93.0 25.81 CoDel 94.42 78.35 106.80 93.1 24.60 PIE 112.65 105.63 145.88 93.3 11.24 24 4 - Vary RTT 3 - Vary Flows 2 - Vary Capacity, DC 1 - Vary Capacity Experiment Table 2.2: Experiment Summary Data, Mark Packets with ECN Algorithm Mean Std. Dev. of IQR of Utilization Packet Queuing Queuing Queuing (%) Marks Delay Delay (ms) Delay (%) (ms) (ms) DUK 11.86 6.67 9.00 99.9 38.54 CoDel 24.21 7.07 10.00 99.9 30.67 PIE 28.07 17.87 23.00 99.8 37.12 DUK 11.05 0.55 0.00 100.0 39.13 CoDel 17.13 0.74 1.00 100.0 13.58 PIE 19.40 0.72 1.00 100.0 9.92 DUK 5.70 5.38 9.00 99.8 19.28 CoDel 17.57 6.14 8.00 100.0 9.39 PIE 20.50 5.59 8.00 100.0 8.11 DUK 0.67 1.34 1.00 98.5 5.67 CoDel 11.33 5.03 7.00 100.0 0.00 PIE 24.26 7.25 10.00 100.0 0.00 6.17 7.43 7.43 9.49 11.96 12.05 N/A N/A N/A N/A N/A N/A Third Quartile Completion Time (seconds) 25 26 Delay dD =1 dU Knee 45° 100% Utilization Figure 2.1: Delay-Utilization Graph 100 Mbps or 1 Gbps Physical 100 Mbps or 1 Gbps Physical Bottleneck Router with Receiver Token Bucket Transmitter Figure 2.2: Configuration 1 - Single Transmitter Transmitters 10 Mbps 10 Mbps 100 Mbps Receivers 10 Mbps Bottleneck Router 100 Mbps 100 Mbps Figure 2.3: Configuration 2 - Dual Transmitters 27 Transmitters 100 CUBIC Mbps Always-On HTCP New Reno Always-On CUBIC On-Off New Reno On-Off 100 Receivers Mbps CUBIC Always-On HTCP Always-On Always-On 20 Router Mbps Delay (AQM) Bottle- Node neck HTCP HTCP On-Off On-Off Figure 2.4: Large-Scale Configuration New Reno Always-On CUBIC On-Off New Reno On-Off 28 Queuing Delay 50 DUK CoDel PIE Delay [ms] 40 30 20 10 0 0 10 20 30 40 50 Time [seconds] 60 70 80 90 (a) Experiment 1 - Vary Link Capacity Queuing Delay 30 DUK CoDel PIE Delay [ms] 25 20 15 10 5 0 0 10 20 30 40 50 Time [seconds] 60 70 80 90 (b) Experiment 2 - Data Center Environment Queuing Delay 50 DUK CoDel PIE Delay [ms] 40 30 20 10 0 0 20 40 60 80 100 Time [seconds] 120 (c) Experiment 3 - Vary Flows Figure 2.5: Queuing Delay, Drop Packets 140 160 180 29 Queuing Delay 30 DUK CoDel PIE Delay [ms] 25 20 15 10 5 0 0 20 40 60 Time [seconds] 80 100 120 (d) Experiment 4 - Vary RTT Queuing Delay 40 DUK CoDel PIE Delay [ms] 30 20 10 0 0 50 100 150 Time [seconds] (e) Experiment 5 - VoIP Queuing Delay 250 DUK CoDel PIE Delay [ms] 200 150 100 50 0 0 10 20 30 Time [seconds] (f) Experiment 6 - Large Scale Figure 2.5: Continued 40 50 60 30 Queuing Delay 50 DUK CoDel PIE Delay [ms] 40 30 20 10 0 0 10 20 30 40 50 Time [seconds] 60 70 80 90 (a) Experiment 1 - Vary Link Capacity Queuing Delay 30 DUK CoDel PIE Delay [ms] 25 20 15 10 5 0 0 10 20 30 40 50 Time [seconds] 60 70 (b) Experiment 2 - Data Center Environment Figure 2.6: Queuing Delay, Mark Packets with ECN 80 90 31 Queuing Delay 50 DUK CoDel PIE Delay [ms] 40 30 20 10 0 0 20 40 60 80 100 Time [seconds] 120 140 160 180 (c) Experiment 3 - Vary Flows Queuing Delay 30 DUK CoDel PIE Delay [ms] 25 20 15 10 5 0 0 20 40 60 Time [seconds] (d) Experiment 4 - Vary RTT Figure 2.6: Continued 80 100 120 32 Completion Time CDF 1 Probability 0.8 0.6 0.4 DUK CoDel PIE 0.2 0 2 3 4 5 6 Time (seconds) 7 8 9 (a) 25 ms RTT Completion Time CDF 1 DUK CoDel PIE Probability 0.8 0.6 0.4 0.2 0 2.5 3 3.5 4 4.5 5 5.5 Time (seconds) 6 6.5 7 (b) 50 ms RTT Completion Time CDF 1 Probability 0.8 0.6 DUK CoDel PIE 0.4 0.2 0 0.02 0.04 0.06 0.08 Time (seconds) 0.1 (c) 1 Gbps Link Speed Figure 2.7: Completion Time CDFs 0.12 7.5 CHAPTER 3 CLOUD FUNCTIONS FOR FAST AND ROBUST RESOURCE AUTO-SCALING 3.1 Introduction Cloud providers offer their customers the ability to automatically horizontally and vertically scale applications to meet ever-changing processing demands, but the customer must define thresholds to trigger scaling operations. Correctly specifying these thresholds is challenging. We develop an auto-scaling method based on run-time measurements that automatically adjusts to changing load conditions without the need to pre-specify any parameters. Cloud resources are mainly hosted on virtual machines (VMs) that can take tens of seconds or even minutes to launch [60]. We present a method to decrease effective startup times with minimal cost to the customer by exploiting Cloud functions. Cloud functions (CFs) are effective, since they can nearly eliminate the time it takes to start a new instance of an application. Cloud providers have introduced CFs to host applications that receive a request, perform some function, and return a result. Unlike VMs, CFs are inexpensive computing resources with low latency start times and execute for short durations. However, one cannot rely on CFs to maintain state when idle. They are of greatest benefit to (nearly) stateless applications. They are applicable to data centers because in this environment, stateless servers occur at a 5:1 ratio to stateful servers [27]. We use CFs to develop a method to decrease the effective launch time of these types of applications. Essentially, we run applications on CFs while the VM launch is still in progress. Once the VM launch is complete, the applications run on the newly launched VM(s). We design and build FEAT, a new scaling approach that uses (1) Cloud functions as interim processing resources to compensate for VM launch delays and (2) a reactive, knobless, auto-scaling algorithm that requires no pre-specified thresholds or parameters, mak- 34 ing it robust against changing load. We implement FEAT on Amazon Web Services (AWS) and Microsoft Azure. Our evaluations clearly demonstrate the higher performance and robustness of FEAT in comparison to existing approaches. A tenant (customer) buys services from a public Cloud provider to host applications on virtual machines (VMs). The tenant sells services provided by applications to its clients. Cloud providers offer the tenant the ability to automatically scale (auto-scale) applications to meet ever-changing processing demands.1 However, two critical problems must be addressed to effectively and automatically scale resources. First, VMs can take tens of seconds or even minutes to launch [60], and more time is needed to load the application on the VM. During this time, the demand on the existing VMs can exceed their capacity, resulting in increased response times. Second, determination of the processing power to add to or remove from the application is tricky as the application load can vary unpredictably over different time scales. In existing scaling approaches, tenants must specify thresholds to Cloud providers for triggering scaling operations, but these pre-specified thresholds are not robust against changing load demands. In this chapter, we design and build a novel scaling approach, FEAT (Fast launch Event-driven Auto-Tuning) that uses (1) Cloud functions available at Cloud providers as interim resources to deal with the delay in launching VMs and (2) a reactive, knobless, auto-scaling algorithm, which we develop, that does not require any pre-specified thresholds, making it robust against changing load. Provisioning compute in Cloud services is undergoing a transition as developers begin to deeply incorporate CFs into their applications. All major Cloud providers now offer CFs, including Amazon Web Services (AWS) Lambda [2], Microsoft Azure Functions [63], and IBM OpenWhisk [43]. Tenants benefit from CFs, since they provide low-latency launch times and are inexpensive for short-lived requests. Cloud providers benefit, since CFs are generally small, short, and stateless, which eases resource scheduling and reduces idle capacity. However, CFs have two serious drawbacks compared to VMs. First, they break the classic OS process abstraction, so applications must be reworked to take advantage of them. Second, their pricing model differs from VMs; certain lightweight workloads are more cost effective when run with CFs, whereas heavier workloads that can amortize costs 1 Horizontal scaling adds/removes VMs (scale out/scale in). Vertical (scale up/scale down) adds/removes resources. We use scale in and scale out for both. 35 are significantly more cost effective on VMs. This leaves developers with hard choices. For certain operating points, CFs are cost effective and scale more rapidly; for other workloads, VMs are cheaper. Worse, developers cannot fluidly trade between these resources as their workload shifts and scales. FEAT uses heavy CFs that adapt CFs to work with applications developed for conventional VMs without requiring code changes, and a knobless reactive scaling algorithm that employs these heavy CFs to cut costs. At a high level, FEAT keeps a pool of CFs to temporarily service requests when load grows and allocates them based on changes in queuing delay. To keep CF costs from growing, it allocates VMs and shifts load to the VMs as they slowly come online, at which point CFs are idled. An idle CF is about 100× cheaper than an idle VM (§3.5), and given that these CFs eliminate VM over-provisioning for scale out, the savings are significant. Despite differences in programming model, providers’ isolation of CFs is not fundamentally different from more conventional approaches, so heavy CFs can be implemented atop existing Cloud provider CF offerings. We implement and evaluate heavy CFs as part of FEAT on both AWS and Azure. FEAT not only improves cost savings over state-of-theart auto-scaling approaches, but also significantly simplifies development and operational management of applications. Developers need not write code specialized for VMs or CFs; both run conventional application code designed for ordinary VM deployment. Developers also need not pre-specify any parameters or thresholds; FEAT automatically provisions CFs and VMs to adapt to changing load while minimizing cost. We demonstrate FEAT’s promise with two common classes of Cloud applications: publisher/subscriber and request/response-based applications. Importantly, we compare FEAT against state-of-the-art auto-scaling frameworks (including combined predictive, reactive, and machine learning-based approaches) and show its simple, knobless, reactive approach supported by CFs reduces tenant-observed queuing delay by 40% in our experiments. Interestingly, even state-of-the-art baseline approaches can be improved with FEAT’s heavy CFs. FEAT is the first approach using CFs with conventional application code to quickly add processing power when scaling out. Evaluating our implementation on AWS and Azure with both synthetic data sets and real-world network traces, we make the following significant observations: • FEAT’s CFs reduce latency up to 2× compared to approaches that do not use CFs, 36 even under 10× changes in offered load. It reduces the number of cores over current Cloud provider scaling offerings by 15-20%. • FEAT’s parameterless approach reduces queuing delay over existing approaches that require tenants to pre-specify thresholds by 5× and existing state-of-the-art approaches by 40%. • When comparing AWS to Azure, we find AWS has faster VM launch times and less restrictive security in the CF execution environment. Consequently, our CF method has greater benefit on Azure. • We find that without CFs, horizontal scaling can attain about 25% lower queuing delay than vertical scaling. Our use of CFs tends to equalize the two. We build a discrete event simulator to examine the stability of FEAT. We find that FEAT maintains a stable queuing delay in all cases that we study. In summary, we present a new Cloud resource scaling architecture and implement it on two Cloud provider platforms. Our evaluation shows FEAT’s higher performance and robustness compared to existing approaches. 3.2 Related Work Since major Cloud providers began offering CFs or “lambdas,” researchers have been investigating these platforms themselves [36] as well as CF-based applications. Commonly, CFs are used for distributed systems like compute platforms [49] or message brokers [66]. FEAT’s use of CFs to temporarily absorb increases in load is new, but others have used them to run specific user-level executables; for example, distributed video processing [26], firewalls, and intrusion detection [80]. Reactive load control auto-scaling models scale in response to quantities measured in real time such as CPU utilization or request rate. Predictive models use past workload training sets in machine learning (ML) to attempt to predict future resource demands. Hybrid models combine these approaches. Unlike our approach, most require pre-specification of parameters and thresholds. 37 3.2.1 Reactive Reactive scaling has been explored in the past [27, 48, 53, 85]; however, existing methods require pre-specification of thresholds and parameters for their operations. We use first principles to develop a very simple, yet effective, scaling algorithm that does not require any pre-specification of parameters or thresholds. Cloud providers offer rule-based auto-scaling. When a metric such as CPU utilization exceeds a threshold, a tenant-defined rule fires to add VMs. We compare FEAT with this baseline algorithm as well as a hybrid approach, described below. Gandhi et al. [27] concentrate requests to a set of servers. They use idle timers to determine when to scale in remaining servers. They infer the capacity of a VM and measure workload to scale out. Lim et al. implement feedback controllers that track CPU utilization [53] and target storage tiers [54]. 3.2.2 Predictive Tesauro et al. [82] use reinforcement learning to provision, but it must be trained offline, since training is costly. Nguyen et al. [67] use wavelets to determine resource demand. They clone VMs to mitigate launch delays. We use CFs to handle additional load until VMs have launched. 3.2.3 Hybrid Wang et al. [89] combine reactive, predictive, and feedback control. Their system uses arrival rate to predict the number of cores and fine-tunes them with a latency feedback signal. We implement this existing state-of-the-art algorithm for comparison in our evaluation. The reactive component of Jiang et al. [48] triggers when the request queue exceeds a threshold. It solves a cost/latency trade-off optimization with exhaustive search that does not necessarily scale. Parameters of the reactive component of Urgaonkar et al. [85] include the quantity of work performed per request and an execution frequency. 3.3 Scaling Architecture A major issue with scaling VMs is that they are slow to launch, on the order of minutes [60]. One way to deal with this is to keep spare VMs idle, but this is expensive and inefficient. Providers charge for VM runtime whether it is performing useful work or 38 sitting idle. Further, it is difficult to predict how many spare VMs will be required. On the other hand, CFs have fast sub-second launch times and are inexpensive to idle. Providers charge CFs by the request and not for idle time. It costs a tenant two orders of magnitude more to idle a spare VM compared to a similarly provisioned CF. CFs have been more resource limited than VMs, but trends are driving them closer to one another. The memory-to-core ratio in a VM is typically between 2 to 4 GB/core. CFs are quickly approaching this ratio. AWS recently [3] doubled the available CF memory from 1.5 GB to 3 GB. We suspect this trend will continue with increasing demand for CFs and from competition between Cloud providers, allowing CFs to run heavier workloads. We demonstrate that CFs are now able to run heavy workloads similar to VMs in our evaluation. 3.3.1 Using CFs with VMs We create a system that allows sub-second launch of long-running processes. To achieve this, we implement the system shown in Figure 3.1a, consisting of Cloud provided VMs and CFs plus a computer running our controller. The controller can be located on a separate VM in the Cloud or on the tenant’s premises. AWS and Azure provide proprietary scaling of CFs; however, we control CF scaling directly to perform CF-to-VM hand-off. At startup, the controller creates a pool of CFs and loads applications on them. No subsequent CF configuration is required. When the controller scales out, it determines the number of required VMs and the number of cores on each VM. It selects an idle CF from the pool and associates it with a core. The controller simultaneously starts the VM and instructs the CF to resume processing. The CF begins processing requests immediately and the VM continues to launch. When the VM is ready, the controller suspends the CF and returns it to the pool. When a CF is idle, the provider may swap it out of cache. Launching a CF swapped out of cache is a cold start; launching a cached CF results in a warm start. To minimize cold starts, the controller makes periodic requests to the CF [40]. This could be costly to the Cloud provider if this technique were widely used, but each tenant needs very few warm CFs to mitigate VM launch latency. Regardless, our technique provides significant improvement since both warm and cold start CFs have significantly lower launch times 39 than VMs. Cloud providers limit the amount of time a CF can process a single request. The maximum processing time can exceed the time it takes for a VM to launch. The controller continually makes sequential requests to the CF until its corresponding VM has launched, allowing the CF to run continuously until the VM is ready. AWS and Azure provide runtimes for languages including Java, JavaScript/NodeJS, and C#. It is possible to execute pre-compiled executables independent of the runtimes on a CF [26, 80, 86] as described in §3.6. Cloud providers set limits on the number of VM cores a tenant may allocate simultaneously. We use a CF pool size equal to our VM core limit. 3.3.2 Model We use the queuing model shown in Figure 3.1b, which allows us to implement both publisher/subscriber (pub/sub) and request/response (req/resp) applications. It consists of client producers that generate jobs/requests, a queue to hold pending jobs/requests, and server consumers to process them. Our pub/sub implementation uses an explicit work queue to hold pending jobs. This implementation can be used for a wide range of applications such as back-end batch processors or IoT systems. Moreover, the queue need not be explicit. In our req/resp implementation, the queue consists of the TCP backlog and receive queues in a custom HTTP load balancer that we build. This implementation can be used for web or machine-to-machine (M2M) applications. 3.4 FEAT Scaling Algorithm In state-of-the-industry scaling, tenants must specify rules on the provider’s platform. Tenants define thresholds such as CPU utilization that trigger the rules. Tenants also specify how many cores to add or remove when the rule fires. Setting these values requires analyzing the system behavior in response to load patterns. When load patterns change, they may no longer be valid. We use first principles to develop a reactive algorithm that requires no pre-tuning, but quickly adapts to changing network conditions using run-time measurements. Our simple approach outperforms existing state-of-the-art approaches while eliminating the need to pre-specify thresholds, which is a primary difficulty in other approaches. Our algorithm (Algorithm 4) makes scaling decisions based on run-time measurements 40 of the queue. It can execute at any frequency (§3.8), but it is convenient to execute it in an event-driven manner whenever the Cloud provider makes new measurements such as queuing delay available. 3.4.1 Scaling Out Our scale out algorithm depends only on run-time measurements of arrival rate λ (offered load), queue growth rate g, and departure rate. Let c0 be the number of cores before scaling and c1 be the number after. The required number of cores to process the current load is λ/µ where µ is the processing rate per core. If a queue has formed before we can add cores, this ratio is insufficient to drain the queue. Our intuition is to include g to collapse the queue to zero delay over the next interval. The required cores c1 is the arrival rate plus the queue growth rate divided by the processing rate per core as shown in Equation 3.1. When scaling horizontally and vertically, we give preference to launching VMs with the largest core count available, not exceeding those remaining to launch. c1 = d 3.4.2 λ+g e µ (3.1) Scaling In Scaling in is more difficult since we need to detect spare capacity in the system. Our intuition is that if the measured processing rate in two consecutive intervals is decreasing, less capacity is required and we scale in. This allows us to predict the required capacity a short time in the future. We do not control this condition in our evaluation. We use it only to make the prediction. We measure the number of jobs, n, processed in the first interval and the time it takes to process the jobs, ∑in=1 Ji where Ji is the time it takes to process job i. The number of opportunities available to process jobs is given by Equation 3.2 where t is the time since the last iteration. n α0 = (n/ ∑ Ji )t (3.2) i =1 We then measure the number of jobs α1 processed in the second interval. The ratio h = α1 /α0 gives the required decrease in capacity. The new number of cores is given by Equation 3.3. 41 Algorithm 4: FEAT Scaling Algorithm // c0 = current number of cores // c1 = new number of cores // t = time since last iteration // d = measured change in queuing delay // g = measured queue growth rate in jobs/sec // λ = measured job arrivals/sec // x = measured job departures/sec // µ = x/c0 = departure rate per core // k = (λ + g)/µ = required capacity // α1 = jobs processed in current iteration // n = jobs processed in previous iteration // Ji = time to process job i in previous iteration // α0 = nt/(∑in=1 Ji ) = predicted opportunities // h = α1 /α0 = predicted capacity if (d > 0) { // queue is growing, scale out if (k < h) { // Measured capacity < predicted but // queue is growing. Invalid // measurement or change in traffic. // Wait for more measurements. c1 = c0 ; } else { c1 = d µ1 (λ + g)e; // Equation 3.1 } } else { // queue is not growing, scale in if (1 < h) { // Queue is not growing but prediction // indicates scale out. Wait for more // measurements, do not scale. c1 = c0 ; } else { c1 = d αα10 c0 e; // Equation 3.3 } } return (c1 ); c1 = d α1 c0 e α0 (3.3) 42 3.4.3 Combined Scaling If the queuing delay increases, we scale out according to Equation 3.1. If the queuing delay does not increase, we scale in according to Equation 3.3. When scaling out, we verify the required capacity is greater than the predicted capacity. We do this to check our measurements, which can be in error because of timing or high variability in offered load. If we detect an error (k < h), we do not scale, but collect more measurements over the next interval. When scaling in, our prediction can be in error when load patterns change. Before scaling in, we verify the ratio h = α1 /α0 is less than 1, indicating we have excess processing opportunities. 3.5 Cost Analysis FEAT saves costs in two ways from a tenant’s perspective. One depends on the provider’s CF and VM cost structure. The other is in terms of latency violations. Although FEAT requires a controller to monitor the system and to make scaling decisions, this should not add any significant cost over other approaches. All methods require a computing element to scale the application whether or not it is part of the Cloud provider’s infrastructure. 3.5.1 Provider-Dependent We acknowledge that the costs in this subsection will vary as providers change their pricing model. However, it is likely that CFs will continue to be less expensive over short durations than VMs for both the provider and tenant because CFs are implemented with containers that have less fragmentation than VMs and can share physical resources more efficiently than a VM. Cloud providers price CFs differently than VMs. VMs are intended to run for long periods of time and AWS and Azure bill by execution time. CFs are intended to handle a large volume of requests and AWS and Azure bill in terms of number of req/month and the amount of memory used during each request. FEAT incurs cost only while launching a VM or keeping a CF warm. It uses few req/month so memory is the dominant cost. Our system keeps a CF warmed by making short running requests every 5 min. These short running requests last less than 1 sec in duration, but we round up to 1 sec here to simplify the analysis. 43 Let C be the number of CFs in the pool. Let V be the number of VMs added/month. Let A be the cost/req and B be the cost per GB-sec. Total cost per month is given by Equation 3.4. Cost = (Creq + Vreq ) A + (Cmem + Vmem ) B (3.4) where Creq is the number of req/month to keep the CF warm, Vreq is the number of req/month to launch a VM, Cmem is the amount of memory/month to keep the CF warm, and Vmem is the amount of memory/month to launch a VM. While a VM is initializing, we make requests at a rate of 1 req/min (the provider’s measurement frequency) to keep the application executing on the CF. When the VM is running, we allow the CF to return to an idle but warmed state. At the time of this writing, A = $2 × 10−7 and B = $1.667 × 10−5 for both AWS and Azure. If it takes 2 min to launch a VM and 3 GB RAM/CF (the worst case on AWS), then we have Creq = 8, 760 C, Vreq = 2 V, Cmem = 26, 280 C, and Vmem = 360 V. Figure 3.2 shows the cost for a range of CFs and VMs. Concretely, the cost of running a spare idle VM with 1 CPU and 3 GB RAM for a month is about $50, whereas the cost of keeping a similarly provisioned CF warm for a month is about $0.50. It is two orders of magnitude less expensive for a tenant to keep a CF ready to process requests than an idle VM. 3.5.2 Provider-Independent We now present our model for the cost savings during a scale out event. This model is independent of the cost structure of the provider. For this model, we assume latency is caused by queuing delay. Let Γ be a target latency. If mean queuing delay γ is greater than Γ during a time interval δ, we apply a cost penalty of β to the tenant’s utility U [89]. The cost P of the request due to the penalty is P = β γΓ U. Consider an offered load increase from λ0 to λ1 jobs/sec. Figure 3.3a shows the time line of a scale out event. At time t0 , the system is running with c0 VMs and the controller makes a scaling decision. The dotted line shows the time line if the system uses CFs and the solid line without them. Time tCF is the time at which the CFs become ready if the system is using them. Latency stops growing at tCF if CFs are being used. Time t1 is the time at which the newly launched VMs become ready. The queuing delay increases during 44 the time from t0 until the CFs or VMs become ready. Figure 3.3b shows the time series of the increase in delay during the scaling event. While the expected value of queuing delay grows exponentially with utilization, the time series growth is linear [29]. At some time tΓ , the queuing delay becomes greater than the target Γ. Latency violations occur in the shaded area of the figure. The total penalty is the sum of the penalties in each δ. We use constant values of λ0 and λ1 in the figures to illustrate our point; however, in a real system, the arrival rate will vary over time. To compute the penalty at run time, a real system cannot simply compute the area of the shaded portion of Figure 3.3b. We compute the total penalty Ptotal at run time by summing the penalties at the end of each interval δ. There are t1 − t Γ δ intervals between tΓ and t1 . Let γi be the measured average latency between times tΓ + (i − 1)δ and tΓ + iδ where i is the interval number, then we have Equation 3.5. t1 − t Γ δ Ptotal = ∑ i =1 β γi U Γ (3.5) Figure 3.4 shows a sample cost savings as a function of the increase in offered load. Let PCF be the penalty when using CFs and PV M be the penalty when not using them. The cost savings is given by Equation 3.6. ∆P = PV M − PCF (3.6) For this example, we use values on the order of those in our evaluation. 3.5.2.1 Key Observations There is a fundamental cost advantage to decreasing the effective launch time of a VM. As a continuous function, the savings is an integral of a linear equation describing the increase in latency in Figure 3.3b. This manifests as a parabola in Figure 3.4. Consequently, there is higher cost savings for larger increases in offered load. 3.6 Implementation We implement FEAT and three other approaches for comparison: (1) a constant number of cores (CONST), (2) a pre-tuned CPU utilization-based algorithm (MAN), and (3) a stateof-the-art hybrid algorithm [89] (FINE). CONST represents a statically-provisioned production system from which we obtain network traces in our evaluation. MAN is a manu- 45 ally pre-tuned state-of-the-industry algorithm available from most Cloud providers. FINE is an existing state-of-the-art algorithm using a hybrid of predictive, reactive, and feedback control techniques. We implement a scalable pub/sub and a load balanced req/resp web server farm. We write custom producer, consumer, and controller applications in Java except for the Azure CF shell, which we write in C#. The two systems are similar in structure with details below. On AWS, we use EC2 (for VMs) and Lambda (AWS’s CFs). On Azure, we use the Virtual Machines and Functions offerings. The producer reads and recreates a pre-recorded network trace. It creates jobs (for pub/sub) or requests (for req/resp) and sends them to the job queue or load balancer. 3.6.1 Publisher/Subscriber We use Apache ActiveMQ [4] for our job queue. The producers (publishers) send the job data to the work queue. The consumers (subscribers) pull jobs from the queue on a first-come-first-served basis. A consumer uses CPU resources proportional to the job size. 3.6.2 Web Server For this application, the producers are HTTP clients. They make requests to a custom load balancer that we write. The load balancer proxies requests to Apache HTTP [5] running on the VMs and CFs. Apache HTTP proxies the request to the consumer on the same VM or CF. The consumer returns the response, which follows the reverse path back to the client. The queue consists of the TCP backlog and receive queues in the load balancer. We set the backlog and timeouts high enough that no requests are dropped to avoid reducing to tail drop. 3.6.3 Cloud Function Operation AWS Lambdas are preloaded with Java, JavaScript/NodeJS, Python, and C# runtimes on an Amazon Linux OS. Azure Functions are preloaded with C#, F#, JavaScript/NodeJS, Python, and PHP runtimes on the Windows Server 2012 Datacenter OS. Programmers write code for any of these runtimes with an SDK from the provider and upload it to the provider’s CF system. A CF will run any code that compiles to a user-level executable [26, 86]. We regard our CF as a shell that interfaces to the underlying OS. We execute and control our consumer and Apache HTTP, which do not depend on the supported runtimes. 46 We write our shells in Java on AWS and in C# on Azure. We write our consumer in Java. Apache HTTP is written in C. The CFs resume from a paused state in 20-40 sec on AWS and 40-50 sec on Azure. The VMs launch and start our applications on AWS in 2-3 min and 4-5 min on Azure. Azure queuing delays grow more than AWS before VMs are fully launched. As a result, Azure benefits more from CFs than AWS. 3.7 Evaluation We first evaluate the algorithms with a synthetic data set without CF fast launch to establish a baseline. Next, we add fast launch to show the improvement from decrease in effective launch times. Finally, we evaluate them on a network trace from a statically and over-provisioned production web server to determine how robust they are in a real-world environment. We summarize mean results in Tables 3.1 and 3.2. Note that the results for all algorithms show an average queuing delay an order of magnitude larger than typical service level agreements (SLAs) [20]. In a real-world web server, when the TCP request rate exceeds capacity, the backlog fills and the server drops arriving TCP connections. The queuing delay will not grow very large. In our pub/sub application, we use an explicit queue to hold pending jobs and in our web server application, we set the TCP backlog to a very high value to evaluate the system without resorting to tail drop. We do not discard any requests. Consequently, queuing delay can grow large and stay at elevated levels until we add capacity to the system. 3.7.1 Evaluation Algorithm Details CONST, MAN, and FINE require pre-tuning to the network. Fast launch has no significant effect on CONST because its VMs are always running. FEAT does not require pre-tuning. 3.7.1.1 FINE FINE trains an autoregressive AR(2) predictor ahead of time on pre-recorded traces to predict arrival rate [89]. It pre-computes a lookup table that maps arrival rate to the core count required to service that rate. At run time, it looks up the predicted rate to obtain the core count. It uses latency as a feedback signal to fine-tune the core count to achieve 47 a target latency. One issue we encounter while pre-tuning FINE is that pre-computing the core count does not always give an appropriate computing capacity because workloads vary over time. The feedback signal fine-tunes the capacity but introduces feedback delay. Another issues is pre-tuning the feedback gain. At high gain, the core count far exceeds a provider-imposed core quota. To pre-tune, we iteratively lower the gain until the core count stays below the quota without causing high queuing delays. In spite of our best effort, the core count occasionally exceeds the quota during the evaluations. We limit the core count to the quota in this case. We use the algorithm’s default 0.6 ms target latency and a gain of 10−5 . 3.7.1.2 MAN If the average CPU utilization exceeds an upper threshold, MAN scales out by a pretuned number of cores. If it drops below a lower threshold, MAN scales in by another pre-tuned number of cores. We find that an upper threshold of 40% and a lower threshold of 20% work well in our environment. We set the upper threshold somewhat low to detect increased load as early as possible. We scale by ±3 cores per operation. 3.7.1.3 CONST We set the number of cores in each experiment in CONST to handle 3/4 of the peak request rate. The choice of core count is a trade-off between the cost of over-provisioning and the cost of slow response at high workloads. At low request rates, CONST is overprovisioned and wastes resources. At high request rates, it is under-provisioned and incurs long queuing delays. Provisioning CONST is difficult because it may be hard or even impossible to foresee peak rates in real-world systems. 3.7.2 Experiments We first establish a baseline by running a synthetic workload with CFs disabled. We then run the workload with CFs enabled. Finally, we run a production workload to see how the system performs in a real-world environment. 48 3.7.2.1 Without Fast Launch We establish a baseline for each algorithm by disabling the CFs. We generate a square wave with a trough of 15 req/sec and two peaks of 35 req/sec (more than 2× peak-totrough) and 150 req/sec (10× peak-to-trough). We alternate peak rates every period. We set the peak and trough durations to 10 min each. We evaluate first with horizontal scaling only and then with both horizontal and vertical scaling. With horizontal only, we use VMs with a single core. The representative traces in Figures 3.5 and 3.6 show horizontal scaling results for pub/sub on AWS and Azure, respectively. Figure 3.5c shows the queuing delay of FEAT and FINE enlarged for detail. We set CONST to 12 cores. It has lowest queuing delay because it is over-provisioned most of the time. This is reflected in its low CPU utilization. 3.7.2.2 With Fast Launch We see a significant improvement in queuing delay with CFs enabled though they have no real effect on CONST since its cores are always running. Our CFs allow capacity to be quickly added and higher request rates have less opportunity to form a queue. We see up to 45% less queuing delay on AWS and up to 75% on Azure. We see up to 20% reduction in cores in most cases. The representative traces in Figures 3.7 and 3.8 show the combined horizontal and vertical scaling results for pub/sub on AWS and Azure, respectively. Figure 3.7c is enlarged for detail. We see FINE compute a core count beyond our quota near the 80 min mark. We see the queuing delay drop, but CPU utilization also drops, causing inefficient use of resources for an extended period. CONST is not able to control queuing delay during the peak traffic rates. 3.7.2.3 Production Trace Evaluation We collect network traces from a production system that serves assets and metadata to hundreds of clients. It is over-provisioned with multiple machines running Apache HTTP. It services both M2M and web browser requests. We execute many traces with similar results. To illustrate the issues with sudden changes in offered load, we choose a trace from a 24-hour period in February 2017 with a sudden spike in traffic near the 17th hour shown in Figure 3.9. The figure shows req/sec, response size, and response size distribution. We execute our web server with CF fast launch enabled and horizontal scaling. We show AWS 49 results in Figure 3.10 and Azure results in Figure 3.11. Figures 3.10c and 3.11c are enlarged for detail. We break queuing delay results into 1-min intervals and show the percentage of delays greater than 10 ms and 50 ms in these intervals, representing percentage of SLA violations. In this experiment, we run CONST over-provisioned (CONST1) and again with enough cores to handle 3/4 peak rate (CONST2). We show the latter run in the figure. CONST1 data show inefficient use of cores (45% CPU utilization on AWS and 42% on Azure). It has fewer SLA violations, but at higher core count. CONST2 queuing delay grows to excessive levels and the system does not recover by the end of the experiment. FINE performs well because there is less variability in request rate than in our square wave evaluations; however, FEAT performs nearly equally well without the need to pretune knobs. AWS results show FEAT decreases queuing delay by 40% over MAN and improves CPU utilization by 40% compared to over-provisioned CONST1. This indicates more efficient use of resources, reflected in the decrease in core count. FEAT decreases the core count, and thus cost, by about 20% compared to MAN and 35% compared to CONST1. 3.7.3 Measurement Periods Cloud providers periodically measure VM CPU utilization and managed queue delays, eliminating the need to modify existing code to obtain these measurements. AWS and Azure offer two periods, a 5-min free tier and a 1-min paid tier. Since FEAT is event-driven and operates at the Cloud provider’s measurement period, we evaluate it at each. We run a square wave with fast launch enabled. We set a 35 req/sec peak and 15 req/sec trough with durations of 30 min each and scale horizontally. Table 3.3 shows the results. FEAT performs much better at 1-min intervals by keeping delays very low with little change in the number of cores and CPU utilization. Therefore, we use a 1-min interval for all experiments in §3.7. 3.7.4 Observations In general, without fast launch, we observe a lower queuing delay with horizontal over combined horizontal and vertical scaling. This is caused by variance in the time it takes to launch a VM. With more smaller VMs, some launch quickly and begin to process requests while the remaining VMs continue to launch. Adding CFs tends to equalize this difference. We see lower queuing delay in AWS than Azure without fast launch because of the longer 50 VM launch times on Azure. Our use of CFs tends to decrease this difference. 3.8 Stability We evaluate FEAT’s stability with a discrete-event simulator, which we build, that models the system as an M/M/c queue [29] with c cores. Arrival and processing rates have exponential distributions with means λ and µ, respectively. We run the simulator on a wide range of values (Table 3.4) to probe boundaries of stability. Experiments include (1) a single arrival rate with mean λ, (2) a square wave with λ alternating between two values, and (3) a wave where we vary λ as a sinusoid. We measure mean queuing delay and compute standard deviation and inter quartile range (IQR) of the queuing delay. We measure the mean core count and utilization. Table 3.5 shows results with λ = 15 jobs/sec, µ = 10 jobs/sec, and launch time = 60 sec. In the case of Experiment 2, the table shows results for λ alternating between 15 and 35 jobs/sec. For Experiment 3, λ oscillates between 15 and 35 jobs/sec with a period of 30 min. IQR remains low and utilization indicates the queuing system remains stable for all cases in all experiments. We observe the same general trends in simulation as in the implementations. With shorter metric period, queuing delay decreases, core count increases, and CPU utilization decreases. 3.9 Conclusion We built a novel architecture for quickly auto-scaling virtual machines by using Cloud functions and designing an algorithm that does not require pre-specification of parameters or thresholds. We implemented our system on both AWS and Azure. We compared it with pre-tuned and statically-provisioned algorithms as well as a state-of-the-art hybrid algorithm. We evaluated it on both synthetic data sets and production network traces for both horizontal only and combined horizontal and vertical scaling. We determined that it is stable through extensive simulation. We found that our system can reduce latency by 2× and reduce the number of VMs, and thus cost, by 20%. Up to this point, we have considered mainly CPU-intensive applications, largely ignoring memory-intensive applications. We next investigate scaling memory-intensive applications in public Clouds. Table 3.1: Synthetic Summary. Pub/Sub Is Listed on the Left Side of Each Column, Req/Resp on the Right. ExperiAlgoQueuing Delay (sec) Number of Cores CPU (%) Scaling ment rightm AWS Azure AWS Azure AWS Azure CONST 24.02 25.59 34.38 22.40 12.00 12.00 12.00 12.00 24.65 24.54 24.64 22.79 Horizontal MAN 217.55 224.20 285.15 292.55 9.48 9.41 9.00 8.57 45.66 43.76 40.41 42.99 Square Only FINE 50.65 51.81 101.47 133.15 8.56 10.13 11.51 13.14 57.29 56.63 46.21 46.59 Wave FEAT 45.54 31.31 83.61 101.59 7.35 7.50 8.98 9.54 45.77 45.04 41.98 40.94 without CONST 23.01 24.51 25.52 25.25 12.00 12.00 12.00 12.00 30.19 29.15 29.26 33.26 CF Fast Horizontal MAN 232.62 242.65 388.29 318.19 8.93 9.34 7.30 7.91 43.72 43.33 44.43 44.30 Launch and FINE 53.40 60.43 88.87 116.98 9.09 9.98 9.69 13.29 51.30 48.13 46.22 43.16 Vertical FEAT 43.20 44.28 70.74 95.28 8.33 7.13 11.42 11.46 43.10 43.26 34.76 39.99 CONST 23.90 25.62 23.69 21.98 12.00 12.00 12.00 12.00 24.25 23.71 26.43 21.71 Horizontal MAN 148.05 169.16 247.20 203.59 7.62 7.70 7.52 7.27 43.54 42.64 42.14 38.74 Square Only FINE 41.28 31.31 41.73 34.79 9.09 8.62 12.26 10.41 56.65 51.05 40.87 40.94 Wave FEAT 25.71 18.44 30.41 19.77 6.48 6.61 8.92 8.93 45.41 38.52 37.14 35.53 with CONST 23.01 24.66 24.73 22.16 12.00 12.00 12.00 12.00 29.59 28.91 27.04 25.78 CF Fast Horizontal MAN 202.07 114.90 222.67 230.61 9.02 7.96 7.33 7.27 41.19 37.34 40.39 40.04 Launch and FINE 33.85 28.17 34.28 47.58 8.29 7.91 10.28 14.27 54.85 49.39 42.25 40.28 Vertical FEAT 24.66 25.69 20.76 51.69 7.96 8.89 10.37 9.73 42.88 37.44 36.44 36.83 51 CONST1 CONST2 MAN FINE FEAT Algorithm Table 3.2: Production Summary, Request/Response Queuing Delay (sec) Number of Cores CPU (%) 10 ms Violations (%) AWS Azure AWS Azure AWS Azure AWS Azure 0.01 0.01 5.00 5.00 45.08 42.00 3.55 2.15 662.99 242.68 3.00 3.00 73.42 75.03 36.49 34.67 1.05 0.47 4.07 3.42 62.97 69.20 14.93 15.77 0.73 0.32 3.14 3.21 82.04 80.62 10.67 6.74 0.66 0.24 3.19 3.20 79.16 77.92 10.57 6.53 50 ms Violations (%) AWS Azure 2.86 1.32 35.52 33.22 13.43 14.51 10.54 5.37 10.44 5.21 52 53 Table 3.3: Measurement Period Summary Delay (Sec) No. Cores CPU (%) AWS Azure AWS Azure AWS Azure 1 min 0.47 0.33 2.97 2.82 70.35 69.24 5 min 8.71 7.89 2.79 2.76 72.52 74.32 Period Table 3.4: Simulation Ranges Variable Minimum Maximum Arrival Rate λ 1 job/sec 1000 job/sec Processing Rate µ 1 job/sec 1000 jobs/sec Launch Time (Sec) 0 sec 10 min Measurement Period 10 sec 10 min Experiment 1 2 3 Table 3.5: Simulation Results IQR Std. Dev. Period (s) Delay (s) (ms) (s) 10 0.05 0.05 0.09 100 0.10 0.10 0.16 1000 0.12 0.19 0.20 10 0.83 0.06 3.19 100 4.59 0.22 13.22 1000 122.28 249.01 202.86 10 0.46 0.08 1.39 100 0.74 0.13 3.15 1000 102.09 174.80 144.20 CPU (%) 54.56 67.06 71.28 64.38 75.03 76.57 65.94 76.24 78.21 No. Cores 2.87 2.33 2.16 3.90 3.35 3.39 3.81 3.30 3.33 Virtual Machines Producers Job Queue Controller Consumers Cloud Functions (a) Scaling Architecture Figure 3.1: Scaling System (b) Queuing Model 54 Request Cost Memory Cost 30 $/Month $/Month 0.1 0.05 0 1000 10 0 1000 40 500 20 0 0 VMs Added/Month CF Pool Size 40 500 20 20 0 0 VMs Added/Month (a) Request Cost ($) CF Pool Size (b) Memory Cost ($) Figure 3.2: Cost Scaling Time Line Queuing Delay 1 Cores c Delay Detail c 0 t 0 t t CF 1 δ γ(t ) 1 Γ γ(t ) 0 t 0 t Γ t t CF Time Time (a) Time Line (b) Delay Detail Figure 3.3: Scale Out Event 1 55 Cost Savings 2000 β U Pvm ∆P Pcf Cost 1500 β U 1000 β U 500 β U 2λ 0 4λ 0 6λ 0 8λ 0 10 λ Increase in Offered Load Figure 3.4: Cost Savings During Scale Out Event 0 100 50 0 CPU % 100 50 0 CPU % 100 50 0 CPU % 100 50 0 CPU % 150 100 50 0 Req/Sec 0 0 0 0 0 20 20 20 20 20 MAN FINE (a) CPU % and Cores 60 Time (Minutes) Square Wave 60 CONST 60 60 60 80 80 80 80 80 Figure 3.5: AWS Pub/Sub without Fast Launch 40 40 40 40 40 FEAT 100 100 100 100 100 120 120 120 120 120 60 30 0 60 30 0 60 30 0 60 30 0 Cores Cores Cores Cores CPU Cores 56 400 200 0 400 200 0 Delay 400 200 0 Delay 400 200 0 Delay Delay 150 100 50 0 Req/Sec 0 0 0 0 0 20 20 20 20 20 40 40 40 40 40 MAN FINE 80 80 80 80 80 Figure 3.5: Continued (b) Queuing Delay (Sec) and Cores 60 Time (Minutes) Square Wave 60 CONST 60 60 60 FEAT 100 100 100 100 100 120 120 120 120 120 60 30 0 60 30 0 60 30 0 60 30 0 Cores Cores Cores Cores Delay Cores 57 400 400 0 200 Delay 0 200 Delay 0 0 20 20 40 40 FINE 80 80 Figure 3.5: Continued (c) Queuing Delay (Sec) and Cores, Enlarged 60 60 FEAT 100 100 120 120 60 60 0 30 Cores 0 30 Cores Delay Cores 58 100 50 0 CPU % 100 50 0 CPU % 100 50 0 CPU % 100 50 0 CPU % 150 100 50 0 Req/Sec 0 0 0 0 0 20 20 20 20 20 CPU Cores MAN FINE (a) CPU % and Cores Time (Minutes) 60 Square Wave 60 CONST 60 60 60 80 80 80 80 80 Figure 3.6: Azure Pub/Sub without Fast Launch 40 40 40 40 40 100 100 100 100 100 120 120 120 120 120 60 30 0 60 30 0 60 30 0 60 30 0 Cores Cores Cores Cores FEAT 59 400 200 0 400 200 0 Delay 400 200 0 Delay 400 200 0 Delay Delay 150 100 50 0 Req/Sec 0 0 0 0 0 20 20 20 20 20 40 40 40 40 40 MAN FINE 80 80 80 80 80 Figure 3.6: Continued (b) Queuing Delay (Sec) and Cores 60 Time (Minutes) Square Wave 60 CONST 60 60 60 FEAT 100 100 100 100 100 120 120 120 120 120 60 30 0 60 30 0 60 30 0 60 30 0 Cores Cores Cores Cores Delay Cores 60 100 50 0 CPU % 100 50 0 CPU % 100 50 0 CPU % 100 50 0 CPU % 150 100 50 0 Req/Sec 0 0 0 0 0 20 20 20 20 20 MAN FINE (a) CPU % and Cores 60 Time (Minutes) Square Wave 60 CONST 60 60 60 80 80 80 80 80 Figure 3.7: AWS Pub/Sub with Fast Launch 40 40 40 40 40 FEAT 100 100 100 100 100 120 120 120 120 120 60 30 0 60 30 0 60 30 0 60 30 0 Cores Cores Cores Cores CPU Cores 61 400 200 0 400 200 0 Delay 400 200 0 Delay 400 200 0 Delay Delay 150 100 50 0 Req/Sec 0 0 0 0 0 20 20 20 20 20 40 40 40 40 40 MAN FINE 80 80 80 80 80 Figure 3.7: Continued (b) Queuing Delay (Sec) and Cores 60 Time (Minutes) Square Wave 60 CONST 60 60 60 FEAT 100 100 100 100 100 120 120 120 120 120 60 30 0 60 30 0 60 30 0 60 30 0 Cores Cores Cores Cores Delay Cores 62 400 400 0 200 Delay 0 200 Delay 0 0 20 20 40 40 FINE 80 80 Figure 3.7: Continued (c) Queuing Delay (Sec) and Cores, Enlarged 60 60 FEAT 100 100 120 120 60 60 0 30 Cores 0 30 Cores Delay Cores 63 100 50 0 CPU % 100 50 0 CPU % 100 50 0 CPU % 100 50 0 CPU % 150 100 50 0 Req/Sec 0 0 0 0 0 20 20 20 20 20 MAN FINE (a) CPU % and Cores 60 Time (Minutes) Square Wave 60 CONST 60 60 60 80 80 80 80 80 Figure 3.8: Azure Pub/Sub with Fast Launch 40 40 40 40 40 FEAT 100 100 100 100 100 120 120 120 120 120 60 30 0 60 30 0 60 30 0 60 30 0 Cores Cores Cores Cores CPU Cores 64 400 200 0 400 200 0 Delay 400 200 0 Delay 400 200 0 Delay Delay 150 100 50 0 Req/Sec 0 0 0 0 0 20 20 20 20 20 40 40 40 40 40 MAN FINE 80 80 80 80 80 Figure 3.8: Continued (b) Queuing Delay (Sec) and Cores 60 Time (Minutes) Square Wave 60 CONST 60 60 60 FEAT 100 100 100 100 100 120 120 120 120 120 60 30 0 60 30 0 60 30 0 60 30 0 Cores Cores Cores Cores Delay Cores 65 66 Request Rate 70 Requests/Second 60 50 40 30 20 10 0 5 10 15 Time (Hours) 20 25 (a) Request Rate (Requests/Sec) Response Size Response Size (KB) 250 200 150 100 50 0 0 5 10 15 Time (Hours) 20 25 100 150 200 Response Size (KB) 250 (b) Response Size (KB) 1 Probability 0.8 0.6 0.4 0.2 0 0 50 (c) Response Size Distribution Figure 3.9: Network Trace CPU % 100 75 50 25 0 CPU % 100 75 50 25 0 CPU % 100 75 50 25 0 CPU % 60 40 20 0 Req/Sec 0 0 0 0 0 5 5 5 5 5 CONST2 MAN FINE 15 15 15 15 15 Figure 3.10: AWS Req/Resp Production Trace (a) CPU% and Cores Time (Hours) 10 Production Trace 10 10 10 10 FEAT 20 20 20 20 20 8 6 4 2 8 6 4 2 8 6 4 2 8 6 4 2 Cores Cores Cores Cores 100 75 50 25 0 CPU Cores 67 Delay Delay 60 40 20 0 75 50 25 0 75 50 25 0 75 50 25 0 Delay Delay Req/Sec 0 0 0 0 0 5 5 5 5 5 CONST2 MAN FINE 15 15 15 15 15 Figure 3.10: Continued (b) Queuing Delay (Sec) and Cores Time (Hours) 10 Production Trace 10 10 10 10 FEAT 20 20 20 20 20 8 6 4 2 8 6 4 2 8 6 4 2 8 6 4 2 Cores Cores Cores Cores 75 50 25 0 Delay Cores 68 Delay Delay FINE 20 15 20 Figure 3.10: Continued (c) Queuing Delay (Sec) and Cores, Enlarged 2 10 0 5 4 6 50 25 8 75 0 2 15 0 10 4 25 5 6 8 50 0 FEAT Cores Cores 75 Delay Cores 69 CPU % 100 75 50 25 0 CPU % 100 75 50 25 0 CPU % 60 40 20 0 100 75 50 25 0 CPU % Req/Sec 0 0 0 0 0 5 5 5 5 5 CONST2 MAN FINE 15 15 15 15 15 Figure 3.11: Azure Req/Resp Production Trace (a) CPU% and Cores Time (Hours) 10 Production Trace 10 10 10 10 FEAT 20 20 20 20 20 8 6 4 2 8 6 4 2 8 6 4 2 8 6 4 2 Cores Cores Cores Cores 100 75 50 25 0 CPU Cores 70 75 50 25 0 75 50 25 0 Delay 75 50 25 0 Delay 75 50 25 0 Delay Delay 60 40 20 0 Req/Sec 0 0 0 0 0 Delay Cores 5 5 5 5 5 CONST2 MAN FINE 15 15 15 15 15 Figure 3.11: Continued (b) Queuing Delay (Sec) and Cores Time (Hours) 10 Production Trace 10 10 10 10 20 20 20 20 20 8 6 4 2 8 6 4 2 8 6 4 2 8 6 4 2 Cores Cores Cores Cores FEAT 71 Delay Delay FINE 20 15 20 Figure 3.11: Continued (c) Queuing Delay (Sec) and Cores, Enlarged 2 10 0 5 4 6 50 25 8 75 0 2 15 0 10 4 25 5 6 8 50 0 FEAT Cores Cores 75 Delay Cores 72 CHAPTER 4 AUTO-SCALING CLOUD-BASED MEMORYINTENSIVE APPLICATIONS 4.1 Introduction Today, Cloud providers offer simplistic scaling policies that rely on thresholds that force tenants to have a priori knowledge of their workloads. We develop a new method for scaling memory-intensive workloads that needs no thresholds. This makes it worry-free for tenants, and it adapts even as workloads evolve. This is especially hard for memorybound applications, because even a small decrease in the amount of memory available to an application can have a dramatic, almost unbounded impact on performance — much more so than the available CPU power. Hence, sizing a machine’s physical memory correctly is critical to application performance and to operating cost. Even so, finding a good operating point without setting a specific threshold is possible. To determine a natural threshold for memory-intensive applications, our approach automatically analyzes an application’s miss ratio curve (MRC) and models it as a hyperbola. Intuitively, a memory scaling policy should operate at the point where the curve flattens: that is, at its intersection with its latus rectum (LR). Our system uses a new approach to constructing and analyzing MRCs at run time that captures memory references from a slice of any scalable application as it executes on standard virtual machines from any major Cloud provider. We demonstrate the approach of provisioning application memory capacity using the intersection of an application’s MRC with its LR and show its benefits with multiple applications running on Amazon Web Services (AWS) and Microsoft Azure. Our implementation and evaluation shows that, though the LR doesn’t require tenants to set thresholds, it is effective in horizontally and vertically scaling memory-intensive workloads to save on operating costs while avoiding queuing, thrashing, or collapse. Today, developers are commonly faced with building and deploying large-scale appli- 74 cations that span the resources of many virtual machines on Cloud infrastructure. Properly provisioning these machines still remains a challenge. Workloads are constantly changing, and even small changes to application logic can result in hard-to-predict changes in resource requirements. Over-provisioned applications waste resources and increase costs, but, more critically, under-provisioned applications execute slowly and provide a poor user experience. Tenants want to make efficient use of resources while maintaining a good user experience. Cloud providers’ scalable infrastructures help accommodate tenants; however, their scaling policies are simplistic, and they force tenants to set thresholds that require a priori (and quickly invalidated) knowledge of the application and its workload. Tenants must analyze their application and its workload to make guesses as to how to add or remove resources based on run-time statistics (e.g., CPU and network load) that the Cloud provider makes available. This analysis is time consuming and has to be done continuously as the application and workload evolve in order to provide the desired experience without wasting resources. Of course, estimating resource needs for some applications is easy. For example, the resource needs of a parallelizable, partitionable, compute-bound workload are simple to estimate. However, many realistic applications are much harder to provision properly; estimating the memory needs of many applications is difficult and fragile. The amount of memory available to an application has a significant impact on performance. Accessing paged memory has higher latency than physical memory by up to 6 orders of magnitude [45], so any increase in the number of page faults an application makes causes significant slowdowns. Even some of the earliest works in operating systems recognized that applications can experience a nearly unbounded decrease in performance with a small decrease in available physical memory [21]. For example, Figure 4.1 shows that for an application described in §4.3.1, the throughput decreases exponentially with a linear decrease in available memory. Hence, sizing a machine’s physical memory correctly is critical for performance and to operating cost. However, to do so, developers needn’t manually and continuously profile applications, and they also needn’t express anything about their application or workload to the Cloud provider. In this chapter, we provide a new approach to application scaling and provision- 75 ing. Applications undergo lightweight, online profiling to understand their compute and memory use. In particular, our approach extracts memory miss ratio curves (MRCs) [61] and identifies “natural thresholds” in these curves that map to similar operating points that a tenant would choose. Our approach maps resources to a sensible, easy to identify operating point on these curves, is conservative enough to avoid thrashing, aggressive enough to avoid wasting resources, and adapts well when the workload changes. Most importantly, our method uses no user-defined or pre-set thresholds. The intuition of our natural threshold is to use phenomena that arise naturally from the system model [72]. For example, the ”knee” could work as a natural threshold in a resource-performance graph, though later we will show the knee is too conservative for memory-bound applications. Most importantly, though, such natural points create the opportunity to avoid user-specified thresholds altogether. Figure 4.2 gives the intuition for how our algorithm estimates the memory needs of an application with no pre-specified threshold. The system continuously samples a small number of memory accesses to extract a miss ratio curve. A miss ratio curve describes, for the application, how many misses/page faults the application would expect to experience per unit time (vertical axis) for various memory sizes (horizontal axis). Even a handful of page faults can dramatically hurt performance. To pick a natural threshold, our approach maps a quadratic curve over the sampled MRC extracted from the application. The knee in the curve (“vertex”) is a natural point at which adding more memory provides diminishing improvements in page faults. Nonetheless, it is too conservative — an operator would gladly double DRAM usage to reduce page faults by 2× since it would result in a nearly 2× performance benefit. Instead, our approach tries to more accurately model the tail of the MRC, and it chooses a different natural threshold that lies to the right of the knee: at the intersection of the curve and its latus rectum (LR). Intuitively, memory scaling policy should operate at the point where the curve flattens, which is what the LR identifies [32]. MRCs have long tails, so this threshold avoids wasting memory where there is little gain while still delivering a low miss rate. Finally, modeling the MRC as a curve also tends to remove performance cliffs (MRC concavity that can be hard to optimize around) [8, 13] in which small changes in allocated memory leads to large changes in miss rate. Currently there is no method to obtain the MRC of an arbitrary application at run time 76 without modifying the application or tracing only specific portions of it such as the caching layer. We develop a novel approach of constructing and analyzing MRCs at run time by capturing memory references from a slice of any scalable application while it executes. We do this with standard virtual machines (VMs) available from any major Cloud provider. A significant problem with scaling memory-intensive applications is the tens of seconds or even minutes it takes for a newly provisioned VM to launch [60]. Further, it takes time to redistribute state to the new VMs if required by the application. We mitigate VM launch time by using Cloud functions (CFs) as interim resources until a new VM becomes ready. CFs are light weight computing elements available from all major Cloud providers [2, 43, 63]. They are small, short lived, and do not maintain state when decommissioned. However, when used for short duration, they provide additional computing power and memory we use while VMs continue to launch. We build a reactive auto-tuning method to scale resources for Cloud-based memoryintensive applications. Existing reactive methods to scale Cloud-based resources including CPU and memory require a priori knowledge of the expected workload [27, 34, 53, 54]. If the run-time workload changes, they may no longer be sufficient to scale resources effectively. Therefore, we develop a reactive auto-tuning method to scale physical memory that instead adapts to workload based on run-time measurements. Similar to the way TCP probes a network for available bandwidth, our method probes the system for the required memory using the MRC threshold as a starting point. Our contributions are • a real-time algorithm that reactively scales resources for memory-intensive applications in the Cloud and automatically adapts to workload changes with no prespecified thresholds; • a new approach to guiding scaling decisions that relies on extracting natural thresholds over models extracted from application run-time statistics; • modeling application miss ratio curves as a hyperbola and identifying the intersection of the hyperbola and its latus rectum as a natural threshold that is well-suited to applications where cache misses/page faults are costly; 77 • a new and inexpensive mechanism for computing an application’s miss rate curve at run time that uses a modified hypervisor [9] to collect memory traces; • a demonstration that our natural threshold effectively computes appropriate memory sizes that reduces operating costs, avoids thrashing or storage-bound performance, and avoids the need to pre-specify any thresholds like target latency; • the use of Cloud functions as temporary memory to absorb increases in offered workload as VMs launch. We implement the system on two major Cloud provider platforms and evaluate it by horizontally and vertically scaling memory-intensive applications. 4.2 Related Work We unite a number of concepts to create an auto-scaling system that does not require specification of thresholds. 4.2.1 Auto-Scaling Database-as-a-Service systems use domain-specific knowledge [18, 19, 79] for scaling. Similarly, Huang et al. [41] create a reactive scaling algorithm based on service level agreement (SLA) latency to scale MongoDB [44]. Our algorithm is not restricted to database applications and does not require domain-specific knowledge. Memory-as-a-Service systems use remote memory to scale the amount of memory available to an application [10, 33]. They allocate blocks of memory from remote nodes connected by high-speed networks [28, 55, 56]. These systems are typically intended to scale an application vertically. Because memory is accessed across a network, latencies are about 3 orders of magnitude greater than accessing local memory. Our work focuses on right-sizing the memory of applications that can be scaled both vertically and horizontally to take advantage of the low latency of local memory and to reduce paging to disk. 4.2.2 Miss Ratio Curves A single pass algorithm for generating MRCs has been in existence since the Mattson et al. seminal paper [61] in 1970 for LRU (least recently used) algorithms. Talus [8] shows how to ensure that MRCs decrease in a convex way for LRU caches. Cliffhanger [13] 78 removes performance cliffs from MRCs with no more than a single cliff for caching applications. A recent system shows how MRCs can be computed for both LRU and non-LRU cache replacement policies using miniature simulation [87]. SHARDS [88] uses MRCs and miniature simulation to determine the required cache size (effectively vertical scaling) for a reduction of a specific amount of average I/O latency in disk caches. We use them to scale physical memory in virtual machines both horizontally and vertically to minimize virtual memory paging. Importantly, our method right-sizes memory without requiring pre-specification of a target latency. In general, existing MRC-based methods for memory provisioning require pre-specification of a target latency. This requires careful tuning for each specific network environment. These methods use the MRC to determine what increase in memory is required to meet the latency requirement. Instead, we determine a natural threshold (operating point) to right-size memory without the need to pre-specify any thresholds. 4.2.3 Profiling Page Faults Geiger [50] monitors disk I/O to infer page faults to make memory allocation decisions. It cannot determine how to decrease memory when it is over-allocated. Other work focuses on efficient MRC construction [93] or using permission protection mechanisms to construct them [94, 95]. 4.2.4 Natural Thresholds DUK [72] introduces the concept of a natural threshold in the context of the delayutilization curve of active queue management. We use the concept in the context of rightsizing memory in VMs. We analyze a different curve, the MRC, in a memory-scaling context. We use a different model to determine our natural threshold. We chose a different location on the curve. 4.3 Background The nonlinear impact memory sizing has on performance makes memory scaling for opaque applications both challenging and important. Determining the amount of memory a scalable application requires at run time in a robust way requires combing and extending a number of different ideas including MRCs, natural thresholds, and miniature simulation. 79 Implemented properly, these techniques can apply to a wide range of application classes. 4.3.1 Nonlinear Impact of Memory on Performance Performance vs. memory exhibits a nonlinear relationship. We demonstrate a nonlinear increase in latency with a linear decrease in physical memory in Figure 4.1. We generate these plots by running a benchmark application against the Redis key-value store [11] at various amounts of physical memory on a Linux machine with a maximum of 4 GB RAM and 4 GB swap space on a solid state drive (SSD). For each execution of the benchmark, we limit the physical memory available to the application and record the number of requests/sec the benchmark is able to make to the server to store keys. The figure shows both the number of requests/sec and the amount of swap used at each amount of available physical memory. In this application, the request rate drops exponentially with a linear increase in used swap space caused by a corresponding decrease in available physical memory. Running on SSD media shows exponential decrease in performance. Magnetic disks would show even faster decay. Clearly, it is beneficial to allocate sufficient main memory to prevent an application from paging to virtual memory. 4.3.2 MRC Model A miss ratio curve (MRC) shows the ratio of cache misses to the total number of cache references as a function of cache size. The MRC will change over time as the patterns in offered workload change. An MRC may have multiple performance cliffs where a small change in the amount of memory leads to large changes in the miss ratio. Figure 4.3 shows two very different MRCs we obtain from the applications in the evaluation. The horizontal axis is normalized for illustration. Figures 4.3a and 4.3b are enlarged for detail. In Figure 4.3a, the MRC (dotted line) has a gentle slope in the area around the vertex knee and does not contain a performance cliff. The MRC in Figure 4.3c has a sharp slope on each side of the knee and a performance cliff toward the right side of the figure. Figure 4.3b shows the detail of the cliff. We model the MRC as a hyperbola (solid line). When the slope near the vertex knee is gentle as in Figure 4.3a, the hyperbola models the entire curve very closely. When the slope is sharp, it fits the tail very well and this is the portion of the MRC important to choosing the natural threshold. The figures show key locations on the model including the 80 vertex knee where the change in the horizontal axis equals the change in the vertical axis, the focus of the hyperbola, and the intersection of the LR with the hyperbola. The latter is our natural threshold. Other operating points on the MRC are possible. For example, it is tempting to choose the vertex knee. Although it is not possible to use the vertex knee with the hyperbola model as shown in Figure 4.3c because it can compute a threshold to the left of the vertical axis, the actual vertex knee of the MRC can be obtained by searching the data for a slope of -1. This location tends to lead to oscillation and does not provide a cushion for a sudden change in offered load that may push the knee farther to the right, resulting in too little memory and more frequent page faults. 4.3.3 Miniature Simulation Sampling all memory references of a running application is prohibitively expensive in terms of computing resources. Further, it is not currently possible because of lack of support from commodity hardware. A miniature simulation emulates a cache of any given size by sampling a spatially-hashed subset of memory references at a sampling rate R. It is a method of obtaining a highly accurate estimate of memory references with a mean absolute error (MAE) on the order of 0.003 with a sampling rate R = 0.001 [87]. The output of the simulation can be passed to a standard MRC generator. The results are divided by the sampling rate R to obtain an estimate of the MRC of the full set of memory references. Current commodity hardware does not support sampling memory references on a running VM. We therefore modify the QEMU hypervisor’s memory management unit (MMU) emulator to capture memory references from the guest OS running on the hypervisor. Capturing memory references in this manner is agnostic to the guest OS and to the applications running on the guest. We run the hypervisor on a standard Cloud provider VM and the emulated guest OS and application on the hypervisor. The emulated VM runs slowly and is not appropriate for applications with hard deadlines; however, it is sufficiently fast to capture a subset of memory references for MRC computation. Further, it would be expensive to emulate the full workload on multiple machines running the hypervisor. Running a subset of the workload on a single emulated VM, similar to the concept of cache miniature simulation, is sufficiently fast to compute the MRC. We obtain an estimate of the MRC combining both techniques: running a subset of the 81 workload on the hypervisor VM and performing a miniature simulation on the memory references captured on the VM. We divide the output of the MRC by the product of the sampling rates of the workload and the memory references to obtain the MRC estimate. 4.3.4 Application Classes At the most general level, we can use MRCs on any memory intensive application that scales. To demonstrate, we look specifically at the class of applications that have a queue that can be measured. This class includes (i) the caching tier of a web application where the queue is the TCP connection backlog and request queue at the front end web servers, (ii) batch processors with an explicit work queue, and (iii) IoT servers also with an explicit work queue. We use first principles to design a simple scaling algorithm to evaluate the MRC thresholds on this class of applications. We use CFs as temporary computing devices to run an instance of the application as a VM launches. Our use of CFs is applicable to any application we scale; however, since CFs are not guaranteed to maintain state over long duration, it is most beneficial to the class of applications that do not need to maintain state, specifically, batch processors and IoT servers. We find that the CFs do maintain state over short duration and we use them only temporarily while launching VMs. Thus, they are beneficial to caching applications as well. Caching applications accumulate state and become more efficient over time. While the CF is executing and because they do maintain state over a short time, they operate as efficient short-term caching nodes. When the VM has launched, it begins to accumulate state and we terminate the CF. At this time, there is a decrease in caching performance, but only for a short time while the VM accumulates as much state as the CF had when it terminates. To demonstrate and evaluate these methods, we focus on two specific instances of the application classes. Specifically, we implement a Redis caching cluster and a batch processing server farm that applies a watermark to high-resolution images. 4.4 Architecture Figure 4.4 shows the architectural model of our applications. It is a producer/consumer or client/server model with a queue. This model supports a wide variety of applica- 82 tions including publisher/subscriber (pub/sub) and web server-based request/response (req/resp) applications. For pub/sub, an explicit queue holds pending jobs. The jobs can either be pushed to VMs and CFs through a load balancer or pulled from the application. For web server applications, the queue is the TCP backlog and request queue at the load balancer. The load balancer routes requests to the VMs and CFs. 4.4.1 System Components Figure 4.5 shows the system components. They consist of (i) a controller to coordinate the other components; (ii) CFs and VMs that run the application under evaluation; (iii) the job queue or load balancer to distribute jobs or requests to the application; (iv) metrics sources including the provider’s metrics systems (AWS CloudWatch or Azure Monitor), VMs, CFs, and the job queue or load balancer; (v) persistent storage to record data for analysis; and (vi) logging to the provider’s logging subsystem. To execute pre-compiled applications on a CF, we create a shell in a provider-supported CF runtime. The shell interfaces to the underlying operating system to execute the applications. We run the same applications on CFs as on VMs. We add an API to the shell that the controller uses to instruct the shell to start, stop, pause, or resume the applications as need. In addition to monitoring VM statistics such as CPU utilization and memory consumption, the controller monitors the job queue/load balancer for metrics (m in Figure 4.4) including the arrival and departure rates. For pub/sub applications, we measure the queue in terms of the number of jobs waiting to be processed. For web server applications, we measure the number of pending TCP connections and receive queue length. 4.4.1.1 Controller This module coordinates activities of the other modules. It initiates scaling operations and registers or unregisters VMs and CFs with the load balancer. 4.4.1.2 Function Manager The Function Manager calls AWS Lambda or Azure Function to create and destroy CFs. It calls our CF API to start or stop the applications. 83 4.4.1.3 VM Manager The VM Manager instructs AWS EC2 or Azure Virtual Machine to create or destroy VMs. 4.4.1.4 MRC Manager This module receives memory references from the MRC VM. It periodically re-generates the MRC and provides it to the Scaling Algorithm. We compute the MRC by executing a small fraction of the workload on a separate VM running the QEMU hypervisor. This method is applicable to any application where a small fraction of the workload can be replicated and distributed to the VM running QEMU. The MRC Manager collects memory references from the VM running the QEMU hypervisor. We demonstrate how an algorithm may use the MRC to make scaling decisions based on memory usage in §4.6. 4.4.1.5 Workload Generator The Workload Generator produces the workload based on pluggable generator modules. We write generator modules that include Yahoo! Cloud Serving Benchmark (YCSB) [16] workloads and pre-recorded trace files. 4.4.1.6 Scaling Algorithm The Scaling Algorithm is a pluggable module that determines the desired resources given the current set of metric measurements. 4.4.1.7 Metrics Manager This module queries various sources for measurements including the provider platform, the job queue or load balancer, application VMs, CFs, the MRC VM, and the Workload Generator. Cloud providers make CPU utilization and work queue metrics available through their platforms. We collect these metrics as well as provide a custom component that runs on the VMs that collects metrics not implemented by the Cloud providers. For example, AWS requires running a guest OS application to collect memory statistics. We implement our own such application that integrates directly with the controller. This gives us more control and extensibility than relying on provider implementations that may or may not give the same set of metrics. The module is configurable to collect metrics either 84 at the Cloud provider’s frequency or at a periodic interval. 4.4.1.8 Data Recorder This module saves real-time performance data to persistent storage for later analysis. 4.4.1.9 Log Manager This module manages Cloud-based logs. 4.4.1.10 Load Balancer The load balancer routes data to applications running on the VMs. It marshals data to applications running on CFs through the CF API. 4.4.2 Cloud Function Operation CFs are lightweight processing units offered by all major Cloud providers including Amazon Web Services (AWS) Lambda [2], Microsoft Azure Functions [63], and IBM OpenWhisk [43]. They are typically implemented as containers that run for duration on the order of at most 5 min per request. Providers pre-load CFs with runtimes including Java, JavaScript/NodeJS, Python, and C#. They provide an SDK for each runtime in which programmers develop code. The programmer registers the CF code with and uploads it to the provider’s CF system. The interface from the outside world to a CF takes the form of a provider-specific API. We write a custom load balancer that interfaces to the API. It marshals application jobs and requests to the CFs so the application does not need to interface to the Cloud provider’s CF API. The load balancer interface is modular and can be replaced with implementations for additional Cloud providers. CFs typically have their own runtimes but run on a standard underlying OS. We can execute any application on them that can be compiled to a user-level binary. To execute pre-compiled code on a CF, we create a shell in a provider-supported CF runtime. The shell interfaces to the underlying operating system to execute the applications. We run the same applications on CFs as on VMs. We add an API to the shell that the controller uses to instruct the shell to start, stop, pause, or resume the applications as need. CFs are typically more resource limited than VMs, but industry trends are closing the 85 gap. For example, AWS recently [3] doubled the available CF memory from 1.5 GB to 3 GB. CFs will increasingly be able to run heaver workloads as we suspect this trend will continue. Current CF resources are sufficient for many applications such as those we evaluate. With these observations, we use CFs to temporarily absorb increases in offered workload as VMs come online. When the controller determines new VMs need to be launched to increase total system memory, it launches a VM paired with as many CFs as are necessary with aggregate memory equal to the VM. As the VM slowly comes online, we route requests to the CFs to immediately begin processing them. When the VM is online, we terminate the CFs and shift workload to the VM. We lose the state that has accumulated in the CF when we shift to the VM; however, since the CF is online for a short time, the state will be minimal. For example, in our batch processor, no state is required between handling of requests. In a caching application, the cache that accumulated in the CF will be lost, but it will quickly repopulate in the VM. In this case, there will be a temporary performance penalty as the cache repopulates; however, it is still a benefit over not using the CFs as temporary memory. 4.4.2.1 Executables Although Cloud providers pre-load CFs with runtimes, we regard the CF as a shell that interfaces to the underlying OS. We execute applications on CFs that do not depend on the runtimes but instead run on the underlying OS. Application code may be written in any language that compiles to a user-level executable [22, 26, 86]. We write the shell in Java on AWS and in C# on Azure. 4.4.2.2 CF API Providers allow the requester to pass query parameters to CFs. We create an API to communicate with our CF and embed it in the parameters. We use the API internally. The applications do not need to be coded to this API. As shown in Table 4.1, it includes commands to (i) start and stop the applications on the CF at system initialization and shutdown, respectively; (ii) pause and resume the applications during scale out operations; (iii) keep the CF warmed in the provider’s CF cache to minimize cold starts; (iv) retrieve metrics from the CF; (v) give the CF time to execute on the provider’s platform with a 86 long-running query; and (vi) proxy and marshal jobs and web requests and responses from the load balancer to the application running on the CF. 4.4.2.3 Controller Interaction When we start the controller, the Function Manager creates a pool of CFs. For each CF in the pool, the Function Manager sends a start command to load the applications into the CF. It then sends a pause command to suspend them until needed for a scale out operation. The Function Manager periodically sends keep alive commands to the CFs to keep them warm. When the controller starts a scale out operation, the Function Manager sends the CF a resume command to resume execution of the application. As long as the scale out is in progress, the Function Manager continually sends long-running status commands to the CF. The shell performs two actions when it receives the command. First, it retrieves metrics from the OS. Second, it gives our CF, and consequently the applications, time to execute. When the controller determines that a scale out is complete, the Function Manager sends the CF a pause command to suspend the applications. The Function Manager discontinues sending long-running status commands in this state. When we terminate the controller, the Function Manager sends the stop command to the CF to perform an orderly shutdown of the application. 4.4.2.4 CF Caching We attempt to keep CFs in the provider’s CF cache by sending requests to them. However, we do not require the CF to be cached. The cache state of a CF is provider implementationdependent. Cold starts are much faster launching compared to a VM and we gain a benefit regardless of whether the CF is in cache or not. 4.5 Miss Ratio Curves A miss ratio curve (MRC) shows the ratio of cache misses to the total number of memory cache references as a function of cache size. If we view the physical main memory of a VM as the cache for the disk-based swapping of virtual memory, we can use MRCs to determine the amount of main memory required by an application. We then use this information to determine the memory requirements of a scaled up version of the system. 87 The MRC will change over time as the patterns in the offered workload change. We recompute the MRC periodically and scale memory resources at run time. While it is computationally expensive to construct an MRC from all memory references, we sample a small subset of spatially hashed memory references to efficiently construct an MRC with a high degree of accuracy. We further reduce the size of our memory trace by recording page accesses rather than individual memory locations. A page (4096 bytes in Linux) is the minimum amount of memory the OS allocates and we do not lose MRC accuracy by recording the page reference. We record a spatially hashed subset of memory references [88]. We use MurmurHash3 [6] to determine if a memory reference should be recorded. We capture memory references from the QEMU hypervisor running on a live VM and pass them to our controller. The controller divides the memory reference by the page size before computing the MRC. The output of the MRC is the number of pages scaled down by the sampling frequency. The controller multiplies the MRC results by the sampling frequency to obtain the memory required by the VM. We distribute a fraction of the workload to the VM running QEMU so that we need only a single extra VM to compute the MRC. The controller multiplies the MRC results by the fraction of jobs not sent to the QEMU VM to scale the MRC up to the full system workload. Since each process has its own virtual memory space, we include the process identifier (PID) to create an OS-wide unique memory reference. As shown in Figure 4.6, we combine the 32-bit PID with the 52-bit page number to create an 84-bit memory reference. We discard the lower 12-bit page offset of the virtual memory address. We use an efficient open source application, PARDA [71], to compute the MRC. The reference implementation of PARDA [70] has a 76-bit address space. We modify the code to extend it to 84 bits to allow us to include the 32-bit process identifier in the memory reference. Modern CPUs do not provide hardware support to allow software to capture accessed memory addresses. We invent a method to capture the memory accesses necessary to compute an MRC on a running system. At run time, we duplicate and route a fraction of the offered workload to a single VM running the QEMU hypervisor [9]. We do not run QEMU on the other VMs in the system. We execute the application being analyzed on the guest OS of the hypervisor. We modify QEMU to sample a subset of memory references 88 of the guest OS. We stream them over a network to the controller running on a separate machine. The controller generates the MRCs to determine how much memory the system requires. The controller scales the VMs first vertically and then horizontally to meet the memory requirements. The controller concurrently runs a probing algorithm to fine-tune the memory to the offered workload. We run QEMU with an emulated memory management unit (MMU). We modify the emulated MMU to capture the spatially hashed subset of memory references. In this mode, the guest OS runs slowly compared to other hosts in the network. Consequently, we discard any results it generates and use it only to take measurements that allow us to compute the MRC. Figure 4.7 illustrates how we capture the guest OS memory references. Each process running on the guest OS has its own virtual memory address space. The emulated MMU in the host intercepts virtual memory address accesses from the guest OS. The MMU maps the virtual address to an emulated physical address. If a single process dominates memory references, we do not need to know which process is accessing memory; however, if multiple processes are memory intensive, then knowledge of the running process identifier is necessary. The host needs to know which process in the guest is accessing memory to form a unique memory reference for the MRC trace. If it does not know which process is accessing memory, it will not be able to distinguish memory references in one process from those in another and we will not be able to create an accurate view of the required memory or MRC. We therefore modify the guest OS (Linux Kernel 4.4.0-116). Immediately after a context switch, we record the process identifier (PID) that is scheduled to run on the core for which the context switch occurs. In the host MMU, we read the PID when a memory reference occurs. In this way, we obtain a guest OS-wide unique memory reference consisting of a PID and virtual memory address pair. 4.6 Scaling Method We create an algorithm to evaluate the effectiveness of our scaling policy. Each iteration of the algorithm first checks if the controller updated the MRC or if the queuing delay has changed. Next, it establishes an estimate of the required system memory. It then probes the system to fine-tune the memory to the offered workload. 89 4.6.1 MRC Threshold A key challenge in finding the required system memory size is determining an appropriate point on the MRC at which to operate. Choosing a value too high is inefficient and memory remains unused. Choosing a value too low leads to poor performance as processes run out of memory. Existing techniques require specification of a target latency and use the MRC to determine how much memory to add to meet that target. We eliminate the need to pre-specify a target latency by choosing a natural threshold in our model of the MRC. We model the curve as a hyperbola. This tends to remove cliffs from the curve as shown in Figure 4.2. We use the Nelder-Mead simplex procedure [75, 83] to fit our data to the hyperbola model. The question remains, at what level on the curve should we choose to run our system? We notice that MRCs tend to have a very long tail. Setting the threshold too far along the tail is an inefficient use of memory and provides little to no benefit to the application. We therefore look for the point at which the curve flattens. The LR is the line through the focus of the hyperbola, perpendicular to the transverse axis. We choose the intersection of this line with the hyperbola as the natural threshold. This point is the beginning of the flat part of the curve [32]. Our intuition is that this point on the curve indicates a point at which there is sufficient memory in the system and more memory adds little value. Less memory will cause more page faults. 4.6.2 Probing Algorithm We draw inspiration from TCP, which probes a network for available bandwidth. In a similar manner, we probe the system for the required memory. In this discussion, we use thread to refer to the set of processes and resources required to handle a request. Table 4.2 defines the variables in the algorithm. We consider how two different types of applications use MRCs to scale memory: a caching cluster and a batch processing farm that processes high-resolution images. The two applications are distinct in their processing and memory requirements. We use the same probing algorithm for both applications but compute the initial required memory of a scale out event differently as shown in Algorithm 5. The caching cluster accumulates state over time and consumes very little CPU. The batch processor requires significant memory 90 and CPU for a short time while processing an image, but does not maintain state between jobs. The amount of memory in the caching cluster is determined by the accumulated state, distributed among all nodes in the cluster. In this case, the MRC threshold (hi+1 ) multiplied by the fraction of requests not sent to the QEMU machine (pi+1 = f ) that computes the MRC is the required memory in the cluster (mtmp = pi+1 hi+1 = f hi+1 ). On the other hand, the memory in the batch processor is determined by the number of threads in the system. The number of threads required to match the throughput to the offered workload is the arrival rate divided by the departure rate per thread (pi+1 = dλ/µe). The required memory is then the MRC threshold multiplied by the number of threads (mtmp = pi+1 hi+1 = dλ/µehi+1 ). We use mtmp if we detect a change in the required number of threads or if the MRC threshold has changed. We use it to establish the starting point of the next scale out event; however, we do not permit a starting point less than the currently allocated memory size. A scale out event begins when the queuing delay increases and the previous event was a scale in event. Algorithm 6 shows the probing algorithm. Probing helps right-size memory in two situations. First, it helps when using a threshold that underestimates or overestimates memory requirements. Second, if the offered workload suddenly shifts, then the MRC estimate will be incorrect until enough new memory samples are collected with the new load. We use run-time measurements to estimate the average amount of memory a request requires. If a VM does not have enough free memory to process a request, or if the VM is depleted of main memory, we set a low memory flag (l) for the VM. We track this condition for each VM and use it as a control signal to the probing algorithm. The algorithm probes for more memory if the queue is growing (i.e., there are not enough resources to meet demand) or if any VM is in a low memory state. The algorithm conservatively probes for less memory if the queue is empty and no VMs are low on memory. In this case, there are an excess of resources and we can reduce the memory in the system. There is high penalty in terms of performance when there is too little memory allocated to an application. We are therefore aggressive when probing for higher memory and conservative when probing for less memory. We use multiplicative increase when probing for more memory just as TCP does during slow start when probing for more bandwidth. However, we use a 91 conservative additive decrease when probing for less memory unlike TCP, since fairness considerations do not apply here. We probe lower by the granularity of VM memory size (s) available from the Cloud provider. At the time of this writing, Cloud providers do not support changing the amount of memory allocated to a VM at run time. Consequently, to change the amount of memory allocated to an application, the controller selects a set of VMs to be stopped and creates a set of new VMs to be launched to meet the computed memory requirement. It first launches the new set of VMs. It then redistributes application state away from the set of VMs that will be stopped, if necessary. For the caching application, this requires a reshard operation followed by a rebalance operation. Redistribution of state is not necessary for the batch application. Finally, the controller terminates the set of VMs to be stopped when the state redistribution is complete. Algorithm 5: Initial Memory Calculation ( f , caching application dλ/µe, batch processing application = p i +1 h i +1 p i +1 = mtmp if (pi+1 6= pi ) or // Required threads changed (hi+1 6= h0 ) // MRC threshold changed { if (δ > 0 and last operation was scale in) { // Establish new starting point for scale out mi+1 = max(mtmp , mi ); } } Algorithm 6: Probing Algorithm if (δ > 0 or l = true) { mi+1 = 2mi ; // Multiplicative increase } else if (δ = 0 and l = f alse) { mi+1 = mi − s; // Additive decrease } 92 4.7 Implementation We implement both a batch processing and a caching application on two major Cloud providers, namely AWS and Azure. The batch processor follows a publisher/subscriber (pub/sub) model and the caching application follows a request/response (req/resp) model. We write custom producer and consumer applications and run them on separate computing devices as shown in Figure 4.4. We write them in Java. Table 4.3 shows the number of lines of code in each component, reported by the Linux tool cloc [17]. We make modifications to QEMU to extract the PID from the guest OS. As the application executes on the guest OS, we record memory references consisting of PID and virtual memory address pairs and stream them to the controller. We modify the Linux kernel to obtain the PID from the scheduler and export it to the QEMU host. The two applications are similar in structure with details below. On AWS, we use Elastic Compute Cloud (EC2) for VMs and Lambda for CFs. On Azure, we use the Virtual Machines and Functions offerings. The producer reads and recreates a pre-recorded network trace. It sends the job or request (d in Figure 4.4) to the job queue or load balancer. We encounter a number of differences between AWS and Azure. We use Linux VMs on both platforms. At the time of this writing, there is no choice of operating system for CFs. AWS uses Linux and Azure uses Windows. The security restrictions of each platform make the greatest difference in our CF shell implementations. We encountered fewer restrictions on AWS. We make the following observations and choices when implementing the applications on AWS CFs and VMs. 4.7.1 CPU Utilization We use vmstat to obtain CPU utilization on CFs on AWS, but we cannot obtain it from Azure CFs. Azure CFs run under Internet Information Services (IIS) [64] with a restricted account that cannot access CPU statistics. 4.7.2 Subnet On AWS, we specify that our CFs reside on the same subnet as our VMs. Our CFs connect to VMs over our AWS virtual private network (VPN). Over these connections, we measure CF to VM round trip times (RTTs) of 5 ms to 25 ms with a mean of about 18 ms. 93 On Azure, we are not able to specify the subnet of the CF. Consequently, the CFs must connect to the external IP address of the VMs. Over these connections, RTTs vary from 5 ms to 50 ms with a mean of about 23 ms. 4.7.3 Hardware Hardware differs between AWS and Azure. We choose the VMs shown in Table 4.4. The Cloud provider provisions CFs by the amount of memory we request. We request 1.5 GB CFs and determine the hardware specifications shown in the table. On AWS, we obtain this information from the Linux /proc/cpuinfo and /proc/meminfo files. On Azure, we use Windows Sysinternals PsInfo [65]. Although AWS limits the RAM available to our CF processes, /proc/meminfo shows 3.5 GB. Azure CF security prevents us from obtaining physical memory information. 4.7.4 Startup Time Our CFs resume from a paused state in 20-40 sec on AWS and 40-50 sec on Azure. Our VMs launch and start our applications on AWS in 2-3 min and 4-5 min on Azure. This difference impacts scaling algorithm performance between the platforms. In general, Azure queuing delays grow more than AWS before VMs are fully launched. Consequently, Azure benefits more from CFs than AWS. 4.7.5 Code Deployment For AWS, we pre-load our CF executables onto AWS’s Cloud storage platform, Simple Storage Service (S3). When we programmatically launch a CF, we instruct it to pull them from S3. After uploading the executables to S3, the deployment is completely automated. We automate Azure deployment with GitHub (an Internet-based source code repository) continuous integration [31]. We upload our Azure CF executables to GitHub. When we programmatically create the CF, we instruct it to pull them from GitHub. 4.7.6 Web Server Cache - Redis We examine the behavior of a caching application by scaling Redis key-value store servers. We build a Redis cluster controller to automatically, add, remove, reshard, and rebalance nodes in a Redis cluster as we scale VMs. Resharding and rebalancing are 94 based on the reference implementation [52]. We run a YCSB workload on the cluster. The producers are web servers that query the Redis cluster through the load balancer. The consumers are the Redis servers. The load balancer proxies the requests to the Redis servers running on the VMs and CFs. For our queuing metrics, we monitor the TCP backlog and receive queues in the load balancer. We set the TCP backlog and timeouts high enough in the load balancer that we do not drop any incoming requests so that we can observe scaling algorithm behavior without resorting to tail drop. 4.7.7 Batch Processor - Image Watermarking As photo and video resolutions continue to increase, memory becomes a significant factor in the ability to transcode them. Granted, it takes more powerful CPUs to process higher resolutions, but increasing the available memory makes a greater difference to throughput than increasing CPU performance. Further, CPUs are often at maximum utilization while transcoding and a higher performance CPU may not be available (we use a compute-optimized classes of VMs on AWS and Azure). We demonstrate with an experiment in which we watermark high-resolution 16K (15360x8640) images with the open source transcoder, ffmpeg [84]. The images at 24-bit color depth consume 380 MB when uncompressed. We run 4 simultaneous instances of ffmpeg on a compute-optimized AWS EC2 C5 VM. Each instance consumes just over 1 GB. We use the same VM on each experiment and limit the number of cores available to the OS as necessary. Table 4.5 shows the results. Doubling the cores alone has negligible effect on throughput because most of the time is spent servicing page faults. Doubling the memory improves throughput by 54%. Doubling both cores and memory improves it by 74%. The greatest improvement comes from the increase in memory. In the batch processor application, we use ffmpeg to apply a watermark to high-resolution images. We write a Java wrapper around Apache ActiveMQ [4] for our job queue. The producers (publishers) send the job data to the work queue. The transcoder consumers (subscribers) pull jobs from the queue on a first-come-first-served basis. 95 4.8 Evaluation We evaluate two natural thresholds on the MRC: the knee computed from the raw MRC data and the LR intersection of the hyperbola model of the MRC. We run a YCSB workload on the caching application and a trace from a production system on the image processing application. We launch each experiment with a minimum configuration consisting of a single consumer VM with 2 cores and 4 GB RAM and a QEMU hypervisor VM with 4 cores and 8 GB RAM. We first establish a baseline of the performance of each application. In the baseline, we do not use probing Algorithm 6 or CFs. We then add the probing algorithm, and finally CFs. Figures 4.8 and 4.9 show caching application plots for AWS and Azure, respectively. Figure 4.10 shows the production trace on which we evaluate the batch application. Figures 4.11 and 4.12 show batch application plots for AWS and Azure, respectively. 4.8.1 Metrics For each evaluation, we report the mean metrics shown in Tables 4.6 and 4.7 for the caching and batch applications, respectively. The metrics are (i) queuing delay, (ii) amount of memory allocated, (iii) millions of page faults per core, and (iv) throughput in jobs/sec. 4.8.2 Caching Application It takes longer to deploy new instances and rebalance state on Azure because of longer launch times. Further, the AWS C5 family has a faster network (up to 25 Gbps vs. “extremely high” on Azure) and more storage bandwidth (up to 14 Gbps vs. 1.1 Gbps on Azure). We therefore run the experiments for 6 hours on AWS and 12 hours on Azure. To establish a baseline for the caching application, we disable both probing and CFs. We make requests at a high rate to determine the maximum throughput. We generate the MRC every 30 min. The MRC threshold converges after one or two scaling events. For the baseline, Table 4.6 shows the memory size to which the system converges and the page faults per core and throughput at this memory size. The LR shows about 1.5× speedup over the knee on AWS and 2.5× on Azure. We run probing and CFs at a constant request rate. Probing begins to scale out before the controller computes the MRC since it takes time to collect memory references. The 96 system does not scale in enough before the second and subsequent scale out events for the MRC calculation to be evident. Probing computes a higher memory requirement than the MRC and Algorithm 5 gives precedence to the higher value. Consequently, probing results are about the same for both the LR and knee thresholds. The slower VM launch times on Azure can cause much higher queuing delay compared to AWS. We see a decrease in queuing delay between probing without CFs and probing with CFs, on the order of 25% on AWS and 20% on Azure. The difference in queuing delay with AWS CFs is from variation in start up time. The probing and CF plots diverge as time progresses because of the faster effective launch time of the CFs. This is also the source of the lower queuing delay. These charts show that the LR is a very good estimate of the required memory since probing oscillates about this value. Throughput charts (Figures 4.8e, 4.8f, 4.9e, and 4.9f) are also similar between the thresholds because probing takes precedence over MRCs in this evaluation. The throughput follows VM and CF allocation with some variation as VMs and CFs launch. To observe MRC behavior when it takes precedence over probing, we run a second caching experiment. We create a step function of request rate in which the first 2.5 hours run at 10 req/sec, the next 30 min at 500 req/sec, and produce no messages thereafter. Running at a low request rate for a long period of time allows the system to compute about the same MRC as in the first experiment while keeping the queue empty, avoiding a probing event. The scale out event at the step uses the MRC threshold. Figures 4.8g and 4.8h show the queuing delay on AWS. We omit memory and throughput charts as they follow the step function. We omit Azure charts because they are similar to AWS. Table 4.6 shows the mean results over the time from 2.5 to 3.5 hours. LR improves queuing delay by about 2.5× without CFs and up to 4× with CFs compared to the knee. CFs improve queuing delay by about 1.5× over VMs alone. 4.8.3 Batch Application We evaluate the batch application on the trace shown in Figure 4.10 taken from a production system. The trace has a sudden increase in traffic near the 1.5 hour mark. Table 4.7 shows the mean values across the entire duration. Each experiment begins with minimal resources. Additionally, it takes time to collect memory references to compute an 97 MRC. We compute MRCs every 15 min in this experiment. The first spike in the figures is caused by probing without the benefit of MRC since it has not yet been computed. The benefit of LR and CFs techniques is in collapsing the spikes in queuing delay. Azure VMs appear to have higher computing performance compared to AWS resulting in lower latency when transcoding images in this application; however, both platforms exhibit the same general trends. The baseline evaluation using the knee threshold is not able to control queuing delay as shown in Figures 4.11a and 4.12a, and it tends toward infinity as the offered workload increases. Using the LR threshold does control queuing delay as shown in Figures 4.11b and 4.12b. The evaluations with probing show a TCP-like sawtooth pattern in the queuing delay. With little allocated memory, the queuing delay increases. The probing algorithm reacts by allocating more resources as shown in Figures 4.11c, 4.11d, 4.12c, and 4.12d until the queuing delay begins to decrease. When the queuing delay has reached zero, the probing algorithm decreases resources until the delay begins to increase again and the process repeats. After initial MRC generation, the baseline computes a constant memory requirement (16 GB for the knee and 28 GB for the LR) and we omit it from the memory charts for clarity. There is an increase in resources at the 1-hour mark. Without CFs, the LR computes a higher memory requirement (about 60 GB) compared to the knee (about 48 GB) at this point. This results in the LR preventing the queuing delay from growing as quickly and we see a much smaller spike in the queuing delay compared to the knee. With probing, AWS computes higher memory requirements than Azure. It requires more resources than Azure for this application because of Azure’s higher CPU performance. Figures 4.11e, 4.11f, 4.12e, and 4.12f show throughput. The baseline throughput for the knee on AWS is near constant because it is under-provisioned and cannot control queuing delay. Azure shows the same trend near the end of the experiment when offered load is high and it cannot keep up with the load. The baseline for LR follows the offered load closely because it is sufficiently provisioned for the load. In the baseline, throughput with LR is about 1.5× faster than the knee. With probing, it reduces queuing delay by 1/2 on AWS and by 1/3 on Azure. It also shows about a 1.5× decrease in page faults/core. Adding probing improves the knee results significantly and allows it to control queuing delay. On average, it computes about the same amount of 98 memory as LR, which shows that even with a threshold that is too low, probing for memory allows the system to perform well. With the decrease in effective launch time, adding CFs further improves the knee threshold; however, we still see about a 15% decrease in queuing delay with LR on AWS and 30% on Azure. 4.9 Conclusion Sizing an application’s memory correctly is critical to its performance. We created a new approach to application scaling that does not require specification of any thresholds. We modified the MMU emulator of the QEMU hypervisor to capture memory references from a running VM to generate the miss ratio curve (MRC) of an application. We modeled the MRC as a hyperbola to allow us to choose a natural threshold (operating point) on the curve. We chose the latus rectum of the hyperbola as the natural threshold. We used the threshold to determine a starting point to find an appropriate amount of memory for the system. We developed a probing algorithm to fine-tune the allocated system memory. We evaluated the method for two classes of applications, batch processors and web server caches. The LR threshold showed a speedup in throughput of 1.5× in both caching and batch applications. With probing, it reduced queuing delay by 2.5× in the caching application and by 1/2 in the batch application versus the knee in our evaluation. 99 Table 4.1: Function API Command start stop pause resume keep alive status proxy Description Launches the applications. Terminates the applications. Suspends the applications. Resumes the applications. Performs no action. This is effectively a no-op that keeps the CF warmed on the provider’s system to minimize cold starts. Long running request to retrieve system metrics and to give the CF time to execute on the provider’s system. Proxies jobs and web requests/responses to/from the applications. Variable pi p i +1 mi m i +1 f hi h i +1 x λ µ = x/pi δ l s Table 4.2: Algorithm Variables Definition existing number of threads new number of threads existing memory in system new memory in system fraction of requests not sent to MRC machine existing threshold determined by MRC new threshold determined by MRC measured total departures/sec measured arrival rate measured departure rate/thread change in queuing delay true if memory of any VM is low, f alse if not minimum VM physical memory increment size 100 Table 4.3: Lines of Code Component Number of Lines Controller 21,543 Consumer 2,688 Producer 1,201 Load Balancer 1,668 AWS Shell 2,532 Azure Shell 2,167 Common Libraries 7,433 QEMU Modifications 2,143 Kernel Modifications 312 Redis Controller 8,354 Total 50,041 Azure AWS Provider Application Shell/Consumer Job Queue Consumer Controller Shell/Consumer Job Queue Consumer Controller Machine Type CF, Lambda VM, m3.medium VM, c5 family VM, m3.large CF VM, DS v3 VM, FS v2 family VM, DS v3 Table 4.4: Hardware CPU Type Intel(R) Xeon(R) E5-2666 v3 Intel(R) Xeon(R) E5-2670 v2 Intel(R) Xeon(R) Platinum 8124M Intel(R) Xeon(R) E5-2670 v2 Intel(R) Xeon(R) E5-2660 Intel(R) Xeon(R) E5-2673 v3 Intel(R) Xeon(R) Platinum 8168 Intel(R) Xeon(R) E5-2673 v3 CPU Clock (GHz) 2.9 2.5 3.0 2.5 2.2 2.3 2.7 2.3 RAM (GB) 3.5 3.75 4-32 7.5 N/A 8.0 4-32 8.0 Cores 2 1 2-16 2 1 2 2-16 2 101 102 Table 4.5: Image Watermark Speedup Cores RAM Total Time Time (GB) (min:sec) Savings (min:sec) 2 4 5:38 0:00 4 4 5:30 0:08 2 8 2:36 3:02 4 8 1:27 4:11 Provider AWS Azure Table 4.6: Caching Application Results Experiment Delay Memory Faults (Sec) (GB) (M/Core) Baseline, LR N/A 16.00 0.62 Baseline, Knee N/A 8.00 0.22 Probe, LR 118.57 19.49 1.20 Probe, Knee 118.20 18.92 0.97 CFs, LR 85.51 19.72 0.79 CFs, Knee 91.05 19.48 1.12 Step Probe, LR 153.54 12.86 0.03 Step Probe, Knee 378.06 11.90 0.03 Step CFs, LR 83.35 12.79 0.03 Step CFs, Knee 338.85 11.96 0.03 Baseline, LR N/A 16.00 1.35 Baseline, Knee N/A 8.00 0.72 Probe, LR 1132.30 25.80 0.52 Probe, Knee 1220.40 26.00 0.51 CFs, LR 927.12 25.02 0.40 CFs, Knee 1010.56 25.35 0.44 Step Probe, LR 134.76 12.46 0.04 Step Probe, Knee 345.67 11.50 0.05 Step CFs, LR 120.52 12.72 0.05 Step CFs, Knee 232.44 11.12 0.05 Throughput (Reqs/Sec) 495.64 336.22 384.62 385.18 386.51 386.53 242.40 242.57 242.45 242.53 401.23 147.09 395.50 389.35 395.41 393.94 253.81 253.80 274.26 253.78 103 Provider AWS Azure Throughput 4 25000 3.5 20000 3 Swap (GB) Requests/Second 30000 Table 4.7: Batch Application Results Experiment Delay Memory Faults (Sec) (GB) (M/Core) Baseline, LR 124.97 26.86 5.09 Baseline, Knee 1236.93 14.91 5.98 Probe, LR 122.75 48.16 0.77 Probe, Knee 255.84 47.59 1.71 CFs, LR 22.09 31.28 2.29 CFs, Knee 25.78 31.01 2.82 Baseline, LR 45.58 26.82 2.52 Baseline, Knee 1893.94 11.14 6.53 Probe, LR 46.59 24.84 1.60 Probe, Knee 142.26 24.27 2.34 CFs, LR 40.08 21.68 1.63 CFs, Knee 64.12 29.58 1.24 15000 10000 5000 0 0 Throughput (Jobs/Sec) 14.00 9.51 14.02 13.07 13.80 13.80 13.85 8.12 13.96 13.92 13.98 13.90 Swap Space Used 2.5 2 1.5 1 2 3 Physical Memory (GB) (a) Throughput 4 1 0 1 2 3 Physical Memory (GB) (b) Swap Space Used Figure 4.1: Effects of Physical Memory Size 4 104 Miss Ratio MRC Hyperbola Transverse Axis Latus Rectum Focus Natural Threshold Vertex Memory Size Figure 4.2: MRC Model Long Tail 105 MRC MRC 0.4 0.2 MRC Model LR Threshold 0.25 Miss Ratio 0.3 Miss Ratio 0.3 MRC Model Vertex Focus LR Threshold 0.1 0.2 0.15 0.1 0.05 0.1 0.2 0.3 Memory Size 0 0 0.4 (a) Gentle Slope 0.2 0.4 0.6 Memory Size 1 (b) Sharp Slope Detail MRC 1 MRC Model Vertex Focus LR Threshold 0.8 Miss Ratio 0.8 0.6 0.4 0.2 0 0 −0.5 0 Memory Size 0.5 (c) Sharp Slope Figure 4.3: Locations on MRCs Controller Producer d m Job Queue or Load Balancer c d Consumers Figure 4.4: Architectural Model 1 106 Function Shell Scaling Algorithm VM Manager Metrics Manager Application VM Application MRC Manager MRC VM Hypervisor Controller Load Balancer Function Manager Data Recorder Platform Logging Log Manager Workload Generator Application Monitoring, VMs, Functions, Job Queue Figure 4.5: System Components Memory Address Process ID 52-bit Page 32-bit PID 84-bit Reference 12-bit Offset Discard Figure 4.6: MRC Memory Reference Processes Virtual Memory Reference Trace Stream Emulated MMU Kernel Scheduler Kernel Space MRC Generator Process ID Hypervisor Controller Figure 4.7: Capturing Memory References 107 250 Probing CF 200 Delay (Seconds) Delay (Seconds) 250 150 100 50 0 0 2 4 Time (Hours) 200 150 100 50 0 0 6 (a) Queuing Delay, Knee 20 10 0 0 6 Baseline Probing CF 40 Memory (GB) Memory (GB) 30 2 4 Time (Hours) (b) Queuing Delay, LR Baseline Probing CF 40 Probing CF 30 20 10 2 4 Time (Hours) (c) Memory, Knee 6 0 0 2 4 Time (Hours) (d) Memory, LR Figure 4.8: AWS Caching Application Plots 6 108 1000 Probing CF 800 Throughput (Jobs/Sec) Throughput (Jobs/Sec) 1000 600 400 200 0 0 2 4 Time (Hours) 800 600 400 200 0 0 6 (e) Throughput, Knee 400 200 2.5 6 3 Time (Hours) 3.5 Probing CF 800 Delay (Seconds) Delay (Seconds) 600 0 2 4 Time (Hours) (f) Throughput, LR Probing CF 800 Probing CF 600 400 200 0 (g) Step Queuing Delay, Knee Figure 4.8: Continued 2.5 3 Time (Hours) (h) Step Queuing Delay, LR 3.5 109 2500 Probing CF 2000 Delay (Seconds) Delay (Seconds) 2500 1500 1000 500 0 0 2 4 6 8 Time (Hours) 10 Probing CF 2000 1500 1000 500 0 0 12 2 (a) Queuing Delay, Knee 20 10 30 20 10 0 0 2 4 6 8 Time (Hours) 10 0 0 12 2 4 6 8 Time (Hours) (c) Memory, Knee Probing CF 800 600 400 200 0 0 2 4 6 8 Time (Hours) (e) Throughput, Knee 10 12 (d) Memory, LR 10 12 1000 Throughput (Jobs/Sec) 1000 Throughput (Jobs/Sec) 12 Baseline Probing CF 40 Memory (GB) Memory (GB) 30 10 (b) Queuing Delay, LR Baseline Probing CF 40 4 6 8 Time (Hours) Probing CF 800 600 400 200 0 0 2 4 6 8 Time (Hours) (f) Throughput, LR Figure 4.9: Azure Caching Application Plots 10 12 110 Production Trace 35 30 Jobs/Sec 25 20 15 10 5 0 0 0.5 1 1.5 Time (Hours) Figure 4.10: Production Trace 2 111 1500 Delay (Seconds) Delay (Seconds) 1500 Baseline Probing CF 1000 500 0 0 0.5 1 1.5 Time (Hours) 1000 500 0 0 2 (a) Queuing Delay, Knee 100 Memory (GB) Memory (GB) 20 60 40 20 0.5 1 1.5 Time (Hours) 0 0 2 (c) Memory, Knee 35 25 20 15 10 5 0.5 1 1.5 Time (Hours) (e) Throughput, Knee 1 1.5 Time (Hours) 35 Baseline Probing CF 30 0.5 2 (d) Memory, LR Throughput (Jobs/Sec) Throughput (Jobs/Sec) 2 Probing CF 80 40 0 0 1 1.5 Time (Hours) 100 60 0 0 0.5 (b) Queuing Delay, LR Probing CF 80 Baseline Probing CF 2 Baseline Probing CF 30 25 20 15 10 5 0 0 0.5 1 1.5 Time (Hours) (f) Throughput, LR Figure 4.11: AWS Batch Application Plots 2 112 1500 Baseline Probing CF 1000 Delay (Seconds) Delay (Seconds) 1500 500 0 0 0.5 1 1.5 Time (Hours) 1000 500 0 0 2 (a) Queuing Delay, Knee 100 Memory (GB) Memory (GB) 20 60 40 20 0.5 1 1.5 Time (Hours) 0 0 2 (c) Memory, Knee 35 25 20 15 10 5 0.5 1 1.5 Time (Hours) (e) Throughput, Knee 1 1.5 Time (Hours) 35 Baseline Probing CF 30 0.5 2 (d) Memory, LR Throughput (Jobs/Sec) Throughput (Jobs/Sec) 2 Probing CF 80 40 0 0 1 1.5 Time (Hours) 100 60 0 0 0.5 (b) Queuing Delay, LR Probing CF 80 Baseline Probing CF 2 Baseline Probing CF 30 25 20 15 10 5 0 0 0.5 1 1.5 Time (Hours) (f) Throughput, LR Figure 4.12: Azure Batch Application Plots 2 CHAPTER 5 SUMMARY AND FUTURE WORK 5.1 Summary Managing offered workload and pre-tuning scalable Cloud-based applications is a timeconsuming task. Further, when operating conditions change, systems must be re-tuned to the new conditions. In this dissertation, we have shown that this pre-tuning is not necessary for a wide variety of application classes. We developed methods to eliminate the need to pre-specify the parameters and thresholds, which makes this time-consuming task difficult. In the second chapter, we introduced the concept of natural thresholds in which we look for operating points that occur naturally from the system model. Using this technique, we created an active queue management method that scales offered workload to the available processing resources. By choosing the knee of the delay-utilization curve as the natural threshold, our method maintains a low queuing delay while keeping link utilization high. In the third chapter, we turned our attention to scaling Cloud-based resources without adjusting the offered workload. Today, Cloud-providers require the tenant to specify thresholds to trigger scaling operations. We developed an event-driven method to scale resources without the need to specify these thresholds. Further, we developed a method to decrease the effective launch time of virtual machines during a scaling operation. In the fourth chapter, we focused on scaling memory-intensive applications. We determined a method that does not require the tenant to specify any thresholds for scaling memory resources. We did this by finding an appropriate natural threshold on the memory miss ratio curve of an application at run time. We also used Cloud functions as temporary memory resources while virtual machines launch. 114 5.2 Future Research Directions In this dissertation, we looked at methods to auto-scale network traffic and Cloudbased resources without the need to specify thresholds. There is future work that could follow from work done in this dissertation. 5.2.1 Alternative Natural Thresholds We determined natural thresholds for active queue management and for scaling memory-intensive applications. These may not be the only natural thresholds for these applications. Our work does not preclude systems from running at other operating points, nor does it preclude applying the concept to other types of applications. Future research would investigate other curves and operating points for the applications we examined and also for other application classes. 5.2.2 Reusing State of Cloud Functions We used Cloud functions as temporary computing resources while launching virtual machines. Cloud functions do not maintain state over long periods of time. Consequently, their usefulness is limited to application classes that require no or short-term state. It would be beneficial to determine methods to maintain this state over longer periods of time or to shift the state to a fully launched virtual machine. Shifting state between Cloud functions and virtual machines at run time would lead to a new programming paradigm for Cloud-based applications. 5.2.3 Control Systems Models Many of the methods we explored may be modeled as classical control systems. Exploring these systems may lead to new discoveries and observations. 5.2.4 Heterogeneous VM Families Our implementation allowed for only a single family of virtual machine. Extending the framework to allow multiple families simultaneously would open avenues of exploring trade-offs between core to memory ratios and using differing amounts of CPU capacity for different applications or for subsystems of an application. 115 5.2.5 I/O Bound Applications This dissertation explored scaling of CPU-intensive and memory-intensive applications. Although our queue-based methods should encompass delays incurred by I/O, a deep exploration of I/O bound applications may reveal more opportunities to use natural thresholds. 5.2.6 Combining Scaling Methods Combining the queue-based methods of Chapter 3 with the miss ratio curve methods of Chapter 4 may lead to a better scaling strategy for applications that are both CPU and memory intensive. Exploring a combined method may benefit a wider range of application classes. REFERENCES [1] M. Alizadeh et al., “Dctcp: Efficient packet transport for the commoditized data center,” Proceedings of SIGCOMM, 2010. [2] Amazon, “Amazon web services lambda,” http://aws.amazon.com/lambda, accessed: 2017-06-17. [3] ——, “Aws lambda doubles maximum memory capacity for lambda functions,” http://aws.amazon.com/about-aws/whats-new/2017/11/ aws-lambda-doubles-maximum-memory-capacity-for-lambda-functions, accessed: 2018-02-03. [4] Apache, “Apache activemq,” http://activemq.apache.org, accessed: 2017-06-17. [5] ——, “Apache http server project,” http://httpd.apache.org, accessed: 2018-01-29. [6] A. Appleby, “Smhasher,” https://github.com/aappleby/smhasher, accessed: 20188-18. [7] S. Athuraliya, S. H. Low, V. H. Li, and Q. Yin, “Rem: Active queue management,” Network, IEEE, vol. 15, no. 3, pp. 48–53, 2001. [8] N. Beckmann and D. Sanchez, “Talus: A simple way to remove cliffs in cache performance,” in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 2015, pp. 64–75. [9] F. Bellard, “Qemu, a fast and portable dynamic translator.” in USENIX Annual Technical Conference, FREENIX Track, vol. 41, 2005, p. 46. [10] B. Caldwell, Y. Im, S. Ha, R. Han, and E. Keller, “Fluidmem: Memory as a service for the datacenter,” arXiv preprint arXiv:1707.07780, 2017. [11] J. L. Carlson, Redis in Action. Manning Publications Co., 2013. [12] D.-M. Chiu and R. Jain, “Analysis of the increase and decrease algorithms for congestion avoidance in computer networks,” Computer Networks and ISDN systems, vol. 17, no. 1, pp. 1–14, 1989. [13] A. Cidon, A. Eisenman, M. Alizadeh, and S. Katti, “Cliffhanger: Scaling performance cliffs in web memory caches.” in NSDI, 2016, pp. 379–392. [14] Cisco, “Pie source code,” ftp://ftpeng.cisco.com/pie/linux code/pie code /pie code.tar, accessed: 2013-09-18. [15] ——, “Voice over ip-per call bandwidth consumption,” http://www.cisco.com/c/en/us/support/docs/voice/voice-quality/ 7934-bwidth-consume.html, accessed: 2015-07-14. 117 [16] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with ycsb,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 143–154. [17] A. Danial, “cloc count lines of code,” https://github.com/AlDanial/cloc, accessed: 2018-9-12. [18] S. Das, D. Agrawal, and A. El Abbadi, “Elastras: An elastic, scalable, and selfmanaging transactional database for the cloud,” ACM Transactions on Database Systems (TODS), vol. 38, no. 1, p. 5, 2013. [19] S. Das, F. Li, V. R. Narasayya, and A. C. König, “Automated demand-driven resource scaling in relational database-as-a-service,” in Proceedings of the 2016 International Conference on Management of Data. ACM, 2016, pp. 1923–1934. [20] G. DeCandia et al., “Dynamo: amazon’s highly available key-value store,” in ACM SIGOPS op sys review, vol. 41. ACM, 2007, pp. 205–220. [21] P. J. Denning, “Effects of scheduling on file memory operations,” in Proceedings of the April 18-20, 1967, spring joint computer conference. ACM, 1967, pp. 9–21. [22] D. Ebbo, “Creating a java azure function,” https://github.com/Azure/ azure-webjobs-sdk-script/wiki/Creating-a-Java-Azure-Function, accessed: 2017-6-15. [23] W.-C. Feng et al., “A self-configuring red gateway,” in INFOCOM’99, vol. 3. 1999, pp. 1320–1328. IEEE, [24] W.-c. Feng et al., “Blue: A new class of active queue management algorithms,” Technical Report, UM CSE-TR-387-99, 1999. [25] S. Floyd and V. Jacobson, “Random early detection gateways for congestion avoidance,” Networking, IEEE/ACM Trans. on, vol. 1, no. 4, 1993. [26] S. Fouladi et al., “Encoding, fast and slow: Low-latency video processing using thousands of tiny threads.” in NSDI, 2017, pp. 363–376. [27] A. Gandhi et al., “Autoscale: Dynamic, robust capacity management for multi-tier data centers,” ACM TOCS, vol. 30, no. 4, p. 14, 2012. [28] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, S. Han, R. Agarwal, S. Ratnasamy, and S. Shenker, “Network requirements for resource disaggregation.” in OSDI, 2016, pp. 249–264. [29] N. Gautam, Analysis of queues: methods and applications. CRC Press, 2012. [30] J. Gettys and K. Nichols, “Bufferbloat: Dark buffers in the internet,” Communications of the ACM, vol. 9, no. 11, pp. 57–65, 2011. [31] GitHub, “Github,” https://github.com, accessed: 2017-07-15. [32] F. A. Gonzalez-Horta et al., “Mathematical model for the optimal utilization percentile in m/m/1 systems: A contribution about knees in performance curves,” arXiv preprint arXiv:1106.2380, 2011. 118 [33] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin, “Efficient memory disaggregation with infiniswap.” in NSDI, 2017, pp. 649–667. [34] R. Han, L. Guo, M. M. Ghanem, and Y. Guo, “Lightweight resource scaling for cloud applications,” in Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on. IEEE, 2012, pp. 644–651. [35] S. Hemminger et al., “Network emulation with netem,” in Linux Conf Au, 2005, pp. 18–23. [36] S. Hendrickson et al., “Serverless computation with openlambda,” in 8th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2016, 2016. [37] C. V. Hollot et al., “On designing improved controllers for aqm routers supporting tcp flows,” in INFOCOM 2001, vol. 3. IEEE, 2001. [38] C. Hollot et al., “Analysis and design of controllers for aqm routers supporting tcp flows,” IEEE Transactions on Automatic Control, 2002. [39] J. Hong et al., “Active queue management algorithm considering queue and load states,” Computer Communications, vol. 30, no. 4, pp. 886–892, 2007. [40] A. Hornsby and N. Undén, “Getting started with aws lambda and the serverless cloud,” https://www.slideshare.net/AmazonWebServices/ aws-lambda-and-serverless-cloud-61712836, accessed: 2017-09-11. [41] C.-W. Huang, C.-C. Shih, W.-H. Hu, B.-T. Lin, and C.-W. Cheng, “The improvement of auto-scaling mechanism for distributed database-a case study for mongodb,” in Network Operations and Management Symposium (APNOMS), 2013 15th Asia-Pacific. IEEE, 2013, pp. 1–3. [42] B. Hubert, T. Graf, G. Maxwell, R. van Mook, M. van Oosterhout, P. Schroeder, J. Spaans, and P. Larroy, “Linux advanced routing & traffic control,” in Ottawa Linux Symposium, vol. 213, 2002. [43] IBM, “Ibm bluemix openwhisk,” http://www.ibm.com/ cloud-computing/bluemix/openwhisk, accessed: 2017-06-17. [44] M. Inc., “Mongodb,” http://www.mongodb.com, accessed: 2017-10-09. [45] A. Jacobs, “The pathologies of big data,” Communications of the ACM, vol. 52, no. 8, pp. 36–44, 2009. [46] R. Jain, “A delay-based approach for congestion avoidance in interconnected heterogeneous computer networks,” ACM SIGCOMM Computer Communication Review, vol. 19, no. 5, pp. 56–71, 1989. [47] R. Jain, D.-M. Chiu, and W. R. Hawe, A quantitative measure of fairness and discrimination for resource allocation in shared computer system. Eastern Research Laboratory, Digital Equipment Corporation, 1984. [48] J. Jiang et al., “Optimal cloud resource auto-scaling for web applications,” in Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on. IEEE, 2013, pp. 58–65. 119 [49] E. Jonas et al., “Occupy the cloud: distributed computing for the 99%,” in 2017 Symposium on Cloud Computing. ACM, 2017, pp. 445–451. [50] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Geiger: monitoring the buffer cache in a virtual machine environment,” ACM SIGARCH Computer Architecture News, vol. 34, no. 5, pp. 14–24, 2006. [51] S. S. Kunniyur and R. Srikant, “An adaptive virtual queue (avq) algorithm for active queue management,” Networking, IEEE/ACM Transactions on, vol. 12, no. 2, pp. 286– 299, 2004. [52] R. Labs, “Redis-trib,” http://download.redis.io/redis-stable/src/redis-trib.rb, accessed: 2018-09-15. [53] H. Lim et al., “Automated control in cloud computing: challenges and opportunities,” in Proceedings of the 1st Workshop on Automated Control for Datacenters and Clouds. ACM, 2009, pp. 13–18. [54] ——, “Automated control for elastic storage,” in Proceedings of the 7th International Conference on Autonomic Computing. ACM, 2010, pp. 1–10. [55] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch, “Disaggregated memory for expansion and sharing in blade servers,” in ACM SIGARCH Computer Architecture News, vol. 37, no. 3. ACM, 2009, pp. 267–278. [56] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch, “System-level implications of disaggregated memory,” in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012, pp. 1–12. [57] G. Linden, “Marissa mayer at web 2.0,” http://glinden.blogspot.ca/2006/11/ marissa-mayer-at-web-20.html, accessed: 2017-10-07. [58] C. Long et al., “The yellow active queue management algorithm,” Computer Networks, vol. 47, no. 4, pp. 525–550, 2005. [59] M. Mao and M. Humphrey, “Auto-scaling to minimize cost and meet application deadlines in cloud workflows,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011, p. 49. [60] ——, “A performance study on the vm startup time in the cloud,” in Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on. IEEE, 2012, pp. 423–430. [61] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger, “Evaluation techniques for storage hierarchies,” IBM Systems Journal, vol. 9, no. 2, pp. 78–117, 1970. [62] A. Mcgregor, K. Nichols, V. Jacobson, M. D. Taht, and E. Dumazet, “Codel source code for ns-3,” codereview.appspot.com/6463048, accessed: 2014-03-04. [63] Microsoft, “Microsoft azure functions,” services/functions, accessed: 2017-06-17. http://azure.microsoft.com/en-us/ [64] ——, “Microsoft iis,” https://www.iis.net, accessed: 2017-07-15. 120 [65] ——, “Windows sysinternals,” https://technet.microsoft.com/enus/sysinternals/bb842062.aspx, accessed: 2017-07-22. [66] P. Nasirifard, “A serverless topic-based and content-based pub/sub broker,” in Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Posters and Demos. ACM, 2017, pp. 23–24. [67] H. Nguyen et al., “Agile: Elastic distributed resource scaling for infrastructure-as-aservice.” in ICAC, vol. 13, 2013, pp. 69–82. [68] K. Nichols and V. Jacobson, “Controlling queue delay,” Communications of the ACM, vol. 55, no. 7, pp. 42–50, 2012. [69] K. Nichols, V. Jacobson, M. D. Taht, and E. Dumazet, “Codel source code,” http://git.kernel.org/cgit/linux/kernel/git/stable/ http://git.kernel.org/cgit/linux/ linux-stable.git/tree/net/sched/sch codel.c, kernel/git/stable/linux-stable.git/tree/net/sched/include/net/codel.h, accessed: 2013-03-19. [70] Q. Niu, “Parda source code,” https://bitbucket.org/niuqingpeng/file parda, accessed: 2018-8-18. [71] Q. Niu, J. Dinan, Q. Lu, and P. Sadayappan, “Parda: A fast parallel reuse distance analysis algorithm,” in Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 1284–1294. [72] J. Novak and S. K. Kasera, “Auto-tuning active queue management,” in 2017 9th International Conference on Communication Systems and Networks (COMSNETS), June 2017. [73] J. Novak, S. K. Kasera, and R. Stutsman, “Auto-scaling cloud-based memoryintensive applications,” in 2019 11th International Conference on Communication Systems and Networks (COMSNETS), June 2019. [74] nsnam, “The network simulator (ns-3),” https://www.nsnam.org, accessed: 2018-1019. [75] D. M. Olsson and L. S. Nelson, “The nelder-mead simplex procedure for function minimization,” Technometrics, vol. 17, no. 1, pp. 45–51, 1975. [76] R. Pan et al., “Pie: A lightweight control scheme to address the bufferbloat problem,” ftp://ftpeng.cisco.com/pie/documents/pie hpsr2013 final.pdf, accessed: 2013-0918. [77] K. Ramakrishnan, S. Floyd, and D. Black, “The addition of explicit congestion notification (ECN),” RFC 3168. [78] N. Shalom, “Amazon found every 100ms of latency cost them 1% in sales.” https://blog.gigaspaces.com/ amazon-found-every-100ms-of-latency-cost-them-1-in-sales, accessed: 2017-10-07. 121 [79] P. Shivam, A. Demberel, P. Gunda, D. Irwin, L. Grit, A. Yumerefendi, S. Babu, and J. Chase, “Automated and on-demand provisioning of virtual machines for database applications,” in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. ACM, 2007, pp. 1079–1081. [80] A. Singhvi et al., “Granular computing and network intensive applications: Friends or foes?” in Proceedings of the 16th ACM Workshop on Hot Topics in Networks. ACM, 2017, pp. 157–163. [81] J. Sun and M. Zukerman, “Raq: A robust active queue management scheme based on rate and queue length,” Computer Communications, vol. 30, no. 8, pp. 1731–1741, 2007. [82] G. Tesauro et al., “A hybrid reinforcement learning approach to autonomic resource allocation,” in International conf on Autonomic Computing (ICAC’06). IEEE, 2006, pp. 65–73. [83] D. M. Thomas, “Michael thomas flanagan’s java https://www.ee.ucl.ac.uk/ mflanaga, accessed: 2018-8-16. scientific library,” [84] S. Tomar, “Converting video formats with ffmpeg,” Linux Journal, vol. 2006, no. 146, p. 10, 2006. [85] B. Urgaonkar et al., “Dynamic provisioning of multi-tier internet applications,” in Second International Conference on Autonomic Computing (ICAC’05). IEEE, 2005, pp. 217–228. [86] T. Wagner, “Running arbitrary executables in aws lambda,” https://aws.amazon.com/blogs/compute/running-executables-in-aws-lambda, accessed: 2017-3-15. [87] C. Waldspurger, T. Saemundsson, I. Ahmad, and N. Park, “Cache modeling and optimization using miniature simulations,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, 2017, pp. 487–498. [88] C. A. Waldspurger, N. Park, A. T. Garthwaite, and I. Ahmad, “Efficient mrc construction with shards,” in FAST, 2015, pp. 95–110. [89] C. Wang, A. Gupta, and B. Urgaonkar, “Fine-grained resource scaling in a public cloud: A tenant’s perspective,” in Cloud Computing (CLOUD), 2016 IEEE 9th International Conference on. IEEE, 2016, pp. 124–131. [90] G. White and D. Rice, “Active queue management algorithms for docsis 3.0: A simulation study of codel, sfq-codel and pie in docsis 3.0 networks,” http://www.cablelabs.com/downloads/pubs/ Active Queue Management Algorithms DOCSIS 3 0.pdf, accessed: 2013-6-1. [91] B. Wydrowski and M. Zukerman, “Green: An active queue management algorithm for a self managed internet,” in ICC 2002, vol. 4. IEEE, 2002. [92] H. Zhang et al., “A self-tuning structure for adaptation in tcp/aqm networks,” in Globecom 2003, vol. 22, no. 1, 2003, pp. 3641–3645. 122 [93] W. Zhao, X. Jin, Z. Wang, X. Wang, Y. Luo, and X. Li, “Low cost working set size tracking.” in USENIX Annual Technical Conference, 2011. [94] W. Zhao, Z. Wang, and Y. Luo, “Dynamic memory balancing for virtual machines,” ACM SIGOPS Operating Systems Review, vol. 43, no. 3, pp. 37–47, 2009. [95] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar, “Dynamic tracking of page miss ratio curve for memory management,” in ACM SIGOPS Operating Systems Review, vol. 38, no. 5. ACM, 2004, pp. 177–188.
Reference URL	https://collections.lib.utah.edu/ark:/87278/s62k2bct