Description |
Software developers often record critical system events and system status into log files, which are a readily valuable data source to be analyzed by data-driven approaches for system health and stability. As a result, system logs are more frequently leveraged for online monitoring and anomaly detection. This dissertation focuses on unleashing the power of data-driven analytics, to analyze rich system-generated log data, with the goal of making computer systems easier to manage and understand. System logs are in general free-text events generated by log printing statements inserted in the source code, while most (if not all) data analytic methods require structured data input. Therefore, automatic log parsing is an essential step towards further log analysis. We propose an online streaming method Spell, which utilizes a longest common subsequence based approach, to parse system event logs. We show how to dynamically extract log patterns from incoming logs and how to maintain a set of discovered message types in streaming fashion. With structured data parsed from system logs, data analytics methods could be explored for automatic anomaly detection. We propose DeepLog, which automatically learns log patterns from normal execution, and it detects anomalies when log patterns deviate from the model trained from log data under normal execution. In addition, we demonstrate how to incrementally update the DeepLog model in an online fashion so that it can adapt to new log patterns over time. Furthermore, DeepLog also provides solutions to help the user diagnose the detected anomaly. The first one is to construct workflows from the underlying system logs to guide users for root cause analysis. Secondly, since system logs generated by multiple programs could be mixed together, we explore and evaluate multiple data analytics methods for system log classification, in order to present users with only relevant logs for diagnosis. Lastly, nowadays there is an increasing need to manage system logs from multiple computers together, e.g., to effectively monitor all computers in an organization. In such a scenario, system logs generated by each computer node are constantly forwarded to a centralized controller to store and index, for user to query and visualize. We design a framework that is able to not only reduce the network overhead to send log values, but also perform online monitoring across multiple log files for more comprehensive anomaly detection. iv |