Over the past five years, we investigated methods to characterize global behavior in large distributed systems and applied those methods to predict effects from deploying alternate distributed control algorithms (a complete record of this research is available at http://www.nist.gov/itl/antd/emergent_behavior.cfm). The methods we used assess global behaviors under a wide range of conditions, enable significant understanding of overall system dynamics, and yield insightful comparisons of competing control regimes. On the other hand, such methods do not provide information about potential for rare combinations of events to drive system dynamics into global failure regimes, leading to catastrophic collapse. Our ongoing research aims to address this topic using two complementary thrusts: (1) design-time methods that enable system architects to identify and evaluate global failure scenarios that could lead to system collapse and (2) run-time methods that alert system operators about incipient transition to global failure regimes, and subsequent collapse. Effective design-time methods will enable architects to devise mechanisms that can prevent high-risk scenarios. Since no design-time methods can identify all possible failure scenarios, effective run-time methods will signal operators when system trajectory trends toward collapse, allowing remedial actions to forestall or mitigate catastrophic failure. We seek research collaborators who are also interested in design-time and run-time methods to predict global failure regimes in infrastructures on which modern society increasingly depends.
Complex systems; Emergent behavior; Failure scenarios; Large-scale distributed systems; Phase transitions;