What is characterized as the ability of a system to recover from failures and continue to function?

  1. The ability of a system to recover from failures and continue to function
  2. System Resilience: What Exactly is it?
  3. About Reliable and Resilient Cloud Topology Practices
  4. What is Fault Tolerance?
  5. Fault Tolerance and Disaster Recovery
  6. Tactics and Patterns for Software Robustness
  7. Reliability, availability and serviceability
  8. System Resilience Part 2: How System Resilience Relates to Other Quality Attributes


Download: What is characterized as the ability of a system to recover from failures and continue to function?
Size: 57.16 MB

The ability of a system to recover from failures and continue to function

We're still on the topic of the WAF, the Well-Architected Framework. That's a set of guidelines to strive for the best cloud infrastructure possible. Today we've arrived at step 4: Reliability. It seems like a no-brainer: of course, that's important in your cloud management. Zegert van der Linde got his questions lined up, time to ask them to Carlo Garavaglia and Jurjen Uijttenboogaart. See all episodes

System Resilience: What Exactly is it?

Chicago Citation Firesmith, Donald. "System Resilience: What Exactly is it?." Carnegie Mellon University, Software Engineering Institute's Insights (blog). Carnegie Mellon's Software Engineering Institute, November 25, 2019. https://insights.sei.cmu.edu/blog/system-resilience-what-exactly-is-it/. Copy IEEE Citation D. Firesmith, "System Resilience: What Exactly is it?," Carnegie Mellon University, Software Engineering Institute's Insights (blog). Carnegie Mellon's Software Engineering Institute, 25-Nov-2019 [Online]. Available: https://insights.sei.cmu.edu/blog/system-resilience-what-exactly-is-it/. [Accessed: 15-Jun-2023]. Copy BibTeX Code @misc Copy Over the past decade, As part of work on the development of resilience requirements for cyber-physical systems, I recently completed a literature study of existing standards and other documents related to resilience. My review revealed that the term resilience is typically used informally as though its meaning were obvious. In those cases where it was defined, it has been given similar, but somewhat inconsistent, meanings. Another issue I found was that the term resilience is used in two very different senses. The scope of this blog post, the first in a two-part series, focuses on system resilience and not organizational resilience, which has a much larger scope. Organizational resilience is primarily concerned with business continuity and includes the management of people, information, technology, and facilities. For more in...

About Reliable and Resilient Cloud Topology Practices

The architecture of a reliable application in the cloud is typically different from a traditional application architecture. While historically you may have purchased redundant higher-end hardware to minimize the chance of an entire application platform failing, in the cloud, it's important to acknowledge up front, that failures will happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component (SPOF). Follow these best practices to build reliability into each step of your design process. Reliable applications are: • Resilient and recover gracefully from failures, and they continue to function with minimal downtime and data loss before full recovery. • Highly available (HA) and run as designed in a healthy state with no significant downtime. • Protected from Region failure through good disaster recovery (DR) design. Understanding how these elements work together, and how they affect cost, is essential to building a reliable application. It can help you determine how much downtime is acceptable, the potential cost to your business, and which functions are necessary during a recovery. When creating a cloud application, use the following to build in reliability. • Define the requirements. Define your availability and recovery requirements based on the workloads you are bringing to the cloud and business needs. • Apply architectural best practices. Follow proven practices, identify possible failure points in the arch...

What is Fault Tolerance?

What is fault tolerance Fault tolerance refers to the ability of a system (computer, network, cloud cluster, etc.) to continue operating without interruption when one or more of its components fail. The objective of creating a fault-tolerant system is to prevent disruptions arising from a single point of failure, ensuring the Fault-tolerant systems use backup components that automatically take the place of failed components, ensuring no loss of service. These include: • Hardware systems that are backed up by identical or equivalent systems. For example, a server can be made fault tolerant by using an identical server running in parallel, with all operations mirrored to the backup server. • Software systems that are backed up by other software instances. For example, a database with customer information can be continuously replicated to another machine. If the primary database goes down, operations can be automatically redirected to the second database. • Power sources that are made fault tolerant using alternative sources. For example, many organizations have power generators that can take over in case main line electricity fails. In similar fashion, any system or component which is a single point of failure can be made fault tolerant using redundancy. Fault tolerance can play a role in a Fault tolerance vs. high availability High availability refers to a system’s ability to avoid loss of service by minimizing downtime. It’s expressed in terms of a system’s uptime, as a pe...

Fault Tolerance and Disaster Recovery

Fault Tolerance and Disaster Recovery The ability to recover from failures is critical to the proper function of any system, including Operations Manager. Although the two concepts are closely related, fault tolerance and disaster recovery are fundamentally different. Fault tolerance is the ability to continue operating even in the event of a failure. This ensures that failures don’t result in loss of service. Fault-tolerance mechanisms, such as clustering or load-balanced components, have activation times typically measured in seconds or minutes. These mechanisms typically also have high costs associated with them, such as duplicated hardware. On the other hand, disaster recovery is the ability to restore operations after a loss of service. ... Get Microsoft® System Center 2012 Unleashed now with the O’Reilly learning platform. O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Tactics and Patterns for Software Robustness

Chicago Citation Kazman, Rick. "Tactics and Patterns for Software Robustness." Carnegie Mellon University, Software Engineering Institute's Insights (blog). Carnegie Mellon's Software Engineering Institute, July 25, 2022. https://insights.sei.cmu.edu/blog/tactics-and-patterns-for-software-robustness/. Copy IEEE Citation R. Kazman, "Tactics and Patterns for Software Robustness," Carnegie Mellon University, Software Engineering Institute's Insights (blog). Carnegie Mellon's Software Engineering Institute, 25-Jul-2022 [Online]. Available: https://insights.sei.cmu.edu/blog/tactics-and-patterns-for-software-robustness/. [Accessed: 15-Jun-2023]. Copy Robustness has traditionally been thought of as the ability of a software-reliant system to keep working, consistent with its specifications, despite the presence of internal failures, faulty inputs, or external stresses, over a long period of time. Robustness, along with other quality attributes, such as security and safety, is a key contributor to our trust that a system will perform in a reliable manner. In addition, the notion of robustness has more recently come to encompass a system’s ability to withstand changes in its stimuli and environment without compromising its essential structure and characteristics. In this latter notion of robustness, systems should be malleable, not brittle, with respect to changes in their stimuli or environments. Robustness, consequently, is a highly important quality attribute to design into a sy...

Reliability, availability and serviceability

Quality of robustness of computer hardware Reliability, availability and serviceability ( RAS), also known as reliability, availability, and maintainability ( RAM), is a Computers designed with higher levels of RAS have many features that protect data integrity and help them stay Definitions [ ] While RAS originated as a hardware-oriented term, • Reliability can be defined as the probability that a system will produce correct outputs up to some given time t. • Availability means the probability that a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. High-availability systems may report availability in terms of minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational. Availability is typically given as a percentage of the time a system is expected to be available, e.g., 99.999 percent (" • Serviceability or maintainability is the simplicity and speed with which a system can be repaired or maintained; if the time to repair a failed system increases, then availability will decrease. Serviceability includes various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoi...

System Resilience Part 2: How System Resilience Relates to Other Quality Attributes

AMS Citation Firesmith, D., 2019: System Resilience Part 2: How System Resilience Relates to Other Quality Attributes. Carnegie Mellon University, Software Engineering Institute's Insights (blog), Accessed June 15, 2023, https://insights.sei.cmu.edu/blog/system-resilience-part-2-how-system-resilience-relates-to-other-quality-attributes/. Copy Chicago Citation Firesmith, Donald. "System Resilience Part 2: How System Resilience Relates to Other Quality Attributes." Carnegie Mellon University, Software Engineering Institute's Insights (blog). Carnegie Mellon's Software Engineering Institute, December 2, 2019. https://insights.sei.cmu.edu/blog/system-resilience-part-2-how-system-resilience-relates-to-other-quality-attributes/. Copy IEEE Citation D. Firesmith, "System Resilience Part 2: How System Resilience Relates to Other Quality Attributes," Carnegie Mellon University, Software Engineering Institute's Insights (blog). Carnegie Mellon's Software Engineering Institute, 2-Dec-2019 [Online]. Available: https://insights.sei.cmu.edu/blog/system-resilience-part-2-how-system-resilience-relates-to-other-quality-attributes/. [Accessed: 15-Jun-2023]. Copy To most people, a system is resilient if it continues to perform its mission in the face of adversity. In other words, a system is resilient if it continues to operate appropriately and provide required capabilities despite excessive stresses that can or do cause disruptions. not an isolated System Resilience - a Brief Recap Clearly,...