Difference between revisions of "Examples"

From Mobius Wiki
Jump to: navigation, search
(Atomic Models)
Line 130: Line 130:
 
(computer_failed-<span style="font-size:115%">></span>Mark()<span style="font-size:115%"><</span>num_comp)</font></span>
 
(computer_failed-<span style="font-size:115%">></span>Mark()<span style="font-size:115%"><</span>num_comp)</font></span>
 
| align=center| identity
 
| align=center| identity
 +
|} </figtable>
 +
 +
 +
The output gates OG1, OG2, and OG3 are used to determine the next marking based on the current marking and the case chosen when cpu_failure completes. They correspond to the different situations that arise because of the coverage or non-coverage of system components. <xr id="tab:ex_cpuog" /> lists the output gates and the function of each gate.
 +
 +
 +
<figtable id="tab:ex_cpuog">
 +
{| border="1" cellspacing="0" cellpadding="5" align="center"
 +
  |+ <xr id="tab:ex_cpuog" nolink />: cpu_module output gate functions.
 +
|-
 +
! align=center| Gate
 +
! align=left| Function
 +
|-
 +
| align=center| OG1
 +
| align=left| <span style="font-size:125%"><font face=Courier>if (cpus-<span style="font-size:115%">></span>Mark() == 3) <br/>&nbsp;&nbsp;&nbsp;cpus-<span style="font-size:115%">></span>Mark()--;</font></span>
 +
|-
 +
| align=center| OG2
 +
| align=left| <span style="font-size:125%"><font face=Courier>cpus-<span style="font-size:115%">></span>Mark() = 0; <br/>ioports-<span style="font-size:115%">></span>Mark() = 0; <br/>errorhandlers-<span style="font-size:115%">></span>Mark() = 0; <br/>memory_failed-<span style="font-size:115%">></span>Mark() = 2; <br/>computer_failed-<span style="font-size:115%">></span>Mark()++;</font></span>
 +
| align=center| OG3
 +
| align=left| <span style="font-size:125%"><font face=Courier>cpus-<span style="font-size:115%">></span>Mark() = 0; <br/>ioports-<span style="font-size:115%">></span>Mark() = 0; <br/>errorhandlers-<span style="font-size:115%">></span>Mark() = 0; <br/>memory_failed-<span style="font-size:115%">></span>Mark() = 2; <br/>computer_failed-<span style="font-size:115%">></span>Mark() = num_comp;</font></span>
 
|} </figtable>
 
|} </figtable>
  

Revision as of 22:28, 14 March 2014

Fault-Tolerant Multiprocessor System

This section presents an example of a system that can be modeled using Möbius. It starts with a description of the system, and then guides you through one way to build a model of the system and solve it using both simulation and numerical solution. The example is intended to take you step-by-step through the process of creating and solving a model in Möbius, and to exhibit many of the capabilities and features of the tool.


System Description

The system under consideration is a highly redundant fault-tolerant multiprocessor system adapted from [1] and shown in <xr id="fig:ex_multiproc" />. At the highest level, the system consists of multiple computers. Each computer is composed of 3 memory modules, of which 1 is a spare module; 3 CPU units, of which 1 is a spare unit; 2 I/O ports, of which 1 is a spare port; and 2 non-redundant error-handling chips.


<figure id="fig:ex_multiproc">

Multiproc.png


<xr id="fig:ex_multiproc" nolink />: Fault-tolerant multiprocessor system.
</figure>


Internally, each memory module consists of 41 RAM chips (2 of which are spare chips) and 2 interface chips. Each CPU unit and each I/O port consists of 6 non-redundant chips. The system is considered operational if at least 1 computer is operational. A computer is classified as operational if, of its components, at least 2 memory modules, at least 2 CPU units, at least 1 I/O port, and the 2 error-handling chips are functioning. A memory module is operational if at least 39 of its 41 RAM chips, and its 2 interface chips, are working.

Where there is redundancy (available spares) at any level of system hierarchy, there is a coverage factor associated with the component failure at that level. For example, following the parameter values used by Lee et al.[1], if one CPU unit fails, with probability 0.995 the failed unit will be replaced by the spare unit, if available, and the corresponding computer will continue to operate. On the other hand, there is also a 0.005 probability that the fault recovery mechanism will fail and the corresponding computer will cease to operate. <xr id="tab:ex_coverage" /> shows the redundant components and their associated fault coverage probability. Finally, the failure rate of every chip in the system, as in [1], is assumed to be 100 failures per billion hours1.

1 0.0008766 failures per year.


<figtable id="tab:ex_coverage">

<xr id="tab:ex_coverage" nolink />: Coverage probabilities.
Redundant Component Fault Coverage Probability
RAM Chip 0.998
Memory Module 0.95
CPU Unit 0.995
I/O Port 0.99
Computer 0.95
</figtable>


Getting Started

A model of the system in this example is included with the Möbius distribution. Refer to Section C.1 for instructions on installing the example models. You are encouraged to open the model and follow the detailed discussions of its various components in the sections below.

From the Möbius Project Manager window, click Project\toUnarchive. A dialog will present a list of archived projects in the project directory. Choose Multiproc-Paper and hit Unarchive. After the project has been successfully unarchived, you will be prompted to resave the project using Project\toResave. At the dialog, choose Multiproc-Paper again, hit Resave, and wait until all components have been built. The Multiproc-Paper project editor will appear as shown in Figure 3.1.


Atomic Models

To build a model for an entire system, begin by defining SAN submodels to repre- sent the failures of various components in the system.

The SAN submodel of the CPUs is called cpu_module and is shown in <xr id="fig:ex_sancpu" />. To open this model, click the Atomic tab in the project panel, and then double-click on cpu_module or right-click on it and select Open. The places named cpus and computer_failed represent the current state of the CPUs and the current state of the multiprocessor system, respectively. That is, the number of tokens in cpus represents the number of operational CPUs in a given computer. Likewise, the number of tokens in computer_failed indicates the number of computers that have failed in the system. To open any of these places, right-click on the place and select Edit. This will bring up the Place Attributes dialog, in which you can edit the Name of the place and the initial marking (number of tokens) of the place. Note that the Tokens field can be specified with either a constant or a global variable name. For example, the place cpus has been initialized with three tokens, as each computer consists of three CPU units.


<figure id="fig:ex_sancpu">

Ex sancpu.png


<xr id="fig:ex_sancpu" nolink />: SAN submodel of cpu_module.
</figure>


To create a new place, either click the blue circle icon in the toolbar or select Elements\toPlace from the menu. Then click where you would like the place to go in the editor. The Place Attributes dialog will appear, and you can edit the Name of the place as well as the initial marking of the place in the Tokens field, as described earlier. To delete a place, right-click on it and select Delete, and hit OK to confirm.

The places labeled ioports, errorhandlers, and memory_failed are also included in this model to aid in reducing the size of the state space for the overall system model by lumping as many failed states together as possible. Additional state lumping (beyond that provided by the reduced base model construction method) can be achieved because once a computer fails, there is no need to keep track of which component failure caused the computer failure. More specifically, because of the assumption that all internal components of the failed computer have failed, the states that represent a computer failure due to a failure of a CPU unit, a memory module, an I/O port, or an error-handling chip are combined into a single state. The marking of the combined state is reached by setting the number of tokens in each of the places cpus, ioports, and errorhandlers to zero, setting the number of tokens in memory_failed to 2, and incrementing the number of tokens in computer_failed.

The failure of a CPU unit corresponds to the completion of timed activity cpu_failure. To open this activity, right-click on it and select Edit. This will bring up the Timed Activity Attributes dialog. In this dialog, you can edit the name of the activity and the distribution of its firing delay in the Time distribution function field. For this activity, the Exponential distribution should be selected. The activity completion rate is shown in <xr id="tab:ex_cpuact" />. This rate corresponds to six2 times the failure rate of a chip times the number of operational CPU units in the computer. If a spare CPU unit is available (i.e., cpus->Mark() == 3), three cases are associated with the activity completion, as designated in the Case quantity field. To define the case probabilities, click on the appropriate case number’s tab and type the expression in the box. The expression for the case probability can be a constant, a global variable, or a C++ statement returning a value as in this example. The first case represents a successful coverage of a CPU unit failure. If that case occurs, the failed CPU unit is replaced by the spare unit, and its corresponding computer continues to operate. The second case represents the situation in which a CPU unit failure occurs that is not covered, but the failure of its corresponding computer is covered. If that happens and a spare computer is available, the failed computer is replaced by the spare computer and the system continues to operate. However, if no spare computer is available, the multiprocessor system fails. The third case represents the situation in which neither the CPU failure nor the corresponding computer failure is covered, resulting in a total system failure.

2 Remember that each CPU unit consists of 6 non-redundant chips.


<figtable id="tab:ex_cpuact">

<xr id="tab:ex_cpuact" nolink />: cpu_module activity time distributions.
Activity Distribution
cpu_failure expon(0.0052596 * cpus->Mark())
</figtable>


On the other hand, if no spare CPU is available (i.e., cpus->Mark() == 2), then a CPU unit failure causes a computer failure. In this marking, two possible outcomes may result from the completion of activity cpu_failure. In the first, a spare computer is available, so that the computer failure can be covered. In the second, no spare computer is available, and system failure results. <xr id="tab:ex_cpucaseprob" /> shows the case numbers and the probabilities associated with each case for the activity cpu_failure. It is clear that the case probabilities are marking-dependent, since the coverage factors depend on the state of the system.


<figtable id="tab:ex_cpucaseprob">

<xr id="tab:ex_cpucaseprob" nolink />: cpu_module case probabilities for activities.
Case Probability
cpu_failure
1 if (cpus->Mark() == 3)
     return(0.995);

else
     return(0.0);

2 if (cpus->Mark() == 3)
     return(0.00475);

else
     return(0.95);

3 if (cpus->Mark() == 3)
     return(0.00025);

else
     return(0.05);

</figtable>


The input gate Input_Gate1 is used to determine whether the timed activity cpu_failure is enabled in the current marking, and hence can complete. The cpu_failure activity is enabled only if at least 2 working CPU units are available and their corresponding computer and the system have not failed. <xr id="tab:ex_cpuig1" /> shows the enabling predicate and function associated with this gate.


<figtable id="tab:ex_cpuig1">

<xr id="tab:ex_cpuig1" nolink />: cpu_module input gate predicates and functions.
Gate Enabling Predicate Function
Input_Gate1 (cpus->Mark()>1) &&

(memory_failed->Mark()<2) &&
(computer_failed->Mark()<num_comp)

identity
</figtable>


The output gates OG1, OG2, and OG3 are used to determine the next marking based on the current marking and the case chosen when cpu_failure completes. They correspond to the different situations that arise because of the coverage or non-coverage of system components. <xr id="tab:ex_cpuog" /> lists the output gates and the function of each gate.


<figtable id="tab:ex_cpuog">

<xr id="tab:ex_cpuog" nolink />: cpu_module output gate functions.
Gate Function
OG1 if (cpus->Mark() == 3)
   cpus->Mark()--;
OG2 cpus->Mark() = 0;
ioports->Mark() = 0;
errorhandlers->Mark() = 0;
memory_failed->Mark() = 2;
computer_failed->Mark()++;
OG3 cpus->Mark() = 0;
ioports->Mark() = 0;
errorhandlers->Mark() = 0;
memory_failed->Mark() = 2;
computer_failed->Mark() = num_comp;
</figtable>



Möbius

Möbius

Motivation

Solution

Graph

Edit Möbius Documentation

“” –

<equation id="eqn:binom" shownumber>

f(k)=\binom{n}{k}p^k(1-p)^{n-k}\quad k=0,1,\dots,n

</equation>

Sort of like <xr id="eqn:binom" />, but not really.


References

  1. 1.0 1.1 1.2 D. Lee, J. Abraham, D. Rennels, and G. Gilley. A numerical technique for the evaluation of large, closed fault-tolerant systems. In Dependable Computing for Critical Applications, pages 95–114. Springer-Verlag, Wien, 1992.

Fault-Tolerant Multiprocessor System[edit]

This section presents an example of a system that can be modeled using Möbius. It starts with a description of the system, and then guides you through one way to build a model of the system and solve it using both simulation and numerical solution. The example is intended to take you step-by-step through the process of creating and solving a model in Möbius, and to exhibit many of the capabilities and features of the tool.


System Description[edit]

The system under consideration is a highly redundant fault-tolerant multiprocessor system adapted from [1] and shown in <xr id="fig:ex_multiproc" />. At the highest level, the system consists of multiple computers. Each computer is composed of 3 memory modules, of which 1 is a spare module; 3 CPU units, of which 1 is a spare unit; 2 I/O ports, of which 1 is a spare port; and 2 non-redundant error-handling chips.


<figure id="fig:ex_multiproc">

Multiproc.png


<xr id="fig:ex_multiproc" nolink />: Fault-tolerant multiprocessor system.
</figure>


Internally, each memory module consists of 41 RAM chips (2 of which are spare chips) and 2 interface chips. Each CPU unit and each I/O port consists of 6 non-redundant chips. The system is considered operational if at least 1 computer is operational. A computer is classified as operational if, of its components, at least 2 memory modules, at least 2 CPU units, at least 1 I/O port, and the 2 error-handling chips are functioning. A memory module is operational if at least 39 of its 41 RAM chips, and its 2 interface chips, are working.

Where there is redundancy (available spares) at any level of system hierarchy, there is a coverage factor associated with the component failure at that level. For example, following the parameter values used by Lee et al.[1], if one CPU unit fails, with probability 0.995 the failed unit will be replaced by the spare unit, if available, and the corresponding computer will continue to operate. On the other hand, there is also a 0.005 probability that the fault recovery mechanism will fail and the corresponding computer will cease to operate. <xr id="tab:ex_coverage" /> shows the redundant components and their associated fault coverage probability. Finally, the failure rate of every chip in the system, as in [1], is assumed to be 100 failures per billion hours1.

1 0.0008766 failures per year.


<figtable id="tab:ex_coverage">

<xr id="tab:ex_coverage" nolink />: Coverage probabilities.
Redundant Component Fault Coverage Probability
RAM Chip 0.998
Memory Module 0.95
CPU Unit 0.995
I/O Port 0.99
Computer 0.95
</figtable>


Getting Started[edit]

A model of the system in this example is included with the Möbius distribution. Refer to Section C.1 for instructions on installing the example models. You are encouraged to open the model and follow the detailed discussions of its various components in the sections below.

From the Möbius Project Manager window, click Project\toUnarchive. A dialog will present a list of archived projects in the project directory. Choose Multiproc-Paper and hit Unarchive. After the project has been successfully unarchived, you will be prompted to resave the project using Project\toResave. At the dialog, choose Multiproc-Paper again, hit Resave, and wait until all components have been built. The Multiproc-Paper project editor will appear as shown in Figure 3.1.


Atomic Models[edit]

To build a model for an entire system, begin by defining SAN submodels to repre- sent the failures of various components in the system.

The SAN submodel of the CPUs is called cpu_module and is shown in <xr id="fig:ex_sancpu" />. To open this model, click the Atomic tab in the project panel, and then double-click on cpu_module or right-click on it and select Open. The places named cpus and computer_failed represent the current state of the CPUs and the current state of the multiprocessor system, respectively. That is, the number of tokens in cpus represents the number of operational CPUs in a given computer. Likewise, the number of tokens in computer_failed indicates the number of computers that have failed in the system. To open any of these places, right-click on the place and select Edit. This will bring up the Place Attributes dialog, in which you can edit the Name of the place and the initial marking (number of tokens) of the place. Note that the Tokens field can be specified with either a constant or a global variable name. For example, the place cpus has been initialized with three tokens, as each computer consists of three CPU units.


<figure id="fig:ex_sancpu">

Ex sancpu.png


<xr id="fig:ex_sancpu" nolink />: SAN submodel of cpu_module.
</figure>


To create a new place, either click the blue circle icon in the toolbar or select Elements\toPlace from the menu. Then click where you would like the place to go in the editor. The Place Attributes dialog will appear, and you can edit the Name of the place as well as the initial marking of the place in the Tokens field, as described earlier. To delete a place, right-click on it and select Delete, and hit OK to confirm.

The places labeled ioports, errorhandlers, and memory_failed are also included in this model to aid in reducing the size of the state space for the overall system model by lumping as many failed states together as possible. Additional state lumping (beyond that provided by the reduced base model construction method) can be achieved because once a computer fails, there is no need to keep track of which component failure caused the computer failure. More specifically, because of the assumption that all internal components of the failed computer have failed, the states that represent a computer failure due to a failure of a CPU unit, a memory module, an I/O port, or an error-handling chip are combined into a single state. The marking of the combined state is reached by setting the number of tokens in each of the places cpus, ioports, and errorhandlers to zero, setting the number of tokens in memory_failed to 2, and incrementing the number of tokens in computer_failed.

The failure of a CPU unit corresponds to the completion of timed activity cpu_failure. To open this activity, right-click on it and select Edit. This will bring up the Timed Activity Attributes dialog. In this dialog, you can edit the name of the activity and the distribution of its firing delay in the Time distribution function field. For this activity, the Exponential distribution should be selected. The activity completion rate is shown in <xr id="tab:ex_cpuact" />. This rate corresponds to six2 times the failure rate of a chip times the number of operational CPU units in the computer. If a spare CPU unit is available (i.e., cpus->Mark() == 3), three cases are associated with the activity completion, as designated in the Case quantity field. To define the case probabilities, click on the appropriate case number’s tab and type the expression in the box. The expression for the case probability can be a constant, a global variable, or a C++ statement returning a value as in this example. The first case represents a successful coverage of a CPU unit failure. If that case occurs, the failed CPU unit is replaced by the spare unit, and its corresponding computer continues to operate. The second case represents the situation in which a CPU unit failure occurs that is not covered, but the failure of its corresponding computer is covered. If that happens and a spare computer is available, the failed computer is replaced by the spare computer and the system continues to operate. However, if no spare computer is available, the multiprocessor system fails. The third case represents the situation in which neither the CPU failure nor the corresponding computer failure is covered, resulting in a total system failure.

2 Remember that each CPU unit consists of 6 non-redundant chips.


<figtable id="tab:ex_cpuact">

<xr id="tab:ex_cpuact" nolink />: cpu_module activity time distributions.
Activity Distribution
cpu_failure expon(0.0052596 * cpus->Mark())
</figtable>


On the other hand, if no spare CPU is available (i.e., cpus->Mark() == 2), then a CPU unit failure causes a computer failure. In this marking, two possible outcomes may result from the completion of activity cpu_failure. In the first, a spare computer is available, so that the computer failure can be covered. In the second, no spare computer is available, and system failure results. <xr id="tab:ex_cpucaseprob" /> shows the case numbers and the probabilities associated with each case for the activity cpu_failure. It is clear that the case probabilities are marking-dependent, since the coverage factors depend on the state of the system.


<figtable id="tab:ex_cpucaseprob">

<xr id="tab:ex_cpucaseprob" nolink />: cpu_module case probabilities for activities.
Case Probability
cpu_failure
1 if (cpus->Mark() == 3)
     return(0.995);

else
     return(0.0);

2 if (cpus->Mark() == 3)
     return(0.00475);

else
     return(0.95);

3 if (cpus->Mark() == 3)
     return(0.00025);

else
     return(0.05);

</figtable>


The input gate Input_Gate1 is used to determine whether the timed activity cpu_failure is enabled in the current marking, and hence can complete. The cpu_failure activity is enabled only if at least 2 working CPU units are available and their corresponding computer and the system have not failed. <xr id="tab:ex_cpuig1" /> shows the enabling predicate and function associated with this gate.


<figtable id="tab:ex_cpuig1">

<xr id="tab:ex_cpuig1" nolink />: cpu_module input gate predicates and functions.
Gate Enabling Predicate Function
Input_Gate1 (cpus->Mark()>1) &&

(memory_failed->Mark()<2) &&
(computer_failed->Mark()<num_comp)

identity
</figtable>


The output gates OG1, OG2, and OG3 are used to determine the next marking based on the current marking and the case chosen when cpu_failure completes. They correspond to the different situations that arise because of the coverage or non-coverage of system components. <xr id="tab:ex_cpuog" /> lists the output gates and the function of each gate.


<figtable id="tab:ex_cpuog">

<xr id="tab:ex_cpuog" nolink />: cpu_module output gate functions.
Gate Function
OG1 if (cpus->Mark() == 3)
   cpus->Mark()--;
OG2 cpus->Mark() = 0;
ioports->Mark() = 0;
errorhandlers->Mark() = 0;
memory_failed->Mark() = 2;
computer_failed->Mark()++;
OG3 cpus->Mark() = 0;
ioports->Mark() = 0;
errorhandlers->Mark() = 0;
memory_failed->Mark() = 2;
computer_failed->Mark() = num_comp;
</figtable>



Möbius

Möbius[edit]

Motivation[edit]

Solution[edit]

Graph

Edit Möbius Documentation

“” –

<equation id="eqn:binom" shownumber>

f(k)=\binom{n}{k}p^k(1-p)^{n-k}\quad k=0,1,\dots,n

</equation>

Sort of like <xr id="eqn:binom" />, but not really.


References[edit]

  1. 1.0 1.1 1.2 D. Lee, J. Abraham, D. Rennels, and G. Gilley. A numerical technique for the evaluation of large, closed fault-tolerant systems. In Dependable Computing for Critical Applications, pages 95–114. Springer-Verlag, Wien, 1992.