Fault-Tolerant Multiprocessor Model

From Mobius Wiki
Jump to: navigation, search

Fault-Tolerant Multiprocessor System

This section presents an example of a system that can be modeled using Möbius. It starts with a description of the system, and then guides you through one way to build a model of the system and solve it using both simulation and numerical solution. The example is intended to take you step-by-step through the process of creating and solving a model in Möbius, and to exhibit many of the capabilities and features of the tool.


System Description

The system under consideration is a highly redundant fault-tolerant multiprocessor system adapted from [1] and shown in <xr id="fig:ex_multiproc" />. At the highest level, the system consists of multiple computers. Each computer is composed of 3 memory modules, of which 1 is a spare module; 3 CPU units, of which 1 is a spare unit; 2 I/O ports, of which 1 is a spare port; and 2 non-redundant error-handling chips.


<figure id="fig:ex_multiproc">

Multiproc.png


<xr id="fig:ex_multiproc" nolink />: Fault-tolerant multiprocessor system.
</figure>


Internally, each memory module consists of 41 RAM chips (2 of which are spare chips) and 2 interface chips. Each CPU unit and each I/O port consists of 6 non-redundant chips. The system is considered operational if at least 1 computer is operational. A computer is classified as operational if, of its components, at least 2 memory modules, at least 2 CPU units, at least 1 I/O port, and the 2 error-handling chips are functioning. A memory module is operational if at least 39 of its 41 RAM chips, and its 2 interface chips, are working.

Where there is redundancy (available spares) at any level of system hierarchy, there is a coverage factor associated with the component failure at that level. For example, following the parameter values used by Lee et al.[1], if one CPU unit fails, with probability 0.995 the failed unit will be replaced by the spare unit, if available, and the corresponding computer will continue to operate. On the other hand, there is also a 0.005 probability that the fault recovery mechanism will fail and the corresponding computer will cease to operate. <xr id="tab:ex_coverage" /> shows the redundant components and their associated fault coverage probability. Finally, the failure rate of every chip in the system, as in [1], is assumed to be 100 failures per billion hours1.

1 0.0008766 failures per year.


<figtable id="tab:ex_coverage">

<xr id="tab:ex_coverage" nolink />: Coverage probabilities.
Redundant Component Fault Coverage Probability
RAM Chip 0.998
Memory Module 0.95
CPU Unit 0.995
I/O Port 0.99
Computer 0.95
</figtable>


Getting Started

A model of the system in this example is included with the Möbius distribution. Refer to Section C.1 for instructions on installing the example models. You are encouraged to open the model and follow the detailed discussions of its various components in the sections below.

From the Möbius Project Manager window, click Project\toUnarchive. A dialog will present a list of archived projects in the project directory. Choose Multi-Proc and hit Unarchive. After the project has been successfully unarchived, you will be prompted to resave the project using Project\toResave. At the dialog, choose Multi-Proc again, hit Resave, and wait until all components have been built. The Multi-Proc project editor will appear as shown in Figure 3.1.

Atomic Models

To build a model for an entire system, begin by defining SAN submodels to repre- sent the failures of various components in the system.

The SAN submodel of the CPUs is called cpu_module and is shown in <xr id="fig:ex_sancpu" />. To open this model, click the Atomic tab in the project panel, and then double-click on cpu_module or right-click on it and select Open. The places named cpus and computer_failed represent the current state of the CPUs and the current state of the multiprocessor system, respectively. That is, the number of tokens in cpus represents the number of operational CPUs in a given computer. Likewise, the number of tokens in computer_failed indicates the number of computers that have failed in the system. To open any of these places, right-click on the place and select Edit. This will bring up the Place Attributes dialog, in which you can edit the Name of the place and the initial marking (number of tokens) of the place. Note that the Tokens field can be specified with either a constant or a global variable name. For example, the place cpus has been initialized with three tokens, as each computer consists of three CPU units.


<figure id="fig:ex_sancpu">

Ex sancpu.png


<xr id="fig:ex_sancpu" nolink />: SAN submodel of cpu_module.
</figure>


To create a new place, either click the blue circle icon in the toolbar or select Elements\toPlace from the menu. Then click where you would like the place to go in the editor. The Place Attributes dialog will appear, and you can edit the Name of the place as well as the initial marking of the place in the Tokens field, as described earlier. To delete a place, right-click on it and select Delete, and hit OK to confirm.

The places labeled ioports, errorhandlers, and memory_failed are also included in this model to aid in reducing the size of the state space for the overall system model by lumping as many failed states together as possible. Additional state lumping (beyond that provided by the reduced base model construction method) can be achieved because once a computer fails, there is no need to keep track of which component failure caused the computer failure. More specifically, because of the assumption that all internal components of the failed computer have failed, the states that represent a computer failure due to a failure of a CPU unit, a memory module, an I/O port, or an error-handling chip are combined into a single state. The marking of the combined state is reached by setting the number of tokens in each of the places cpus, ioports, and errorhandlers to zero, setting the number of tokens in memory_failed to 2, and incrementing the number of tokens in computer_failed.

The failure of a CPU unit corresponds to the completion of timed activity cpu_failure. To open this activity, right-click on it and select Edit. This will bring up the Timed Activity Attributes dialog. In this dialog, you can edit the name of the activity and the distribution of its firing delay in the Time distribution function field. For this activity, the Exponential distribution should be selected. The activity completion rate is shown in <xr id="tab:ex_cpuact" />. This rate corresponds to six2 times the failure rate of a chip times the number of operational CPU units in the computer. If a spare CPU unit is available (i.e., cpus->Mark() == 3), three cases are associated with the activity completion, as designated in the Case quantity field. To define the case probabilities, click on the appropriate case number’s tab and type the expression in the box. The expression for the case probability can be a constant, a global variable, or a C++ statement returning a value as in this example. The first case represents a successful coverage of a CPU unit failure. If that case occurs, the failed CPU unit is replaced by the spare unit, and its corresponding computer continues to operate. The second case represents the situation in which a CPU unit failure occurs that is not covered, but the failure of its corresponding computer is covered. If that happens and a spare computer is available, the failed computer is replaced by the spare computer and the system continues to operate. However, if no spare computer is available, the multiprocessor system fails. The third case represents the situation in which neither the CPU failure nor the corresponding computer failure is covered, resulting in a total system failure.

2 Remember that each CPU unit consists of 6 non-redundant chips.


<figtable id="tab:ex_cpuact">

<xr id="tab:ex_cpuact" nolink />: cpu_module activity time distributions.
Activity Distribution
cpu_failure expon(0.0052596 * cpus->Mark())
</figtable>


On the other hand, if no spare CPU is available (i.e., cpus->Mark() == 2), then a CPU unit failure causes a computer failure. In this marking, two possible outcomes may result from the completion of activity cpu_failure. In the first, a spare computer is available, so that the computer failure can be covered. In the second, no spare computer is available, and system failure results. <xr id="tab:ex_cpucaseprob" /> shows the case numbers and the probabilities associated with each case for the activity cpu_failure. It is clear that the case probabilities are marking-dependent, since the coverage factors depend on the state of the system.


<figtable id="tab:ex_cpucaseprob">

<xr id="tab:ex_cpucaseprob" nolink />: cpu_module case probabilities for activities.
Case Probability
cpu_failure
1 if (cpus->Mark() == 3)
     return(0.995);

else
     return(0.0);

2 if (cpus->Mark() == 3)
     return(0.00475);

else
     return(0.95);

3 if (cpus->Mark() == 3)
     return(0.00025);

else
     return(0.05);

</figtable>


The input gate Input_Gate1 is used to determine whether the timed activity cpu_failure is enabled in the current marking, and hence can complete. The cpu_failure activity is enabled only if at least 2 working CPU units are available and their corresponding computer and the system have not failed. <xr id="tab:ex_cpuig1" /> shows the enabling predicate and function associated with this gate.


<figtable id="tab:ex_cpuig1">

<xr id="tab:ex_cpuig1" nolink />: cpu_module input gate predicates and functions.
Gate Enabling Predicate Function
Input_Gate1 (cpus->Mark()>1) &&

(memory_failed->Mark()<2) &&
(computer_failed->Mark()<num_comp)

identity
</figtable>


The output gates OG1, OG2, and OG3 are used to determine the next marking based on the current marking and the case chosen when cpu_failure completes. They correspond to the different situations that arise because of the coverage or non-coverage of system components. <xr id="tab:ex_cpuog" /> lists the output gates and the function of each gate.


<figtable id="tab:ex_cpuog">

<xr id="tab:ex_cpuog" nolink />: cpu_module output gate functions.
Gate Function
OG1 if (cpus->Mark() == 3)
   cpus->Mark()--;
OG2 cpus->Mark() = 0;
ioports->Mark() = 0;
errorhandlers->Mark() = 0;
memory_failed->Mark() = 2;
computer_failed->Mark()++;
OG3 cpus->Mark() = 0;
ioports->Mark() = 0;
errorhandlers->Mark() = 0;
memory_failed->Mark() = 2;
computer_failed->Mark() = num_comp;
</figtable>


In a SAN model, relationships between elements are designated by connecting lines or arcs. For example, places and input gates may be connected to an activity to indicate they are enabling conditions for the activity. An activity (or one of its cases) may be connected to a place or an output gate to indicate that upon completion of the activity, the marking of the place is affected or the output gate function is executed. It is not necessary to connect an output gate to a place whose marking the output gate function changes. Such a connection exists only to ease understanding of the model. To draw a connecting line or arc, choose either Straight Connection, Connected Line, or Spline Curve from the Elements menu. To connect two model elements using the first option, click on the first element and then click on the second element to draw a straight line between them. Using the second or third options, click on the first element, then click on one or more points between the two elements, and finally click on the second element. The Connected Line option will connect the two elements by linear interpolation of all user-defined points between them. The Spline Curve option is similar, but will connect the two elements with a smooth curve. The order in which the two elements are clicked is important, since the arcs, although drawn as undirected edges, are actually specified in a directed manner. For instance, to connect an input gate to an activity, the arc must be drawn from the input gate to the activity, and not vice versa. Also, there are some combinations of elements that cannot be connected, such as one place with another place or an input gate with an output gate.

Another way to model the failure of CPU modules would be to model the failure of a single CPU module as a SAN and replicate this model three times. However, since the failure of any chip inside the CPU module causes the CPU to fail, and each chip is assumed to have an exponentially distributed failure rate, the failure rate of one CPU module is just the sum of the failure rates of the 6 CPU chips. Therefore, modeling the failure of one CPU module, and then replicating this model three times, results in a model that is equivalent to the cpu_module submodel described above. Both approaches will generate the same number of states. In contrast, a significant state space reduction can be achieved by modeling one memory module as a SAN and replicating this model three times, instead of modeling the failure of the three memory modules in one SAN. The reason is that the failure of a single RAM chip does not cause the memory module to fail, so a memory module cannot be modeled as a single entity.

The SAN submodels of the I/O ports, the memory module, and the two error-handling chips are shown in <xr id="fig:ex_sanio" />, <xr id="fig:ex_sanmem" />, and <xr id="fig:ex_sanerror" />, respectively. The line of reasoning followed in modeling each of these components is similar to that followed in modeling the CPU modules. Note the similarity between the io_port_module and cpu_module SANs. A more detailed discussion of creating SAN models can be found in Section 4.1 of Building Models.


<figure id="fig:ex_sanio">

Ex sanio.png


<xr id="fig:ex_sanio" nolink />: SAN submodel of io_port_module.
</figure>



<figure id="fig:ex_sanmem">

Ex sanmem.png


<xr id="fig:ex_sanmem" nolink />: SAN submodel of memory_module.
</figure>



<figure id="fig:ex_sanerror">

Ex sanerror.png


<xr id="fig:ex_sanerror" nolink />: SAN submodel of the errorhandlers.
</figure>


Composed Model

Now the replicate and join operations previously defined (see Section 5.1 of Building Models) are used to construct a complete composed model from the atomic models. <xr id="fig:ex_composed" /> shows the multi_proc composed model for the multiprocessor system. To open this model click the Composed tab in the project panel, and double-click on multi_proc or right-click on it and select Open.


<figure id="fig:ex_composed">

Ex composed.png


<xr id="fig:ex_composed" nolink />: Composed model multi_proc.
</figure>


The leaf nodes represent the individual submodels, or atomic models, that were defined in the previous section. The memory_module is replicated 3 times, corresponding to the number of memory modules in each computer, with the places computer_failed and memory_failed (see <xr id="fig:ex_sanmem" />) held in common among all the replicas. You can see where that is set by right-clicking on the Rep node whose child is the memory_module submodel, and choosing Edit. The Define Rep Node: REP1 window will appear. Here the name of the Rep node is specified as Rep1, and the Number of Reps is specified as the global variable num_mem_mod, which is later defined to be 3 in Section 1.6. The two lists Unshared State Variables and Shared State Variables define which state variables are shared, or held in common, among all replicas. To move a state variable from one list to the other use either the Share > or < Unshare button. To move all state variables use the Share All >> or << Unshare All button. You can create a new Rep node by selecting the red Rj rep32.png icon from the toolbar or choosing Elements\toRep from the menu. Then click inside the editor where the Rep node is to be placed and specify the name of the node and the number of Reps in the Define Rep Node dialog. A Rep node must have as its child either an atomic model or another composed model. Click on the black Rj submodel32.png icon in the toolbar or select Elements\toSubmodel to add a submodel. Then you can draw a connecting line from the Rep node to the child submodel in the same way that you would draw connecting lines in the atomic model editor (see Section 1.3). Once a Rep node is given a child, the shared state variables can be defined by editing the Rep node again.

The three memory modules are then joined to the I/O ports model (<xr id="fig:ex_sanio" />), CPUs failure model (<xr id="fig:ex_sancpu" />), and error-handler model (<xr id="fig:ex_sanerror" />) to form a model of a computer. In the Join node, places with a common name are shared, and thus treated as single places among all system submodels. To open this node, right-click on the blue Join node and select Edit. This will bring up the Define Join Node dialog. Here, the Join node name is specified as Join1 and shared state variables can be created. The Join State Variables list shows all state variables that are shared across multiple submodels in the Join. Clicking on a shared variable in this list will display the corresponding name of the shared variable in each of the submodels among which it is shared under the Submodel Variables list. The # Shared column indicates how many submodels share each Join state variable. To share a state variable among submodels in a Join, click the Create New Shared Variable button, give a name for the new variable, and select the submodel state variables that are to be shared. In this example, places with a common name across different submodels are shared; this is achieved with the Share All Similar Variables button. A new Join node can be created by clicking on the blue Rj join32.png icon in the toolbar or selecting Join from the Elements menu. Then the Join node must be connected to its children nodes with arcs as discussed previously. A Join node can have as its children submodels, Rep nodes, or other Join nodes.

Finally, the joined SAN model of one computer is replicated num_comp times by the ‘Rep2’ node to generate the complete model of the multiprocessor system. More information about creating composed models and the composed model editor can be found in Section 5 of Building Models.


Reward Variables

After the composed model of the multiprocessor system has been built, the next step in the model construction process is to define reward variables. Reward variables permit us to compute interesting measures from the model. This example, for instance, focuses on measuring the reliability of the multiprocessor system over a 20-year mission time. The system is considered unreliable by time t if all of the num_comp computers in the system have failed. In terms of this model, the system is unreliable when there are num_comp tokens in place computer_failed.

To define the reliability variable, click on Reward in the project panel, then click New (either in the toolbar or by right-clicking on Reward and selecting New) and specify the new performance variable model name. Or, to view the existing performance variable model for this example, click the Reward tab in the Project panel. All previously defined variables are listed under this tab. The reliability variable should already have been defined, and you can open it for revision either by double-clicking on the variable MultiProc_PV or by choosing it and then clicking on the Open button on the panel. That will open up the Reward Editor for the variable.

On the left-hand side of the Reward Editor window, there is a Variable List sub-window containing all defined reward variables for this model. In the example, unreliability is the only variable. Choose it for revision by clicking on it once. Then click on the Submodels tab to choose the submodels on which the reward is to be computed. Because unreliability is defined on the place computer_failed in the submodel cpu_module, choose this submodel by clicking on it once (see <xr id="fig:ExampleSubmodelTab" />).


<figure id="fig:ExampleSubmodelTab">

ExampleSubmodelTab.png


<xr id="fig:ExampleSubmodelTab" nolink />: Reward Submodels tab.
</figure>


Next, to define the rate reward for unreliability, click on the tab Rate Rewards. This will bring up two sub-windows. The top sub-window lists all available state variables in the model on which rate rewards can be defined. The bottom sub-window, Reward Function, is a text area for entering C++ code for computing reward for the currently selected reward variable (see <xr id="fig:ExampleRateTab" />). In this example, a reward of (1/num_comp) should be returned when all of the computers have failed, because the reward is evaluated over all submodels in the composed model. That is, a reward of (1/num_comp) is accumulated once for each computer, or a total of num_comp times, for a total reward of 1. Thus, the reward for a state in which all computers have failed is 1, and the mean unreliability of the system (for example) can be found by calculating the mean of this reward variable. The C++ code that should be entered in this sub-window is

if (cpu_module->computer_failed->Mark() == num_comp)
{

return 1.0/num_comp;

}


<figure id="fig:ExampleRateTab">

ExampleRateTab.png


<xr id="fig:ExampleRateTab" nolink />: Rate Rewards tab.
</figure>


Now click the Simulation tab to view the parameters for simulation. Since the goal is to measure the unreliability of the system at a particular time (20 years), the Type has been set to Instant of Time and the Start time to 20.0 as in <xr id="fig:ex_rewardsim" />. You can ignore the Estimation and Confidence tabs for now.


<figure id="fig:ex_rewardsim">

Ex rewardsim.png


<xr id="fig:ex_rewardsim" nolink />: Simulation tab in Performance Variable Editor.
</figure>


At this point, if you wanted to define impulse rewards on your model, you could do it here. While rate rewards are evaluated when the model is in a particular state, impulse rewards are evaluated upon action firings. Click on the Impulse Rewards tab to view a list of available actions in the model. The three cases of activity cpu_failure in cpu_module will be listed, accompanied by a column indicating whether impulse rewards have been defined on any of them. Since there are no impulse rewards in this example, this column should read No for each activity. To define an impulse reward on an activity, click on the appropriate activity in the Action Name column and type the reward function in the Impulse Reward Function text box. For example, to count the number of times the cpu_failure activity fires during the first 20 years, the impulse function should return 1 for each of the three cases. Under the Simulation tab, you would set the Type to Interval of Time with a Start of 0.0 and Stop of 20.0. For the purposes of this example, however, do not define impulse rewards. More information about the Reward Editor can be found in Section 6 of Building Models. Save the reward variable definition with File\toSave.


Study and Experiments

Once all of the variables of interest have been defined as described in the previous section, you can create a study, or set of experiments, to evaluate the model of the multiprocessor system. To do so, simply define values for the global parameters of the model. To begin, click on the Study tab in the Project panel. There is a pre-defined study, vary_num_comp, for this example. Open vary_num_comp for revision either by double-clicking on it or by choosing it and clicking on the Open button on the Project panel. This will bring up the Study Editor, in which the global parameters for the model are defined, as shown in <xr id="fig:ExampleStudyEditor" />.


<figure id="fig:ExampleStudyEditor">

ExampleStudyEditor.png


<xr id="fig:ExampleStudyEditor" nolink />: The study editor for the example Multi-Proc.
</figure>


To modify the parameters that have numerical values in the column Variable Value in <xr id="fig:ExampleStudyEditor" />, click directly on the values. For the other parameters that have labels, such as Incremental Range, Functional Range, Manual Range, or Random Range, you can modify the values by selecting the variable and then clicking on the corresponding button at the bottom of the editor window. For instance, the variable num_comp has an incremental range from 1 to 3 with increments of 1. You can change its values by selecting it and then clicking on the Incremental Range button. This brings up the editor window as shown in <xr id="fig:ExampleIncrementalRange" />. As shown in the figure, the variable ranges from 1 to 3 with additive increments of 1. You can reveal all the values in the range by pressing the button View Values.


<figure id="fig:ExampleIncrementalRange">

ExampleIncrementalRange.png


<xr id="fig:ExampleIncrementalRange" nolink />: The editor window for editing values for the parameter num_comp.
</figure>


From the parameter values, three experiments have been created. The experiments differ in the values for the parameter num_comp. You can individually activate or deactivate the experiments by pressing the button Experiment Activator. Pressing this button brings up a window similar to the one shown in <xr id="fig:ExampleExpActivator" />.


<figure id="fig:ExampleExpActivator">

ExampleExpActivator.png


<xr id="fig:ExampleExpActivator" nolink />: The Experiment Activator window. The user can activate or deactivate individual experiments by pressing the checkboxes.
</figure>


Note the checkboxes in the row Active. Active experiments are indicated by the check marks. Deactivate experiments by clicking on these checkboxes to remove the check marks. For the example, leave these experiments activated and accept the existing values by pressing the button OK. This brings you back to the Study Editor. Save the values with File\toSave.

Solving the Model

After the studies have been created, the next step in model construction is to solve the model for the measures of interest. Möbius provides two methods for solving models: numerical analysis and simulation. For this example, either method can be used to compute the transient solution of the model. The next subsection describes numerical analysis using a transient solver. The following subsection will describe the procedure for conducting simulation.

Numerical solvers

Section 1.5 described how reward variables are defined to allow computation of interesting measures in the model. The reward variable defined there is the unreliability of the multiprocessor system for a 20-year mission time. You can now compute the transient solution of the unreliability of the system.

Before the unreliability measure can be computed, the state spaces of the model must be generated. The state space is a description of all the states that the model may visit and reside in. To generate the state spaces, first open the state-space editor by clicking on the Solver tab in the Project panel. Then double-click on the tab ssg to open the State Space Editor for a previously created state space. That should bring up a window similar to the one shown in <xr id="fig:ExampleSSG#1" />.

<figure id="fig:ExampleSSG">

<subfigure>ExampleSSGEditor.png</subfigure> <subfigure>ExampleSSGOutput.png</subfigure>
(a) The parameters for the state-space generator. (b) The output of the state-space generator.
</figure>
<xr id="fig:ExampleSSG" nolink />: State-space generator editor.


There are a few things to note on this editor. The study vary_num_comp is the study defined in the previous section, and the state spaces that will be generated are based on the global parameters defined in vary_num_comp. There are three experiments for this study, as shown in the Experiment List, and three corresponding state spaces will be generated for them. You can prevent the state spaces for these experiments from being generated by deactivating the corresponding experiments. Do so by pressing the button Experiment Activator. This will bring up a window similar to the one in <xr id="fig:ExampleExpActivator" />. For this example, disable Experiment 3, because the computation time for it may be lengthy on machines without enough resources. Finally, press the button Start State Space Generation to generate the state spaces for the experiments.

As the state spaces are being generated, you can check their progress by pressing the tab SSG Output. This should show a window similar to the one in <xr id="fig:ExampleSSG#2" />. Note that in the figure, Experiment 2’s state space was generated with 10,114 states. At any time, you can stop the state-space generator by pressing the Stop button. When the two state spaces have been generated, save them and close the editor with File\toSave and File\toClose.

After the state spaces have been generated, the next step is to use a transient solver to compute the unreliability measure based on these state spaces. Möbius provides several transient solvers, but this example will demonstrate the transient solver based on standard uniformization. To create a transient solver, select the Numerical tab and click the New button. This brings up a menu of solvers from which you can choose. Select Transient Solver from this menu and type the name trs in the text box Numerical Name. Next, press OK to select the state spaces on which the transient solver will be used. Select ssg from the menu presented and press OK to bring up the transient solver editor, which will be similar to the one shown in <xr id="fig:ExampleTRS#1" />. In the transient solver editor, enter 20.0 in the text box Time 1 to indicate that the time point of interest is the twentieth year. If there is a file name in the text box Output File Name, erase it so that the results are displayed on the screen. Finally, press the button Solve to start the transient solver.

<figure id="fig:ExampleTRS">

<subfigure>ExampleTRSEditor.png</subfigure> <subfigure>ExampleTRSOutput.png</subfigure>
(a) Transient solver input parameters. (b) The output from running the transient solver.
</figure>
<xr id="fig:ExampleTRS" nolink />: Transient Solver editor.


After the transient solver has completed, you can display the results by pressing the Output tab. <xr id="fig:ExampleTRS#2" /> shows the results from running the transient solver.

Note that the figure shows the output for Experiment_2, which corresponds to the state space for the multiprocessor system that has 2 computers (that is, num_comp = 2). The unreliability by the twentieth year is 0.017465.

Simulation

The numerical results obtained through the transient solver may be verified using discrete-event simulation. First, open the simulation editor by double-clicking on sim under the Solver tab in the project editor. This will bring up the window shown in <xr id="fig:ex_simpanet#1" />. This window presents the parameters for the simulation. The Current Study text box specifies the child study, which in this case is vary_num_comp. Note that the Simulation Type selected is Terminating Simulation. That means that you will obtain a transient solution, which is appropriate for the length of time you are studying (20 years). Möbius knows which type of simulation you are running because in the reward editor, unreliability is specified under the Simulation tab to be an Instant of Time performance variable with a start time of 20.0 (see <xr id="fig:ex_rewardsim" />). In the Maximum Batches text box, type 100000. This number sets the maximum number of batches to run for each experiment. The actual number of batches run depends on how long it takes the results to converge to the specified confidence interval. In this example, the default number of batches is not enough to allow the results to converge. The remaining default options (random number options, build type, run name, and so forth) should suffice for now. For more information on these options, consult Section 3 of Solving Models.

<figure id="fig:ex_simpanet">

<subfigure>Ex simparam.png</subfigure> <subfigure>Ex simnetwork.png</subfigure>
(a) Simulation parameters for Multi-Proc. (b) Network setup for simulation.
</figure>
<xr id="fig:ex_simpanet" nolink />: Simulation input and network configuration windows.


The Network Setup tab shown in <xr id="fig:ex_simpanet#2" /> displays the list of machines available for running a distributed simulation. Your local machine should be listed on the right under Selected Systems. Any other machines available will be listed under Available Systems. To add a network machine to the list of available systems, click the Edit Machine/Group Info button to bring up the Network Machine and Group Settings dialog. To add an available machine to a distributed simulation, select it in the left list and click the > button to move it to the Selected Systems list.

Click the Run Simulation tab and then the Start Simulation button. This begins the process of compiling the models and linking with the Möbius simulator libraries. The window (which appears in <xr id="fig:ex_simrunin#1" />) displays the status of the running simulation.

<figure id="fig:ex_simrunin">

<subfigure>Ex simrun.png</subfigure> <subfigure>Ex siminfo.png</subfigure>
(a) Running simulation. (b) Simulation results.
</figure>
<xr id="fig:ex_simrunin" nolink />: Simulation status and output windows.


The output of the simulation can be found under the Simulation Info tab shown in <xr id="fig:ex_simrunin#2" />. The table at the top of the dialog shows the status of each active experiment (i.e., whether it is finished, running, or waiting to run), as well as the number of CPUs dedicated to the simulation and the number of batches completed. Click on any experiment to view its status in more detail below. A progress bar, indicating how near the experiment is to completion, is displayed, along with the elapsed running time. The mean value for the performance variable unreliability is shown, along with its confidence interval, as the simulation progresses. A number in red indicates that the value has not yet converged, while a blue number indicates that the value has converged to the specified confidence interval. Finally, click the Show Results button to bring up a window containing more detailed simulation results. The mean unreliability over the 20-year mission time is 0.016895 for Experiment 2, which corresponds to the multiprocessor system consisting of two computers. Observe that the unreliability obtained through numerical solution falls within the confidence interval for the mean obtained through simulation, and thus the results from the two solution techniques match.

References

  1. 1.0 1.1 1.2 D. Lee, J. Abraham, D. Rennels, and G. Gilley. A numerical technique for the evaluation of large, closed fault-tolerant systems. In Dependable Computing for Critical Applications, pages 95–114. Springer-Verlag, Wien, 1992.