Reliability, Maintenance & Managing Risk Conference

July 27-28, 2023
Minneapolis, Minnesota
R&M Resilience to a Rapidly Changing Environment

Keynote Speaker – Shelley Ford

Product Safety Officer, Collins Aerospace

Professional Background

Before joining Raytheon, Shelley spent her career working at Kennedy Space Center supporting NASA
programs, holding positions of increasing responsibility:
• Lead airframe engineer on the Space Shuttle Columbia
• Vehicle manager of the Space Shuttle
• Chief Patent Counsel within the Office of the Chief Counsel
• Key representative on a multi-billiondollar Commercial Crew Program contract certifying new commercial launch vehicles and spacecraft to carry astronauts from U.S. soil to the
International Space Station.


• She is the recipient of several NASA awards including a Silver Achievement Medal and a Silver
Snoopy, given by astronauts to less than 1% of the NASA workforce each year.
• Shelley has also received the Society of Women Engineers Distinguished New Engineer


• Juris Doctorate from Barry University School of Law
• Master of Science in space science from Florida Institute of Technology
• Bachelor of Science in mathematics (with minors in physics and space studies) from Bemidji State University.



Session A1 – Design for Reliability: 

Is it New, Unique, or Difficult? – Daniel Conrad           

Abstract: Presentation on the use of New, Unique, Difficult (NUDs) to manage risks and prioritize work within the Design for Reliability (DFR) process.  A review of the NUD’s concept will be covered and an example of how NUD’s link and provide focus to the DFR processes (risk management, FMEAs, Reliability test development and execution, to minimize resources and maximize robustness and risk mitigation within the design project.

Making the most from reliability experiments via multifactor testing – Mark Anderson

Abstract: This talk provides a briefing on design of experiments (DOE) for rapid and effective screening, characterization and optimization of factors affecting reliability. It covers a broad range of DOE, including split plots for hard-to-change factors, response surface methods (RSM) for achieving high-level robust performance, and optimal design. Anyone tasked with improving reliability will benefit by learning how to make the most from every experiment via multifactor testing.  

Apply Design for Reliability Process to Improve Medical Device Robustness – Brenda Ruan

Abstract: Design for Reliability (DFR) is a powerful approach for product design improvement.  When implemented proactively from a product’s early development phase, it can help to ensure the product meets both internal and customer reliability objectives.  Over 75 years, Varex Imaging Corporation has been producing Medical and Industrial x-ray imaging components.  Effectively implementing a DFR process is an important key to our success.  

     Driven by the reliability engineer, a key member of a cross-functional design team, the DFR approach has been introduced and widely practiced to ensure that newly developed products will comply with all reliability requirements.  Implemented properly, DFR offers fundamental insights into the total design cycle and helps getting the whole design team involved in better understanding of product functions, design rationale, interactions of product components and how the product will be used in medical systems.  In this presentation, the DFR process will be discussed first, followed by a mapping structure from the common DFR steps to the X-Ray detector design process, to offer an understanding of why, when, and how to apply the DFR process during the product development cycle.

     In this talk, an X-Ray detector design will be used as an example to go over the DFR implementation, including how to setup system reliability goal, reliability analysis, reliability growth testing and accelerated life testing, family approach in sample size determination, and test data analysis.

     An analysis of sanitized post-marketing field data will be presented to demonstrate that a 3 times improvement on B10 life has been achieved on a hand-held portable radiography detector by adopting DFR process.          

Session B1 – FMEA: 

Harnessing the Power of Model-Based Systems Engineering for Effective and Automated Reliability and Safety Analysis – Soon Keat Ong 

Abstract: The development of Model-Based Systems Engineering (MBSE) has brought about new techniques for evaluating system reliability and safety. The reliability of a system can be viewed from the perspective of a success path, represented by a Reliability Block Diagram (RBD) or from failure analysis using Failure Mode and Effects Analysis (FMEA) and Fault Tree Analysis (FTA). These are key methods and artifacts that complement one another to ensure the reliability, safety, and fault tolerance of a complex system. However, they often take a back seat during the design process as they can be time consuming and require significant effort to perform comprehensively. This presentation will focus on the Model-Based Fault Management Engineering (MBFME) methodology and associated tools we have developed for modeling in the Systems Modeling Language (SysML) failure information, allowing rapid and automated generation of these safety and reliability products.  This enables the results to be fed back to systems engineers in a timely manner, leveraging this information to support rapid design iteration especially early in the design cycle.  This methodology uses a collection of profiles and libraries to incorporate reliability, failure, and fault propagation information into the system model. This information is then extracted using a set of in-house developed MagicDraw/Cameo plugins to automatically generate FMEA, FTA, and RBD items in graphical and XML formats. In this presentation, we will provide a high-level overview of the methodology, how it facilitates reliability engineering, and demonstrate a few examples of its implementation. We will also discuss our roadmap to adapt our methodology and tools using the Object Management Group (OMG)’s Risk Analysis and Assessment Modeling Language (RAAML) specification and identify the synergies between the two concepts.

Enhancing Constant Temperature Reliability of Gas Water Heaters through FMEA Analysis and Targeted Improvements: A Case Study of Midea Group – Jian Gou

Abstract: Midea Group is a leading global manufacturer of gas water heaters, providing reliable and high-quality products to millions of households worldwide. However, the increasing demand for our products, particularly in harsh environments, has posed challenges in maintaining high reliability and stable temperature output, especially under windy conditions. This can cause discomfort and safety risks for users.

To tackle these challenges, we conducted a product redesign and engineering project aimed at enhancing the constant temperature reliability of our gas water heater products. Using the FMEA analysis method, we carried out a comprehensive analysis of the failure modes and causes of gas water heaters delivering constant temperatures. Our study identified the inadequate applicability of the combustion system and control algorithm in dynamic environments, such as gas source and water flow, as the primary factors affecting temperature consistency.

Based on our failure analysis, we proposed targeted improvement measures, including optimization of the combustion system, increasing the proportional valve margin, and adopting adaptive control technology. We then validated the effectiveness of these measures through experiments and user feedback, which demonstrated a significant improvement in our products’ constant temperature reliability. As a result of these improvements, our products can now meet market requirements and the challenges of the near-term future, until the next reliability challenge emerges.

Overall, this project showcases our commitment to addressing real-world reliability challenges and highlights the value of targeted improvement measures in enhancing product reliability and quality.

First and Beyond First Time Field Failure Analysis for Effective Field Service Planning – Sarath Jayatilleka          

Abstract: The lowest part reliability determines the overall system reliability. Therefore, field replaced parts can bring down the overall system reliability. First time failed parts are expected to assemble under controlled environments such as in an assembly line with right tools, ergonomics, and accessibility assuming the right skill sets were available. On the other hand, field replaced parts are assembled after a field diagnosis, removed, and replaced by field service personnel in varying field settings. Field environment is markedly different from the assembly line. It may not provide the comfort of the assembly line for the service personnel and the accessibility is often a challenge. Therefore, a field failure of a component or a subsystem in a repairable system for the first time is markedly different from a once repair/replaced component/subsystem failure.

Modern systems have complex configurations with more components in limited spaces while losing the service accessibility within the system. Large systems may be installed in places like top of high-rise buildings or in the outer space with difficult and expensive accessibility. Understanding of the phenomenon that once replaced parts are having lower reliabilities than the pristine part population can lead to a better design reliability and operational maintenance planning.

This paper provides the method of part reliability analysis by treating first time failures separately from the second time failures and beyond. The key concept here is to isolate the once and more than once failed population from the pristine population failures. In most occasions, it was found that the once and more than once failed population can be treated as one population that is different from the pristine population. This paper also provides the elaborated methods and analysis with field examples in order to highlight the benefits that can be incorporated into design for reliability and serviceability, in turn results in increased equipment availability.

Session A2 – Risk

Risk Management for Uninterruptible Power Supply Systems used in Server Farm Infrastructure – Gregory Zinkel

Data centers around the world rely on uninterruptible power supplies (UPS) to ensure there is no data loss in the event of a power failure. Lengthy periods without power such as failed backup restart procedures may result in large data losses. The UPS systems must ramp up megawatts of power immediately for a short period, measured in seconds, to prevent data loss. The central part of UPS systems are usually diesel generators, which are preliminarily supplemented by megawatts provided by battery systems. The batteries engage as a bridge until the generators come online. The battery systems typically use VRLA (lead acid), LIB (lithium ion), LIP (lithium iron phosphate), and more recently, sodium batteries. Each battery type has its own inherent problems, dangers and reliability issues. For a reliable system, strict parameters must be established and maintained to avoid catastrophic losses.
Various inherent risk factors and their possible remedies will be discussed, including failure types for the components of the UPS systems. Server farm companies can employ several techniques to reduce their risk, including redundancy, FMEA and strict supplier quality practices in order to maintain system performance.
Risk reduction steps that manufacturers of the UPS components or systems can employ are discussed including QC methods, PFMEA and reliability testing using statistical sampling.
During battery product development, risk reduction can be achieved through many facets including QA, stage gate criteria, DFMEA, excellent release and engineering processes, systems approaches between the chemical, electro-mechanical and software components, and complying with UL and appropriate standards from ISO and IEEE, for example. Of course, a major risk reduction technique in product development is extensive reliability/ durability testing done under various potential environments.

Effective Use of FDA’s TPLC Database – Dan O’Leary 

Abstract: FDA has an information filled medical device database called the Total Product Life Cycle, TPLC. Organized by device Product Code, it provides three major groups of information.

The first group is the list of “marketing authorizations” for the devices. It lists each company and provides links to the submission and summary information.

The second group is the adverse event data in the Manufacturer and User Facility Device Experience, MAUDE, database. It provides counts of the number of reports by year and includes counts of device problems and patient problems using the coding information from the reports. It includes links to the individual reports.

The third group is the recall data; organized by year and recall class with a link to each recall report.

The database is an underutilized tool that can support analysis in a number of areas. For medical device risk management, patient problems can help identify hazards, hazardous situations, and harms. For reliability, device problems can help identify failures and modes. For post-market surveillance, the information can provide comparative analysis and support improvement projects.

The presentation discusses each data group and illustrates its application to medical device improvement.

Assessment of Oil Leak Containment Strategies in Refineries in Tropical Countries, using Quantitative Risk Assessment and ALARP (as Low as Reasonably Practicable) Principle – Muhamed Jamil Khan, Nurul Amin

Abstract: Oil spills occur in oil refining installations and the spill rates vary depending on inherent design conditions of the facility, operation, and maintenance practices. Majority of these spills will end-up in on the facility pavement. Part of the spill will be cleaned up using oil spill kits and part of it washed away during rain or with use of service/fire water, ultimately polluting waters if not contained. Addressing this issue should consider the need to contain the oil, separate clean water from contaminated water, and design systems to prevent the contaminated water from entering public water sources; by eventually diverting the contaminated water to treatment facilities.

In designing such systems, two major strategies can be adopted, i.e. (a) end of pipe strategy with very large catchment ponds to contain contaminated water, followed by deploying vacuum truck to remove the surface oil, or (b) control at source strategy with a smaller Controlled Discharge Facility (CDF) to separate oil and water closer to source and treat the separated contaminated water after any spill-and-rain/wash event. There are advantages for each solution where notably strategy (a) manages contaminated water in a centralized location away from the source, while strategy (b) manages contaminated water closer to the source. However (a) is typically less sophisticated and may require larger footprint, and (b) is typically equipped with oil-water separation technology and require several smaller systems located at several points within a refinery.

This paper examines the best strategy and configuration to achieve the objective of separating clean and oil contaminated water. Environment selected is tropical region with high rain intensity. The methodology employed is quantitative risk assessment through geometric and behavior modelling; and simulation using Monte Carlo method and Computational Fluid Dynamics (CFD). The parameters considered for the tests are realistic oil release from potential leak sources, leak clean-up effectiveness, rain intensity, paved area subjected to rain and leaks, rain frequency and duration, and basin/pond size.

From the simulations and risk assessments (i.e., ALARP [as Low as Reasonably Practicable] demonstration), it was concluded that the best approach is to use a combination of CDF to trap oil, penstock for high viscosity oil, and adequate bund wall for very large potential leaks. Having large ponds to contain spills is deemed ineffective, as the surface area required is very large, posing challenges to empty-out and manage contaminated water. Furthermore, there is a risk of fire from nearby hot-work activities, contacting hydrocarbon vapor above water surface or pond walls.    

Session B2 PHM

Improving Component Reliability and Human Productivity for Offshore Wind Turbines Operations and Maintenance – Yashwant Sinha               

Abstract: Offshore Wind Turbines (OWT) are assets of significant value. However, although the reliability and availability of OWT components and assemblies have improved over the years, much still needs to be done to reduce the high Levelized Cost of Energy (LCOE) of offshore wind farms and control the risk of hazards linked with offshore-based operations and maintenance (O&M) works. This work discusses a strategy using a modified FMECA model that can improve the Reliability, Availability, and Maintainability (RAM) of OWT and improve human productivity in offshore-based maintenance work. Both of these objectives can potentially reduce LCoE and control the risk of hazards with offshore-based O&M works. Additionally, this modified FMECA model can help design predictive maintenance and its financial management, which can help improve decision-making for O&M works.     

In-Service Wind Turbine Blade Condition Monitoring – Affan Khan         

Abstract: Wind turbine blades occasionally experience fatigue cracking in the root region (the area of the turbine in which the blade is connected to the nacelle). The cracking can occur due to static loads that the wind turbine experiences, and the cracks can be propagated due to the movement of the wind turbine rotor. The crack can increase in size until the blade itself becomes detached from the wind turbine this is known as catastrophic failure, the occurrence of which is very costly to the wind turbine operator, and it is also a danger to anything or anyone that is around the turbine.  It is for this reason that catastrophic failure must be avoided at all costs and so an efficient condition monitoring system is desired that can monitor the onset as well as the propagation of the cracks in a wind turbine blade.

Systematic Approach to Prioritizing Durability Improvements – Hemant Urdhwareshe     

Abstract: Most manufacturers of consumer durable products have good amount of data of failures within warranty period, although the warranty period varies for different companies. Warranty failures typically are attributable to manufacturing defects. Analysis of such failures within warranty can be done using Nevada Chart format and life data analysis.

However, failures occurring after warranty period are usually not captured in the data as usually there is no mechanism to do so. Therefore, additional efforts are needed to collect and analyze data of post-warranty failures. Another important aspect of such data is that quite a few of these failures are likely to be related to the design of the product. Analysis of this data can reveal important conclusions which could lead to design improvement ideas and projects. These projects can be prioritized based on the analysis of post-warranty life data.

In this presentation, I would share a case study based on my personal experience. The case study will highlight:

  • Planning and collection of data of failures occurring after warranty.
  • Analysis and interpretation of data with multiple failure modes.
  • How B10 life was estimated for each failure mode.
  • Prioritization of improvement projects.
  • The lessons learnt.
  • Process deployed for these improvement projects.
  • Benefits and lessons learnt.

Session A3 – Reliability Prediction and Machine Learning

Surface-Controlled Subsurface Safety Valve (SCSSV) Reliability Predictive Analytics – Nurul Aizad Md Safian   

Abstract: Surface-Controlled Subsurface Safety Valves’s (SCSSV) is designed to automatically shut in the flow of a well in the event surface controls fail or surface equipment becomes damaged. It is identified as Safety Critical Element (SCE) and assigned to testing in ensuring that the device functions on demand. Failure of SCSSV leading to loss of well integrity downhole barrier (risk escalation for well integrity) and production unplanned deferment (UPD). Malaysia asset had lost significant amount of production in 2020 due to SCSSV system failure. SCSSV Reliability Improvement initiative was embarked, starting with SCSSV performance analysis which includes measuring SCSSV reliability through Mean Time Between Failures (MTBF) monitoring, understand equipment failure behavior/pattern through failure distribution and probability calculation, and identify causes effecting low MTBF through Failure Mode Effect Analysis (FMEA). Analysis based on SCSSV failure population found that jammed closed/opened, leakage and tripped were caused mainly by hydraulic control line issues. Inspection, Testing and Preventive Maintenance (ITPM) plan was developed through FMEA methodology, and reliability improvement through performing three (3) monthly hydraulic oil analysis was found out to be one of the ways to improve control line fluid quality from time to time. Enterprise data was then extracted and churned in reliability predictive analytics model that able to established predicted list of wells number that will fail within a period based on PI data (PCP (kPa), THP (kPa) and THT (°C)) and UPD data. This presentation will cover the SCSSV Reliability Improvement Program journey, analysis done, and result of SCSSV Failure Prediction Modelling.

Equipment Maintenance Reliability Strategy Review via Statistical and Cost Approach – Ping Hun Kok 

Abstract: Equipment Reliability Strategy (ERS)Review is used for maintenance optimization using quantitative and statistical risk-based approach (i.e., incorporating cost of maintenance and cost of failure) to achieve As Low as Reasonably Practicable (ALARP) level risks. It covers component or equipment level analyzes for Inspection, Test and Preventive Maintenance (ITPM)plan for all physical assets from subsurface till surface utilities.

Methods, Procedures, Processes. The ERS review processes involve the following (1) Data identification and gathering with the objective of selecting reliability data (i.e., Testing data) and cost data (i.e., Production and repair cost), (2) Data Clustering (homogenous data with same failure mode) and mining with the objective of ensuring data quality (data accuracy and reflective), (3) degradation analysis with the objective of obtaining times-to-failure of components/equipment, (4) Life Data Analysis (LDA) to obtain failure characteristics(i.e., Weibull analysis.), (5) Consequence and cost input to evaluate  impact of failure and subsequently obtaining the most optimum maintenance interval, (6)review with technical authority, (7) approval with relevant stakeholders, and (8) implement the agreed optimum maintenance interval.

Result. The review has been in one of operating unit in Malaysia which focus on gas transmission particularly on electrical equipment such as switchgear, transformers and switchboards. In summary, statistical analysis shown (1) High Mean Time Between Failure (MTBF) days with at least 5000 days and above (2) Failure pattern of exponential which depicts random failures (3) Mitigated risk to be acceptable level.

Observation. The maintenance interval managed to be extended as (1) Switchboards major maintenance from 5 yearly to 10 yearly. (2) Transformer major maintenance from 3 yearly to 6 yearly. (3) TAN delta from 3 yearly to 6 yearly.

Conclusion. (1) It has generated cost saving from maintenance due to extended maintenance interval amount close to 20% of electrical annual maintenance cost. (2) It has been replicated to other operating unit with expansion to other engineering discipline.

Novel/Additional Information. (1) The above analysis can be used as more accurate predictive analysis in supporting but not limited to (a) Self-regulated Turnaround. (b)Prolong interval between shutdowns. (c) Maintenance cost optimization      

An Improved Exact Goodness-of-Fit Test for Reliability Data with Censored Observations – Trevor Craney 

When confronted with fitting a distribution to reliability data, one typically encounters a mixture of failure times and right-censored observations. Typical approaches to assessing the goodness of the fit of the distribution to the data include: theory, visual comparison on a probability plot, or analytical tests such as the Kolmogorov-Smirnov test or the Anderson-Darling test. This presentation will show the inadequacy of these methods to assess goodness-of-fit (GOF) by revealing the shortcomings of each method. In particular, the analytical tests will be shown to be more conducive to use as outlier tests rather than GOF tests, and the typical deficiencies of visual assessment on a probability plot will be reviewed. An exact test, including its construction and performance, will be described and demonstrated. It will be shown to be an exact test for GOF by comparing it to an observed measure of interest with reliability data.

Session B3 – Resilience and Maintenance

Predictive Resilience Modeling – Lance Fiondella

Abstract: Resilience is the ability of a system to respond, absorb, adapt, and recover from a disruptive event. Dozens of metrics to quantify resilience have been proposed in the literature. However, fewer studies have proposed models to predict these metrics or the time at which a system will be restored to its nominal performance level after experiencing degradation. This talk presents alternative approaches to model and predict performance and resilience metrics with elementary techniques from reliability engineering and statistics. We will also present a free and open-source tool developed to apply the models without requiring detailed understanding of the underlying mathematics, enabling users to focus on resilience assessments in their day to day work.

Bayesian Multimodal Models for Risk Analyses of Low-Probability High-Consequence Events – Arda Vanli     

Abstract: This talk will present a set of Bayesian model updating methodologies for quantification of uncertainty in multimodal models for estimating failure probabilities in rare hazard events. Specifically, a two-stage Bayesian regression model is proposed to fuse an analytical capacity model with experimentally observed capacity data for residential building roof systems under severe wind loading. The ultimate goals are to construct fragility models accounting for uncertainties due to model inadequacy (epistemic uncertainty) and lack of experimental data (aleatory uncertainty) in estimating failure (exceedance) probabilities and number of damaged buildings in building portfolios. The proposed approach is illustrated on a case study involving a sample residential building portfolio under scenario hurricanes to compare the exceedance probability and aggregate expected loss to determine the most cost-effective wind mitigation options.              

Accelerated life testing of thermoelectric components in automotive applications –  Julio Pulido

Innovate colling technology is evaluated using accelerated degradation techniques. If the thermoelectric is damaged, product performance will decrease rapidly until failure. The challenge is that new technologies need historical data that can direct the reliability of practitioners in selecting the approach.
In today’s competitive environment, accelerated life testing is becoming a competitive advantage when time from the conceptual stage to the final product development needs to be competitively small (project costs and development time) to succeed. Using accelerated life testing techniques for mechanical and structural applications has substantial challenges when defining the loading and the product life to represent actual field performance. The issue is how new technologies should be evaluated to determine which life stress relationship better represents the component performance.
Such common problems and helpful strategies using accelerated life testing are presented for faster planning of structural and mechanical testing of thermoelectric components used for cooling automotive or optoelectronic applications.
The paper reviews each element of the test planning process and how different life stress relationships can be effectively incorporated to evaluate the time-to-failure calculation effectively and efficiently. The paper covers several testing techniques like DOE combined with accelerated testing. It shows successes and pitfalls that could be avoided when the correct tools were applied in a timely manner.


Session A4 – Lean Manufacturing and Reliability

Transformational Leadership & Lean Strategy – Sean Nobari 

Abstract: What is required to successfully implement and sustain Lean Manufacturing principles? In recent years many organizations have faced the continued challenges of implementing Lean Manufacturing, as many other have failed to sustain the implementation process. Successful implementation and sustaining of Lean Manufacturing requires:

  • Creation of a long-range vision for change, an environment conducive to change and implement team-based manufacturing,
  • Developing methods to identify and communicate clear goals and objectives to all level of the organization,
  • Continually educating, leading, coaching, and empowering team members to lead the change and,
  • Developing an organizational assessment and feedback system to ensure continued commitment to change and to improve organizational performance at all levels of the organization.

The main focus of this Presentation is to present “Transformational Leadership and Lean Strategy” and what is required to sustain the process in any organization. 

Quality and reliability tools for capacity planning and costing –  Edward Jaeck 

Abstract: In this presentation we will present a derivative model to Overall Equipment Effectiveness and demonstrate a cost modeling method based on historical data and activity-based costing.

During Covid-19, lack of operators forced many companies to go lines down. In this presentation, we will show how we can model cell level availability as a function of sub-systems and operators. We will show an example of how we can model that using a concept from reliability network theory. We then show how once the complex math is complete, we can simulate cell performance using a binominal. This high-level model is very useful when challenging capacity assumptions asserted by suppliers in a complex supply chain.  The second part of the presentation will cover the use of Monte Carlo simulation to combine activity based costing and historical data to produce a variety of cost estimates.

Beyond the Algorithms: Strategic Implementation of AI/ML – Mindy R. Hotchkiss

This talk highlights Industry 4.0 and machine learning related activities at Aerojet Rocketdyne (AR), which develops and produces propulsion systems and energetics to the space and defense arenas. AR has been working to leverage Artificial Intelligence and Machine Learning (AI/ML) methods and tools by linking existing data and engineering processes, and enhancing digital infrastructure. Specific AI/ML solutions can be utilized to modernize work practices and enable new programs and projects. This presents opportunities to reduce design cycle time, optimize production processes and reduce costs, if implemented strategically. This presentation also discusses challenges related to developing effective AI solutions, including implications and practical considerations associated with implementation of AI/ML-enabled systems, including systems and reliability engineering. 

Scroll to Top