How I Learned to Stop Worrying and Love the Bathtub Curve, Part I
Computer-based Product Testing and Reliability Series Introduction
This article is the first in a series that will explore the topics of testing and reliability control, specifically oriented at computer-based products. If you find value in these articles you are welcome to subscribe to the website's mailing list and follow us on LinkedIn for news and more content on these topics.
In this series you will find theory along with examples from the field - real world scars gathered throughout the years, brought in to demonstrate the main points of the articles.
All examples are censored with no classified information or customer identification unless specifically approved. The examples also include minor technical disinformation to further protect against identification. They are brought in to clarify concepts, with the hopes that real world examples will bring additional value to people in similar situations.
Theoretical questions and general discussions are more than welcome, but please refrain from sharing anything that may expose non-disclosed information.
It should be noted that the examples come from a vast range of industries, project scales and companies. As you will see, they are widely applicable as neither scale nor simplicity make a company immune to inherent problems in computer system integrations. We have witnessed pitfalls at company sizes ranging from Forbes 500 corporations to seed stage startups, projects from dozens of thousands of IPCs down to 10 cybersecurity appliance ones, in full scale datacenter deployments just the same as in the simplest SBCs within medical appliances.

Bathtub Curve Introduction
Anyone responsible on any level for a computer-based product should familiarize themselves with the bathtub curve representation of failure rate. It’s general guidelines are easy to grasp and implement, and there is no need to get too technical to understand its significant business implications.
The bathtub curve is a simple visual representation of a widely applicable failure rate function commonly used in reliability engineering of electronic devices. It is often overlooked in larger integrated systems such as computer-based OEM products and enterprise grade IT deployments, which have been our home turf for over a decade by now.
It is overlooked for various reasons, the most common ones we have encountered are: insufficient involvement of reliability or quality teams in the process; lack of expertise in the peculiarities of complex system integration; and a false assumption that the sub-components have already been tested independently, since they are often provided by top-of-the-line vendors, therefore sufficient quality has been assured and further testing up the integration process is redundant.
This assumption often seems intuitive to people who have not undergone quality assurance training, and that is precisely its danger, but why exactly is it false?
Most importantly, because any integration action introduces new complexity and risks that were not present in its sub-components, human error being just one of them, and because the reliability specifications of the sub-components do not necessarily align with those of the full system. A simple way to visualize the problems of this assumption would be to draw the FTA (Fault Tree Analysis) diagram of such a scenario. Smaller is better, and this one will grow into a Sequoia tree.
Murphy's law does not sleep, and statistics show no mercy. As production quantity or complexity scales up, this false assumption starts to show its dire consequences. It is a good thing for us that this is a simple and predictable phenomenon that is easy to mitigate, once the bathtub curve is understood. We’ll get to the specifics of this example in part II once a baseline understanding is set. Let's dive into the basics:

The bathtub curve consists of three failure groups:
Decreasing hazard of infant mortality (blue dotted line)
Constant hazard of random failure (green dotted line)
Increasing hazard of wear-out failure (orange dotted line)
Once combined, the three form the observed failure rate (yellow solid line), which resembles the shape of a bathtub - hence the name.
The graph will not predict the failure rate of a single item, but rather the relative failure rate of an entire population of products over time, broken down into 3 cause groups.
Infant mortality failure group
Early failure rate, commonly called infant mortality, is an exceptionally high failure rate observed at the early life period of a product (or a component, system, infrastructure, etc…).
It is attributed to factors such as:
Manufacturing defects
Integration or installation process irregularities (unclear technical procedures)
Improper start-up or implementation
Inadequate component handling during storage and integration
A lack of QA processes
Sub-component reliability specification misalignments
It is characterized in the graph by a high start and a sharp drop as the system starts to undergo usage and stress (an important detail, will go into detail on this in part II). Counterintuitively, it decreases in deceleration as time goes on and does not reach zero, making the failure causes illusive and still affecting the failure rate after prolonged periods of time.
A point to keep in mind is that this failure group comprises the major part of our DOA (dead/defective on arrival) products. DOAs undermine customer confidence in the product, causing negative "priming" from the customer's first engagement with the product. This failure group is the most damaging business-wise as unlike the other two failure groups, which customers expect and understand to be inevitable, DOAs are simply unacceptable from a customer satisfaction viewpoint.
This holds true for virtually any physical computer product, be it a point-of-sale kiosk, a cybersecurity appliance, multi-petabyte storage or a hyperconverged backend infrastructure serving the development team.
Normal failure group
Each system has a chance to experience random failure, attributed to factors such as:
Architecture and usage design flaws
Acceptable specification limitations
Acceptable deviations from manufacturing and testing processes
At small scales, normal failures seem to occur at random and to be unrelated. They narrow down to a constant rate when measured on a big sample over a long period of time. This group represents a low yet constant chance of failure at any point in time, and is therefore characterized on the graph by a straight horizontal line.
Wear-out failure group
The wear out failure rate, as the name suggests, consists of failures attributed to the natural deterioration of our products over time. It is often what dictates in practice the product’s end of usable life, triggering a replacement or phase out. The wear out rate is affected by factors like:
Customer usage stress and usage variation
Environmental factors
Sub-component MTBFs (Mean Time Between Failures)
The wear-out failure rate is characterized in the graph by a start at zero and increases in acceleration as time goes by and takes its toll on the product.
We'll go further into the bathtub curve implications on all aspects of our product and it's processes in the following posts, for the time being - a short story for context.

Story time
During my time as an integrator, a global OEM customer of ours designed an IPC (Industrial Personal Computer) to operate and monitor all of the robotics and optics in their product, which was a physically massive and highly complex cutting edge technology machine. The IPC system was a neat and straight-forward SuperMicro server with many IO cards, RHEL OS, redundant PSUs and overspeced dual Xeon processors. They even made sure to hook up IPMI OOB. Their machines were expensive and high quality, they didn’t play around with subpar hardware and didn’t cut corners on the architecture. The roadmap was hundreds of copy-exact systems per year, designed for 5 year longevity. So far, so good.
The problem
Our customer’s R&D were immersed in providing functionality and performance, and did produce a very neat enterprise-grade solution, but no one gave sufficient thought to the useful life period required from this product (more on this down the rabbit hole of this series). They failed to adequately analyze the system's use case wear-out factors, therefore they did not calculate correctly the MTBFs (Mean Time Between Failures) and endurance requirements of the sub-components. The long term support strategy was recognized as important, but in practice was regarded as “some other abroad department’s problem, in the distant future”.
Business impact
And so, a mature, very strictly defined, and BOM locked product as is common in that industry, passed all the strict processes after over a year of development and engineering, and the whole machine was happily launched and sold in hundreds to facilities worldwide. And then, after about a year in the field, the SSD failure rate started ramping up well beyond what was expected.
By the time the SSD pandemic was noticed, hundreds of units were already deployed in the field, with many hundreds more in the supply chain pipeline.
It turns out that our customer had an image capture data stream writing to an SSD as part of the machine operation at just above 2MB/s.
Yes, not a typo – just 2MB/s, so no one bothered to think twice of it, but as the machine works non stop 24/7 this presumably minor write stream was as constant as the AC powering the IPC.
Data generated by the stream was not crucial for retention and was automatically overwritten twice a day, therefore a single SSD with sufficient performance and no RAID was defined in the spec. It was separated from the OS drive, which is often a good best-practice in such use cases, but it proved to be a SPOF (Single Point of Failure) as once it failed the core applications went down with their crucial storage resource, bringing the whole machine to a halt.
The reason why the SSDs started failing after about 1 year was very straight forward - the chosen SSD was of a good brand and quality, at a good price and with a stable supply roadmap, but its write endurance was rated for 75TBW (Terabytes Written). Sounds plenty, but is it though?
75TBW in an SSD spec means Terabytes written for the duration of the product warranty, which was 5 years.
75TBW / 5 years / 365 days / 24 hours / 60 minutes / 60 seconds = ~0.5MB/s for 5 years straight. Our customer’s data stream was generating over 4 times that, and the SSDs quickly wore out.
An unexpected high failure rate produces a costly butterfly effect. In this case the end customers were global, often in hard-to-access locations, and the product was mission critical with the IPC being just a minor sub-component of a complex machine. Therefore, the whole IPC had to be defined as a single FRU (field replaceable unit) in the full product BOM to make sure any issues can be resolved by onsite available application engineers in a timely manner. This is a good trade off as long as the failure rate is low and predictable, as it means each time an SSD failed the whole IPC is swapped in the field, with the failed one being shipped back for RMA (Return Merchandise Authorization). But suddenly our failure rate was no longer low.
Surely, the more failures the more downtime and customer dissatisfaction, which are gravely important issues all by themselves. However, let's go further and imagine the raw operations overhead: having to immediately purchase more units as the rolling stock expanded unexpectedly; prioritizing emergency support shipments over new customer shipments; having to increase stocks in all global hubs and production facilities; numerous engineer flight dispatches; shipping costs back and forth (for a 30kg rack mountable IPC expedite shipping alone can very easily surpasses 1K$, per RMA swap); root cause analysis efforts; emergency replacement SSD qualification and ECO (engineering change order) implementation; SSD sub-component replacement and system refurbishment costs; the man hours invested in the whole endeavor, and the list goes on and on.
All of these expenses and work comprise an unplanned overhead, substantial resources the company could have invested elsewhere.
How should our customer manage such a crisis? They now have to make a choice. Do they brace until the storm passes, replacing each unit as it fails with no idea how much it will end up costing or what the business impact will be, or do they issue a preventive action and replace all SSDs in the field?
Our intuition to maintain customer satisfaction and the brand reputation might scream for the latter, but in our customer's case that easily sums up to over 1M$ along with thousands of engineer hours for SSD only swaps if they take the risk, or over 10M$ in equipment cost alone for a full IPC head for head swaps if they decide to play it safe.
The price is high and the compromises are not simple, it clearly would have been best if it had been avoided to begin with.
In our field, such an event is not considered a major scale issue nor is it an uncommon one, therefore the prevention of such events should be taken seriously. For a mature enterprise it is just a bad day at the office, but for a medium business or a startup that just recently reached mass production such events often both come by surprise and have more severe impacts.
I hope this example illustrates the importance of planning and managing our product failure rate.
The next chapter in this series will elaborate on the time phases of the product lifecycle, as dictated by the failure rate bathtub curve, along with additional stories from the field.