How I Learned to Stop Worrying and Love the Bathtub Curve, Part II
Setting an Acceptable Failure Rate
Before we can start talking about the implementation of the bathtub curve insights into our product, we need to establish a baseline understanding which is hopefully agreed upon after reading part I of this series: expecting our production teams, or our subcontractor’s integration, to supply mass-produced products with zero failure rate is unrealistic. However good and optimized, there is no such thing as actual zero failure rate in mass production (But, as greater minds have said - although we’ll never reach perfection, always persist to try). Also, there is no such thing as a no usage fatigue, so we cannot expect our products to not wear out with time.
Once we are done drowning in sorrow following this harsh reality-check, we should shift focus to how we control the failure rate throughout the product's lifecycle, bringing it down to non-devastating rates and keeping it there consistently, as well as how to support our product at a known failure rate.
As we accept that some failure rate will always be present we can conclude that a mature product should have a defined acceptable failure rate as part of its internal specification. Once mass production is reached, failure rate becomes a leading parameter, in the sense that it will dictate processes and highly affect business operations. Our product's failure rate cannot be a post-mortem calculated value, since it will single handedly force us to adjust high impact operations and business aspects to accommodate for it.
Assuming we are actively working to take ownership over our product, the main take-home message of this article is that the acceptable failure rate must be set and managed actively. It must be realistic, taking into consideration the complex implications of under; and over; investing in the reliability of the product. A conscious decision, with processes structured accordingly and followed by controlled KPIs (key performance indicators). This is crucial to owning the product, instead of merely reacting in retrospect.
The Bathtub Curve Life Stages
Once we set the acceptable failure rate on our bathtub curve graph we observe two intersections between it and the observed failure rate (yellow solid line). These intersections are what splits the product's lifecycle timeline into its 3 major life stages. See figure 1 below, life stages are overlaid in color.
Startup Stage, highlighted in red
During the startup stage the failure rate of our product is above acceptable, with the main offender being the high infant mortality rate during our product’s early life stages. There is much to be said on this stage and on methods to reduce its affects, it will be the main topic we will explore in one of the next chapters of this series.
Normal Operation Stage
This stage, as the name suggests, is the normal and useful life stage of the product. It is during this stage that our product is within our defined acceptable failure rate.
The longer we design this stage to last, the more use our customers get out of the product and the more warranty we can provide at a controlled manner and cost.
Phase Out Stage, highlighted in green
During the phase out stage the failure rate once again exceeds our defined acceptable level, mainly due to the increasing growth in the wear out rate. Our product is no longer reliable enough by our standards, and phase out is planned.
A Few Words on Product Support
Considering the expected normal operation stage duration, acceptable failure rate and the specific support needs of the customer (SLA, environment restrictions, support scope, etc...) during product specification design, gives fundamental insights into the product's operational and business implications at very early stages, when flexibility is still high.
This information structures the support contract and dictates what support length and features our sales team can sell. What happens once these information-deduced restrictions do not meet the customer requirements, such as support period (read - normal operation stage)? We are forced to either compromise or adjust the product design. An easy problem if found while we are still at the product design stage, a much harder one if found when we are already at mass production...
Next, once the reliability specification design is aligned with customer requirements, we can draw an initial service contract, allowing us to gauge, at very early stages, the support and RMA operations we will need, and therefore factor our expected support costs into the product pricing and business strategy with higher precision.
Hopefully, it is easier now to see the importance of actively controlling these parameters, and the significant business implications we will unavoidably endure if we do not. In the next chapters we will explore warranty expansion implications and the bathtub curve startup stage, going into why, when and how we should test our products. For now, story time from the field.
A few years back, when AI medical imaging was just emerging, a pioneering startup in the field contacted us to consult on an infrastructure solution they were drafting for their R&D teams. They were a 100-200 person startup with teams of researchers at 3 locations, running heavy training iterations over GPUs on a massive 1PB dataset they have gathered. The IT team was appointed responsible to provide a solution, but was overloaded with the bring up of the new company's various environments and mass employee onboarding. Hence, they didn't have time to tailor a solution themselves although they were perfectly capable. Being experienced enterprise veterans, the IT team heard GPGPUs and 1PB storage blocks of precious data at 3 locations and headed directly into familiar territory, contacting the account managers at the enterprise storage and server vendors they were familiar with, to get quotes.
When we came into the discussion it was already narrowed down to choosing between 3 storage quotes from leading vendors. The GPGPU solution was already decided to be dedicated 2U 2 GPU servers, one server per researcher. The management saw the storage quotes alone ranged between 1.5 to 2M$ and asked the team to run a thorough technical comparison before committing to such an expense. The discussion was focused on topics such as: whether the seven nines (99.99999%) availability of the Infinibox was an automatic win for Infinidat, as the other two offers promised the traditional five nines (99.999%); what will be the peripheral cost of the infrastructure upgrades at each office from 40GbE to a 100GbE backbone; sufficient power and cooling; which caching and which deduplication engine is better; and again - where will they physically put 2-3 additional 42U racks at each site.
The thing is, the dataset was indeed crucial data but it was to be an exact copy across all 3 locations and it was not going to be altered, only read from. The systems were used for internal R&D only and the impact of downtime was not high, the training jobs could just be restarted with no impact beyond the waste of a few work hours. The workload was very predictable, the storage's sole purpose was to feed all machine learning workloads ran on the GPGPUs with chunks of the same huge dataset, a few tens of GB of raw data per iteration every few minutes. All traffic was literally big block read streams, raw throughput over whichever protocol, no writes and no importance to IO. Although the GPU server clients could pull over 100GbE if the reads were synchronized and done in parallel, chances the clients will pull even 40GbE in practice were slim and there was no significant penalty if traffic was capped to 40GbE. As the solution was needed for temporary R&D needs, it really had to work for less than one year and was then to be repurposed into rack sized decorative ornaments, or whichever other use the IT could find for it by then.
We didn't start with the technical deep dive that yielded the above conclusions, those details just drove the nail further in as they unfolded later on. What pushed me to suggest rethinking the solution right away was the dissonance between the client's reliability needs and what the enterprise storage market is all about.
If we were to imagine the customer needs on our bathtub curve, in this customer's case the acceptable failure rate could be very high to begin with and the normal operation period could be very short. These are the exact opposites of traditional IT and enterprise storage architecture fundamentals. On top of that add the fact that our customer literally didn't need a single feature of enterprise storage, except for its most basic functionality and the large capacity.
You don't go shopping for a Caterpillar haul truck when all you need is a vehicle to go grocery shopping and back, so why buy an enterprise storage solution when all you really need is a 1PB HDD over NFS?
Obviously enough we offered to switch the approach and after running a POC provided 3 stacks at 10U each, made up of a single Dell 2U server with dual port 40GbE and a quad port LSI SAS HBA connecting two 4U 60 drive JBODs. The 2U node ran CentOS with ZFS, publishing a single NFS share. Some performance optimization and some basic tailored monitoring SW. Global next business day Dell support for the hardware. 300K$ for a simplistic solution, which did the job flawlessly.
Since the customer needs were far from the normal use case, we were able to save over 1M$ by tailoring a solution to their needs. Off-the-shelf offerings are incomparably great, as long as your needs fit with the common use case the product was designed to meet. The less common your use case, the higher the price "penalty" to use generic products to meet your needs.
And for the record yes, we did offer to consolidate the GPGPU servers into a smaller pool of 4U 10GPU nodes with a workload balancer to service all researchers more efficiently, and combine the 3 sites into one resource pool in a single location.
In this series you will find theory along with examples from the field - real world scars gathered throughout the years, brought in to demonstrate the main points of the articles.
All examples are censored with no classified information or customer identification unless specifically approved. The examples also include minor technical disinformation to further protect against identification. They are brought in to clarify concepts, with the hopes that real world examples will bring additional value to people in similar situations.
Theoretical questions and general discussions are more than welcome, but please refrain from sharing anything that may expose non-disclosed information.