Battery energy storage systems are scaling faster than almost any other infrastructure category in the United States. Cumulative utility-scale battery storage capacity exceeded 26 GW in 2024, according to the U.S. Energy Information Administration. That growth is creating enormous pressure on manufacturers, engineers, and project developers to move quickly. Speed without rigor, however, creates risk. And nowhere is that risk more visible than in the failure modes that have caused real incidents at real facilities.
According to the IEA, the global market for BESS surged by 40 GW in 2023 alone, nearly doubling the capacity growth recorded in the previous year. As more systems come online the importance of structured risk assessment tools like Failure Modes and Effects Analysis (FMEA) has never been higher.
Angelo Zandona, founder of Keystone Fire Consultants, has conducted FMEA analysis across a wide range of BESS projects in the country’s highest-growth storage markets. His work is part of how the industry is building the analytical infrastructure to match its rapid physical expansion.
What FMEA Is and Why It Matters for BESS
Failure Modes and Effects Analysis is a systematic method for identifying how a system or component might fail, what the consequences of each failure would be, and how likely each failure is to occur. It has been used for decades in aerospace, automotive, and nuclear industries. Its application to battery energy storage is more recent, but it has become a standard requirement under NFPA 855 and the International Fire Code for utility-scale installations.
A BESS FMEA is a structured document that assigns severity ratings, occurrence probability scores, and detection capability scores to each identified failure mode. The product of those three scores tells the engineering team which failure modes deserve the most urgent mitigation attention.
For a lithium-ion-based BESS, the failure modes cover a wide range of battery cell-level failures, battery management system (BMS) errors, cooling system breakdowns, inverter faults, fire suppression activation failures, and control system anomalies. Each of these can have cascading effects if not identified and managed early.
Critical failures can occur not only in battery cells, but also in control systems, transformers, fire suppression systems, and cooling mechanisms. If control systems are not correctly programmed or calibrated, they may fail to monitor battery performance, leading to overheating or thermal runaway. Inadequately designed fire suppression systems may also fail to activate promptly, allowing a situation to escalate.
A complete FMEA for a battery energy storage project generally follows six structured steps. Understanding each step is important for QA engineers, battery manufacturers, and project developers who want to use FMEA as an operational tool rather than a compliance document.
Step 1: Define the System Scope
The first step is to define exactly what is being analyzed. For a containerized BESS installation, this typically includes the battery racks, the battery management system (BMS), the thermal management system, the fire suppression system, the power conversion system (PCS), and the facility-level controls. At a multi-container site, the FMEA scope also addresses inter-container propagation risks and the centralized monitoring architecture.
This scoping step is often underestimated. An FMEA that only analyzes the battery cells but ignores the inverter, the cooling loop, or the suppression system is incomplete and will not satisfy most AHJs in California, Texas, or Arizona.
Step 2: Identify All Failure Modes
For each system component in scope, the team identifies every conceivable way that component could fail to perform its intended function. This step draws on OEM documentation, UL 9540A test data, historical incident records (including data from EPRI’s BESS Failure Incident Database), and the engineering team’s direct experience with similar systems.
EPRI’s BESS Failure Incident Database has tracked grid-scale incidents globally, finding that between 2018 and the present, 36% of failures had root causes that could be identified, compared to zero percent for incidents recorded between 2011 and 2017, reflecting how much the industry has learned about failure modes over time.
Common failure modes identified in well-conducted BESS FMEAs include cell-level internal short circuit, overcharge events from BMS control failure, thermal runaway propagation from one cell or module to adjacent units, cooling system pump failure leading to elevated cell temperatures, and fire suppression nozzle blockage preventing system activation.
Step 3: Determine Failure Effects
Once failure modes are identified, the next step is to determine the effect of each failure at three levels: the local level (what happens to the component itself), the next higher level (what happens to the subsystem or container), and the system level (what happens to the entire BESS installation or, in the worst case, the surrounding community).
This is where FMEA connects most directly to fire and life safety. A cell-level internal short circuit may have a local effect of elevated temperature and venting. Its next-higher-level effect might be thermal runaway propagation to adjacent cells. Its system-level effect, in a worst case, is a full container fire with toxic off-gas release. Mapping those effects explicitly is what allows the engineering team to design proportionate safeguards for each scenario.
Step 4: Assign Severity, Occurrence, and Detection Scores
Each failure mode is then scored on three dimensions. Severity (S) rates how serious the worst-case effect would be, typically on a scale of 1 to 10. Occurrence (O) rates how frequently the failure mode is likely to occur given the current design. Detection (D) rates how likely the failure is to be detected before it causes its worst-case effect.
The Risk Priority Number is calculated by multiplying S × O × D. High RPN scores indicate failure modes that deserve priority attention in the design process. Low RPN scores indicate failure modes that are either unlikely, detectable before becoming serious, or manageable in their consequences.
Step 5: Develop Mitigation Actions
For every failure mode with an RPN score above the project’s threshold, the team develops specific mitigation actions. These might include design changes (adding temperature sensors at specific locations in the battery rack), procedural changes (requiring daily BMS log review as part of O&M protocols), or protective additions (upgrading fire suppression coverage to address a specific nozzle blind spot).
Each mitigation action is then assigned to a responsible party with a target completion date. The FMEA is updated with a revised RPN score reflecting the expected risk reduction from each mitigation.
Step 6: Document, Review, and Update
A BESS FMEA should be reviewed whenever the system design changes materially, whenever a new failure mode is identified through operating experience, and whenever the applicable technology (battery chemistry, BMS version, suppression system) is updated.
The U.S. Department of Energy has noted that plans, documentation, and installation practices form the foundation of safety for the entire lifecycle of a BESS and that simply requiring code compliance without adequate documentation context can lead to unacceptably long delays in project delivery.
The Most Common FMEA Failures in Real-World BESS Projects
Angelo Zandona and the team at Keystone Fire Consultantshave reviewed dozens of FMEA documents produced for projects across California, Texas, and Arizona. Several recurring gaps stand out consistently:
● The first is scope limitation. Many FMEAs analyze the battery cells in detail but treat the fire suppression system, the BMS, and the cooling architecture as outside the scope of analysis. This creates blind spots that AHJs in experienced jurisdictions will catch immediately.
● The second is the use of generic scoring without site-specific justification. A severity score of 8 for a thermal runaway event means nothing without documentation of why that score was selected for this specific chemistry, in this specific container layout, at this specific site.
● The third is the absence of failure propagation analysis. NFPA 855 and UL 9540A both require analysis of how a failure in one cell or module might propagate to adjacent units. An FMEA that treats each failure mode as isolated will not meet the standard.
FMEA in California, Texas, and High-Growth Markets
The relevance of FMEA extends beyond fire safety compliance. In states like California and Texas, where community opposition to BESS siting has intensified following high-profile incidents, a thorough FMEA serves as a communication tool as well as a safety document.
Local siting and permitting decisions for utility-scale battery storage are increasingly challenged or delayed due to community concerns about safety and fire risks. Project developers who share best practices for navigating the permitting process report that addressing community concerns proactively is one of the most effective strategies for avoiding permitting delays.
When a developer can show a county commissioner or a local fire marshal a credible, site-specific FMEA that addresses propagation risks, community evacuation scenarios, and suppression system design, the conversation shifts from “is this safe?” to “how are you ensuring this is safe?” That shift is the difference between a smooth approval and a two-year contested permitting process.
FMEA is the analytical foundation that tells a project’s engineering team, its AHJ, and its community stakeholders that every meaningful failure mode has been thought about, scored, and addressed.
For battery energy storage projects in California, Texas, Arizona, and other high-growth markets, the quality of the FMEA is directly correlated with permitting speed and community acceptance. A generic template document will not clear a sophisticated AHJ review. A thorough, site-specific, six-step FMEA that addresses real failure modes for real equipment in a real location will.
Angelo Zandona and the team at Keystone Fire Consultantsspecialize in exactly this kind of rigorous, site-specific risk analysis. Their work helps developers turn a compliance requirement into a genuine safety tool and one that moves projects through regulatory review faster as a result.
Is FMEA required for all BESS projects under NFPA 855?
ANS: FMEA is a standard documentation requirement for utility-scale BESS projects under NFPA 855 and the International Fire Code. The level of detail required varies by system size and jurisdiction, but most AHJs in California and Texas expect a site-specific FMEA as part of the permit package.
How is an RPN score calculated in a BESS FMEA?
ANS: The Risk Priority Number is calculated by multiplying three scores: Severity (how serious is the worst-case effect?), Occurrence (how likely is the failure?), and Detection (how easily will the failure be caught before it becomes critical?). Each dimension is typically scored on a scale of 1 to 10, with higher RPN values indicating failure modes that need priority mitigation.
What are the most common battery failure modes in utility-scale BESS systems?
ANS: The most frequently documented failure modes include internal cell short circuits (leading to thermal runaway), BMS control errors that allow overcharging, cooling system failures that elevate cell temperatures, and fire suppression activation failures. Propagation from one cell or module to adjacent units is a key concern in all of these scenarios.
How does FMEA interact with UL 9540A test data?
ANS: UL 9540A provides test-based data on thermal runaway propagation behavior for specific battery chemistries and form factors. That data feeds directly into the FMEA’s occurrence and severity scoring for propagation-related failure modes. An FMEA that does not incorporate available UL 9540A test results for the project’s specific equipment is likely to be challenged during AHJ review.
How often should a BESS FMEA be updated during the project lifecycle?
ANS: The FMEA should be reviewed whenever the system design changes materially, whenever the battery chemistry or BMS version is updated, and periodically during operations when new failure modes are identified through incident reports or operating experience.
























