How to Replace a Faulty Cisco Switch in a Stack

Introduction

A Cisco StackWise stack groups multiple physical switches into a single logical unit — one management IP, one configuration file, one control plane. This design dramatically simplifies operations, but it also means that replacing a failed member must be done carefully. An incorrect sequence can trigger a spanning-tree reconvergence event, partition the stack ring, or — in the worst case — accidentally take down the Active switch and black out the entire segment.

This guide walks you through every stage: understanding stack roles, preparing the replacement hardware, executing the physical swap, and confirming a fully healthy stack before closing the change window.

Following the correct procedure ensures minimal downtime and a smooth return to full stack redundancy.

How a Cisco StackWise Stack Works

Before touching any hardware, it helps to understand what is running under the hood.

Stack Member Roles

Every switch in the stack holds one of three roles:

Role How It Is Elected Responsibility
Active Highest priority → longest uptime → lowest MAC Runs the control plane, owns the running-config, makes all Layer 2/3 forwarding decisions
Standby Second-highest priority Mirrors the Active state via SSO; takes over in under one second if the Active fails
Member All remaining switches Forwards data-plane traffic only; receives its configuration from the Active

Ring Topology

StackWise cables connect switches in a physical ring. Each switch has two stack ports, and each port connects to a different neighbour. The ring provides a redundant path — if one cable fails, traffic automatically reroutes in the opposite direction around the ring. The stack stays operational but degraded. This is why restoring a failed member (and the ring) matters even when traffic is still flowing.

What Happens When a Member Fails?

When a member drops off the stack, the Active switch detects the loss, marks that switch as Removed or Failed, and brings down all ports on that member. All other members continue forwarding normally. Critically, the Active switch retains a provisioned entry for the missing switch in the running-config — this preserves the interface configuration so it can be applied automatically when a replacement joins.

Prerequisites Before Replacement

1. Take a Full Configuration Backup

Save all relevant state before making any change. If something goes wrong, you will need this output.

show running-config
show version
show switch
show switch stack-ports
show switch neighbors

Copy all output off-device — to a TFTP server, your change-management system, or at minimum a local text file.

2. Identify the Faulty Switch

show switch

Look for a member whose State column shows Removed, Failed, or Provisioned. Note its switch number and its current role. Confirm it is not the Active switch — replacing the Active requires an additional failover step covered in the procedure below.

3. Verify the Replacement Hardware

The replacement switch must match the existing stack on all three of the following attributes:

Attribute Why It Matters Command to Check
Switch Model (PID) Incompatible hardware cannot join the stack show version | include PID
IOS Version A mismatch triggers an automatic IOS upgrade that can add 10–20 minutes to your window show version | include Version
Boot Mode Install mode and Bundle mode are not interchangeable; a mismatch prevents a clean boot show version | include Mode

If the new switch has a different IOS version, the Active switch will automatically push the correct image over the stack interconnect when the replacement joins. This is a safety net, but it adds significant time. Pre-loading the correct IOS on the replacement switch before connecting it to the stack avoids this delay entirely.

4. Schedule a Maintenance Window

Even replacing a Member switch will cause a brief spanning-tree recalculation on segments connected to that member's ports. Notify stakeholders and schedule accordingly. Replacing the Active switch requires additional planning — a controlled failover before any physical work.

Step-by-Step Replacement Procedure

  1. Confirm Stack and Ring Health

    show switch
    show switch stack-ports
    show switch neighbors

    Verify the faulty switch number and confirm all other members are in Ready state. Check that the two stack ports of the failed member show Absent or Down — this confirms the ring is already broken at that point and you will not create an additional disruption by removing the switch.

  2. Handle Active Switch Failover (If the Faulty Switch Is the Active)

    If the failed switch is a Member or Standby, skip to Step 3. If it is the Active, lower its priority first so the Standby takes over cleanly:

    switch <active-number> priority 1
    reload

    Wait for the Standby to be elected as the new Active before proceeding. Verify with show switch.

    Never remove the Active switch without a planned failover. An unplanned Active loss triggers a new election and drops all traffic for 1–3 minutes or longer.

  3. Factory Reset the Replacement Switch in Standalone Mode

    Power up the new switch with no stack cables connected and erase any previous configuration:

    write erase
    delete flash:vlan.dat
    reload

    On Catalyst 9000 series running Install mode, also clean up old package files:

    install remove inactive

    Starting from a blank slate prevents stale VLANs, old hostnames, or conflicting configurations from interfering with the stack join.

  4. Pre-Assign the Switch Number and Priority

    Assigning the same switch number as the failed member ensures the provisioned interface configuration stored on the Active is applied automatically at join time. Still on the standalone switch (before connecting stack cables):

    switch 1 renumber 3
    switch 3 priority 7
    write memory
    reload

    Adjust the target number and priority to match your environment. The switch reboots and now identifies itself as Switch 3.

    Match the priority level of the original switch so the Active/Standby election result does not change after the replacement joins.

  5. Remove Stale Provision Entries on the New Switch

    After renumbering, check for leftover provisioned entries that could cause a number conflict:

    show running-config | include provision

    If you see a line such as switch 1 provision WS-C3850-24P, remove it:

    no switch 1 provision
    write memory

    Leaving a stale provision entry causes the switch to report a number conflict when it joins and prevents the correct interface configuration from loading.

  6. Power Off and Remove the Faulty Switch

    Label every cable before disconnecting — data uplinks, stack cables, and power. Then:

    1. Power off the faulty switch.
    2. Disconnect both StackWise stack cables.
    3. Disconnect all data uplinks.
    4. Unmount and remove the switch from the rack.

    Double-check you are working on the correct physical switch. Removing the wrong unit from an active stack can take down a live segment immediately.

  7. Install the New Switch and Reconnect the Stack Ring

    Rack the replacement switch in the same position. Reconnect the StackWise cables to restore the ring topology:

    • Stack port 1 of the new switch → Stack port 2 of its lower neighbour
    • Stack port 2 of the new switch → Stack port 1 of its upper neighbour

    Do not reconnect data uplinks yet — wait until the switch is confirmed Ready.

    Use only Cisco-approved StackWise cables (e.g., CAB-STK-E-0.5M, CAB-STK-E-1M). Third-party cables are not supported and can cause intermittent stack errors that are difficult to diagnose.

  8. Power On the New Switch and Monitor the Join

    Apply power and attach a console cable to the new switch. A successful join produces log messages like:

    %STACKMGR-6-SWITCH_ADDED: Switch 3 has been ADDED to the stack
    %STACKMGR-6-SWITCH_READY: Switch 3 is READY

    If an IOS version mismatch is detected, the Active will automatically copy the correct image:

    %IMAGEMGR-6-AUTO_COPY_SW: IOS version mismatch. Copying software to switch 3...
    %IMAGEMGR-6-AUTO_COPY_SW_DONE: Auto-copy of software to switch 3 complete.

    Do not interrupt an auto-upgrade in progress. The switch will reload automatically once the image copy is complete and then join the stack normally.

  9. Reconnect Data Uplinks

    Once show switch confirms the new member is in Ready state, reconnect all data uplinks to their labelled ports. Interfaces will come up with the configuration that was stored in the provisioned entry on the Active.

  10. Monitor Console Logs

    Watch for any error messages related to IOS, stack formation, spanning-tree, or port-channel on both the new switch console and the Active switch console. Allow 60–90 seconds for STP to reconverge before drawing conclusions.

  11. Verify Full Stack Status

    show switch
    show switch stack-ports

    All members must show Ready. Both stack ports on every switch must show OK. This confirms the ring is fully restored.

  12. Perform Final Validation

    show interfaces status
    show ip interface brief
    show spanning-tree summary
    show etherchannel summary
    show log | last 50

    All interfaces and protocols should be stable. Compare against your pre-change baseline output. When everything checks out, save the configuration: write memory

Post-Replacement Verification Reference

Use this table as a quick checklist after the replacement is complete.

Command What to Check Expected Result
show switch All member states, switch numbers All members show Ready
show switch stack-ports Stack port link and sync status Link: OK, Sync: OK on all ports
show switch neighbors Ring topology Every switch sees both neighbours
show version IOS version on all members Identical version on every switch
show interfaces status Port states on replaced switch No unexpected err-disabled ports
show spanning-tree summary Root bridge, topology change counters Stable STP, no active topology changes
show etherchannel summary Port-channel bundle state All links show bundled (P) state
show log | last 50 Recent syslog messages No repeated errors after join

Troubleshooting Common Issues

New switch does not appear in show switch

Check both StackWise cables — one may be seated incorrectly or internally damaged. Run show switch stack-ports from the Active to identify which port reports a problem. Try swapping in a known-good cable. Also confirm the new switch completed its full boot cycle and any IOS auto-upgrade before expecting it to appear in the stack.

Switch joins but remains in Provisioning or Version Mismatch state

An IOS auto-upgrade is in progress — do not interrupt it. Wait for the image copy to complete and the switch to reload automatically. If the switch loops repeatedly, manually copy the correct IOS image to its flash:

copy tftp://<server-ip>/<ios-image> flash:

Then set the boot variable and reload.

Switch joins but interfaces have no configuration

The switch joined under a different member number than the provisioned entry. Run show switch and compare the reported number to the provisioned stanza in show running-config | include switch. Renumber the switch to match and reload it.

Spanning-tree topology changes persist after replacement

A brief topology change is expected immediately after join — allow 60–90 seconds for convergence. If changes continue, check whether a new root bridge was elected unexpectedly. Run show spanning-tree detail to identify the originating port and address the root cause.

Port-channel stays suspended on the replaced switch

Re-check the LACP or PAgP configuration on both ends of the link. Common causes include a mismatched channel-group mode (e.g., active vs on), missing VLANs on a trunk, or a duplex/speed mismatch. Run show etherchannel detail on the specific port-channel for specific error hints.

If the stack does not stabilise within 30 minutes of joining, collect show tech-support output and open a Cisco TAC case. Have a rollback plan ready — reconnect backup uplinks or restore the original hardware if it can power on.

Best Practices

  • Always perform the replacement during a scheduled maintenance window
  • Use the identical switch model, IOS version, and boot mode as the rest of the stack
  • Pre-load the correct IOS on the replacement switch before connecting it to avoid auto-upgrade delays
  • Label all stack cables and data uplinks before removal so reconnection is accurate
  • Keep a factory-reset spare switch with the correct IOS pre-loaded on the shelf at all times
  • Always restore the full ring topology — a degraded chain leaves the stack without cable redundancy
  • Avoid removing the Active switch unless a planned failover has already transferred the role
  • Save the configuration with write memory once the stack is confirmed healthy

Common Mistakes to Avoid

  • Removing the Active switch without first triggering a controlled failover to the Standby
  • Connecting a switch with a mismatched IOS version without budgeting time for auto-upgrade
  • Forgetting to erase the new switch before joining, causing VLAN or config conflicts
  • Skipping switch renumbering, causing interfaces to come up unconfigured
  • Not removing stale provision entries on the new switch before connecting it
  • Using third-party or damaged StackWise cables
  • Reconnecting data uplinks before the new switch reaches Ready state
  • Ignoring Install vs Bundle boot mode differences between stack members
  • Closing the change window without running a full post-replacement verification

Conclusion

Replacing a faulty Cisco stack switch is a structured process, not a rushed hardware swap. The steps most often skipped — removing stale provision entries, pre-matching the IOS version, and restoring the full ring topology — are the ones most likely to cause a second outage or a re-do. Work through the procedure in order, monitor the join on the console, and run the full verification checklist before declaring the change complete.

With the right preparation, a member switch replacement can be completed within a maintenance window with no impact on end users beyond the ports physically located on that switch.

Always run write memory after confirming the stack is healthy. A successful hardware swap means nothing if the next reload brings back a stale or missing configuration.