2020-09-09 – SIP Trunk Outage
A misconfiguration of the SER Trunk SBC resulted in a failure to properly respond to a telecore router failure and switch to the redundant unit.
On 09/09/20 calling were issues reported around 10pm, and isolated to SIP Trunking @10:15pm. AT&T engagement was begun and additional trouble shooting led to an SBC SWACT restoring service at 11:45pm. AT&T engagement was difficult and tickets were not opened until @12am. The next morning a failed telecore router in SER was identified as the primary cause of the failure, and reconfiguration of the ports to equipment restored redundancy until equipment is moved to new Network Access Point (router at old Network Access Point eliminated). However questions remain as to why normal HA failover did not occur on the External Trunk SBC when the router failed. A case was opened with Ribbon and the investigation revealed that montoring of the port was not enabled, so no HA trigger to change to paired COM SBC would ever have occurred. Configuration corrected on 09/14/2020.
20:42 Logs indicate router failure and Service Disruption Begins
21:42 Call from Ann Treffer at UTPD to KJ. Callers can't reach dispatchers at 4714441 the UTPD AutoAttendant
22:05 KJ with MM review that on campus calls to UTPD AA are OK, calls to and from campus are failing
22:12 report from William who received escalation for Operators that their phones were not working
22:17 Message to ITS Problems regarding investigation of report voice problems
22:25 Update to ITS Problems regarding SIP Trunking and a ticket being opened with AT&T
22:30 MT and PR confirm internal systems seem good, lack of alarms indicating internal issues, MT engaging AT&T TMAC to report possible IPFlex trunking issue.
22:43 UDC opens Sr Staff conference bridge
22:45 still in queue with TMAC, NR engages Anthony Bomba, AT&T management portal down noted as down
22:50 Anthony Bomba finally gets us in contact with TMAC ticket opened. Much confusion on their part (circuit out down, yet working?)
23:26 ITS Problems updated that there was no change, and that external calls to/from the university were still not working
23:45 MT continued troubleshooting while awaiting TMAC prompts Trunk SBC SWACT
23:46 Call Failures are shown as cleared by SWACT and Service Restored
00:00 Trouble shooting and monitoring, AT&T investigating
00:05 ITS Problems informed that external calls were completing, and that additional testing was underway to confirm
00:22 ITS Problems informed that service was restored, and the problem resolved.
08:00 AT&T management portal reachable, discover message indicating maintenance and down time 09/10/2020 00:00 - 06:00 appears they started early
10:10 MT discovers ser-voip-rtr dark, probably chassis failure.
11:30 DB advises MT ser-voip-rtr recovery is not likely, needed ports rebuilt on other switches completed by 13:30
14:00 Root cause of failure of HA to switch to off line side discovered by Ribbon support, port monitoring was not enabled on the interface.
Monitoring and Detection
Detection was via customer reporting. No alarms or monitoring detected the condition. Legacy Campus SBC detected the condition and performed a SWACT, but since it is deprecated following lifecycle it was not being actively monitored. The Trunk SBC monitoring was incorrectly configured and did not indicate an alarm, or perform a SWACT to the standy unit.
Failed router was not escalated by Networking.
Inbound and Outbound SIP Trunk Calls were restored after a manual system SWACT on the Trunk SBCs at 09/09/2020 23:45. System redundancy was restored at 09/10/2020 13:30 by bypassing the failed router, and moving the interfaces to existing switches in the SER Switchroom. Missing HA functionality was configured 09/14/2020 14:00 following Ribbon support investigation as to why HA had failed to properly automatically SWACT to standby SBC.
Inbound/Outbound call traffic for the University was non-functional for nearly 3 hours from 20:42:30-23:46:30 09/09/2020. E911 was not affected by the outage, and internal calls still completed
The service disruption spanned approximately 3 hours, or 10,180 seconds. The overall service SLA will be treated as unavailable even though E911 and internal calls remained functional
10,180 seconds total downtime (YTD)
Initial notification to ITS-Problems was performed by William Green at approximately 15 minutes after troubleshooting. Initial responders should have indicated a potential problem was being investigated prior to that. Re-occurring follow-up ITS-Problem notification and reporting performed, with regular 30 minutes update of issues. Major problem Conference Bridge initiated, with William Green providing updates to participants. UTPD the reporting customer was directly called and informed of resolution, and follow-up explanation of the problem was performed.
Root Cause Analysis
Root Cause has been attributed to Trunk SBC misconfiguration which prevented the SBC from properly exercising its HA capability and switching to its standby partner, when the initiating event of the failed telecore router occurred.
Mitigation and Lessons Learned
Trouble Shooting lessons:
The expectation for HA functionality and Alarm reporting to function correctly was very high, and misdirected initial troubleshooting, as a switchover had not occurred this was interpreted as an indication that all was well with UT equipment. Adding to this belief was the lack of any other alarms, however due to there being no other monitored equipment remaining in SER to alert to the failure, this was also a bad assumption. Coupled with the failure of the AT&T management portal, and the inability to reach AT&T trouble reporting, efforts were misdirected from towards vendor support delaying actual problem resolution for about 30 minutes, at which time a manual SWACT restored the service. Lesson learned, perform manual SWACT earlier in troubleshooting cycle.
More careful monitoring:
Since transition periods for moves/deprecated equipment can be prolonged, perform review of monitoring and reporting capability of deprecated, but not yet decommissioned equipment (telecore router in this instance). In addition explore the ability to have some kind of call completion testing mechanism to proactively detect outages when SIP Trunk traffic is low.
Voice needs to monitor network infrastructure it is reliant upon (it had removed monitoring for this router because it was moving the equipment).
Networking failed to detect and escalate the failed Telecore router.
UT Trouble reporting:
UT Voice needs a UT Voice infrastructure independent escalation telephone number to reach on call personnel following retirement of oncall pager. Need to engage PagerDuty for a telephone number and Live Call Routing in a manner similar to Edge Networking.
AT&T Trouble reporting:
Issues were encountered trying to open a ticket with AT&T TMAC to engage the vendor with troubleshooting. A delay of over 90 minutes occured before an incident could be logged. Need followup actions to determine why incident occured. AT&T management portal was unavailable during the outage, and appears to have been scheduled for scheduled maintenance later in the morning, but that it was initiated early.
UDC Incident procedures:
UDC personnel encountered issues utilizing the AT&T POTS line, and initiating the Sr Staff conference bridge.
Annual HA exercise simulating actual equipment failure
Information learned from outage
Trouble Shooting improvements needed
Early manual SWACT initiation in problem resolution process
Monitoring improvements needed
Initiate quarterly review of component monitoring performance. Explore automated trunk call testing. Review Telecore router monitoring for shortfalls (Networking has corrected monitoring issues). Add Failure Testing (scheduled) to gain confidence and exercise monitoring tools.
UT Trouble Reporting improvements
Need to re-establish a UT Voice infrastructure contact mechanism for on call personnel. Exploring LCR for PagerDuty service
|AT&T TMAC Contact issues||Report from AT&T regarding issues, and re-establish agreed SLAs for ticket opening and response (AT&T has undertaken process improvements)||Completed||Voice|
Regular HA function testing
Annual or Semi-Annual maintenance window action to test failover functionality for HA systems
ServiceNow Ticket RITM0614274