Child pages
  • 2020-09-09 – SIP Trunk Outage

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

     On 09/09/20 calling were issues reported around 10pm, and isolated to SIP Trunking @10:15pm. AT&T engagement was begun and additional trouble shooting led to an SBC SWACT restoring service at 11:45pm.  AT&T engagement was difficult and tickets were not opened until @12am.  The next morning a failed telecore router in SER was identified as the primary cause of the failure, and reconfiguration of the ports to equipment restored redundancy until equipment is moved to new Network Access Point (router at old Network Access Point eliminated).  However questions remain as to why normal HA failover did not occur on the External Trunk SBC when the router failed.  A case was opened with Ribbon and the investigation revealed that montoring of the port was not enabled, so no HA trigger to change to paired COM SBC would ever have occurred. Configuration corrected on 09/14/2020.    

...

Information learned from outage

Mitigation

Status

Assignee

Trouble Shooting improvements needed

Early manual SWACT initiation in problem resolution process

Completed

Voice

Monitoring improvements needed

Initiate quarterly review of component monitoring performance. Explore automated trunk call testing.  Review Telecore router monitoring for shortfalls (Networking has corrected monitoring issues). Add Failure Testing (scheduled) to gain confidence and exercise monitoring tools.

With Trunk SBC Lifecycle completing need to make sure monitoring is well implemented. 11/11/2021Implemented/Completed

Voice/Networking

UT Trouble Reporting improvements

Need to re-establish a UT Voice infrastructure contact mechanism for on call personnel. Exploring LCR for PagerDuty service

Implemented/OpsGenie

Voice

AT&T TMAC Contact issuesReport from AT&T regarding issues, and re-establish agreed SLAs for ticket opening and response (AT&T has undertaken process improvements)CompletedVoice

Regular HA function testing

Annual or Semi-Annual maintenance window action to test failover functionality  for HA systems

Completed

Voice

...