On 06/26/20 Voice Services was notified that calls to UTPD were failing. Troubleshooting soon revealed that calls were functional, but message playback on the Auto-Attendant system was failing to inform users of the proper menu choice and they were handing up thinking the call had failed. Analysis revealed that all calls to either the voicemail function of the IMM system or the Auto-Attendant function were failing for between 10-40% of calls.
- 13:03 – Troubleshooting revealed replication errors started reporting in error log
- 13:08 – UTPD submitted a ticket (need ticket) that incoming Admin lines are down to UTPD Dispatch. Outgoing calls working.
- 14:31 – UTPD sends campus alert message: UTPD's main phone line is down. Dial 9-1-1-for emergencies and 512-232-4091 for admin.
- 15:25 – replication errors ended in error log
- 16:32 – UTPD sends campus alert message: UTPD Phone lines have been restored to normal operations.
Monitoring and Detection
Calls to any number that utilized messages from the IMM Auto-Attendant (UTPD notably) or were directed to IMM voicemail were affected during the time period 06/26/20 13:03-15:25, would be potentially affected. Test calls during the outage were reported failing from 10-40% of the time. This was sufficiently disruptive to UTPD to warrant them sending a broadcast message. (Determine if any other tickets were opened by other departments)
Target SLA – 99.9%
Current SLA (before outage) – 99.885% (x Seconds total downtime YTD)
Post-incident SLA – 99.816% (x + y Seconds total downtime YTD – y Seconds this incident)
The outage was calculated from 13:03 to 15:25 when IMM Error logs show failures occuring. No observed reports of message play back failure occured after the replication tasks were restarted by Ribbon Support and returned to normal function at 15:25.
Trouble Tickets were reported as follows:
- Help Desk –
- Telephone Switchroom –
- Telecom IT Manager -
Root Cause Analysis
[under investigation, appears to be related to LDAP and MySQL replication tasks failure, and improper HA Switchover]
Mitigation and Lessons Learned
We discovered with follow on incident, that the IMM HA was the source of the IMM issues. Since that time the IMM has been converted to a simplex arrangement with active backup, and a manual rather than automatic failure recovery procedure. This has prevented the Auto-Attendant audio play issue from re-occurring. We have since as of May 2022 been migrating the AA's to the Mirta PBX. The process is about 50% finished. Will update when complete.