On behalf of Team Apptentive, I apologize. After several years of near perfect uptime, we let you down over the past month, multiple times. We believe in transparency, as it’s our job as your partner to tell you more about exactly what happened, what it means for you and your customers, and what we’ll be doing to fix it.
This FAQ is intended to give you insight to what happened and what we’re doing about it. It’s comprehensive but if we’ve missed anything, please let me know so that we can improve it. If you have questions or need to talk real-time – again, just let me know. My email is email@example.com and I appreciate your feedback.
Robi Ganguly, Co-founder and CEO
What caused your outages and issues with logging in?
Apptentive does routine maintenance as well as non-routine upgrades to our system to improve security, increase capacity, and maintain its overall reliability objectives. The root cause of the outage on March 29th was caused by a non-routine procedure to migrate events statistics into a higher capacity system. This procedure had an unintended side effect of dropping data from the primary data store and the best recovery procedure for this type of incident was to restore from the most recent full backups. Incidents of this scale trigger a full post-mortem and diagnosis so that additional actions can be taken to prevent issues of that type in the future.
Unfortunately, as part of the urgent actions taken to decrease recovery time and increase redundancy from the March 29th incident, we made a configuration mistake. This configuration mistake caused the 2nd outage on April 15th. So, in the process of improving our system we made an error and the system came down for the second time during business hours.
Finally, on April 20th, our authentication provider, Auth0, had a massive outage itself. This of course came as no solace to our team or to our customers, because authentication is necessary for access to our dashboard and as a result, a lot of you were unable to login for 5 hours on April 20th.
Each of these outages were unacceptable and we don’t intend to repeat these actions. Disrupting your work and the work of your team is not something we take lightly.
In the near term, our team has paused all non-routine/unscheduled work on our production systems, and we are implementing a formal Production Change Management process to add further scrutiny and gates to these types of tasks. Items of this nature will be scheduled during low activity times in the future to lessen the impact of any undesired outcomes.
Why did the outages and recovery take so long?
As part of its business, Apptentive collects, stores, and manages petabytes of data. Critical incidents that require us to restore from “cold” backups have a minimum recovery time of 6 hours with a recovery point of 2 hours. After a restore has completed, the system needs to “warm up” in order to support its normal traffic as well as the traffic that could not be sent during the outage window. The 2 large scale outages required the same recovery procedure. We fell short of our goal of 6 hours in the first outage. The 2nd outage fell within our recovery time and point objectives and we intend to ensure that should there be outages in the future, we beat our goal for recovery.
What is Apptentive doing to mitigate outage time and frequency in the future?
Our outage time for this critical class of incident is directly correlated to the speed at which we are able to fully recover from our backups.
Recently, AWS has provided additional options including “Fast snapshot restore” to accelerate this process and reduce our mean time to recover (MTTR). We have been able to implement these improvements to our disaster recovery plan.
In addition, our teams are investigating alternatives in the form of infrastructure changes/additions, and recovery strategies to try to reduce our MTTR which may include the addition of hot, read-only replicas for large data stores as well as server-side queuing for API requests to reduce the amount of critical systems needed for availability.
As for frequency, we are reviewing and updating our process and guidelines to ensure that we reduce the dependency on inherently risky work, and that we have sufficient validation checks and tests in place in the event that work is necessary.
Are interactions still shown when Apptentive has an outage? How are my customers impacted by Apptentive outages?
For any interaction set up prior to an outage, your customers will not be impacted by the outage.
The SDKs cache all localized UI content and targeting rules locally using a technique we call “manifest caching”. The Apptentive SDKs use these local manifests to determine when and where to show an interaction and fully work offline. Because of this, any platform disruptions will never impact the ability for an end consumer to see an interaction.
What happens to customer data during an outage? Is it lost?
No data from iOS and Android devices is ever lost.
Apptentive’s platform services at api.apptentive.com receive billions of events per day from our SDK consumers. In order to deal with large bursts of consumer traffic and overall mobile traffic growth, we have a multi-tier strategy to address these aspects in a balanced way for reliability, scale, quality, and performance. To account for errors, downtime, bursts of traffic, or degraded performance by the platform, our iOS and Android SDKs have a retry strategy.
All data is stored locally to the device prior to sending to the API. When our SDKs receive any 5xx error response from the Apptentive API or the device goes offline, they attempt to retry to the server using an exponential backoff strategy and will be retried perpetually. This means that no data from iOS and Android devices is ever lost.
How does the retry strategy impact performance and battery on a device?
Retried network calls can have an impact on performance and battery on the device, but we believe that our batching strategy and generally high uptime make this a minimal impact. Please let us know if your teams experienced any unusual activity levels or errors as a result of these outages.
What has your uptime been over the past year?
Our internal Service Level Objective for our API surface is 99.5% availability although our actual for the last year, including the recent outages, has been 99.58%. The uptime for our Dashboard (be.apptentive.com) over the last year is 99.77%.
How did the outages impact Zendesk messaging?
The Apptentive-Zendesk integration is impacted by the outages in the following ways:
- New messages created from consumers cannot be sent to Apptentive. When Apptentive SDKs cannot send messages to its platform, they are saved locally on the mobile device and perpetually retry using an exponential backoff policy until the retry succeeds (2xx status code) or is no longer authorized (4xx status code).
- Messages received by the Apptentive platform and not yet sent to Zendesk cannot forward as new tickets in Zendesk. Generally, Apptentive will continue to retry to forward tickets on to external Helpdesk systems (Zendesk) until they are successful or no longer authorized.
- Zendesk status updates and replies cannot be sent to Apptentive during outages. Apptentive receives these updates from Zendesk through their “Notify external target” webhook functionality which has specific limitations. Namely, in a failure situation, it does not continue to retry sending requests on to other systems and automatically disables itself after 21 consecutive failures.
How did the outages impact Apptentive for Web?
Apptentive for Web requires all systems to be online to function properly as there is no offline mode. In outages, Apptentive for Web consumers cannot receive new “manifests” or send events and responses to the Apptentive platform. Within a session, Apptentive for Web will retry sending items to the Apptentive platform 10x with a 3-second delay between attempts. Items that cannot be sent to Apptentive within that window/session are lost.
I appreciate you making it to the end of this FAQ. On behalf of Team Apptentive, I apologize – we let you down. I hope that this documents what happened and gives you confidence that we recognize the gravity of these issues and have a plan in place to improve your experience. Please let me know personally if you have any questions or would like to speak directly. I appreciate your candid feedback and accept responsibility for not meeting your expectations.
Co-founder and CEO