Monitoring - We have identified the root cause of the intermittent 502 errors affecting POST requests in both Test and Production environments. The issue was traced to an incompatibility between the Google Cloud Load Balancer and the NGINX controller. This incompatibility was related to the handling of the keepalive connection window, which resulted in the TCP connection sending a FIN (finish) signal from the gateway to the load balancer while connections were still active.
This condition created a race situation where the load balancer prematurely closed the connection, leading to 502 errors.
To address this, we have increased the NGINX keepalive time to exceed the configured timeout interval of the load balancer. This adjustment ensures that the keepalive window in NGINX remains open longer than the load balancer's timeout, preventing premature connection terminations.
We continue to monitor the system for stability and will provide further updates as needed. Thank you for your patience and understanding as we worked to resolve this issue.
We are pleased to report that in the past hour, we have observed a significant reduction in the number of 502 errors compared to our regular data. While this is an encouraging sign that the fix is effective, we will continue to monitor the system closely to ensure sustained improvement and stability.
Nov 10, 2024 - 15:38 GMT-03:00
Identified - We have identified an underlying structural issue affecting our platform running on Google infrastructure (SaaS BR and US). This issue causes some POST requests originating from the internet to sporadically return 502 errors in both the Test and Production environments
While the overall impact affects a very low percentage of total requests, customers with high traffic volumes may experience this error more frequently. It is important to note that only POST requests are primarily impacted. Despite this, our platform’s availability remains above 99.95%, which is higher than the contracted SLA.
Current Status: We are actively working on a permanent resolution. In the meantime, we recommend customers experiencing higher impact implement a fast retry mechanism, as subsequent retries will successfully process the request.
Please note that not all 502 errors are related to this specific issue.
Nov 08, 2024 - 15:08 GMT-03:00
Here you can verify current platform's status and historical data on past incidents. We keep this page updated with realtime information collected from our systems, so you can check regularly or sign up for SMS or email updates.
SaaS BR
Operational
90 days ago
99.96
% uptime
Today
BR - Portal
Operational
90 days ago
100.0
% uptime
Today
BR - Test Environment
?
Operational
90 days ago
99.95
% uptime
Today
BR - Prod Environment
?
Operational
90 days ago
99.94
% uptime
Today
BR - Core APIs
?
Operational
90 days ago
99.96
% uptime
Today
SaaS US
Operational
90 days ago
99.99
% uptime
Today
US - Portal
Operational
90 days ago
100.0
% uptime
Today
US - Core APIs
Operational
90 days ago
100.0
% uptime
Today
US - Prod Environment
Operational
90 days ago
99.99
% uptime
Today
US - Test Environment
Operational
90 days ago
99.99
% uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Related
No incidents or maintenance related to this downtime.
Completed -
The scheduled maintenance has been completed.
Nov 19, 09:00 GMT-03:00
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Nov 19, 08:00 GMT-03:00
Scheduled -
We are going to rotate our SSL Certificates for our Core APIs, Portal and Test/Prod Environments. No disruption is expected during this maintenance.
Nov 11, 23:38 GMT-03:00
Nov 18, 2024
No incidents reported.
Nov 17, 2024
No incidents reported.
Nov 16, 2024
No incidents reported.
Nov 15, 2024
No incidents reported.
Nov 14, 2024
No incidents reported.
Nov 13, 2024
No incidents reported.
Nov 12, 2024
No incidents reported.
Nov 11, 2024
No incidents reported.
Nov 10, 2024
Unresolved incident: Intermittent 502 Errors for POST Requests.