Elevated error rates due to...

Get status updates

Back to overview

Degraded

Elevated error rates due to some backpressure

Sep 16 at 01:00pm CST

Affected services

Main API

Resolved
Sep 16 at 02:48pm CST

Overview

The Steel Sessions API entered a degraded state with a higher error rate due to intermittent failures in the session acquisition and reservation logic. The switching logic responsible for managing session lifecycles exhibited unpredictable behavior, resulting in users being unable to establish or maintain browser sessions. This was made prevalent to us due to a large amount of requests/backpressure which revealed this specific issue.

Timeline

6:03 PM UTC - Initial increase in number of requests to the Sessions API
6:53 PM UTC - Engineering team identified intermittent failures in session management + some early user reports
7:48 PM UTC - Root cause traced to switching logic in session acquisition/reservation system
8:12 PM UTC - Issue isolated and fix implemented
8:48 PM UTC - Service fully restored and monitoring confirmed stability

Root Cause

The session acquisition and reservation system contained flawed switching logic that intermittently failed to properly:

Allocate available browser sessions to incoming requests
Release reserved sessions back to the available pool
Handle concurrent session requests during high load periods

This resulted in a cascading failure where the session pool became exhausted due to sessions being incorrectly marked as reserved but not properly allocated or released.

Impact

User Impact: Slowly increasing unavailability for browser automation tasks
Duration: 2h46m
Affected Users: Many users attempting to create new browser sessions during the incident window
Business Impact: Service interruption affecting customer workflows and API integrations

Resolution

The engineering team identified and corrected the faulty switching logic in the session management system. The fix involved:

Refactoring the session state transitions to ensure atomic operations
Implementing proper error handling for edge cases in session allocation
Adding additional validation checks for session pool consistency

Current Status

✅ Issue Resolved: The switching logic has been fixed and deployed

✅ Service Restored: All session acquisition functionality is operating normally

✅ Monitoring Active: Enhanced monitoring is in place to detect similar issues

Next Steps and Preventive Measures

Immediate Actions (Next 7 Days)

Enhanced Monitoring Implementation
- Deploy additional alerting for session pool health metrics
Load Testing
- Conduct comprehensive load testing of session management under various scenarios
- Validate fix effectiveness under simulated high-concurrency conditions

Medium-term Actions (Next 30 Days)

Code Review and Testing Enhancement
- Comprehensive audit of session management codebase
- Implement additional unit and integration tests for session lifecycle edge cases
- Establish chaos engineering practices for session management resilience
Infrastructure Improvements
- Evaluate session pool sizing and auto-scaling mechanisms and test with methods in #1
- Design graceful degradation strategies for session pool exhaustion
Documentation and Runbooks
- Create detailed runbooks for session management incidents

Lessons Learned

High-concurrency scenarios expose edge cases not apparent under normal load
Proactive monitoring of internal system states (session pools) is critical for early detection
Automated testing should include concurrent access patterns and resource exhaustion scenarios

Post-Incident Review

A detailed post-incident review meeting will be scheduled within 48 hours to discuss:

Technical deep-dive into the root cause
Evaluation of response time and communication
Assessment of proposed preventive measures
Assignment of follow-up action items

Created
Sep 16 at 01:00pm CST

Several services are being affected, but we're working on some upgrades