Elevated error rates due to some backpressure
Resolved
Sep 16 at 02:48pm CST
Overview
The Steel Sessions API entered a degraded state with a higher error rate due to intermittent failures in the session acquisition and reservation logic. The switching logic responsible for managing session lifecycles exhibited unpredictable behavior, resulting in users being unable to establish or maintain browser sessions. This was made prevalent to us due to a large amount of requests/backpressure which revealed this specific issue.
Timeline
- 6:03 PM UTC - Initial increase in number of requests to the Sessions API
- 6:53 PM UTC - Engineering team identified intermittent failures in session management + some early user reports
- 7:48 PM UTC - Root cause traced to switching logic in session acquisition/reservation system
- 8:12 PM UTC - Issue isolated and fix implemented
- 8:48 PM UTC - Service fully restored and monitoring confirmed stability
Root Cause
The session acquisition and reservation system contained flawed switching logic that intermittently failed to properly:
- Allocate available browser sessions to incoming requests
- Release reserved sessions back to the available pool
- Handle concurrent session requests during high load periods
This resulted in a cascading failure where the session pool became exhausted due to sessions being incorrectly marked as reserved but not properly allocated or released.
Impact
- User Impact: Slowly increasing unavailability for browser automation tasks
- Duration: 2h46m
- Affected Users: Many users attempting to create new browser sessions during the incident window
- Business Impact: Service interruption affecting customer workflows and API integrations
Resolution
The engineering team identified and corrected the faulty switching logic in the session management system. The fix involved:
- Refactoring the session state transitions to ensure atomic operations
- Implementing proper error handling for edge cases in session allocation
- Adding additional validation checks for session pool consistency
Current Status
✅ Issue Resolved: The switching logic has been fixed and deployed
✅ Service Restored: All session acquisition functionality is operating normally
✅ Monitoring Active: Enhanced monitoring is in place to detect similar issues
Next Steps and Preventive Measures
Immediate Actions (Next 7 Days)
- Enhanced Monitoring Implementation
- Deploy additional alerting for session pool health metrics
- Load Testing
- Conduct comprehensive load testing of session management under various scenarios
- Validate fix effectiveness under simulated high-concurrency conditions
Medium-term Actions (Next 30 Days)
- Code Review and Testing Enhancement
- Comprehensive audit of session management codebase
- Implement additional unit and integration tests for session lifecycle edge cases
- Establish chaos engineering practices for session management resilience
- Infrastructure Improvements
- Evaluate session pool sizing and auto-scaling mechanisms and test with methods in #1
- Design graceful degradation strategies for session pool exhaustion
- Documentation and Runbooks
- Create detailed runbooks for session management incidents
Lessons Learned
- High-concurrency scenarios expose edge cases not apparent under normal load
- Proactive monitoring of internal system states (session pools) is critical for early detection
- Automated testing should include concurrent access patterns and resource exhaustion scenarios
Post-Incident Review
A detailed post-incident review meeting will be scheduled within 48 hours to discuss:
- Technical deep-dive into the root cause
- Evaluation of response time and communication
- Assessment of proposed preventive measures
- Assignment of follow-up action items
Affected services
Created
Sep 16 at 01:00pm CST
Several services are being affected, but we're working on some upgrades
Affected services