Back to overview
Degraded

Elevated error rates due to some backpressure

Sep 16 at 01:00pm CST
Affected services
Main API

Resolved
Sep 16 at 02:48pm CST

Overview

The Steel Sessions API entered a degraded state with a higher error rate due to intermittent failures in the session acquisition and reservation logic. The switching logic responsible for managing session lifecycles exhibited unpredictable behavior, resulting in users being unable to establish or maintain browser sessions. This was made prevalent to us due to a large amount of requests/backpressure which revealed this specific issue.

Timeline

  • 6:03 PM UTC - Initial increase in number of requests to the Sessions API
  • 6:53 PM UTC - Engineering team identified intermittent failures in session management + some early user reports
  • 7:48 PM UTC - Root cause traced to switching logic in session acquisition/reservation system
  • 8:12 PM UTC - Issue isolated and fix implemented
  • 8:48 PM UTC - Service fully restored and monitoring confirmed stability

Root Cause

The session acquisition and reservation system contained flawed switching logic that intermittently failed to properly:

  • Allocate available browser sessions to incoming requests
  • Release reserved sessions back to the available pool
  • Handle concurrent session requests during high load periods

This resulted in a cascading failure where the session pool became exhausted due to sessions being incorrectly marked as reserved but not properly allocated or released.

Impact

  • User Impact: Slowly increasing unavailability for browser automation tasks
  • Duration: 2h46m
  • Affected Users: Many users attempting to create new browser sessions during the incident window
  • Business Impact: Service interruption affecting customer workflows and API integrations

Resolution

The engineering team identified and corrected the faulty switching logic in the session management system. The fix involved:

  • Refactoring the session state transitions to ensure atomic operations
  • Implementing proper error handling for edge cases in session allocation
  • Adding additional validation checks for session pool consistency

Current Status

Issue Resolved: The switching logic has been fixed and deployed

Service Restored: All session acquisition functionality is operating normally

Monitoring Active: Enhanced monitoring is in place to detect similar issues

Next Steps and Preventive Measures

Immediate Actions (Next 7 Days)

  1. Enhanced Monitoring Implementation
    • Deploy additional alerting for session pool health metrics
  2. Load Testing
    • Conduct comprehensive load testing of session management under various scenarios
    • Validate fix effectiveness under simulated high-concurrency conditions

Medium-term Actions (Next 30 Days)

  1. Code Review and Testing Enhancement
    • Comprehensive audit of session management codebase
    • Implement additional unit and integration tests for session lifecycle edge cases
    • Establish chaos engineering practices for session management resilience
  2. Infrastructure Improvements
    • Evaluate session pool sizing and auto-scaling mechanisms and test with methods in #1
    • Design graceful degradation strategies for session pool exhaustion
  3. Documentation and Runbooks
    • Create detailed runbooks for session management incidents

Lessons Learned

  • High-concurrency scenarios expose edge cases not apparent under normal load
  • Proactive monitoring of internal system states (session pools) is critical for early detection
  • Automated testing should include concurrent access patterns and resource exhaustion scenarios

Post-Incident Review

A detailed post-incident review meeting will be scheduled within 48 hours to discuss:

  • Technical deep-dive into the root cause
  • Evaluation of response time and communication
  • Assessment of proposed preventive measures
  • Assignment of follow-up action items

Created
Sep 16 at 01:00pm CST

Several services are being affected, but we're working on some upgrades