Cybereason Inc.

Company

Key Value
Company Cybereason Inc.
Website https://www.cybereason.com/
Description Cybereason is a global cybersecurity company operating in 40 countries. We provide EDR, XDR, and MDR solutions to stop cyberattacks. The core system runs on 12,000 servers and handles 80M events per second.
Location San Diego, US

Team (Software Engineer - Backend)

Key Value
Title Software Engineer (Backend)
Mission My mission was to develop new features and refactor legacy microservices for the core system.
Term 2025-03 - 2025-08
Type Permanent

Team (Site Reliability Engineer)

Key Value
Title Site Reliability Engineer
Mission My mission was to work with a global team to improve system reliability for the core system.
Term 2024-10 - 2025-03
Type Permanent

Projects

Key Value
Summary CR1. I designed and developed custom metrics to improve observability across 3,000 servers, which cut problem detection time for enterprise customers.
Situation
  • An API running on 3,000 servers had many connected components, which made troubleshooting difficult.
  • Our monitoring did not provide enough information, which made it slow to detect issues.
Task My mission was to improve system observability, make debugging easier, and reduce problem detection time.
Action
  • I implemented custom metrics:
    • I implemented API server latency and error rate metrics
    • I implemented Elasticsearch throughput metrics
    • I implemented MongoDB query performance metrics
  • I created real-time dashboards and monitors
Result As a result, better visibility and useful dashboards helped engineers detect problems faster. We cut problem detection time for enterprise customers.
Challenge A key challenge was making metrics that gave engineers useful insights, not just data points.
Solution To solve this, I led the communication between SRE, DevOps, and Product teams. We worked together to define the most important metrics using CUJs.
Learning I learned that effective monitoring isn't about collecting data, but providing clear, actionable insights that help engineers solve problems faster.
Skill Teamwork
Key Value
Summary CR2. I troubleshot and fixed a critical server issue with 2,000 concurrent threads, which resolved crashes affecting an enterprise customer.
Situation
  • An API server which had 2,000 concurrent threads kept crashing.
  • Standard monitoring tools did not provide useful answers.
Task My mission was to find what caused the crashes and fix it urgently.
Action
  • I analyzed server performance data:
    • I collected thread dumps and heap dumps from the broken server
    • I found that an old API was blocking HTTP threads
    • I identified that it caused health check failures and led to automatic restarts
  • I implemented the solution:
    • I fixed the thread blocking issue completely
Result As a result, the fix made the system reliable for the customer with 100,000 employees and we got their trust.
Challenge A key challenge was that the pipeline restarted the server right after crashes automatically. This stopped me from collecting the dumps I needed.
Solution To solve this, I modified the pipeline to collect dumps before restarting the server.
Learning I learned that solving complex issues requires analyzing the system's core behavior, not just its surface-level symptoms.
Skill Business Impact
Key Value
Summary CR3. I fixed a critical bug in the core API that required thorough testing across various versions and feature flags, which restored the correct incident status for millions of endpoints.
Situation
  • The core API that returns incident information had a bug where the status was different from what was intended.
  • The bug only occurred with specific feature flags and specific versions, making it extremely difficult to reproduce.
  • This API affected millions of endpoints, so any fix required thorough testing.
Task My mission was to identify the root cause of the bug and fix it without impacting millions of endpoints.
Action
  • I investigated the issue:
    • I set up multiple test environments with different configurations
    • I tested various scenarios with different versions and feature flag combinations
  • I identified the exact conditions to reproduce the bug:
    • I discovered that a critical variable used in incident status calculation became null under specific conditions
  • I fixed the code and ensured quality:
    • I implemented the fix with minimal code changes
    • I conducted thorough testing and QA before shipping
Result As a result, the bug was fixed and deployed to production without incidents, restoring correct incident status for millions of endpoints.
Challenge A key challenge was that the investigation took a long time because the bug only appeared under very specific conditions.
Solution To solve this, I systematically tested different combinations using scripts, which eventually led me to identify the exact reproduction conditions.
Learning I learned that reproducing complex bugs is both challenging and crucial - systematic testing and detailed documentation are essential for solving edge-case issues.
Skill Technical Challenge

Technology

Value Tag
Apache Kafka Backend
Elasticsearch Backend
GraphQL Backend
gRPC Backend
Java Backend
MongoDB Backend
PostgreSQL Backend
Python Backend
Redis Backend
Spring Boot Backend
AWS Infrastructure
Google Cloud Infrastructure
Jenkins Infrastructure
Kubernetes Infrastructure
Oracle Cloud Infrastructure
Terraform Infrastructure
Elastic Stack Observability
Grafana Observability
Jaeger Observability
Prometheus Observability