Cybereason Inc.

Company

Key Value
Company Cybereason Inc.
Employee 1000
Founded 2012
Web Site https://www.cybereason.com/
Description Cybereason is a global cybersecurity company operating in 40 countries. We provide EDR, XDR, and MDR solutions to stop cyberattacks. The core system runs on 12,000 servers and handles 80M events per second.
Location San Diego, US

Team (Software Engineer - Backend)

Key Value
Title Software Engineer (Backend)
Mission My mission was to develop new features and refactor legacy microservices for the core system.
Term 2025-03 - 2025-08
Team Size 9
Type Permanent

Team (Site Reliability Engineer)

Key Value
Title Site Reliability Engineer
Mission My mission was to work with a global team to improve system reliability for the core system.
Term 2024-10 - 2025-03
Team Size 5
Type Permanent

Projects

Key Value
Summary CR1. I designed and developed custom metrics to improve observability across 3,000 servers, which cut problem resolution time by 50% for enterprise customers.
Situation
  • An API running on 3,000 servers had many connected components, which made troubleshooting difficult.
  • Our monitoring did not provide enough information, which took time to resolve issues.
Task My mission was to improve system observability, make debugging easier, and reduce problem resolution time.
Action
  • I implemented custom metrics:
    • I implemented API server latency and error rate metrics
    • I implemented Elasticsearch throughput metrics
    • I implemented MongoDB query performance metrics
  • I created real-time dashboards and alerts
Result As a result, better visibility and useful dashboards made troubleshooting faster. We cut problem resolution time by 50% for enterprise customers.
Challenge A key challenge was making metrics that gave engineers useful insights, not just data points.
Solution To solve this, I led the communication between SRE, DevOps, and Product teams. We worked together to define the most important metrics.
Learning I learned that effective monitoring isn't about collecting data, but providing clear, actionable insights that help engineers solve problems faster.
Skill Observability / Monitoring / Collaboration / Teamwork
Key Value
Summary CR2. I troubleshot and fixed a critical server issue with 2,000 concurrent threads, which resolved crashes affecting an enterprise company.
Situation
  • An API server which had 2,000 concurrent threads kept crashing.
  • Standard monitoring tools didn't provide any answers.
Task My mission was to find what caused the crashes and fix it urgently.
Action
  • I analyzed server performance data:
    • I collected thread dumps and heap dumps from the broken server
    • I found that an old API was blocking HTTP threads
    • I identified that it caused health check failures and led to automatic restarts
  • I implemented the solution:
    • I migrated the problematic old API to an efficient v2 API
    • I fixed the thread blocking issue completely
Result As a result, the fix made the system reliable for the customer with 100,000 employees and we got their trust.
Challenge A key challenge was that the pipeline restarted the server right after crashes automatically. This stopped me from collecting the dumps I needed.
Solution To solve this, I modified the pipeline to collect dumps before restarting the server.
Learning I learned that solving complex issues requires analyzing the system's core behavior, not just its surface-level symptoms.
Skill Incident Response / Troubleshooting / Automation / Difficult Problem
Key Value
Summary CR3. I fixed a critical bug in the core API that required thorough testing across various versions and feature flags, which restored the correct Malop status for millions of endpoints.
Situation
  • The core API that returns Malicious Operation (Malop) information had a bug where the status was different from what was intended.
  • The bug only occurred with specific feature flags and specific versions, making it extremely difficult to reproduce.
  • This API affected millions of endpoints, so any fix required thorough testing.
Task My mission was to identify the root cause of the bug and fix it without impacting millions of endpoints.
Action
  • I investigated the issue systematically:
    • I analyzed the code and documentation to understand the expected behavior
    • I set up multiple test environments with different configurations
    • I tested various scenarios with different versions and feature flag combinations
  • I identified the exact conditions to reproduce the bug:
    • I discovered that a critical variable used in Malop status calculation became null under specific conditions
  • I fixed the code and ensured quality:
    • I implemented the fix with minimal code changes
    • I conducted thorough testing and QA before shipping
Result As a result, the bug was fixed successfully and deployed to production without any incidents, restoring correct Malop status for millions of endpoints.
Challenge A key challenge was that the investigation took a long time because the bug only appeared under very specific conditions.
Solution To solve this, I systematically tested different combinations and documented each test result, which eventually led me to identify the exact reproduction conditions.
Learning I learned that reproducing complex bugs is both challenging and crucial - systematic testing and detailed documentation are essential for solving edge-case issues.
Skill Difficult Problem / Troubleshooting / Automation

Technology

Value Tag
Aerospike Backend
Apache Kafka Backend
Apache ZooKeeper Backend
Consul Backend
Elasticsearch Backend
GraphQL Backend
gRPC Backend
Java Backend
MongoDB Backend
PostgreSQL Backend
Python Backend
Redis Backend
Spring Boot Backend
AWS Infrastructure
Google Cloud Infrastructure
Jenkins Infrastructure
Kubernetes Infrastructure
Oracle Cloud Infrastructure
Terraform Infrastructure
Elastic Stack Monitoring
Grafana Monitoring
Jaeger Monitoring
Prometheus Monitoring