Cybereason Inc.
Company
| Key | Value |
|---|---|
| Company | Cybereason Inc. |
| Website | https://www.cybereason.com/ |
| Description | Cybereason is a global cybersecurity company operating in 40 countries. We provide EDR, XDR, and MDR solutions to stop cyberattacks. The core system runs on 12,000 servers and handles 80M events per second. |
| Location | San Diego, US |
Team (Software Engineer - Backend)
| Key | Value |
|---|---|
| Title | Software Engineer (Backend) |
| Mission | My mission was to develop new features and refactor legacy microservices for the core system. |
| Term | 2025-03 - 2025-08 |
| Type | Permanent |
Team (Site Reliability Engineer)
| Key | Value |
|---|---|
| Title | Site Reliability Engineer |
| Mission | My mission was to work with a global team to improve system reliability for the core system. |
| Term | 2024-10 - 2025-03 |
| Type | Permanent |
Projects
| Key | Value |
|---|---|
| Summary | CR1. I designed and developed custom metrics to improve observability across 3,000 servers, which cut problem detection time for enterprise customers. |
| Situation |
|
| Task | My mission was to improve system observability, make debugging easier, and reduce problem detection time. |
| Action |
|
| Result | As a result, better visibility and useful dashboards helped engineers detect problems faster. We cut problem detection time for enterprise customers. |
| Challenge | A key challenge was making metrics that gave engineers useful insights, not just data points. |
| Solution | To solve this, I led the communication between SRE, DevOps, and Product teams. We worked together to define the most important metrics using CUJs. |
| Learning | I learned that effective monitoring isn't about collecting data, but providing clear, actionable insights that help engineers solve problems faster. |
| Skill | Teamwork |
| Key | Value |
|---|---|
| Summary | CR2. I troubleshot and fixed a critical server issue with 2,000 concurrent threads, which resolved crashes affecting an enterprise customer. |
| Situation |
|
| Task | My mission was to find what caused the crashes and fix it urgently. |
| Action |
|
| Result | As a result, the fix made the system reliable for the customer with 100,000 employees and we got their trust. |
| Challenge | A key challenge was that the pipeline restarted the server right after crashes automatically. This stopped me from collecting the dumps I needed. |
| Solution | To solve this, I modified the pipeline to collect dumps before restarting the server. |
| Learning | I learned that solving complex issues requires analyzing the system's core behavior, not just its surface-level symptoms. |
| Skill | Business Impact |
| Key | Value |
|---|---|
| Summary | CR3. I fixed a critical bug in the core API that required thorough testing across various versions and feature flags, which restored the correct incident status for millions of endpoints. |
| Situation |
|
| Task | My mission was to identify the root cause of the bug and fix it without impacting millions of endpoints. |
| Action |
|
| Result | As a result, the bug was fixed and deployed to production without incidents, restoring correct incident status for millions of endpoints. |
| Challenge | A key challenge was that the investigation took a long time because the bug only appeared under very specific conditions. |
| Solution | To solve this, I systematically tested different combinations using scripts, which eventually led me to identify the exact reproduction conditions. |
| Learning | I learned that reproducing complex bugs is both challenging and crucial - systematic testing and detailed documentation are essential for solving edge-case issues. |
| Skill | Technical Challenge |
Technology
| Value | Tag |
|---|---|
| Apache Kafka | Backend |
| Elasticsearch | Backend |
| GraphQL | Backend |
| gRPC | Backend |
| Java | Backend |
| MongoDB | Backend |
| PostgreSQL | Backend |
| Python | Backend |
| Redis | Backend |
| Spring Boot | Backend |
| AWS | Infrastructure |
| Google Cloud | Infrastructure |
| Jenkins | Infrastructure |
| Kubernetes | Infrastructure |
| Oracle Cloud | Infrastructure |
| Terraform | Infrastructure |
| Elastic Stack | Observability |
| Grafana | Observability |
| Jaeger | Observability |
| Prometheus | Observability |