Cybereason Inc.
Company
Key | Value |
---|---|
Company | Cybereason Inc. |
Employee | 1000+ |
Founded | 2012 |
Web Site | https://www.cybereason.com/ |
Description | A global cybersecurity company operating in 40+ countries, providing EDR, XDR, and MDR solutions to combat cyberattacks |
Location | San Diego, US - Japan Office |
Team (Software Engineer)
Key | Value |
---|---|
Title | Software Engineer |
Mission | My mission was to develop new features and refactor legacy microservices for a distributed system that operates across 12,000+ servers and processes 80M+ events per second. |
Task |
|
Term | March 1, 2025 - current |
Team Size | 9 |
Type | Permanent |
Team (Site Reliability Engineer)
Key | Value |
---|---|
Title | Site Reliability Engineer |
Mission | As a member of a cross-functional global team, my mission was to enhance the reliability of a distributed system (operating across 12k+ servers, processing 80M+ events/sec) by developing custom metrics and distributed tracing. |
Task |
|
Term | October 1, 2024 - March 31, 2025 |
Team Size | 5 |
Type | Permanent |
Projects
Key | Value |
---|---|
Summary | CR1. I designed and developed custom metrics, significantly improving observability across 3,000 servers. |
Problem |
|
Mission | Enhance Observability and Reduce MTTR: My objective was to improve system observability, streamline the debugging process, and significantly reduce MTTR. |
Action |
|
Challenge | Designing Actionable Metrics: The main challenge was to design custom metrics that were not just data points but provided truly actionable insights for engineers. |
Overcome | Fostered Cross-Team Communication: I initiated and facilitated communication between SRE, DevOps, and Product teams to collaboratively define the most impactful metrics. |
Result | Reduced MTTR by 50% for Enterprise Customers: The enhanced observability and actionable dashboards significantly improved troubleshooting efficiency, leading to a 50% reduction in MTTR for enterprise customers. |
Skill | Observability / Monitoring |
Key | Value |
---|---|
Summary | CR2. I analyzed thread dumps and heap dumps to troubleshoot a critical server issue involving 2,000 concurrent threads. |
Problem |
|
Mission | Identify Bottleneck and Ensure Stability: My mission was to pinpoint the performance bottleneck causing the crashes and implement a solution to ensure system stability. |
Action |
|
Challenge | Automated Restarts Hindered Data Collection: An automated pipeline would restart the UI server immediately upon failure, preventing manual collection of necessary thread and heap dumps for analysis. |
Overcome | Enhanced Data Collection Pipeline: I modified the pipeline to automatically collect thread dumps and heap dumps before initiating a server restart. |
Result | Resolved Critical Availability Issues: My actions successfully resolved the critical availability issues, significantly enhancing system reliability for the enterprise customer with 100,000 employees. |
Skill | Incident Response / Troubleshooting |
Technology
Value | Tag |
---|---|
Aerospike | Backend |
Apache Kafka | Backend |
Apache ZooKeeper | Backend |
Consul | Backend |
Elasticsearch | Backend |
GraphQL | Backend |
gRPC | Backend |
Java | Backend |
MongoDB | Backend |
PostgreSQL | Backend |
Python | Backend |
Redis | Backend |
Spring Boot | Backend |
AWS | Infrastructure |
Google Cloud | Infrastructure |
Jenkins | Infrastructure |
Kubernetes | Infrastructure |
Oracle Cloud | Infrastructure |
Terraform | Infrastructure |
Elastic Stack | Monitoring |
Grafana | Monitoring |
Jaeger | Monitoring |
Prometheus | Monitoring |