Cybereason Inc.

Company

Key	Value
Company	Cybereason Inc.
Employee	1000+
Founded	2012
Web Site	https://www.cybereason.com/
Description	A global cybersecurity company operating in 40+ countries, providing EDR, XDR, and MDR solutions to combat cyberattacks
Location	San Diego, US - Japan Office

Team (Software Engineer)

Key	Value
Title	Software Engineer
Mission	My mission was to develop new features and refactor legacy microservices for a distributed system that operates across 12,000+ servers and processes 80M+ events per second.
Task	I developed new features. I refactored legacy microservices.
Term	March 1, 2025 - current
Team Size	9
Type	Permanent

Team (Site Reliability Engineer)

Key	Value
Title	Site Reliability Engineer
Mission	As a member of a cross-functional global team, my mission was to enhance the reliability of a distributed system (operating across 12k+ servers, processing 80M+ events/sec) by developing custom metrics and distributed tracing.
Task	Observability Development: I developed custom metrics and distributed tracing to improve system observability. Incident Response: I troubleshooted cross-microservice issues to identify their root causes.
Term	October 1, 2024 - March 31, 2025
Team Size	5
Type	Permanent

Projects

Key	Value
Summary	CR1. I designed and developed custom metrics, significantly improving observability across 3,000 servers.
Problem	Complex Troubleshooting Environment: The API server consisted of multiple interconnected components, making troubleshooting a complex task. High Mean Time to Resolution (MTTR): Existing monitoring capabilities lacked sufficient insights, resulting in unacceptably long MTTR.
Mission	Enhance Observability and Reduce MTTR: My objective was to improve system observability, streamline the debugging process, and significantly reduce MTTR.
Action	Developed Custom Metrics: API Server: I implemented metrics for latency and error rates. Elasticsearch: I developed metrics to monitor throughput. MongoDB: I created metrics for query performance. Improved Dashboards: I enhanced real-time dashboards to provide engineers with actionable insights for quicker issue identification.
Challenge	Designing Actionable Metrics: The main challenge was to design custom metrics that were not just data points but provided truly actionable insights for engineers.
Overcome	Fostered Cross-Team Communication: I initiated and facilitated communication between SRE, DevOps, and Product teams to collaboratively define the most impactful metrics.
Result	Reduced MTTR by 50% for Enterprise Customers: The enhanced observability and actionable dashboards significantly improved troubleshooting efficiency, leading to a 50% reduction in MTTR for enterprise customers.
Skill	Observability / Monitoring

Key	Value
Summary	CR2. I analyzed thread dumps and heap dumps to troubleshoot a critical server issue involving 2,000 concurrent threads.
Problem	Intermittent Server Crashes: A UI server for an enterprise customer (with 100,000 employees) was crashing intermittently. Insufficient Monitoring Data: The root cause was unknown, and conventional monitoring tools did not provide adequate insights for diagnosis.
Mission	Identify Bottleneck and Ensure Stability: My mission was to pinpoint the performance bottleneck causing the crashes and implement a solution to ensure system stability.
Action	Conducted Root Cause Analysis: I fetched and meticulously analyzed thread dumps and heap dumps from the affected server. I discovered that a legacy API was blocking HTTP threads, which in turn triggered health check failures and led to automatic server restarts. Implemented Solution: I migrated the problematic legacy API to an efficient v2 API, which resolved the blocking issue.
Challenge	Automated Restarts Hindered Data Collection: An automated pipeline would restart the UI server immediately upon failure, preventing manual collection of necessary thread and heap dumps for analysis.
Overcome	Enhanced Data Collection Pipeline: I modified the pipeline to automatically collect thread dumps and heap dumps before initiating a server restart.
Result	Resolved Critical Availability Issues: My actions successfully resolved the critical availability issues, significantly enhancing system reliability for the enterprise customer with 100,000 employees.
Skill	Incident Response / Troubleshooting

Technology

Value	Tag
Aerospike	Backend
Apache Kafka	Backend
Apache ZooKeeper	Backend
Consul	Backend
Elasticsearch	Backend
GraphQL	Backend
gRPC	Backend
Java	Backend
MongoDB	Backend
PostgreSQL	Backend
Python	Backend
Redis	Backend
Spring Boot	Backend
AWS	Infrastructure
Google Cloud	Infrastructure
Jenkins	Infrastructure
Kubernetes	Infrastructure
Oracle Cloud	Infrastructure
Terraform	Infrastructure
Elastic Stack	Monitoring
Grafana	Monitoring
Jaeger	Monitoring
Prometheus	Monitoring