Cybereason Inc.

Company

Key Value
Company Cybereason Inc.
Employee 1000+
Founded 2012
Web Site https://www.cybereason.com/
Description A global cybersecurity company operating in 40+ countries, providing EDR, XDR, and MDR solutions to combat cyberattacks
Location San Diego, US - Japan Office

Team (Software Engineer)

Key Value
Title Software Engineer
Mission My mission was to develop new features and refactor legacy microservices for a distributed system that operates across 12,000+ servers and processes 80M+ events per second.
Task
  • I developed new features.
  • I refactored legacy microservices.
Term March 1, 2025 - current
Team Size 9
Type Permanent

Team (Site Reliability Engineer)

Key Value
Title Site Reliability Engineer
Mission As a member of a cross-functional global team, my mission was to enhance the reliability of a distributed system (operating across 12k+ servers, processing 80M+ events/sec) by developing custom metrics and distributed tracing.
Task
  • Observability Development: I developed custom metrics and distributed tracing to improve system observability.
  • Incident Response: I troubleshooted cross-microservice issues to identify their root causes.
Term October 1, 2024 - March 31, 2025
Team Size 5
Type Permanent

Projects

Key Value
Summary CR1. I designed and developed custom metrics, significantly improving observability across 3,000 servers.
Problem
  • Complex Troubleshooting Environment: The API server consisted of multiple interconnected components, making troubleshooting a complex task.
  • High Mean Time to Resolution (MTTR): Existing monitoring capabilities lacked sufficient insights, resulting in unacceptably long MTTR.
Mission Enhance Observability and Reduce MTTR: My objective was to improve system observability, streamline the debugging process, and significantly reduce MTTR.
Action
  • Developed Custom Metrics:
    • API Server: I implemented metrics for latency and error rates.
    • Elasticsearch: I developed metrics to monitor throughput.
    • MongoDB: I created metrics for query performance.
  • Improved Dashboards: I enhanced real-time dashboards to provide engineers with actionable insights for quicker issue identification.
Challenge Designing Actionable Metrics: The main challenge was to design custom metrics that were not just data points but provided truly actionable insights for engineers.
Overcome Fostered Cross-Team Communication: I initiated and facilitated communication between SRE, DevOps, and Product teams to collaboratively define the most impactful metrics.
Result Reduced MTTR by 50% for Enterprise Customers: The enhanced observability and actionable dashboards significantly improved troubleshooting efficiency, leading to a 50% reduction in MTTR for enterprise customers.
Skill Observability / Monitoring
Key Value
Summary CR2. I analyzed thread dumps and heap dumps to troubleshoot a critical server issue involving 2,000 concurrent threads.
Problem
  • Intermittent Server Crashes: A UI server for an enterprise customer (with 100,000 employees) was crashing intermittently.
  • Insufficient Monitoring Data: The root cause was unknown, and conventional monitoring tools did not provide adequate insights for diagnosis.
Mission Identify Bottleneck and Ensure Stability: My mission was to pinpoint the performance bottleneck causing the crashes and implement a solution to ensure system stability.
Action
  • Conducted Root Cause Analysis:
    • I fetched and meticulously analyzed thread dumps and heap dumps from the affected server.
    • I discovered that a legacy API was blocking HTTP threads, which in turn triggered health check failures and led to automatic server restarts.
  • Implemented Solution: I migrated the problematic legacy API to an efficient v2 API, which resolved the blocking issue.
Challenge Automated Restarts Hindered Data Collection: An automated pipeline would restart the UI server immediately upon failure, preventing manual collection of necessary thread and heap dumps for analysis.
Overcome Enhanced Data Collection Pipeline: I modified the pipeline to automatically collect thread dumps and heap dumps before initiating a server restart.
Result Resolved Critical Availability Issues: My actions successfully resolved the critical availability issues, significantly enhancing system reliability for the enterprise customer with 100,000 employees.
Skill Incident Response / Troubleshooting

Technology

Value Tag
Aerospike Backend
Apache Kafka Backend
Apache ZooKeeper Backend
Consul Backend
Elasticsearch Backend
GraphQL Backend
gRPC Backend
Java Backend
MongoDB Backend
PostgreSQL Backend
Python Backend
Redis Backend
Spring Boot Backend
AWS Infrastructure
Google Cloud Infrastructure
Jenkins Infrastructure
Kubernetes Infrastructure
Oracle Cloud Infrastructure
Terraform Infrastructure
Elastic Stack Monitoring
Grafana Monitoring
Jaeger Monitoring
Prometheus Monitoring