SRE Team Leader

  • Claroty
  • Israel
  • Full Time Employee
About The Position

We’re growing and looking to hire SRE Team Leader who embodies our core values: People First, Customer Obsession, Strive for Excellence, and Integrity.

About Claroty:   

Claroty is on a mission to secure cyber-physical systems across industrial, healthcare, commercial and public sector environments: the Extended Internet of Things (XIoT). The Claroty Platform integrates with customers’ existing infrastructure to provide a full range of controls for visibility, exposure management, network protection, threat detection, and secure access. Our solutions are deployed by over 1,000 organizations at thousands of sites across all seven continents.

Claroty is headquartered in New York City, with employees across the Americas, Europe, Asia-Pacific, and Tel Aviv. The company is widely recognized as the industry leader in cyber-physical systems protection, with backing from the world’s largest investment firms and industrial automation vendors, as well as recognition from KLAS Research as Best in KLAS for Healthcare IoT Security, the Deloitte Technology Fast 500, the Forbes Cloud 100, and the Fortune Cyber 60. 

About the Engineering team 

The Claroty Engineering (R&D) Department is a group of talented engineers with specialties including BE, FE, DevOps, QA and automation, who come from a variety of backgrounds and organizations with strong experience and skills in software development and cybersecurity.

Our engineers use the most state-of-the-art technologies available to build our products – from Kafka to K8s, Spark and latest React/Angular, AWS lambdas & Argo workflows and many others – ensuring the fastest and highest-quality delivery to our customers.

We are solving some of the most complex technical challenges in the industry today – anything from OS level in-depth activity, networking traffic analysis, big-data analysis, multi-tenancy architecture and limited resources design and implementation to cope with high performance requirements, and sophisticated UX concepts.

Overview

The SRE and NOC Team Leader is tasked with leading, coordinating, and overseeing the management of the Production Cloud environment and infrastructure. This role includes ensuring efficient, seamless rollouts, high system performance, and quick response times when disruptions occur. This professional works collaboratively on the SRE and NOC side to balance rapid technology rollouts and upgrades with reliability and dependability.

The SRE and NOC Team Leader role is a strategic and critical position tasked with leading, coordinating, and overseeing the management of two international squads:

The Site Reliability Engineering (SRE) team is based primarily in Israel and the US, and the 24/7 Network Operations Center (NOC) squad will be based in a location to be determined.

The role requires the candidate to be available mainly during local working hours.

A significant part of this role also includes reestablishing and recruiting members to both squads and defining and implementing relevant tools and processes.

This is a tremendous opportunity for the right candidate to build these squads from the ground up, shaping the future direction of our SRE and NOC operations.

Requirements:

As an SRE Team Leader, Your impact will be:

Site Reliability Engineering (SRE)

  1. Production Gatekeeper: Design and enforce the rollout strategy for new technologies and oversee their execution to ensure minimal disruption to existing systems.
  2. Production On-Call: Act as the first line of response for critical incidents, assessing issues, triaging, and coordinating with the team to prevent further issues and swiftly restore services.
  3. Monitor Production Performance and Degradation: Keep a close eye on system performance metrics and detect any degradation early to prevent outages and disruptions.
  4. Production Maintenance: Conduct regular infrastructure upgrades to accommodate changes, developments, and advancements in the technological landscape.
  5. Manage Release Flow: Oversee the release of updates and new functionalities, ensuring a seamless transition while handling any potential negative impacts on production.
  6. Staging Management: Oversee the management of the staging environment, ensuring that it accurately represents the production environment for effective testing and simulation.

Network Operations Center (NOC)

  1. Build Playbooks: Develop and maintain comprehensive playbooks for managing system issues and incidents, setting guidelines for troubleshooting, escalation, and resolution processes.
  2. Build Monitoring Dashboards: Design, set up, and maintain monitoring dashboards to visualize and track system performance and incidents in real-time.
  3. Alerts and Incident Management: Establish protocols for issuing alerts in the event of system issues or anomalies and lead the team in incident resolution.

Apply for this position