Site Reliability Engineer (SRE) – FedRAMP
Location: New York, USA
Description
We’re growing and looking to hire a Site Reliabability Engineer who embodies our core values: People First, Customer Obsession, Strive for Excellence, and Integrity.
Claroty’s Public Sector practice is rapidly expanding to secure the mission-critical systems that our society’s safety and stability depend on. We are looking for mission-driven professionals who want to join a high-growth team dedicated to protecting critical infrastructure and ensuring essential services remain resilient and uninterrupted.
Requirements:
About the Role:
We are seeking a skilled Site Reliability Engineer (SRE) to support and maintain Claroty’s FedRAMP-compliant deployment in AWS GovCloud for public sector customers. The SRE will be responsible for ensuring high availability, security, and compliance of cloud-based environments while driving automation, monitoring, and incident response best practices.
As a DevOps SRE, your impact will be:
- AWS GovCloud Operations: Manage and optimize Claroty’s cloud-based infrastructure in AWS GovCloud, ensuring FedRAMP compliance and high availability.
- Reliability & Performance: Monitor and enhance system performance, scalability, and reliability through observability tools, automation, and best practices.
- Security & Compliance: Implement and maintain security controls aligned with FedRAMP, NIST 800-53, and other federal cybersecurity standards.
- Infrastructure as Code (IaC): Develop and manage infrastructure automation using Terraform and Ansible.
- CI/CD & Automation: Enhance DevSecOps pipelines, automate deployments, and improve system resilience through tools like GitLab CI/CD, Jenkins, and Kubernetes.
- Incident Response & Monitoring: Implement and manage monitoring solutions (Prometheus, Grafana, ELK Stack), respond to incidents, and conduct post-mortems.
- Networking & Security: Configure and maintain VPCs, VPNs, security groups, and firewalls in AWS GovCloud, ensuring compliance with FedRAMP requirements.
- GOV Production Gatekeeper: Manage rollout strategy for new technologies and oversee their execution to ensure minimal disruption to existing systems.
- GOV Production On-Call: Act as the first line of response for critical incidents, assessing issues, triaging, and coordinating with the team to prevent further problems and swiftly restore services.
- Monitor Production Performance and Degradation: Monitor system performance metrics closely and detect any degradation early to prevent outages and disruptions.
- Production Maintenance: Conduct regular infrastructure upgrades to accommodate changes, developments, and advancements in the technological landscape.
- Manage Release Flow: Oversee the release of updates and new functionalities, ensuring a seamless transition while handling any potential negative impacts on production.
- Collaboration: Work closely with DevOps, security teams, developers, and federal stakeholders to maintain a compliant and secure cloud environment.