Posted at: 23 March

Manager, Site Reliability Engineering

Company

CompanyNVIDIA

NVIDIA Corporation is a Santa Clara-based technology company specializing in designing GPUs and AI solutions for gaming, professional visualization, and cloud services, operating in both B2B and B2C markets globally.

Remote Hiring Policy:

NVIDIA supports flexible remote work arrangements and hires from various regions globally, including the Americas, Europe, Asia, and the Middle East, with roles that may require collaboration across time zones.

Job Type

Full-time

Allowed Applicant Locations

United States

Salary

$208,000 to $333,500 per year

Job Description

NVIDIA is the leading artificial intelligence computing company and is paving the way with innovations in self-driving cars, machine learning, supercomputing, gaming and visualization. NVIDIA gives automakers, tier-1 suppliers, automotive research institutions, and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems for self-driving vehicles. We are developing the software and driving the processes for software development. We are looking for a seasoned and experienced SRE manager to drive the Infrastructure and Operations teamWhat you’ll be doing:You will be leading the team of site reliability engineers responsible for automating maintenance of 10000+ hosts and providing support to customers towards debugging workflowsResponsible for maintaining service level SLA’sYou should be someone that is passionate for continuous improvements by driving critical metrics towards customer responsiveness and delivering to service level agreementsReuse AI techniques and data analytics to extract useful signals about machines and jobs to ensure high availability and resiliency of the systems in the data centerTake part in prototyping, designing and developing cloud infrastructure for Nvidia.What we need to see:Solid programming background in python and/or relevant scripting languagesExperience of maintaining large scale cloud infrastructure applicationsExcellent debugging and problem solving skillsIs an extraordinary teammate that can collaborate well across time zonesProven track record of delivering solutions using Agile process and methodologiesBS/MS in Computer Science, Computer Engineering or equivalent experience8+ overall years of industry experience with at least 2+ years of people management experienceWays to stand out from the crowd:Previous experience in managing and leading small engineering teamsExperience with using and improving data centersExperience with computer algorithms and ability to choose best possible algorithms to meet the scaling challengeAbility to divide complex problems into simple sub problems and then reuse available solutions to implement most of those.Design simple systems that can work reliably without needing much support.NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most brilliant and talented people in the world working for us. If you're creative and autonomous, we want to hear from you!#LI-HybridYour base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 208,000 USD - 333,500 USD.You will also be eligible for equity and benefits.Applications for this job will be accepted at least until March 26, 2026.This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.