Posted at: 29 April
Manager, Infra Tools AI
Company
NVIDIA Corporation is a Santa Clara-based technology company specializing in designing GPUs and AI solutions for gaming, professional visualization, and cloud services, operating in both B2B and B2C markets globally.
Remote Hiring Policy:
NVIDIA supports flexible remote work arrangements and hires from various regions globally, including the Americas, Europe, Asia, and the Middle East, with roles that may require collaboration across time zones.
Job Type
Full-time
Allowed Applicant Locations
Asia, Israel
Job Description
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. We are now seeking a highly motivated Infrastructure, Tools & AI Engineering Manager to join our Ethernet Switching group, working on SONiC Network OS. In this role, you will own and drive the engineering infrastructure that powers the full product development lifecycle — from development environments and CI pipelines through regression, code coverage, and test efficiency. You will apply cutting-edge AI and LLM capabilities to transform how we analyze failures, generate test coverage, and accelerate product quality.What you’ll be doing:Design, build, and maintain scalable infrastructure for development, integration, and test environments supporting SONiC OS.Architect and deliver LLM-based tools for intelligent regression analysis — failure classification, root cause clustering, anomaly detection, and test flakiness predictionLead efforts to reduce regression runtime through parallelization, smart test selection, and dependency-aware schedulingDevelop deep technical knowledge of SONiC Network OS internals, including its subsystem architecture, SAI/ASIC abstraction layer, and management planeLead and mentor a team of infrastructure and tooling engineers; set technical direction, define priorities, and grow team capabilitiesWhat we need to see:B.Sc. degree or higher in Computer Science, Software Engineering, or a related field — or equivalent experience8+ overall years of software engineering experience, with at least 3 years in an infrastructure, DevOps, or tooling leadership roleStrong Python programming skills; experience building production-quality automation frameworks and toolingDemonstrated experience designing and operating CI/CD systems at scale (Jenkins, GitLab CI, GitHub Actions, or equivalent)Hands-on experience with LLMs or AI-assisted developer tooling — building, integrating, or productizing AI capabilities in an engineering workflowProven ability to lead technical teams: hiring, mentoring, technical roadmapping, and cross-team influenceStrong analytical and problem-solving skills with a bias toward measurable outcomes and data-driven decisionsWays to stand out from the crowd:Deep Linux expertise: system internals, networking stack, process management, and scriptingPrior experience building LLM-powered test analysis pipelines or AI-enhanced DevOps tooling in a real production environmentKnowledge of networking protocols and hardware: Ethernet switching, L2/L3 protocols, QoS, VLANs, high-performance data center networkingExperience with code coverage instrumentation in large-scale C/Python codebases and using coverage data for test prioritizationTrack record of measurably improving regression runtime, test reliability, or CI throughput in a complex embedded or systems software environment