Software Engineer - Codec Avatar ML Compute Team

at Meta

Full Time

Reality Labs Research (RL-R) brings together a diverse and highly interdisciplinary team of researchers and engineers to create the future of augmented and virtual reality. On the Codec Avatars ML Compute team, you’ll work on building tools, libraries, and frameworks that will help researchers collaborate with each other and empower their research towards the generation of Codec Avatars. Our team cultivates an honest and considerate environment where self-motivated individuals thrive. We encourage a strong sense of ownership and embrace the ambiguity that comes with working on the frontiers of research.

In this software engineer role on the Codec Avatar ML Compute team, you will serve as the point of contact for Meta's research GPU super clusters, managing and optimizing compute resources to enable groundbreaking research in relightable avatars, full-body avatars, and generative AI for codec avatars.

Software Engineer - Codec Avatar ML Compute Team Responsibilities

Build, scale, and secure the HPC clusters within Meta research labs, a heterogeneous environment containing diverse operating systems and applications
Provide on-call support and lead incident root cause analysis through multiple infrastructure layers (compute, storage, network) for HPC clusters and act as a final escalation point
Collaborate in a diverse team environment across multiple scientific and engineering disciplines, making the architectural tradeoffs required to rapidly deliver software and infrastructure solutions
Find ways to leverage the scale and complexity of the larger Meta production infrastructure to solve problems for Reality Lab researchers
Provide guidance to other engineers on best practices to build mature services which are highly available, reliable, secure, and scalable
Provide guidance to other engineers on best practices to build mature services which are highly available, reliable, secure, and scalable
Ability to work independently, handle large projects simultaneously, and prioritize team roadmap and deliverables by balancing required effort with resulting impact

Minimum Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
Experience in automating the management of infrastructure and services
3+ years experience in distributed system performance measurement, logging, and optimization
3+ years experience coding in at least one of the following languages: C++, Python, Rust, or Go
Thorough understanding of Linux operating system internals, including the networking subsystem
Experience with Python library management systems such as Conda or Python venv
Experience in writing system level infrastructure, libraries, and applications
Experience with software development practices such as source control, code reviews, unit testing, debugging and profiling
Proven track record of shipping software
Experience in developing performant software and systems

Preferred Qualifications

Experience with managing HPC scheduler libraries like Slurm, Kubernetes, or LSF
Prior experience in building out HPC clusters, handling compute, storage, network, operating systems, schedulers, and stakeholder discussions
Prior experience in cluster oncall operations, including troubleshooting server/scheduler/storage errors, maintaining compute/storage environments/libraries/tools, helping onboard users to the cluster, and answering general questions from users
Prior experience in cluster coordination and strategy planning, including collecting/understanding needs of users, developing tools to improve user experience, providing guidance on best practices, coordinating distribution of compute/storage resources, forecasting compute/storage needs, and developing long-term user experience/compute/storage strategies
Prior experience building tooling for monitoring and telemetry
Prior experience supporting configuration management in a multi-region environment
Prior experience optimizing multi-tenant HPC clusters for performance and maintenance
Prior experience with containerization technologies like Docker or Virtual Machines
Prior experience building services
Prior experience building PaaS or internal clouds
Prior experience in developing/managing distributed network file systems
Prior academic or development experience with machine learning and/or deep learning
Prior experience in ML libraries such as PyTorch, TensorFlow or cuDNN
Prior experience in GPGPU development with CUDA, OpenCL or DirectCompute
Prior experience in network security
Experience in database and data management systems at scale
Familiar with Linux observability tools, such as eBPF

About Meta

Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today—beyond the constraints of screens, the limits of distance, and even the rules of physics. Meta is committed to providing reasonable support (called accommodations) in our recruiting processes for candidates with disabilities, long term conditions, mental health conditions or sincerely held religious beliefs, or who are neurodivergent or require pregnancy-related support. If you need support, please reach out to accommodations-ext@fb.com.

$146,994/year to $208,000/year + bonus + equity + benefits

Individual pay is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base salary, Meta offers benefits. Learn more about benefits at Meta.

Salary

$146,994 - $208,000

Yearly based

Location

Pittsburgh, PA

Engineer Machine Learning

Job Overview

Job Posted:

7 months ago

Job Expires:

Job Type

Full Time

Software Engineer - Codec Avatar ML Compute Team Responsibilities

Minimum Qualifications

Preferred Qualifications

About Meta

Salary

$146,994 - $208,000

Location

Share This Job:

AI Jobs

Companies

Support

Job Details

Software Engineer - Codec Avatar ML Compute Team Responsibilities

Minimum Qualifications

Preferred Qualifications

About Meta

Salary

$146,994 - $208,000

Location

Share This Job:

Related Jobs

AI Jobs

Companies

Support