As a Senior Infrastructure Software Engineer, you will focus on automating infrastructure installations and decommissions at scale. You will build tools to constantly improve our scale and speed of deployment. You will nurture a passion for an “automate everything” approach that makes systems failure-resistant and ready-to-scale.

Your work will enable our partners to bring up new data centers for AI and replace servers and networking in existing data centers as quickly and efficiently as possible without impacting running services. You will also review hardware changes, plan deployments, and aggressively execute to expand our network.

The ideal candidate has a passionate curiosity about how the Internet, GPUs, and computers fundamentally work and has a strong knowledge of Linux and AI or GPU hardware. We require strong coding ability in Python, Go, or similar languages. This is a highly visible position that requires deep technical understanding of datacenter infrastructure, physical and logical networking, Linux, and basic experience with project management.

Requirements

  • 5 years of relevant Development experience
  • Intermediate level software development skills in Python, Go, or similar
  • Linux systems administration experience
  • Strong skills in network services, including REST APIs and HTTP
  • Strong tooling and automations development experience
  • Network fundamentals DHCP, ARP, subnetting, routing, firewalls, IPv6
  • Experience with configuration management and infrastructure-as-code systems such as Saltstack, Chef, Puppet, Ansible, or Terraform
  • Experience with continuous / rapid release engineering
  • Experience working in a 24/7/365 service environment
  • Familiarity with day-to-day tasks and projects common in Data Center Operations
  • Excellent understanding of low level operating systems concepts including multi-threading, memory management, networking and storage, performance, and scale
  • Experience with Kubernetes and containerization, VPNs, AI workloads, and blockchain based protocols a plus
  • Deep knowledge of network engineering and protocols used in data center switching and routing, Internet routing, and optical line systems a plus
  • GPU programming, NCCL, CUDA knowledge a plus
  • Experience with PyTorch or Tensorflow a plus

Responsibilities

  • Aggressively seek opportunities to introduce cutting-edge technology and automation solutions that are effective, efficient, and scalable in order to improve our ability to deploy and maintain our global infrastructure
  • Provisioning, monitoring, and maintaining hardware, software, and network in new data centers
  • Perform architecture and research work for decentralized AI workloads
  • Work with vendors to obtain, debug, and maintain the most efficient and effective next-generation hardware and software for Together AI’s workloads
  • Collaborate with Together AI’s partners to make informed decisions about hardware strategy
  • Plan and implement network and server installations, including in the areas of facility power (AC/DC), cooling, security/access, rack layout, and cable management
  • Provide technical leadership and guidance during deployment activities
  • Create and maintain documentation, plans, SOP’s, MOP’s, etc.
  • Communicate your results and updates through blog posts, internal talks, and tickets

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https://www.together.ai/privacy  

Salary

$160,000 - $230,000

Yearly based

Location

San Francisco

Job Overview
Job Posted:
5 months ago
Job Expires:
Job Type
Full Time

Share This Job: