Job Role: Site Reliability Engineer – 2
Location: Bangalore
Job Description
At [247].ai, we’re passionate about building software that solves problems. We count on our site reliability engineers (SREs) to empower our users with a rich feature set, high availability, and stellar performance level to pursue their missions. As we expand our customer deployments, we are currently seeking an experienced SRE to deliver insights from massive scale data in real time. Specifically, we are searching for someone who brings fresh ideas, demonstrates a unique and informed viewpoint, and enjoys collaborating with a cross-functional team to develop real-world solutions and positive user experiences at every interaction.
Objectives of this Role: -
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Work during US hours and get direct experience communicating and supporting our US Clients.
- Build software and systems to manage platform infrastructure and applications.
- Improve reliability, quality, and time-to-market of our suite of software solutions.
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve.
- Provide primary operational support and engineering for multiple large, distributed software applications.
Required Skills: -
- Strong working knowledge of Red Hat Linux environments.
- Good knowledge of Python, Bash shell script development.
- Ability to program with one or more high level languages, such as Python, Perl, etc...
- Experience with logging, monitoring, alerting and CICD & Big data platform tools.
- Strong communication and analytical/problem-solving skills.
- Good understanding of Cloud Technologies like GCP, Azure.
Preferred Qualifications
- Bachelor’s degree in computer science or other highly technical, scientific discipline.
- Previous success in technical engineering.
- Coding experience beyond simple scripts.
- Good understanding of networking concepts (load balancers, TCP/IP, Firewalls).
- Logging and Monitoring tools like Logstash, Kibana, Grafana, etc...
- Strong debugging skills.
Responsibilities:
- Should flexible to work in PST working hours
- Perform Incident Management and Change Management to maintain the continuous availability of all Cloud Infrastructure services.
- Ensure all SRE and operating procedures are maintained and executed.
- Work in partnership with stakeholders to design, implement, manage, and support a highly available and secure infrastructure.
- Maintain 24x7 production environment with a high level of service availability and Perform quality reviews, manage operational issues.
- Partner with development teams in defining and implementing improvements in service architecture.
- Interface with Dev, QA, OPS teams to identify root cause analysis and re-instrument triggers to prevent future network degradation and outages.
- Explore and innovate new cloud technologies, features, and tools to improve the platform and automate using Bash, Python or Perl, etc...
- Implement automation and orchestration for manual processes required to operate and deploy cloud services, be at the heart of developing new ideas into internal tools by working closely with teams.
- Partner with development teams to improve services through rigorous testing and release procedures.
- Analyze alarms and dashboards to identify problem areas, report incidents, troubleshoot, and escalate as required.
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding.
- Perform ticket review and updates through JIRA ticketing tool.
- Manage, coordinate, and document all type maintenances / events.
- Must take initiative and be proactive.
- Must take on the responsibility to learn new products and procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Create sustainable systems and services through automation and uplifts.
- Conducting post-incident reviews and creating actional reports and coming up with the application optimization recommendations for engineering teams.
- Implementation of proactive monitoring, alerting, trend analysis and self-healing systems.
- Understand the existing architecture and work with various Engineering teams to develop and execute strategies to provide a high-quality Global production service.
About [24]7.ai
[24]7.ai is a leader in the Conversational AI market, with over 250+ Fortune 500/1000 customers. We continue to transform our business to drive greater value to our team, shareholders, customers through new product development and market growth.
[24]7.ai is redefining the way companies interact with consumers. Using Artificial Intelligence and Machine Learning to understand consumer intent, [24]7.ai’s technology helps companies create a personalized, predictive and effortless customer experience across all channels. The world’s largest and most recognizable brands are using intent-driven engagement from [24]7.ai to assist several hundred million visitors annually, through more than 1.5 billion conversations, most of which are automated and learn from each consumer experience.
For more information, visit: www.247.ai