Robbinsville, New Jersey, United States
Job Type: Full-Time
Apping Technology is an innovative technology company specializing in AI-driven solutions, ERP systems, and cloud-based SaaS platforms. We focus on delivering scalable, high-performance applications with a strong emphasis on security, automation, and reliability.
Job Overview
As a Site Reliability Engineer (SRE) at Apping Technology, you will be responsible for designing, building, and maintaining scalable infrastructure, ensuring system reliability, automating workflows, and improving incident response. You will work closely with development, operations, and security teams to enhance our cloud-based services and ensure seamless performance.
Key Responsibilities
Reliability & Performance
- Design, implement, and manage highly available and scalable infrastructure in cloud environments (AWS/Azure/DigitalOcean).
- Monitor system performance, identify bottlenecks, and optimize for speed, resilience, and cost-efficiency.
- Establish SLAs, SLOs, and error budgets to balance reliability and feature development.
Automation & Infrastructure as Code (IaC)
- Develop and maintain IaC using Terraform, Ansible, or equivalent tools.
- Automate deployment processes with CI/CD pipelines (GitHub Actions, GitLab CI/CD, Jenkins).
- Implement auto-scaling, failover mechanisms, and automated recovery strategies.
Incident Management & Monitoring
- Set up observability tools (Prometheus, Grafana, New Relic, Datadog, ELK stack) for proactive monitoring.
- Handle incident response, root cause analysis (RCA), and post-mortem processes.
- Ensure log management and monitoring solutions are in place for system health tracking.
Security & Compliance
- Implement cloud security best practices (IAM, firewalls, encryption, vulnerability management).
- Ensure compliance with industry standards like ISO 27001, SOC 2, GDPR.
- Conduct periodic security audits and penetration testing.
Database & Infrastructure Management
- Optimize and manage PostgreSQL, MySQL, and NoSQL databases for performance and availability.
- Ensure regular backups, failover mechanisms, and disaster recovery plans.
- Scale database solutions to meet business needs.
Collaboration & DevOps Culture
- Work closely with development teams to integrate reliability into the software development lifecycle.
- Enable developers to adopt DevOps best practices through self-service infrastructure and automation.
- Provide training and documentation for incident response and best practices.
Qualifications & Skills
Must-Have:
- 3+ years of experience in Site Reliability Engineering (SRE), DevOps, or Cloud Engineering.
- Strong experience with AWS, Azure, or DigitalOcean (EC2, RDS, S3, IAM, Kubernetes, etc.).
- Expertise in Linux administration, networking, and shell scripting.
- Hands-on experience with Docker, Kubernetes, and container orchestration.
- Proficiency in Terraform, Ansible, Helm, or other IaC tools.
- Experience with monitoring & logging tools like Prometheus, Grafana, ELK, Datadog.
- Familiarity with CI/CD pipelines and automation (GitHub Actions, GitLab CI, Jenkins).
- Strong programming skills in Python, Go, or Bash scripting.
Good to Have:
- Knowledge of serverless architectures and event-driven cloud computing.
- Experience with cloud cost optimization strategies.
- Exposure to AI/ML infrastructure in cloud environments.
- Familiarity with multi-cloud and hybrid cloud setups.
Why Join Us?
- Cutting-edge projects in AI, SaaS, and cloud computing.
- Flexible work environment (remote options available).
- Continuous learning & development opportunities.
- Competitive salary & benefits package.