Site reliability engineer
Do you want to build and manage scaleable, self-healing, globally-distributed systems? Our Site Reliability engineers keep Yelp fast, available, and growing, connecting users to great local businesses. No matter how many times we get searched, scraped, scanned, spammed, pinged, paged, or queried, we gotta keep our cool - and keep the site running smoothly.
We work in both the dev and systems worlds, implementing key parts of the core architecture and supporting devs as they try to do the same. We get to tackle interesting challenges that you can only find at the kind of scale that serves over 100 million users per month.
You'll work to empower product teams and developers: at Yelp, spinning up infrastructure should always be a git commit and a code review away: automation and self-service are at the core of what we do.
Where You Come In
You will work closely with developers in supporting new features and services
You will build tools to monitor site stability and performance
You will help scale our AWS-based infrastructure (no racking servers or swapping hard drives here!)
You will troubleshoot site issues using industry-leading tools like Splunk and SignalFX
You will automate everything with Puppet, Git, Jenkins, and Terraform
You will develop custom tools when off-the-shelf solutions don’t work at our scale and contribute upstream to open source projects
You will design new systems, tests, and procedures
You will participate in light on-call rotations - we have geographically distributed SRE teams for follow-the-sun support, which means no 2:00 AM pages!
What It Takes To Succeed
You have a mastery of Linux (we use Ubuntu but any distro is fine)
You can command of your favourite modern programming language: Python, Ruby, Go, Rust, Java, C++, etc.
You've got a solid understanding of fundamental technologies like TCP/IP, HTTP, and DNS
You have knowledge of best practices related to security, performance, and disaster recovery
You have had experience with web server configuration (Apache/Nginx/HAproxy), monitoring, trending, and high availability
You've got strong scripting and automation skills
You have expertise in Configuration Management (Puppet/Ansible/Chef/etc.)
You have experience with public cloud platforms (we use AWS, but Azure/GCP are fine) and related tooling (Terraform, etc.)
You have experience with Docker or other container technologies
You have excellent communication and documentation skills
What You'll Get
Full responsibility for projects from day one, an awesome team, and a dynamic work environment
Competitive salary with equity in the company, a pension scheme, and an optional employee stock purchase program
25 days paid holiday initially, rising to 29 with service
Private health insurance, including dental and vision
Flexible working hours and meeting-free Thursdays
Regular 3-day Hackathons and weekly learning groups, always with interesting topics
Opportunities to participate in events and conferences throughout Europe and the US
Public transportation season ticket loan and £50 per month toward any exercise of your choice
Monthly personal development allowance
Central location, a fully stocked kitchen, adjustable sitting/standing desks, quarterly offsites, locally roasted coffee, happy hours, and more!