Site Reliability Engineer/Lead (f/m)

Zagreb, Croatia

What do we do?

Photomath is a fast-growing EdTech company whose mobile app is the #1 app in the world to learn math. Powered by advanced machine learning technology, the app instantly scans, accurately solves, and intuitively explains printed and handwritten math problems to users through step-by-step explanations.

 

With over 220 million downloads globally, Photomath is the most popular mobile application from Croatia and one of the most popular educational apps of all times. Since its launch in 2014, our award-winning app has topped App Store & Google Play Store education charts and Apple has recently declared it the application of the day.

 

Today, we employ almost 100 people and have offices in Zagreb and Silicon Valley. We are a team of people with diverse backgrounds, experiences and skills, united by passion for technology and innovation. We believe that math is an increasingly crucial skill, particularly as problem-solving and quantitative analysis become prerequisites for many occupations.

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, distributed, low-latency, fault-tolerant systems servicing tens of millions of users. SRE ensures that Photomath's services are reliable and performant. Additionally SRE’s will keep an eye on our systems capacity. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. This is done in collaboration with SW developers and DevOps engineers to ensure high standards for our services scale, performance, robustness, uptime and responsiveness. 


As a Site Reliability Engineer at Photomath you’ll have the mandate to build the SRE team from ground up, manage complex challenges of scale, while using your expertise in coding, algorithms, complexity analysis and distributed system design. The engineer in this role will develop a broad knowledge of the architecture of the system and will work across the organization to identify and help drive to closure gaps in design, test coverage, monitoring and key processes that impact our ability to meet desired SLA, SLOs, SLIs. The candidate will need to have the ability to think at a cross-team/cross-service level and have (or build) that breadth of knowledge that is needed to support a very broad group of teams on two continents (US and EU).

The engineering culture at Photomath  is -  diversity, intellectual curiosity, problem solving and openness. Our organization brings together talented people who are encouraged to collaborate, think big and take risks in a blame-free environment.


Our Stack: GCP, Kubernetes, Terraform, Jenkins, Debian, BitBucket, Java/Kotlin, C/C++, Python, NodeJS, Shell, React, TypeScript, MySQL, PostgreSQL, MongoDB, Spring Boot, Micronaut, Flyway, Redis, Elastic Search, Apache Kafka, Knative


Qualifications:

  • CS or similar technical degree/study or equivalent practical experience
  • Proven system architecture experience
  • Multiyear programming experience in Java, C, C++, Python or similar
  • 5+ years experience of Linux/Unix systems and networking experience
  • Experience with analyzing and troubleshooting systems
  • Lives under a motto “Your problem is my problem”


Plus if you have:

  • Cloud experience like GCP, AWS or similar
  • Ability to debug and optimize code
  • Systematic problem solving, curiosity, communication skills and interest in service troubleshooting
  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems


What is this role about:

  • Building a SRE group, identifying SLIs/SLOs/SLAs, define error budgets, create playbooks, evangelize SEV’s
  • Focuses on optimizing existing systems end to end, building infrastructure with aim at automatic and intelligent failure recovery, pushing for changes that improve reliability and velocity
  • Setup processes and tools for logging, monitoring and tracking anomalies and respectively solving or reporting issues (Jira tasks)
  • Engage in service design, creation, deployment, operation and refinement
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews
  • Practice sustainable incident response and blameless postmortems

Salary range:

  • 22.000 - 28.000 kn gross, depending on candidate's competencies, with possible exceptions   
  • Stock options 

What we offer:

  • Chance to grow your team and skills as we tackle even more demanding projects as the company grows
  • Flexible working hours and work from home arrangements 
  • Dedication to a healthy work-life balance and various benefits for parents 
  • A diverse environment with agile and talented individuals across the career spectrum - to teach and be taught
  • A friendly, collaboration-heavy team atmosphere
  • A culture that recognizes and rewards dedication and success
  • Dedicated person (mentor/buddy) to help you navigate your first weeks in a new role
  • Learning and growth opportunities through knowledge sharing, education and conferences, individual development plan with a dedicated budget, weekly time devoted to learning new things
  • Cutting edge hardware and equipment, budget for additional equipment
  • Company events and celebrations, company retreat, team budget for team building activities
  • Birthday and holiday presents for employees and their kids
  • Generous vacation and paid leave policy, sick leave without a doctor's note, annual physical exam (check-up)
  • Multisport card for various discounts at sport facilities
  • Underground bicycle parking garage
  • Modern office design, great view :) and great location (Zagreb, Strojarska 20)

Site Reliability Engineer/Lead (f/m)

Job description

Site Reliability Engineer/Lead (f/m)

Personal information
Professional data