Cloud Infrastructure Reliability (SRE) Lead

Zagreb, Croatia

What do we do?

Photomath is a fast-growing EdTech company whose mobile app is the #1 app in the world to learn math. Powered by advanced machine learning technology, the app instantly scans, accurately solves, and intuitively explains printed and handwritten math problems to users through step-by-step explanations.

With over 270 million downloads globally, Photomath is the most popular mobile application from Croatia and one of the most popular educational apps of all times. Since its launch in 2014, our award-winning app has topped App Store & Google Play Store education charts and Apple has recently declared it the application of the day.

Today, we employ almost 100 people and have offices in Zagreb and Silicon Valley. We are a team of people with diverse backgrounds, experiences and skills, united by passion for technology and innovation. We believe that math is an increasingly crucial skill, particularly as problem-solving and quantitative analysis become prerequisites for many occupations.

Cloud Infrastructure Reliability Engineer combines software and systems engineering to build and run large-scale, distributed, low-latency, fault-tolerant systems servicing tens of millions of users. Engineer in this role ensures that Photomath's services are reliable and performant. Additionally engineers will keep an eye on our systems capacity. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. This is done in collaboration with SW developers and Core Infrastructure engineers to ensure high standards for our services scale, performance, robustness, uptime and responsiveness. 

As a Cloud Infrastructure Reliability Engineer at Photomath you’ll have the mandate to build the SRE (Site Reliability Engineering) like team from ground up, manage complex challenges of scale, while using your expertise in coding, algorithms, complexity analysis and distributed system design. The engineer in this role will develop a broad knowledge of the architecture of the system and will work across the organization to identify and help drive to close gaps in design, test coverage, monitoring and key processes that impact our ability to meet desired SLA, SLOs, SLIs. The candidate will need to have the ability to think at a cross-team/cross-service level and have (or build) that breadth of knowledge that is needed to support a very broad group of teams on two continents (US and EU).

The engineer culture of diversity, intellectual curiosity, problem solving and openness is key to its success. Our organization brings together talented people who are encouraged to collaborate, think big and take risks in a blame-free environment.


Stack: GCP, Kubernetes, Terraform, Jenkins, Debian, BitBucket, Java/Kotlin, C/C++, Python, NodeJS, Shell, React, TypeScript, MySQL, PostgreSQL, MongoDB, Spring Boot, Micronaut, Flyway, Redis, Elastic Search, Apache Kafka, Knative


Qualifications:

  • CS or similar technical degree/study or equivalent practical experience
  • Proven system architecture experience
  • Multiyear programming experience in Java, C, C++, Python or similar
  • 5+ years experience of Linux/Unix systems and networking experience
  • Experience with analyzing and troubleshooting systems
  • Lives under a motto “Your problem is my problem”


Plus if:

  • Cloud experience like GCP, AWS or similar
  • Ability to debug and optimize code
  • Systematic problem solving, curiosity, communication skills and interest in service troubleshooting
  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.


What is this role about:

  • Building a decentralized SRE like group, identifying SLIs/SLOs/SLAs, define error budgets, create playbooks, evangelize SEV’s
  • Focuses on optimizing existing systems end to end, building infrastructure with aim at automatic and intelligent failure recovery, pushing for changes that improve reliability and velocity
  • Setup processes and tools for logging, monitoring and tracking anomalies and respectively solving or reporting issues (Jira tasks)
  • Engage in service design, creation, deployment, operation and refinement
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews
  • Practice sustainable incident response and blameless postmortems


Salary range:

  • 22.000 - 28.000 kn gross, depending on candidate's competencies, with possible exceptions   
  • Stock options 


What we offer:

  • Chance to grow your team and skills as we tackle even more demanding projects as the company grows
  • Flexible working hours and work from home arrangements 
  • Dedication to a healthy work-life balance and various benefits for parents 
  • A diverse environment with agile and talented individuals across the career spectrum - to teach and be taught
  • A friendly, collaboration-heavy team atmosphere
  • A culture that recognizes and rewards dedication and success
  • Dedicated person (mentor/buddy) to help you navigate your first weeks in a new role
  • Learning and growth opportunities through knowledge sharing, education and conferences, individual development plan with a dedicated budget, weekly time devoted to learning new things
  • Cutting edge hardware and equipment, budget for additional equipment
  • Company events and celebrations, company retreat, team budget for team building activities
  • Birthday and holiday presents for employees and their kids
  • Generous vacation and paid leave policy, sick leave without a doctor's note, annual physical exam (check-up)
  • Multisport card for various discounts at sport facilities
  • Underground bicycle parking garage
  • Modern office design, great view :) and great location (Zagreb, Strojarska 20)

Cloud Infrastructure Reliability (SRE) Lead

Job description

Cloud Infrastructure Reliability (SRE) Lead

Personal information
Professional data