Software Developer - Operations and Reliability
- Bengaluru, Karnataka, India
With offices all over the globe, the Moving Picture Company (MPC) is one of the world's leading visual effects (VFX) studios, creating high-end VFX for the advertising and feature film industries. We are constantly looking for the best talents in the world, enthusiastic people that come to work every day with the desire to be a part of some of the best work in the industry
About The Team and Our Work
Our status as a world-class VFX facility has been achieved through the development of industry-leading software which empowers our artists to create stunning imagery. We have curated a considerable portfolio of off-the-shelf and in-house software to meet these unique requirements, and continue to evolve and improve our technology as new needs emerge. The Core Engineering team operates within a larger R&D devision to provide the business critical infrastructure that enables multiple projects with thousands of shots to be simultaneously completed across the globe efficiently and to the highest quality.
A blend of globally distributed software, systems and operations experts, we are responsible for building and maintaining key infrastructure and services in collaboration with site-local engineering teams and other specialized development teams. With users in all areas of the company, our solution (the Core Platform) is based on a distributed micro-services environment providing capabilities in areas such as compute, storage, and digital asset management. Development and operational support spans the stack from operating system through to desktop/web application front-ends.
In recognition of the flexibility, scalability and maturity of the Core Platform that has been created by the team for MPC, we have been tasked with establishing the Core Platform elsewhere, making it available to other business units that are part of Technicolor’s portfolio.
About The Role
As a member of the Core Engineering team you'll be joining a highly passionate team that is responsible for the efficiency, performance, availability and monitoring of MPC's micro-services framework, forming the backbone of MPC's VFX pipeline, as well as the services that surround it such as the monitoring platform and render farm.
This requires a mix of engineering skills ranging from systems knowledge, software architecture and scalability design. One day you will be designing and provisioning dashboards, the next trying to catch runaway threads with your bare hands and the following day assisting development teams in writing scalable code or improving the performance of the micro-services platform by fine tuning Python code.
The role is in a fast paced environment where at times going from 0 to 100 in a jiffy is expected. This requires a systematic approach to problem solving as well as being able to think several steps ahead. We are often the last line of defense in case of a problem. This means the problems reaching the team are the harder but also more interesting problems that often do not have a single solution.
To be able to diagnose and resolve some of these problems we depend heavily on monitoring and alerting and would like a person that is up to date with today's tooling. We use a variety of products currently (ELK, TICK, Opentracing) to collect metrics from our micro-services as well as our entire render farm, collecting thousands of metrics a second, which brings along its own scalability problems. We we would like someone that is familiar with the various stacks both from a usage perspective as well as a deployment perspective.
Running at this scale means we love automation to reduce the repetitive tasks and automate as much as possible to give the team the time to tackle the more interesting design problems to further our entire stack. We use a mix of Saltstack and Ansible.
The role is aimed at improving the systems already in place at MPC as well as helping with future developments to deploy these systems into more modern environments, such as Kubernetes, and different business units inside Technicolor each with their own challenges.
All in all this role covers a broad range of subjects, challenges and opportunities which requires someone that truly enjoys squeezing the last microsecond out of a thread, doesn't mind getting their hands dirty and take apart an entire system to really get a feel for how it is working and where possible provide more tooling, insight and automation.
In this role, your responsibilities will include;
- Run, maintain and improve MPC's existing micro-services platform and metrics platform.
- Help extract these platforms and providing them as part of a modern, efficient and monitored platform to Technicolor's business units.
- Participate in an out-of-hours on-call rota to resolve incidents escalated outside of our normal working hours.
- Advocate for professional standards of development.
- Review events that impact availability and performance to guide future improvements.
- Troubleshoot problems across different levels of the stack, and in production environments.
- Contribute to maintaining an authoritative source of documentation.
- Coordinate with leadership to define and prioritise projects
- Support the introduction of new technologies where and when appropriate.
- Evangelise use of the Core Platform within MPC and other Technicolor business units.
- Take initiative to improve the developer experience in small or large ways.
- Mentor and pair with others in the team to encourage the professional and technical growth of others.
- Own your personal development plan and identify training opportunities for others in the team.
- Provide some level of operational support for the platform as required.
To succeed in the role, the following experience and competencies are required;
- A passion for solving scalability problems on live systems during an outage as well as for future projects.
- Writing efficient, scalable, concurrent code.
- A good understanding of concurrency as it applies to CPython
- A desire to create and/or improve performance, availability and monitoring on existing and new systems.
- Have system administrator skills.
- Knowledge of the various components used to develop micro-service frameworks (nginx, RabbitMQ, uWSGI, Flask).
- Experience administering and gaining insight from monitoring tools (TICK and ELK stacks).
- Experience using lower level tooling to gain insight from running systems and networking (strace, tcpdump).
- Love good automation (ansible, saltstack a plus).
- Comfortable working in a fast-paced and dynamic environment where requirements change.
- Familiarity with source control, in particular Git, and associated best practices.
- Comfortable working in a primarily Linux based development and runtime environment.
- A solid knowledge of testing principles, in particular TDD and/or BDD.
- Be able to consider a technical solution from different perspectives, including algorithms, complexity, correctness, maintainability.
- Collaborative and team oriented approach to product development, working with teams across locations, timezones and cultures.
- Excellent verbal and written communication skills.
- Be self-motivated and demonstrate strong organisational skills.
Nice to Have
The following are not essential to be successful in the role, however prior experience or the desire to grow in these areas will be of benefit;
- Experience with opentracing and jaeger.
- Experience with CI/CD pipelines.
- SQL and general database knowledge.
- Experience with building and running Kubernetes clusters.Engage in engineering practices that avoid incidents and share knowledge of best practices for monitoring, alerting, etc.
- Cross-platform development on Windows and OS X.
- An interest in the architectural perspective - contributing to architectural decisions and other technical documentation.
- Experience of Agile and lean methodologies, and an interest in process improvement in these areas.
- Able to present technical concepts to a broad audience with varying level of technical understanding.
- An awareness of security and keeping content secure.
Writing efficient, scalable, concurrent code,CPython,Knowledge of the various components used to develop micro-service frameworks nginx, RabbitMQ, uWSGI, Flask, Insight from monitoring tools (TICK and ELK stacks),ansible, saltstack a plus, Linux OS, Kubernetes