Dev Ops Engineer - Big Data
- 305 Main St, Redwood City, CA 94063, USA
PubMatic is a digital advertising technology company for premium content creators. The PubMatic platform empowers independent app developers and publishers to control and maximize their digital advertising businesses. PubMatic’s publisher-first approach enables advertisers to maximize ROI by reaching and engaging their target audiences in brand-safe, premium environments across ad formats and devices. Since 2006, PubMatic has created an efficient, global infrastructure and remains at the forefront of programmatic innovation. Headquartered in Redwood City, California, PubMatic operates 13 offices and nine data centers worldwide.
Major Duties and Responsibilities:
- Manage large scale Hadoop cluster environments including capacity planning, cluster setup, performance tuning, monitoring and Alerting.
- Perform proof of concepts on scaling, reliability, performance and manageability.
- Work with core production support personnel in IT and Engineering to automate deployment and operation of the infrastructure. Manage, deploy, and configure infrastructure with Ansible or other automation tool sets.
- Monitoring Hadoop jobs and recommend optimization
- - Job Monitoring
- Rerun jobs
- Job Tuning
- Spark Optimizations
- Data Monitoring and Pruning
- Creation of metrics and measures of utilization and performance.
- Capacity planning, management, and troubleshooting for HDFS, YARN/MapReduce and Spark work loads
- Capacity planning and implementation of new/upgraded hardware and software releases as well as for storage infrastructure.
- Research and recommend innovative, and where possible, automated approaches for system administration tasks.
- Integrating ML libraries
- Hardware accelerations
- SQream / Kinetica / Wallaroo monitoring and maintenance)
- Should be able to develop and apply patches
- Debugging Infrastructure issues(Like - Underlying network issue or Issues with the nodes)
- Addition/replacement of Kafka cluster/consumer(Not sure if this is covered in Hardware acceleration)
- Testing/Support of infrastructure component change (like changing the load balancer to F5)
- Deployment during the release.
- Help QA team with production parallel testing and performance testing.
- Help out Dev team with POC/Adhoc execution of some of the jobs for debugging/cost analysis.
- Partner with program management, network engineering and other cross functional teams on the larger initiatives
- 9 years of professional experience in Java, Scala and Python.
- 5+ experience of Spark/MapReduce in production environment
- Expert understanding in Linux based systems and deep expertise in Hadoop/YARN/Spark based technologies
- Expertise in designing, implementing and administering large Hadoop clusters and related Infrastructure such as Hive, Spark, HDFS, HBase, Oozie, Presto, Flume ,Airflow and Zookeeper
- Experience in managing the life cycle of data services from inception and design to deployment , operation , migration , administration and sunsets.
- Proficiency with programming languages such as Python and JavaActive member or contributor to open source Apache Hadoop projects is a plus
- Multi-datacenter deployment / Disaster Recovery experience is a plus
- A deep understanding of Hadoop design principals, cluster connectivity, security and the factors that affect distributed system performance.
- Experience on Kafka, Hbase and Hortonworks is mandatory.
- Prior experience with remote monitoring and event handling using Nagios, ELK
- Knowledge of best practices related to security, performance, and disaster recovery
- BE/BTech/BS/BCS/MCS/MCA in Computers or equivalent
- Excellent interpersonal, written, and verbal communication skills
All your information will be kept confidential according to EEO guidelines.