Postdoctoral Research Associate in Resilience for Extreme-scale High-Performance Computing Systems

Company Description

Oak Ridge National Laboratory is the US Department of Energy’s largest multi-program science and energy laboratory, with scientific and technical capabilities spanning the continuum from basic to applied research. Located in the city of Oak Ridge, ORNL is in the eastern part of Tennessee in the foothills of the Great Smoky Mountains.

 

Job Description

The Computer Science Research Group (CSR) in the Computer Science and Mathematics Division, at the Oak Ridge National Laboratory has an opening for one or more Postdoctoral Researchers in the field of resilience for extreme-scale high-performance computing systems.  

Major Duties/Responsibilities
The successful candidate will carry out research and development in one or more of the following the areas: (a) characterization of faults in extreme-scale high-performance computing (HPC) systems, (b) modeling of fault propagation and impact, (c) metrics to evaluate resilience, (d) mechanisms and interfaces to coordinate flexible fault management across hardware and software components, and (e) optimizing the cost-benefit trade-offs among performance, resilience, and power consumption. Emphasis will be placed on: (1) statistical methods for identifying faults, errors and failures, their root causes, and propagation paths using system monitoring data; (2) fault-aware and fault-tolerant hardware and software design, including HPC system software, parallel programming models and algorithm-based fault tolerance for science applications; and (3) design space exploration and adaptation for trading off performance, resilience, and power consumption at design time and runtime. This work will be carried out in a multi-disciplinary and multi-institutional team in support of the Oak Ridge Leadership Computing Facility (OLCF, http://olcf.ornl.gov) and in collaboration with Argonne National Laboratory and Lawrence Livermore National Laboratory. Publication of research results in peer-reviewed journals and conference proceedings is expected.


Qualifications

Qualifications Required
•    A Ph.D. in computer science, computer engineering, or a closely related field
•    Background in fault tolerance for parallel and distributed computing systems
•    Demonstrated research experience in fault tolerance for distributed systems
•    Strong analytical, software development and programming skills
•    Experience with parallel computing
•    Demonstrated written and oral communication skills
•    Effective interpersonal skills

Additional Qualifications Desired
•    Experience with statistical data analysis
•    Expertise in fault modeling
•    Experience with fault-tolerant HPC system software
•    Expertise in fault tolerance approaches for parallel applications
•    Experience with performance, resilience, and power consumption modeling
•    Expertise in parallel discrete event simulation
•    Experience with pattern-based software design

Additional Information:

Applicants cannot have received their PhD more than five years prior to the date of application and must complete all degree requirements before starting their appointment. Certain exceptions may be considered. This appointment will initially be for 24 months with a possibility of an extension of up to 12 months. Initial appointments and extensions are subject to performance and availability of funding.


Additional Information

ORNL is an equal opportunity employer.  All qualified applicants, including individuals with disabilities and protected veterans, are encouraged to apply.