Parallelization and performance

JURASSIC is designed for efficient execution on modern multicore CPUs and high-performance computing (HPC) systems. Parallel execution is achieved through a combination of workflow-level parallelism, optional MPI task distribution in retrievals, and OpenMP threading.

This page describes the actual parallelization mechanisms implemented in JURASSIC, typical execution modes, and best practices for performance tuning.

Parallelization model overview

JURASSIC employs three complementary levels of parallelism:

Workflow-level parallelism
Independent jobs (e.g. different days, orbits, ensembles) submitted via batch systems or job arrays.
MPI (Message Passing Interface)
MPI is used to distribute independent retrieval tasks across processes. Optional and limited to retrieval executables only.
OpenMP
Shared-memory parallelism within a single process, used by several computational kernels.

There is no global hybrid MPI–OpenMP model across the entire code base. MPI is currently implemented only in the retrieval code.

MPI parallelization

Scope of MPI support

MPI parallelization is implemented exclusively in the retrieval executable. No forward models, tools, or utilities use MPI internally.

MPI support is optional and enabled at compile time.

What is parallelized?

At the MPI level, retrieval runs are parallelized over:

independent retrieval cases,
observation directories or datasets.

Each MPI rank processes a disjoint subset of retrieval tasks.

Characteristics

MPI is used only for task distribution.
There is no communication between MPI ranks during the retrieval.
No domain decomposition or collective algorithms are employed.
Each rank performs its own I/O.

This design is robust and scales well for large numbers of independent retrievals.

Typical usage (retrieval only)

mpirun -np 16 ./retrieval ctl/retrieval.ctl

Running non-retrieval executables under mpirun provides no performance benefit.

OpenMP parallelization

What is parallelized?

OpenMP is used to accelerate computationally intensive loops within a single process, such as:

radiative transfer calculations,
spectral loops,
internal numerical kernels.

OpenMP is available in both retrieval and non-retrieval executables.

Controlling OpenMP threads

The number of OpenMP threads is controlled via:

export OMP_NUM_THREADS=4

OpenMP threads share memory and therefore incur low overhead.

Combining MPI and OpenMP (retrieval only)

For retrievals, MPI and OpenMP can be combined:

MPI distributes independent retrieval tasks across processes.
OpenMP accelerates computations within each retrieval.

Example:

export OMP_NUM_THREADS=4
mpirun -np 8 ./retrieval ctl/retrieval.ctl

This configuration uses up to 32 CPU cores in total.

Load balancing considerations

Since MPI distributes tasks statically, load balance depends on the uniformity of retrieval cases.

Potential imbalance sources include:

varying convergence behavior,
different observation geometries,
heterogeneous input datasets.

Best practices:

group similar retrieval cases together,
avoid mixing very small and very large problems in one run,
benchmark representative workloads.

I/O considerations

Parallel performance can be limited by file-system access.

Recommendations:

store lookup tables on fast parallel file systems,
ensure unique output paths for each MPI rank,
minimize diagnostic output in large runs,
reuse lookup tables across many retrievals.

Numerical reproducibility

Due to parallel execution:

floating-point round-off differences may occur,
bitwise-identical results across different OpenMP thread counts or MPI layouts are not guaranteed.

These effects are typically small and scientifically insignificant but should be considered in regression testing.

Performance tuning tips

Use MPI only for retrieval workloads.
Prefer OpenMP for accelerating single-case computations.
Avoid oversubscription (MPI ranks × OpenMP threads > available cores).
Validate scaling with small test cases before large campaigns.
Disable expensive diagnostics unless needed.

Scaling expectations

JURASSIC scales well for:

large numbers of independent retrieval cases,
ensemble and campaign-style workflows,
OpenMP-accelerated single-node execution.

Scaling efficiency depends primarily on I/O performance and workload uniformity.

Summary

JURASSIC does not implement a general hybrid MPI–OpenMP model across the entire code base. MPI parallelization is currently limited to the retrieval executables and is used solely for distributing independent retrieval tasks. OpenMP provides shared-memory acceleration within a single process, and large-scale parallelism is typically achieved at the workflow level.

Running JURASSIC
Building from source
Configuration

Parallelization and performance

Parallelization model overview

MPI parallelization

Scope of MPI support

What is parallelized?

Characteristics

Typical usage (retrieval only)

OpenMP parallelization

What is parallelized?

Controlling OpenMP threads

Combining MPI and OpenMP (retrieval only)

Load balancing considerations

I/O considerations

Numerical reproducibility

Performance tuning tips

Scaling expectations

Summary

Related pages