During chemistry work I often encounter problems with intensive computation where I need parallelize some intensive calculations. Here I list the typical problem cases and the strategies I used to solve them:

Processing a big MD trajectory.

In this case, a big MD trajectory file (e.g. LAMMPS dump) needs to be processed frame-by-frame to extract information. This file is too big to fit in the memory. Each frame in the trajectory consists of tens of thousands of atoms is therefore intensive to process. My strategy splits the processing into 2 steps:

  1. read the file frame by frame, do not do any processing but merely marks the starting byte indices of each frame. This indexing information is written to a cache file.
  2. read the index and use them to map segments of the file to memory using mmap. The files are processed in parallel (in python) using joblib.Parallel and the results are combined.

This strategy bypasses the IO limit and avoids explicitly managing parallel environments. Everything is taken care of by the joblib. However, this will only work on trivially parallelizable jobs with no communication between the workers. A demo of this strategy can be found in ale_scripts/thickness.py and ale_scripts/fast_lammpsdump.py

Building a big subgraph-distance matrix.

Here we need to perform \(O(N^2)\) subgraph isomorphism calculations between \(N\) graphs. The graph algorithm are implemented in C program and called thru the command-line, with file-based input and stdout based output. The construction of the input and interpretation of the output are most conveniently done in python. Therefore I chose to parallelize the construction of the distance matrix using mpi4py, and wrap the single-process, multithreaded C program in python. Creation of the MPI communicator and distribution of the work is managed in python.

While this problem is also trivially parallizable, this strategy distinguishes between the master and the workers and requires communication between them. Hence, it is more general. However, the worker program themselves are not parallel, and the scheduling of the worker processes is left to the OS. I did not attempt to use more than one node. A demo of this strategy can be found in ale_scripts/kart_helpers.py

Nudged-elastic band using VASP and ASE

Now things get more complicated. ASE implements a VASP calculator that launches VASP through python via mpirun command line, presumably as a subprocess. In NEB, the multiple images need to send forces to and from each other so that position updates are coordinated. I did not puesue this due to the resource constraints but the following discussion shows the possibilities and difficulties.

In the job script we use python neb.py and the python interpreter is unaware it is running under a parallel environment. The command to launch VASP is simply mpirun -np <ncores> vasp and the mpirun script has no way of telling which cores it needs to launch the processes on.

Now, suppose we launch 128 mpi jobs (say we call mpirun -np 16 8 times). If these were on the same node (i.e. shared), then we can sort of rely on the operating system to decide which cores to use for which process, or (better) if we have the whole node, we can explicitly specify which mpi rank (process) goes to which cores. So this strategy will work.

However, we do not have enough cores in shared to fit the whole NEB. When using dc*, the scheduling of resources is left to the scheduler (Sun Grid Engine). In the general case the cores made available to you is scattered across many nodes. Since the python interpreter is still running on just one node (could be any node in the pool allocated), the mpirun -np 16 command will repeated spawn the 16 ranks on the same node. Therefore, even if you have say 20 cores on host1, 20 cores on host2, etc., the mpirun launches all 128 jobs on node1. This creates enormous oversubscription of resources and jobs have abysmal performance.

One way to work around this is to programmatically decide which nodes/sockets/cores to use for each mpirun command. This is done (for IMPI) through the -machinefile switch together with -env I_MPI_PIN_PROCESSOR_LIST environment variable. The machine file specifies which nodes to use and how many cores to use on each node, and the I_MPI_PIN_PROCESSOR_LIST defines how the processors are bound to each core within the same node. The missing piece is how we can obtain these information (node list and node topology) from the grid engine.

SGE defines a host file variable $PE_HOSTFILE, which can be accessed when the job is run. This file looks like this:

n1992 11 pod_apollo_6240.q@n1992 <NULL>
n1972 20 pod_apollo_6240.q@n1972 <NULL>
n1974 18 pod_apollo_6240.q@n1974 <NULL>
n1906 16 pod_apollo_6240.q@n1906 <NULL>
n1888 2 pod_apollo_6240.q@n1888 <NULL>
n1969 14 pod_apollo_6240.q@n1969 <NULL>
n1978 13 pod_apollo_6240.q@n1978 <NULL>
n1874 13 pod_apollo_6240.q@n1874 <NULL>
n1894 2 pod_apollo_6240.q@n1894 <NULL>
n1889 2 pod_apollo_6240.q@n1889 <NULL>
n1903 1 pod_apollo_6240.q@n1903 <NULL>

The columns are: node, number of cores allocated, per-node queue, and binding to cores. Last column is <NULL>, leaving the on-node scheduling to Linux/IMPI (not sure). Now, if we define the SGE parameter #$ -binding pe linear:12 , we are demanding that if possible, the scheduler gives 12 consecutive cores on each socket. This will change the $PE_HOSTFILE to this:

n1972 3 pod_ib100_apollo10.q@n1972 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0,15:0,16:0,17:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:1,10:1,11:1,12:1,13:1,14:1,15:1,16:1,17
n1974 3 pod_ib100_apollo10.q@n1974 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0,15:0,16:0,17:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:1,10:1,11:1,12:1,13:1,14:1,15:1,16:1,17

Essentially now this file has the binding suggestions to the scheduler, which intel MPI respects if it is possible. However, this binding suggestion is provided after the resources are allocated. In fact in the above hostfile, only 3 cores are assigned on each node. This obviously makes it impossible to fulfill the binding suggestion. We can go one step further to demand certain node topology be satisfied using m_topology in the -l resource list:

#$ -l h_data=4G,h_rt=24:00:00,m_topology=SCCCCCCCCCCCCSCCCCCCCCCCCCSCCCCCCCCCCCC

This SCCCC.. specifies on each socket(S) we must have 12 cores(C) assigned, and we demand 3 sockets. If this is granted, the $PE_HOSTFILE should have the desired binding information, which we can parse in the python interpreter and pass to IMPI’s mpirun as machinefiles and I_MPI_PIN_PROCESSSOR_LIST. However, after testing for a night I was not able to obtain a single allocation demanding topology to be satisfied. This is not surprising given the stringent requirements I put up, and it also more or less defeats the purpose of using a distributed/heterogeneous computing environment.