Splitting up huge ExoMol .trans files

Splitting up huge ExoMol .trans files#

Some of the ExoMol .trans files are very large (up to 100 GB), which can cause problems with memory and make multiprocessing inefficient or impossible. Therefore, line racer provides a function to split up the huge .trans files into smaller files, which directly store the important line parameters in a .npz file. With that, only these smaller files have to be read in for the opacity calculation, which is much faster and more efficient. The splitting up is recommended if you want to calculate opacities from line lists that contain many lines but are not split up into smaller files. Examples are the ExoMol line lists for H2CS, which contain 43 billion lines in just eight files. But also for the ExoMol MM line list for CH4 it is recommended to split up the files, since some of the files are over 35GB in size, which makes multiprocessing on one node very difficult for typical node memory. With the split files, the multiprocessing automatically uses the smaller files and the calculation is much faster and more efficient. It is recommended to use this function, if you ran into memory problems or if you want to optimize the multiprocessing and therefore the speed of the calculation.

For demonstration purposes, we will split up one file from the ExoMol MM line list for CH4. First we need to download the .trans file(s) we want to split up and the .states file, which contains the energy levels and is needed for the splitting up. The smaller files later directly store the important line parameters, such as the effective wavenumber, the Einstein A coefficient, the lower energy level, the upper state degeneracy and the upper and lower state rotational quantum number. The splitting up is done in bins of a certain width, which can be set as an input. The smaller files are stored in the same folder as the original .trans file. The original .trans file is not deleted, so you can always go back to it if needed. However, they could be deleted after the splitting up if you want to save disk space, since they are not needed anymore for the opacity calculation. The smaller files are named with the original file name and the bin wavenumbers, so that they can be easily identified and used for the opacity calculation.

[1]:
import os
import urllib.request
import line_racer.line_racer as lr
[2]:
os.makedirs('line_list/CH4', exist_ok=True)

base = "https://www.exomol.com/db/CH4/12C-1H4/MM"

files = {
    f"{base}/12C-1H4__MM.states.bz2" : "line_list/CH4/12C-1H4__MM.states.bz2",
    f"{base}/12C-1H4__MM__01000-01100.trans.bz2" : "line_list/CH4/12C-1H4__MM__01000-01100.trans.bz2"
}

for url, dest in files.items():
    if not os.path.exists(dest):
        urllib.request.urlretrieve(url, dest)

In this case we use the compressed .bz2 files, which slows down the process, but is easier for demonstration purposes. It also works with the uncompressed files at two times the speed, since the reading in of the files is faster.

[3]:
pressures, temperatures = lr.LineRacer.prt_pressure_temperature_grid()

CH4_racer = lr.LineRacer(database='exomol',
                         mass=16.03130013,
                         input_folder='line_list/CH4/',  # path to folder with input files
                         temperatures=temperatures,  # in K
                         pressures=pressures,  # in bar
                         species_isotope_dict={'12C-1H4': 1.0},
                         line_list_name='MM',
                         broadening_type='exomol_table',
                         broadening_species_dict={'H2': 0.85, 'He': 0.15},
                         )

CH4_racer.split_huge_transition_files(transition_files_list=None, bin_width=10.0, use_mpi=True)
Splitting line_list/CH4/12C-1H4__MM__01000-01100.trans.bz2 took 80.1137318611145 seconds

For the line racer object here, only the relevant information for the splitting up is needed, such as the input folder and the line list name. The pressures, temperatures and broadening information are not relevant for the splitting up, but are required when constructing the object. For the splitting up, you can either provide the transition files list directly or let line racer search for all .trans files in the input folder. The bin_width parameter controls the width of the bins in which the lines are split up. The smaller the bin width, the smaller the resulting files, but also the more files are created. A good value for the bin width depends on the line density of the line list, on the size of the original files and the available memory. For very large files and high line density, a smaller bin width is recommended to make sure that the resulting files are not too large. The splitting up is parallelized with MPI, so you can set use_mpi=True if you have MPI set up on your cluster.