Splitting up huge line list files#

Some of the ExoMol .trans and HITEMP .par files are very large (up to 100 GB), which can cause problems with memory and make multiprocessing inefficient or impossible. Therefore, line racer provides a function to split up the huge line list files files into smaller files, which directly store the important line parameters in a .npz file. With that, only these smaller files have to be read in for the opacity calculation, which is much faster and more efficient. The splitting up is recommended if you want to calculate opacities from line lists that contain many lines but are not split up into smaller files. Examples are the ExoMol line lists for H2CS, which contain 43 billion lines in just eight files. For HITEMP, one example is the CO2 line list, which contains one file of 50GB. But also for the ExoMol MM line list for CH4 it is recommended to split up the files, since some of the files are over 35GB in size, which makes multiprocessing on one node very difficult for typical node memory. With the split files, the multiprocessing automatically uses the smaller files and the calculation is much faster and more efficient. It is recommended to use this function, if you ran into memory problems or if you want to optimize the multiprocessing and therefore the speed of the calculation.

ExoMol `.trans` files#

For demonstration purposes, we will split up one file from the ExoMol MM line list for CH4. First we need to download the .trans file(s) we want to split up and the .states file, which contains the energy levels and is needed for the splitting up. The smaller files later directly store the important line parameters, such as the effective wavenumber, the Einstein A coefficient, the lower energy level, the upper state degeneracy and the upper and lower state rotational quantum number. The splitting up is done in bins of a certain width, which can be set as an input. The smaller files are stored in the same folder as the original .trans file. The original .trans file is not deleted, so you can always go back to it if needed. However, they could be deleted after the splitting up if you want to save disk space, since they are not needed anymore for the opacity calculation. The smaller files are named with the original file name and the bin wavenumbers, so that they can be easily identified and used for the opacity calculation.

[1]:

import os
import urllib.request
import line_racer.line_racer as lr

[2]:

os.makedirs('line_list/CH4', exist_ok=True)

base = "https://www.exomol.com/db/CH4/12C-1H4/MM"

files = {
    f"{base}/12C-1H4__MM.states.bz2" : "line_list/CH4/12C-1H4__MM.states.bz2",
    f"{base}/12C-1H4__MM__01000-01100.trans.bz2" : "line_list/CH4/12C-1H4__MM__01000-01100.trans.bz2"
}

for url, dest in files.items():
    if not os.path.exists(dest):
        urllib.request.urlretrieve(url, dest)

In this case we use the compressed .bz2 files, which slows down the process, but is easier for demonstration purposes. It also works with the uncompressed files at two times the speed, since the reading in of the files is faster.

[3]:

pressures, temperatures = lr.LineRacer.prt_pressure_temperature_grid()

CH4_racer = lr.LineRacer(database='exomol',
                         mass=16.03130013,
                         input_folder='line_list/CH4/',  # path to folder with input files
                         temperatures=temperatures,  # in K
                         pressures=pressures,  # in bar
                         species_isotope_dict={'12C-1H4': 1.0},
                         line_list_name='MM',
                         broadening_type='exomol_table',
                         broadening_species_dict={'H2': 0.85, 'He': 0.15},
                         )

CH4_racer.split_huge_transition_files(transition_files_list=None, bin_width=10.0, use_mpi=True)

Splitting line_list/CH4/12C-1H4__MM__01000-01100.trans.bz2 took 80.1137318611145 seconds

For the line racer object here, only the relevant information for the splitting up is needed, such as the input folder and the line list name. The pressures, temperatures and broadening information are not relevant for the splitting up, but are required when constructing the object. For the splitting up, you can either provide the transition files list directly or let line racer search for all .trans files in the input folder. The bin_width parameter controls the width of the bins in which the lines are split up. The smaller the bin width, the smaller the resulting files, but also the more files are created. A good value for the bin width depends on the line density of the line list, on the size of the original files and the available memory. For very large files and high line density, a smaller bin width is recommended to make sure that the resulting files are not too large. The splitting up is parallelized with MPI, so you can set use_mpi=True if you have MPI set up on your cluster.

HITEMP `.par` files#

The splitting up of HITEMP .par files works in a similar way as for the ExoMol .trans files. The important line parameters are stored in the smaller files, which are named with the original file name and the bin wavenumbers. The original .par file is not deleted, so you can always go back to it if needed. However, they could be deleted after the splitting up if you want to save disk space, since they are not needed anymore for the opacity calculation.

[1]:

import os
import urllib.request
import line_racer.line_racer as lr

Since HITRAN does not allow for automatic downloading of the line list file, you need to download the file manually. In the example we will use the HITEMP CH4 line list, which you can find here <https://hitran.org/hitemp/>_. After downloading the file, you can move it to the desired folder, which is line_list/CH4/ in this case. The rest of the setup is similar to the ExoMol example, where we define the line racer object with the relevant information for the splitting up and then call the splitting up function. The bin_width parameter controls the width of the bins in which the lines are split up. The smaller the bin width, the smaller the resulting files, but also the more files are created. A good value for the bin width depends on the line density of the line list, on the size of the original files and the available memory. For very large files and high line density, a smaller bin width is recommended to make sure that the resulting files are not too large. In this example we use a bin width of 1000 cm^-1, which results in files of a few GB in size, which is just for demonstration purpose and for a real calculation it should be dependent on the available resources.

The most important thing is that in the splitting, only the specified isotopes in the species_isotope_dict are included in the smaller files.

[7]:

pressures, temperatures = lr.LineRacer.prt_pressure_temperature_grid()

CH4_racer = lr.LineRacer(database='hitemp',
                         input_folder='line_list/CH4/',  # path to folder with input files
                         temperatures=temperatures,  # in K
                         pressures=pressures,  # in bar
                         species_isotope_dict={'12C-1H4': 1.0}, # IMPORTANT: only the specified isotope here will be in the smaller files
                         line_list_name='HITEMP2020',
                         broadening_type='hitran_table',
                         broadening_species_dict={'air': 1.0},
                         )

CH4_racer.split_huge_par_files(transition_file_path=None, bin_width=1000.0)

Splitting line_list/CH4/06_HITEMP2020.par.bz2 took 428.5 seconds

Although the file only has a size of 0.4GB in the compressed format, it still takes a long time to read in and process the file. Since the decompressed file has a size of 5.13GB it can still be a good idea to split it up, especially if you want to use multiprocessing for the opacity calculation, since the smaller files can be read in much faster and the multiprocessing is more efficient. But this function is especially useful for the CO2 HITEMP line list, which contains one file of 50GB in size.

Splitting up huge line list files

Contents

Splitting up huge line list files#

ExoMol .trans files#

HITEMP .par files#

ExoMol `.trans` files#

HITEMP `.par` files#