Practical Introduction to the Dataset Module

Newer KLIFF introduces lots more functionality towards dataset io while maintaining backward compatibility. In this example we will go over the dataset module and functionalities.

Dataset and Configuration

The dataset module contains two classes Dataset and Configuration.

Configuration

Configuration class contains the single unit of trainable data in a dataset, which is

Sr. no.

Data

Class Member Name

Data type

1

Coordinates of atoms in the configuration

coords

numpy float64 array

2

Species

species

List of atomic symbols str

3

“Global” energy of the configuration

energy

python float (double precision)

4

Per atom forces of the configuration

forces

numpy float64 array (same shape as coords)

5

Periodic boundaries of the configuration

PBC

List of length 3 with bool indicating the periodic boundaries in dim X, Y, and Z

6

Cell vectors (rowwise, i.e. cell[0,:] is the first vector, and cell[2,:] will be the last

cell

3x3 numpy float64 array

7

Global stress on the configuration

stress

numpy array of dims (6,) (Voigt notation)

8

Weight to apply to this configuration during training

weight

Instance of Weight class, see below

9

Member to store structural fingerprint of the configuration (descriptors, graphs etc)

fingerprint

Any, user defined object. Usually numpy array, torch tensor, or PyGGraph object

10

Per config metadata key-value pairs

metadata

dict of arbitrary key-val pairs

ASE Version

Current Configuration method works with ase <= 3.22. So please pin to that version. Support for newer ase modules will be introduced next.

You can easily initialize the Configuration from ase.Atoms

import numpy as np
from ase.build import bulk

from kliff.dataset import Configuration

Si = bulk("Si")
configuration = Configuration.from_ase_atoms(Si)
print(configuration.coords)
print(configuration.species)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 import numpy as np
----> 2 from ase.build import bulk
      4 from kliff.dataset import Configuration
      6 Si = bulk("Si")

ModuleNotFoundError: No module named 'ase'

There are other IO functions to directly initialize the Configuration class, e.g.

  1. Configuration.from_file : using extxyz file

  2. Configuration.from_colabfit : using ColabFit exchange database

But it is best to use the Dataset to directly load these configurations, as the Dataset object is more equipped to handle any exceptions in reading these files.

Direct initialization

For conversion to newer or unsupported dataformats, you can directly initialize the configuration object as

cell = np.eye(3)  # 3x3 identity matrix
species = ["Al", "Al", "Al", "Al"]
coords = np.array([
    [0.0, 0.0, 0.0],
    [0.5, 0.5, 0.0],
    [0.0, 0.5, 0.5],
    [0.5, 0.0, 0.5],
])
pbc = [True, True, True]

config = Configuration(
    cell=cell,
    species=species,
    coords=coords,
    PBC=pbc,
    energy=-3.5,
    forces=np.random.randn(4, 3),  # random forces as an example
    stress=[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],  # Voigt notation
)

# Let's print some info:
print("Number of atoms:", config.get_num_atoms())
print("Species:", config.species)
print("Energy:", config.energy)
print("Forces:\n", config.forces)
Number of atoms: 4
Species: ['Al', 'Al', 'Al', 'Al']
Energy: -3.5
Forces:
 [[-1.15896365 -2.00961247  1.07234515]
 [-0.55897191  1.3880019  -0.09160773]
 [-1.41068291 -0.54503868 -0.07134876]
 [-1.03509015  0.33842744 -0.71063483]]

Exporting the configuration

You can convert configuration object back to Atoms object using Configuration.to_ase_atoms, or to extxyz file using Configuration.to_file. For more details, please refer to the API docs.

ase_atoms = configuration.to_ase_atoms()
print(np.allclose(ase_atoms.get_positions(), configuration.coords))

configuration.to_file("config1.extxyz")
print("\nSaved extxyz header: ")
print("="*80)
!head -2 config1.extxyz
True

Saved extxyz header: 
================================================================================
2
Lattice="0 2.715 2.715 2.715 0 2.715 2.715 2.715 0" PBC="1 1 1" Properties=species:S:1:pos:R:3

Exception handling for Configuration

If any absent property is accessed, you get ConfigurationError exception. User should handle these exceptions as they see fit.

configuration.forces # raises exception
---------------------------------------------------------------------------
ConfigurationError                        Traceback (most recent call last)
Cell In [16], line 1
----> 1 configuration.forces

File ~/Projects/COLABFIT/kliff/kliff/kliff/dataset/dataset.py:376, in Configuration.forces(self)
    372 """
    373 Return a `Nx3` matrix of the forces on each atoms.
    374 """
    375 if self._forces is None:
--> 376     raise ConfigurationError("Configuration does not contain forces.")
    377 return self._forces

ConfigurationError: Configuration does not contain forces.

Warning

Configuration does not store data with any notion of units, so ensuring the units of the io data is a user delegated responsibility.

Dataset

Like mentioned earlier, Dataset is mostly a collection of Configurations, with member functions to read and write those configurations. In simplest terms the Dataset object works as a list of Configurations.

Initializing the Dataset

You can initialize the Dataset object using myraid of storage options, which include:

1. List of ASE Atoms objects (with keyword ase_atoms_list eplicitly specified)

from kliff.dataset import Dataset

configs = [bulk("Si"), bulk("Al"), bulk("Al", cubic=True)]
ds = Dataset.from_ase(ase_atoms_list=configs)
print(len(ds))
2025-02-26 12:54:51.241 | INFO     | kliff.dataset.dataset:_read_from_ase:957 - 3 configurations loaded using ASE.
2025-02-26 12:54:51.243 | INFO     | kliff.dataset.dataset:add_weights:1124 - No explicit weights provided.
3

2. extzyz file (all configurations in single extxyz file, read using ase.io, default behaviour)

Let us dowload a extyz dataset from web (in this case we are downloading Graphene dataset in extxyz format from Colabfit Exchange.

# Download the dataset, and print header
!wget https://materials.colabfit.org/dataset-xyz/DS_jasbxoigo7r4_0.tar.gz
!tar -xvf DS_jasbxoigo7r4_0.tar.gz
!xz -d DS_jasbxoigo7r4_0_0.xyz.xz
!head -2 DS_jasbxoigo7r4_0_0.xyz
--2025-02-26 13:37:03--  https://materials.colabfit.org/dataset-xyz/DS_jasbxoigo7r4_0.tar.gz
Resolving materials.colabfit.org (materials.colabfit.org)... 216.165.12.42
Connecting to materials.colabfit.org (materials.colabfit.org)|216.165.12.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36567 (36K) [application/x-tar]
Saving to: ‘DS_jasbxoigo7r4_0.tar.gz’

DS_jasbxoigo7r4_0.t 100%[===================>]  35.71K  --.-KB/s    in 0.1s    

2025-02-26 13:37:03 (362 KB/s) - ‘DS_jasbxoigo7r4_0.tar.gz’ saved [36567/36567]

./
./DS_jasbxoigo7r4_0_0.xyz.xz
48
Lattice="7.53 0.0 0.0 0.0 8.694891 0.0 0.0 0.0 6.91756" Properties=species:S:1:pos:R:3:forces:R:3 po_id=PO_1073537155164130421524433 co_id=CO_1056372038821617091165957 energy=-468.61686026192723 stress="-0.05233445077383756 0.003984624736573388 3.332094089548831e-06 0.003984624736573388 -0.03689214199484896 -6.99536080196756e-06 3.332094089548831e-06 -6.99536080196756e-06 -0.004744008663708218" pbc="T T T"

The things to note down in the header of the xyz file are the following, i. Properties=species:S:1:pos:R:3:forces:R:3, and ii. energy=-468.61686026192723, as you might need to supply these energy and forces keys (forces and energy in above example) explicitly to the function to ensure that properties are correctly mapped in KLIFF configuration.

from kliff.utils import get_n_configs_in_xyz # how many configs in xyz file 
# Read the dataset from DS_jasbxoigo7r4_0_0.xyz
ds = Dataset.from_ase("./DS_jasbxoigo7r4_0_0.xyz", energy_key="energy", forces_key="forces")

assert len(ds) == get_n_configs_in_xyz("./DS_jasbxoigo7r4_0_0.xyz")
2025-02-26 13:38:10.139 | INFO     | kliff.dataset.dataset:_read_from_ase:957 - 41 configurations loaded using ASE.
2025-02-26 13:38:10.140 | INFO     | kliff.dataset.dataset:add_weights:1124 - No explicit weights provided.

After loading the dataset you can use it as any other list, with simple indices, slices, or list of numbers.

Tip

Please note that slices and lists of config returns a new dataset object with desired configuration (as opposed to python list).

# access individual configs
print(ds[1], ds[-1])

# access slices
print(len(ds[2:5]))

# access using list of configs
print(len(ds[1,3,5]))
<kliff.dataset.dataset.Configuration object at 0x7f8265aa5730> <kliff.dataset.dataset.Configuration object at 0x7f8265ab8ca0>
3
3

3. List of extxyz files (with one configuration per file)

Dataset module can also be initialized using a list of xyz files, with one configuration per file. Example below demonstrate on how to load a toy dataset with 4 configurations.

!wget https://raw.githubusercontent.com/openkim/kliff/main/examples/Si_training_set_4_configs.tar.gz
!tar -xvf Si_training_set_4_configs.tar.gz
--2025-02-26 13:48:52--  https://raw.githubusercontent.com/openkim/kliff/main/examples/Si_training_set_4_configs.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7691 (7.5K) [application/octet-stream]
Saving to: ‘Si_training_set_4_configs.tar.gz’

Si_training_set_4_c 100%[===================>]   7.51K  --.-KB/s    in 0s      

2025-02-26 13:48:52 (30.0 MB/s) - ‘Si_training_set_4_configs.tar.gz’ saved [7691/7691]

Si_training_set_4_configs/
Si_training_set_4_configs/Si_alat5.431_scale0.005_perturb1.xyz
Si_training_set_4_configs/Si_alat5.409_scale0.005_perturb1.xyz
Si_training_set_4_configs/Si_alat5.442_scale0.005_perturb1.xyz
Si_training_set_4_configs/Si_alat5.420_scale0.005_perturb1.xyz
ds = Dataset.from_path("./Si_training_set_4_configs") # 4 configs in ./Si_training_set_4_configs
assert len(ds) == 4
2025-02-26 13:50:16.834 | INFO     | kliff.dataset.dataset:add_weights:1124 - No explicit weights provided.

4. From a ColabFit Exchange database instance

You can also stream data from Colabfit Exchange as

ds = Dataset.from_colabfit("my_colabfit_database", "DS_xxxxxxxxxxxx_0", colabfit_uri = "mongodb://localhost:27017")

Warning

The Colabfit interface is under heavy development so please check back for any changes till this warning is not removed

Custom Dataset Class

For unsupported io formats, such as VASP, Siesta outfiles etc, you can extend the Dataset class manually using the default Configuration.__init__ method for populating the configurations. You will need to store the list of loaded configurations in the Dataset.config member variable

class CustomDataset(Dataset):
    @classmethod
    def from_custom(files_path):
        self.config = []
        ... # get data from the file
        self.append(Configuration(cell=cell,
                                  species=species,
                                  coords=coords,
                                  PBC=pbc,
                                  energy=energy,
                                  forces=forces))

Weights

KLIFF dataset configurations can have fine grained weights for training, as provided by the Weight.