kliff.dataset¶

class kliff.dataset.Configuration(cell, species, coords, PBC, energy=None, forces=None, stress=None, weight=None, identifier=None)[source]¶

Class of atomic configuration. This is used to store the information of an atomic configuration, e.g. supercell, species, coords, energy, and forces.

Parameters:

cell (ndarray) – A 3x3 matrix of the lattice vectors. The first, second, and third rows are $a_1$ , $a_2$ , and $a_3$ , respectively.
species (List[str]) – A list of N strings giving the species of the atoms, where N is the number of atoms.
coords (ndarray) – A Nx3 matrix of the coordinates of the atoms, where N is the number of atoms.
PBC (List[bool]) – A list with 3 components indicating whether periodic boundary condition is used along the directions of the first, second, and third lattice vectors.
energy (Optional[float]) – energy of the configuration.
forces (Optional[ndarray]) – A Nx3 matrix of the forces on atoms, where N is the number of atoms.
stress (Optional[List[float]]) – A list with 6 components in Voigt notation, i.e. it returns $\sigma=[\sigma_{xx},\sigma_{yy},\sigma_{zz},\sigma_{yz},\sigma_{xz}, \sigma_{xy}]$ . See: https://en.wikipedia.org/wiki/Voigt_notation
weight (Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.
identifier (Union[str, Path, None]) – a (unique) identifier of the configuration

classmethod from_file(filename, weight=None, file_format='xyz')[source]¶

Read configuration from file.

Parameters:

filename (Path) – Path to the file that stores the configuration.
file_format (str) – Format of the file that stores the configuration (e.g. xyz).

to_file(filename, file_format='xyz')[source]¶

Write the configuration to file.

Parameters:

filename (Path) – Path to the file that stores the configuration.
file_format (str) – Format of the file that stores the configuration (e.g. xyz).

classmethod from_colabfit(database_client, data_object, weight=None)[source]¶

Read configuration from colabfit database .

Parameters:

database_client (MongoDatabase) – Instance of connected MongoDatabase client, which can be used to fetch database from colabfit-tools dataset.
data_object (dict) – colabfit data object dictionary to be associated with current configuration and property.
weight (Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.

to_colabfit(database_client, data_object, weight=None)[source]¶

Save configuration from colabfit database.

Parameters:

database_client (MongoDatabase)
data_object (dict)
weight (Optional[Weight])

Returns:

classmethod from_ase_atoms(atoms, weight=None, energy_key='energy', forces_key='forces', stress_key='stress')[source]¶

Read configuration from ase.Atoms object.

Parameters:

atoms (Atoms) – ase.Atoms object.
weight (Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.
energy_key (str) – Name of the field in extxyz that stores the energy.
forces_key (str) – Name of the field in extxyz that stores the forces.
stress_key (str) – Name of the field in extxyz that stores the stress.

to_ase_atoms()[source]¶

Convert the configuration to ase.Atoms object.

Returns:: ase.Atoms representation of the Configuration

property cell: ndarray¶: 3x3 matrix of the lattice vectors of the configurations.

property PBC: List[bool]¶: A list with 3 components indicating whether periodic boundary condition is used along the directions of the first, second, and third lattice vectors.

property species: List[str]¶: Species string of all atoms.

property coords: ndarray¶: A Nx3 matrix of the Cartesian coordinates of all atoms.

property energy: float | None¶: Potential energy of the configuration.

property forces: ndarray¶: Return a Nx3 matrix of the forces on each atoms.

property stress: List[float]¶: Stress of the configuration. The stress is given in Voigt notation i.e $\sigma=[\sigma_{xx},\sigma_{yy},\sigma_{zz},\sigma_{yz},\sigma_{xz}, \sigma_{xy}]$ .

property weight¶: Get the weight class of the loss function.

property identifier: str¶: Return identifier of the configuration.

property fingerprint¶: Return the stored fingerprint of the configuration.

property path: Path | None¶: Return the path of the file containing the configuration. If the configuration is not read from a file, return None.

property metadata: dict¶: Return the metadata of the configuration.

get_num_atoms()[source]¶

Return the total number of atoms in the configuration.

Return type:: int

get_num_atoms_by_species()[source]¶

Return a dictionary of the number of atoms with each species.

Return type:: Dict[str, int]

get_volume()[source]¶

Return volume of the configuration.

Return type:: float

count_atoms_by_species(symbols=None)[source]¶

Count the number of atoms by species.

Parameters:

symbols (Optional[List[str]]) – species to count the occurrence. If None, all species present in the configuration are used.

Returns:

with key the species string, and value the number of: atoms with each species.

Return type:

{specie, count}

order_by_species()[source]¶: Order the atoms according to the species such that atoms with the same species have contiguous indices.

to_dict()[source]¶

Return type:: dict

classmethod bulk(**kwargs)[source]¶

Transparent wrapper to get KLIFF configuration from bulk ASE atoms. Mostly for convenience.

Parameters:: **kwargs – All the args that will be passed to ase.build.bulk
Return type:: Configuration
Returns:: Configuration

get_supercell(nx=1, ny=1, nz=1)[source]¶

Generate supercell from a configuration.

Parameters:

nx (int) – repetition along x-axis
ny (int) – repetition along y-axis
nz (int) – repetition along z-axis

Return type:

Configuration

Returns:

Configuration

class kliff.dataset.Dataset(configurations=None)[source]¶

A dataset of multiple configurations (Configuration).

Parameters:: configurations (Optional[Iterable]) – A list of Configuration objects.

classmethod from_colabfit(cls, colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017', weight=None, **kwargs)¶

Read configurations from colabfit database and initialize a dataset.

Parameters:

weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).
colabfit_database (str) – Name of the colabfit Mongo database to read from.
colabfit_dataset (str) – Name of the colabfit dataset instance to read from, usually it is of form, e.g., “DS_xxxxxxxxxxxx_0”
colabfit_uri (str) – connection URI of the colabfit Mongo database to read from.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_colabfit(colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017', weight=None, **kwargs)¶

Read configurations from colabfit database and add them to the dataset.

Parameters:

colabfit_database (str) – Name of the colabfit Mongo database to read from.
colabfit_dataset (str) – Name of the colabfit dataset instance to read from (usually it is of form, e.g., “DS_xxxxxxxxxxxx_0”)
colabfit_uri (str) – connection URI of the colabfit Mongo database to read from.
weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

classmethod from_path(path, weight=None, file_format='xyz')[source]¶

Read configurations from path and initialize a dataset using KLIFF’s own parser.

Parameters:

path (Union[Path, str]) – Path the directory (or filename) storing the configurations.
weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).
file_format (str) – Format of the file that stores the configuration, e.g. xyz.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_path(path, weight=None, file_format='xyz')[source]¶

Read configurations from path and append them to dataset.

Parameters:

path (Union[Path, str]) – Path the directory (or filename) storing the configurations.
weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).
file_format (str) – Format of the file that stores the configuration, e.g. xyz.

classmethod from_ase(path=None, ase_atoms_list=None, weight=None, energy_key='energy', forces_key='forces', stress_key='stress', slices=':', file_format='xyz')[source]¶

Read configurations from ase.Atoms object and initialize a dataset. The expected inputs are either a pre-initialized list of ase.Atoms, or a path from which the dataset can be read from (usually an extxyz file). If the configurations are in a file, or a directory, it would use ~ase.io.read() to read the configurations. Therefore, it is expected that the file format is supported by ASE.

Example

>>> from ase.build import bulk
>>> from kliff.dataset import Dataset
>>> ase_configs = [bulk("Al"), bulk("Al", cubic=True)]
>>> dataset_from_list = Dataset.from_ase(ase_atoms_list=ase_configs)
>>> dataset_from_file = Dataset.from_ase(path="configs.xyz", energy_key="Energy")

Parameters:

path (Union[str, Path, None]) – Path the directory (or filename) storing the configurations.
ase_atoms_list (Optional[List[Atoms]]) – A list of ase.Atoms objects.
weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).
energy_key (str) – Name of the field in extxyz/ase.Atoms that stores the energy.
forces_key (str) – Name of the field in extxyz/ase.Atoms that stores the forces.
stress_key (str) – Name of the field in extxyz/ase.Atoms that stores the stress.
slices (Union[slice, str]) – Slice of the configurations to read. It is used only when path is a file.
file_format (str) – Format of the file that stores the configuration, e.g. xyz.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_ase(path=None, ase_atoms_list=None, weight=None, energy_key='energy', forces_key='forces', stress_key='stress', slices=':', file_format='xyz')[source]¶

Read configurations from ase.Atoms object and append to a dataset. The expected inputs are either a pre-initialized list of ase.Atoms, or a path from which the dataset can be read from (usually an extxyz file). If the configurations are in a file, or a directory, it would use ~ase.io.read() to read the configurations. Therefore, it is expected that the file format is supported by ASE.

Example

>>> from ase.build import bulk
>>> from kliff.dataset import Dataset
>>> ase_configs = [bulk("Al"), bulk("Al", cubic=True)]
>>> dataset = Dataset()
>>> dataset.add_from_ase(ase_atoms_list=ase_configs)
>>> dataset.add_from_ase(path="configs.xyz", energy_key="Energy")

Parameters:

path (Union[str, Path, None]) – Path the directory (or filename) storing the configurations.
ase_atoms_list (Optional[List[Atoms]]) – A list of ase.Atoms objects.
weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).
energy_key (str) – Name of the field in extxyz/ase.Atoms that stores the energy.
forces_key (str) – Name of the field in extxyz/ase.Atoms that stores the forces.
stress_key (str) – Name of the field in extxyz/ase.Atoms that stores the stress.
slices (str) – Slice of the configurations to read. It is used only when path is a file.
file_format (str) – Format of the file that stores the configuration, e.g. xyz.

classmethod from_lmdb(lmdb_file, n_configs=None, config_key_prefix=None, coords_key='coords', species_key='species', pbc_key='PBC', cell_key='cell', energy_key='energy', forces_key='forces', stress_key='stress', config_weight_key='config_weight', energy_weight_key='energy_weight', forces_weight_key='forces_weight', stress_weight_key='stress_weight', metadata_keys=None, weight_file=None)[source]¶

Load dataset from an LMDB file.

Parameters:

lmdb_file (Path) – Path to the LMDB file.
n_configs (Optional[int]) – Number of configurations to load.
config_key_prefix (Optional[str]) – KLIFF assumes that configurations can be loaded as “prefix{idx}” where idx is the index of the configuration in the LMDB file.
coords_key (str) – Key to get coordinates from the lmdb configuration.
species_key (str) – Key to get species from the lmdb configuration.
pbc_key (str) – Key to get PBC array from the lmdb configuration.
cell_key (str) – Key to get cell vectors from the lmdb configuration.
energy_key (str) – Key to get energy from the lmdb configuration.
forces_key (str) – Key to get forces from the lmdb configuration.
stress_key (str) – Key to get stress from the lmdb configuration.
config_weight_key (str) – Key to get config_weight from the lmdb configuration.
energy_weight_key (str) – Key to get energy_weight from the lmdb configuration.
forces_weight_key (str) – Key to get forces_weight from the lmdb configuration.
stress_weight_key (str) – Key to get stress_weight from the lmdb configuration.
metadata_keys (Optional[List[str]]) – List of keys to get all metadata from the lmdb configuration.
weight_file (Optional[Path]) – Path to the KLIFF weight file.

Return type:

Dataset

Returns:

Dataset object.

add_from_lmdb(lmdb_file, n_configs, config_key_prefix, coords_key, species_key, pbc_key, cell_key, energy_key, forces_key, stress_key, config_weight_key, energy_weight_key, forces_weight_key, stress_weight_key, metadata_keys)[source]¶

Add configurations from an LMDB file.

Parameters:

lmdb_file – Path to the LMDB file.
n_configs – Number of configurations to load.
config_key_prefix – KLIFF assumes that configurations can be loaded as “prefix{idx}” where idx is the index of the configuration in the LMDB file.
coords_key – Key to get coordinates from the lmdb configuration.
species_key – Key to get species from the lmdb configuration.
pbc_key – Key to get PBC array from the lmdb configuration.
cell_key – Key to get cell vectors from the lmdb configuration.
energy_key – Key to get energy from the lmdb configuration.
forces_key – Key to get forces from the lmdb configuration.
stress_key – Key to get stress from the lmdb configuration.
config_weight_key – Key to get config_weight from the lmdb configuration.
energy_weight_key – Key to get energy_weight from the lmdb configuration.
forces_weight_key – Key to get forces_weight from the lmdb configuration.
stress_weight_key – Key to get stress_weight from the lmdb configuration.
metadata_keys – List of keys to get all metadata from the lmdb configuration.

to_lmdb(lmdb_file)[source]¶

classmethod from_huggingface(hf_id, split='train', n_configs=None, coords_key='positions', species_key='atomic_numbers', pbc_key='pbc', cell_key='cell', energy_key='energy', forces_key='atomic_forces', stress_key=None, weights_file=None, **load_kwargs)[source]¶

Load dataset from a HuggingFace Hub dataset.

Parameters:

hf_id (str) – Huggingface id e.g. “colabfit/xxMD-CASSCF_train”
split (str) – which split to load, e.g. “train”
n_configs (Optional[int]) – optionally limit to the first N configs
*_key – column names in the HF dataset
load_kwargs – passed through to datasets.load_dataset

Return type:

Dataset

Returns:

Dataset

add_from_huggingface(hf_id, split, n_configs, coords_key, species_key, pbc_key, cell_key, energy_key, forces_key, stress_key=None, weights_file=None, **load_kwargs)[source]¶

Add configurations from a HuggingFace Hub dataset.

Parameters:

hf_id – Huggingface id e.g. “colabfit/xxMD-CASSCF_train”
split – which split to load, e.g. “train”
n_configs – optionally limit to the first N configs
*_key – column names in the HF dataset
load_kwargs – passed through to datasets.load_dataset

to_path(path, prefix=None)[source]¶

Save the dataset to a folder, as per the KLIFF xyz format. The folder will contain multiple files, each containing a configuration. Prefix is added to the filename of each configuration. Path is created if it does not exist.

Parameters:

path (Union[Path, str]) – Path to the directory to save the dataset.
prefix (Optional[str]) – Prefix to add to the filename of each configuration.

Return type:

None

to_ase(path)[source]¶

Save the dataset to a file in ASE format. The file will contain multiple configurations, each separated by a newline. The file will be saved in the specified path. The file format is determined by the extension of the path.

Parameters:: path (Union[Path, str]) – Path to the file to save the dataset.
Return type:: None

to_ase_list()[source]¶

Convert the dataset to a list of ase.Atoms objects.

Return type:: List[Atoms]
Returns:: List of ase.Atoms objects.

to_colabfit(colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017')[source]¶

Save dataset to a colabfit database. :type colabfit_database: str :param colabfit_database: :type colabfit_dataset: str :param colabfit_dataset: :type colabfit_uri: str :param colabfit_uri:

Returns:

get_configs()[source]¶

Get shallow copy of the configurations.

Return type:: List[Configuration]

save_weights(path)[source]¶

Save the weights of the configurations to a file.

Parameters:: path (Union[Path, str]) – Path of the file to save the weights.

static add_weights(configurations, source)[source]¶

Load weights from a text file/ Weight class. The text file should contain 1 to 4 columns, whitespace seperated, formatted as, ` Config Energy Forces Stress 1.0 0.0 10.0 0.0 ` `{note} The column headers are case-insensitive, but should have same name as above. The weight of 0.0 will set respective weight as `None`. The length of column can be either 1 (all configs same weight) or n, where n is the number of configs in the dataset. ` Missing columns are treated as 0.0, i.e. above example file can also be written as ` Config Forces 1.0 10.0 `

It also now supports the yaml weight file. The yaml file should be formatted as, ``` - config: [1.0, 1.0, 1.0]

energy: 0.0 forces: [1.0, 1.0, 1.0] stress: 0.0

config: [1.0, 1.0, 1.0] energy: 0.0 forces: [1.0, 1.0, 1.0] stress: 0.0

``` Any missing key is treated as 0.0. The weights are assumed to be in same order as the dataset configurations.

Parameters:

configurations (Union[List[Configuration], Dataset]) – List of configurations to add weights to.
source (Union[Path, str, Weight]) – Path to the configuration file

add_metadata(metadata)[source]¶

Add metadata to the dataset object.

Parameters:: metadata (dict) – A dictionary containing the metadata.

get_metadata(key)[source]¶

Get the metadata of the dataset.

Parameters:: key (str) – Key of the metadata to get.
Returns:: Value of the metadata.

property metadata¶: Return the metadata of the dataset.

check_properties_consistency(properties=None)[source]¶

Check which of the properties of the configurations are consistent. These consistent properties are saved a list which can be used to get the attributes from the configurations. “Consistent” in this context means that same property is available for all the configurations. A property is not considered consistent if it is None for any of the configurations.

Parameters:: properties (Optional[List[str]]) – List of properties to check for consistency. If None, no properties are checked. All consistent properties are saved in the metadata.

static get_manifest_checksum(dataset_manifest, transform_manifest=None)[source]¶

Get the checksum of the dataset manifest.

Parameters:

dataset_manifest (dict[str, Any]) – Manifest of the dataset.
transform_manifest (Optional[dict[str, Any]]) – Manifest of the transformation.

Return type:

str

Returns:

Checksum of the manifest.

static get_dataset_from_manifest(dataset_manifest)[source]¶

Get a dataset from a manifest.

Examples

Manifest file for initializing dataset using ASE parser:

```yaml dataset:

type: ase # ase or path or colabfit path: Si.xyz # Path to the dataset save: True # Save processed dataset to a file save_path: /folder/to # Save to this folder shuffle: False # Shuffle the dataset weights: /path/to/weights.dat # or dictionary with weights keys:

energy: Energy # Key for energy, if ase dataset is used forces: forces # Key for forces, if ase dataset is used

```

2. Manifest file for initializing dataset using KLIFF extxyz parser: ```yaml dataset:

type: path # ase or path or colabfit path: /all/my/xyz # Path to the dataset save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset weights: # same weight for all, or file with weights

config: 1.0 energy: 0.0 forces: 10.0 stress: 0.0

```

3. Manifest file for initializing dataset using ColabFit parser: ```yaml dataset:

type: colabfit # ase or path or colabfit save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset weights: None colabfit_dataset:

dataset_name: database_name: database_url:

```

Parameters:: dataset_manifest (dict) – List of configurations.
Return type:: Dataset
Returns:: A dataset of configurations.

kliff.dataset.read_extxyz(filename)[source]¶

Read atomic configuration stored in extended xyz file_format.

Parameters:

filename (Path) – filename to the extended xyz file

Returns:

3x3 array, supercell lattice vectors species: species of atoms coords: Nx3 array, coordinates of atoms PBC: periodic boundary conditions energy: potential energy of the configuration; None if not provided in file forces: Nx3 array, forces on atoms; None if not provided in file stress: 1D array of size 6, stress on the cell in Voigt notation; None if not

provided in file

Return type:

cell

kliff.dataset.write_extxyz(filename, cell, species, coords, PBC, energy=None, forces=None, stress=None, bool_as_str=False)[source]¶

Write configuration info to a file in extended xyz file_format.

Parameters:

filename (Path) – filename to the extended xyz file
cell (ndarray) – 3x3 array, supercell lattice vectors
species (List[str]) – species of atoms
coords (ndarray) – Nx3 array, coordinates of atoms
PBC (List[bool]) – periodic boundary conditions
energy (Optional[float]) – potential energy of the configuration; If None, not write to file
forces (Optional[ndarray]) – Nx3 array, forces on atoms; If None, not write to file
stress (Optional[List[float]]) – 1D array of size 6, stress on the cell in Voigt notation; If None, not write to file
bool_as_str (bool) – If True, write PBC as “T” or “F”; otherwise, write PBC as 1 or 0.