kliff.dataset

class kliff.dataset.Configuration(cell, species, coords, PBC, energy=None, forces=None, stress=None, weight=None, identifier=None)[source]

Class of atomic configuration. This is used to store the information of an atomic configuration, e.g. supercell, species, coords, energy, and forces.

Parameters:
  • cell (ndarray) – A 3x3 matrix of the lattice vectors. The first, second, and third rows are a_1, a_2, and a_3, respectively.

  • species (List[str]) – A list of N strings giving the species of the atoms, where N is the number of atoms.

  • coords (ndarray) – A Nx3 matrix of the coordinates of the atoms, where N is the number of atoms.

  • PBC (List[bool]) – A list with 3 components indicating whether periodic boundary condition is used along the directions of the first, second, and third lattice vectors.

  • energy (Optional[float]) – energy of the configuration.

  • forces (Optional[ndarray]) – A Nx3 matrix of the forces on atoms, where N is the number of atoms.

  • stress (Optional[List[float]]) – A list with 6 components in Voigt notation, i.e. it returns \sigma=[\sigma_{xx},\sigma_{yy},\sigma_{zz},\sigma_{yz},\sigma_{xz},
\sigma_{xy}]. See: https://en.wikipedia.org/wiki/Voigt_notation

  • weight (Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.

  • identifier (Union[str, Path, None]) – a (unique) identifier of the configuration

classmethod from_file(filename, weight=None, file_format='xyz')[source]

Read configuration from file.

Parameters:
  • filename (Path) – Path to the file that stores the configuration.

  • file_format (str) – Format of the file that stores the configuration (e.g. xyz).

to_file(filename, file_format='xyz')[source]

Write the configuration to file.

Parameters:
  • filename (Path) – Path to the file that stores the configuration.

  • file_format (str) – Format of the file that stores the configuration (e.g. xyz).

classmethod from_colabfit(database_client, data_object, weight=None)[source]

Read configuration from colabfit database .

Parameters:
  • database_client (MongoDatabase) – Instance of connected MongoDatabase client, which can be used to fetch database from colabfit-tools dataset.

  • data_object (dict) – colabfit data object dictionary to be associated with current configuration and property.

  • weight (Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.

to_colabfit(database_client, data_object, weight=None)[source]

Save configuration from colabfit database.

Parameters:
  • database_client (MongoDatabase)

  • data_object (dict)

  • weight (Optional[Weight])

Returns:

classmethod from_ase_atoms(atoms, weight=None, energy_key='energy', forces_key='forces', stress_key='stress')[source]

Read configuration from ase.Atoms object.

Parameters:
  • atoms (Atoms) – ase.Atoms object.

  • weight (Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.

  • energy_key (str) – Name of the field in extxyz that stores the energy.

  • forces_key (str) – Name of the field in extxyz that stores the forces.

  • stress_key (str) – Name of the field in extxyz that stores the stress.

to_ase_atoms()[source]

Convert the configuration to ase.Atoms object.

Returns:

ase.Atoms representation of the Configuration

property cell: ndarray

3x3 matrix of the lattice vectors of the configurations.

property PBC: List[bool]

A list with 3 components indicating whether periodic boundary condition is used along the directions of the first, second, and third lattice vectors.

property species: List[str]

Species string of all atoms.

property coords: ndarray

A Nx3 matrix of the Cartesian coordinates of all atoms.

property energy: float | None

Potential energy of the configuration.

property forces: ndarray

Return a Nx3 matrix of the forces on each atoms.

property stress: List[float]

Stress of the configuration. The stress is given in Voigt notation i.e \sigma=[\sigma_{xx},\sigma_{yy},\sigma_{zz},\sigma_{yz},\sigma_{xz},
\sigma_{xy}].

property weight

Get the weight class of the loss function.

property identifier: str

Return identifier of the configuration.

property fingerprint

Return the stored fingerprint of the configuration.

property path: Path | None

Return the path of the file containing the configuration. If the configuration is not read from a file, return None.

property metadata: dict

Return the metadata of the configuration.

get_num_atoms()[source]

Return the total number of atoms in the configuration.

Return type:

int

get_num_atoms_by_species()[source]

Return a dictionary of the number of atoms with each species.

Return type:

Dict[str, int]

get_volume()[source]

Return volume of the configuration.

Return type:

float

count_atoms_by_species(symbols=None)[source]

Count the number of atoms by species.

Parameters:

symbols (Optional[List[str]]) – species to count the occurrence. If None, all species present in the configuration are used.

Returns:

with key the species string, and value the number of

atoms with each species.

Return type:

{specie, count}

order_by_species()[source]

Order the atoms according to the species such that atoms with the same species have contiguous indices.

to_dict()[source]
Return type:

dict

classmethod bulk(**kwargs)[source]

Transparent wrapper to get KLIFF configuration from bulk ASE atoms. Mostly for convenience.

Parameters:

**kwargs – All the args that will be passed to ase.build.bulk

Return type:

Configuration

Returns:

Configuration

get_supercell(nx=1, ny=1, nz=1)[source]

Generate supercell from a configuration.

Parameters:
  • nx (int) – repetition along x-axis

  • ny (int) – repetition along y-axis

  • nz (int) – repetition along z-axis

Return type:

Configuration

Returns:

Configuration

class kliff.dataset.Dataset(configurations=None)[source]

A dataset of multiple configurations (Configuration).

Parameters:

configurations (Optional[Iterable]) – A list of Configuration objects.

classmethod from_colabfit(cls, colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017', weight=None, **kwargs)

Read configurations from colabfit database and initialize a dataset.

Parameters:
  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • colabfit_database (str) – Name of the colabfit Mongo database to read from.

  • colabfit_dataset (str) – Name of the colabfit dataset instance to read from, usually it is of form, e.g., “DS_xxxxxxxxxxxx_0”

  • colabfit_uri (str) – connection URI of the colabfit Mongo database to read from.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_colabfit(colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017', weight=None, **kwargs)

Read configurations from colabfit database and add them to the dataset.

Parameters:
  • colabfit_database (str) – Name of the colabfit Mongo database to read from.

  • colabfit_dataset (str) – Name of the colabfit dataset instance to read from (usually it is of form, e.g., “DS_xxxxxxxxxxxx_0”)

  • colabfit_uri (str) – connection URI of the colabfit Mongo database to read from.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

classmethod from_path(path, weight=None, file_format='xyz')[source]

Read configurations from path and initialize a dataset using KLIFF’s own parser.

Parameters:
  • path (Union[Path, str]) – Path the directory (or filename) storing the configurations.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • file_format (str) – Format of the file that stores the configuration, e.g. xyz.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_path(path, weight=None, file_format='xyz')[source]

Read configurations from path and append them to dataset.

Parameters:
  • path (Union[Path, str]) – Path the directory (or filename) storing the configurations.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • file_format (str) – Format of the file that stores the configuration, e.g. xyz.

classmethod from_ase(path=None, ase_atoms_list=None, weight=None, energy_key='energy', forces_key='forces', stress_key='stress', slices=':', file_format='xyz')[source]

Read configurations from ase.Atoms object and initialize a dataset. The expected inputs are either a pre-initialized list of ase.Atoms, or a path from which the dataset can be read from (usually an extxyz file). If the configurations are in a file, or a directory, it would use ~ase.io.read() to read the configurations. Therefore, it is expected that the file format is supported by ASE.

Example

>>> from ase.build import bulk
>>> from kliff.dataset import Dataset
>>> ase_configs = [bulk("Al"), bulk("Al", cubic=True)]
>>> dataset_from_list = Dataset.from_ase(ase_atoms_list=ase_configs)
>>> dataset_from_file = Dataset.from_ase(path="configs.xyz", energy_key="Energy")
Parameters:
  • path (Union[str, Path, None]) – Path the directory (or filename) storing the configurations.

  • ase_atoms_list (Optional[List[Atoms]]) – A list of ase.Atoms objects.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • energy_key (str) – Name of the field in extxyz/ase.Atoms that stores the energy.

  • forces_key (str) – Name of the field in extxyz/ase.Atoms that stores the forces.

  • stress_key (str) – Name of the field in extxyz/ase.Atoms that stores the stress.

  • slices (Union[slice, str]) – Slice of the configurations to read. It is used only when path is a file.

  • file_format (str) – Format of the file that stores the configuration, e.g. xyz.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_ase(path=None, ase_atoms_list=None, weight=None, energy_key='energy', forces_key='forces', stress_key='stress', slices=':', file_format='xyz')[source]

Read configurations from ase.Atoms object and append to a dataset. The expected inputs are either a pre-initialized list of ase.Atoms, or a path from which the dataset can be read from (usually an extxyz file). If the configurations are in a file, or a directory, it would use ~ase.io.read() to read the configurations. Therefore, it is expected that the file format is supported by ASE.

Example

>>> from ase.build import bulk
>>> from kliff.dataset import Dataset
>>> ase_configs = [bulk("Al"), bulk("Al", cubic=True)]
>>> dataset = Dataset()
>>> dataset.add_from_ase(ase_atoms_list=ase_configs)
>>> dataset.add_from_ase(path="configs.xyz", energy_key="Energy")
Parameters:
  • path (Union[str, Path, None]) – Path the directory (or filename) storing the configurations.

  • ase_atoms_list (Optional[List[Atoms]]) – A list of ase.Atoms objects.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • energy_key (str) – Name of the field in extxyz/ase.Atoms that stores the energy.

  • forces_key (str) – Name of the field in extxyz/ase.Atoms that stores the forces.

  • stress_key (str) – Name of the field in extxyz/ase.Atoms that stores the stress.

  • slices (str) – Slice of the configurations to read. It is used only when path is a file.

  • file_format (str) – Format of the file that stores the configuration, e.g. xyz.

classmethod from_lmdb(lmdb_file, n_configs=None, config_key_prefix=None, coords_key='coords', species_key='species', pbc_key='PBC', cell_key='cell', energy_key='energy', forces_key='forces', stress_key='stress', config_weight_key='config_weight', energy_weight_key='energy_weight', forces_weight_key='forces_weight', stress_weight_key='stress_weight', metadata_keys=None, weight_file=None)[source]

Load dataset from an LMDB file.

Parameters:
  • lmdb_file (Path) – Path to the LMDB file.

  • n_configs (Optional[int]) – Number of configurations to load.

  • config_key_prefix (Optional[str]) – KLIFF assumes that configurations can be loaded as “prefix{idx}” where idx is the index of the configuration in the LMDB file.

  • coords_key (str) – Key to get coordinates from the lmdb configuration.

  • species_key (str) – Key to get species from the lmdb configuration.

  • pbc_key (str) – Key to get PBC array from the lmdb configuration.

  • cell_key (str) – Key to get cell vectors from the lmdb configuration.

  • energy_key (str) – Key to get energy from the lmdb configuration.

  • forces_key (str) – Key to get forces from the lmdb configuration.

  • stress_key (str) – Key to get stress from the lmdb configuration.

  • config_weight_key (str) – Key to get config_weight from the lmdb configuration.

  • energy_weight_key (str) – Key to get energy_weight from the lmdb configuration.

  • forces_weight_key (str) – Key to get forces_weight from the lmdb configuration.

  • stress_weight_key (str) – Key to get stress_weight from the lmdb configuration.

  • metadata_keys (Optional[List[str]]) – List of keys to get all metadata from the lmdb configuration.

  • weight_file (Optional[Path]) – Path to the KLIFF weight file.

Return type:

Dataset

Returns:

Dataset object.

add_from_lmdb(lmdb_file, n_configs, config_key_prefix, coords_key, species_key, pbc_key, cell_key, energy_key, forces_key, stress_key, config_weight_key, energy_weight_key, forces_weight_key, stress_weight_key, metadata_keys)[source]

Add configurations from an LMDB file.

Parameters:
  • lmdb_file – Path to the LMDB file.

  • n_configs – Number of configurations to load.

  • config_key_prefix – KLIFF assumes that configurations can be loaded as “prefix{idx}” where idx is the index of the configuration in the LMDB file.

  • coords_key – Key to get coordinates from the lmdb configuration.

  • species_key – Key to get species from the lmdb configuration.

  • pbc_key – Key to get PBC array from the lmdb configuration.

  • cell_key – Key to get cell vectors from the lmdb configuration.

  • energy_key – Key to get energy from the lmdb configuration.

  • forces_key – Key to get forces from the lmdb configuration.

  • stress_key – Key to get stress from the lmdb configuration.

  • config_weight_key – Key to get config_weight from the lmdb configuration.

  • energy_weight_key – Key to get energy_weight from the lmdb configuration.

  • forces_weight_key – Key to get forces_weight from the lmdb configuration.

  • stress_weight_key – Key to get stress_weight from the lmdb configuration.

  • metadata_keys – List of keys to get all metadata from the lmdb configuration.

to_lmdb(lmdb_file)[source]
classmethod from_huggingface(hf_id, split='train', n_configs=None, coords_key='positions', species_key='atomic_numbers', pbc_key='pbc', cell_key='cell', energy_key='energy', forces_key='atomic_forces', stress_key=None, weights_file=None, **load_kwargs)[source]

Load dataset from a HuggingFace Hub dataset.

Parameters:
  • hf_id (str) – Huggingface id e.g. “colabfit/xxMD-CASSCF_train”

  • split (str) – which split to load, e.g. “train”

  • n_configs (Optional[int]) – optionally limit to the first N configs

  • *_key – column names in the HF dataset

  • load_kwargs – passed through to datasets.load_dataset

Return type:

Dataset

Returns:

Dataset

add_from_huggingface(hf_id, split, n_configs, coords_key, species_key, pbc_key, cell_key, energy_key, forces_key, stress_key=None, weights_file=None, **load_kwargs)[source]

Add configurations from a HuggingFace Hub dataset.

Parameters:
  • hf_id – Huggingface id e.g. “colabfit/xxMD-CASSCF_train”

  • split – which split to load, e.g. “train”

  • n_configs – optionally limit to the first N configs

  • *_key – column names in the HF dataset

  • load_kwargs – passed through to datasets.load_dataset

to_path(path, prefix=None)[source]

Save the dataset to a folder, as per the KLIFF xyz format. The folder will contain multiple files, each containing a configuration. Prefix is added to the filename of each configuration. Path is created if it does not exist.

Parameters:
  • path (Union[Path, str]) – Path to the directory to save the dataset.

  • prefix (Optional[str]) – Prefix to add to the filename of each configuration.

Return type:

None

to_ase(path)[source]

Save the dataset to a file in ASE format. The file will contain multiple configurations, each separated by a newline. The file will be saved in the specified path. The file format is determined by the extension of the path.

Parameters:

path (Union[Path, str]) – Path to the file to save the dataset.

Return type:

None

to_ase_list()[source]

Convert the dataset to a list of ase.Atoms objects.

Return type:

List[Atoms]

Returns:

List of ase.Atoms objects.

to_colabfit(colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017')[source]

Save dataset to a colabfit database. :type colabfit_database: str :param colabfit_database: :type colabfit_dataset: str :param colabfit_dataset: :type colabfit_uri: str :param colabfit_uri:

Returns:

get_configs()[source]

Get shallow copy of the configurations.

Return type:

List[Configuration]

save_weights(path)[source]

Save the weights of the configurations to a file.

Parameters:

path (Union[Path, str]) – Path of the file to save the weights.

static add_weights(configurations, source)[source]

Load weights from a text file/ Weight class. The text file should contain 1 to 4 columns, whitespace seperated, formatted as, ` Config Energy Forces Stress 1.0    0.0    10.0   0.0 ` `{note} The column headers are case-insensitive, but should have same name as above. The weight of 0.0 will set respective weight as `None`. The length of column can be either 1 (all configs same weight) or n, where n is the number of configs in the dataset. ` Missing columns are treated as 0.0, i.e. above example file can also be written as ` Config Forces 1.0    10.0 `

It also now supports the yaml weight file. The yaml file should be formatted as, ``` - config: [1.0, 1.0, 1.0]

energy: 0.0 forces: [1.0, 1.0, 1.0] stress: 0.0

  • config: [1.0, 1.0, 1.0] energy: 0.0 forces: [1.0, 1.0, 1.0] stress: 0.0

``` Any missing key is treated as 0.0. The weights are assumed to be in same order as the dataset configurations.

Parameters:
  • configurations (Union[List[Configuration], Dataset]) – List of configurations to add weights to.

  • source (Union[Path, str, Weight]) – Path to the configuration file

add_metadata(metadata)[source]

Add metadata to the dataset object.

Parameters:

metadata (dict) – A dictionary containing the metadata.

get_metadata(key)[source]

Get the metadata of the dataset.

Parameters:

key (str) – Key of the metadata to get.

Returns:

Value of the metadata.

property metadata

Return the metadata of the dataset.

check_properties_consistency(properties=None)[source]

Check which of the properties of the configurations are consistent. These consistent properties are saved a list which can be used to get the attributes from the configurations. “Consistent” in this context means that same property is available for all the configurations. A property is not considered consistent if it is None for any of the configurations.

Parameters:

properties (Optional[List[str]]) – List of properties to check for consistency. If None, no properties are checked. All consistent properties are saved in the metadata.

static get_manifest_checksum(dataset_manifest, transform_manifest=None)[source]

Get the checksum of the dataset manifest.

Parameters:
  • dataset_manifest (dict[str, Any]) – Manifest of the dataset.

  • transform_manifest (Optional[dict[str, Any]]) – Manifest of the transformation.

Return type:

str

Returns:

Checksum of the manifest.

static get_dataset_from_manifest(dataset_manifest)[source]

Get a dataset from a manifest.

Examples

  1. Manifest file for initializing dataset using ASE parser:

```yaml dataset:

type: ase # ase or path or colabfit path: Si.xyz # Path to the dataset save: True # Save processed dataset to a file save_path: /folder/to # Save to this folder shuffle: False # Shuffle the dataset weights: /path/to/weights.dat # or dictionary with weights keys:

energy: Energy # Key for energy, if ase dataset is used forces: forces # Key for forces, if ase dataset is used

```

2. Manifest file for initializing dataset using KLIFF extxyz parser: ```yaml dataset:

type: path # ase or path or colabfit path: /all/my/xyz # Path to the dataset save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset weights: # same weight for all, or file with weights

config: 1.0 energy: 0.0 forces: 10.0 stress: 0.0

```

3. Manifest file for initializing dataset using ColabFit parser: ```yaml dataset:

type: colabfit # ase or path or colabfit save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset weights: None colabfit_dataset:

dataset_name: database_name: database_url:

```

Parameters:

dataset_manifest (dict) – List of configurations.

Return type:

Dataset

Returns:

A dataset of configurations.

kliff.dataset.read_extxyz(filename)[source]

Read atomic configuration stored in extended xyz file_format.

Parameters:

filename (Path) – filename to the extended xyz file

Returns:

3x3 array, supercell lattice vectors species: species of atoms coords: Nx3 array, coordinates of atoms PBC: periodic boundary conditions energy: potential energy of the configuration; None if not provided in file forces: Nx3 array, forces on atoms; None if not provided in file stress: 1D array of size 6, stress on the cell in Voigt notation; None if not

provided in file

Return type:

cell

kliff.dataset.write_extxyz(filename, cell, species, coords, PBC, energy=None, forces=None, stress=None, bool_as_str=False)[source]

Write configuration info to a file in extended xyz file_format.

Parameters:
  • filename (Path) – filename to the extended xyz file

  • cell (ndarray) – 3x3 array, supercell lattice vectors

  • species (List[str]) – species of atoms

  • coords (ndarray) – Nx3 array, coordinates of atoms

  • PBC (List[bool]) – periodic boundary conditions

  • energy (Optional[float]) – potential energy of the configuration; If None, not write to file

  • forces (Optional[ndarray]) – Nx3 array, forces on atoms; If None, not write to file

  • stress (Optional[List[float]]) – 1D array of size 6, stress on the cell in Voigt notation; If None, not write to file

  • bool_as_str (bool) – If True, write PBC as “T” or “F”; otherwise, write PBC as 1 or 0.