kliff.dataset.dataset

class kliff.dataset.dataset.Dataset(configurations=None)[source]

A dataset of multiple configurations (Configuration).

Parameters:

configurations (Optional[Iterable]) – A list of Configuration objects.

classmethod from_colabfit(cls, colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017', weight=None, **kwargs)

Read configurations from colabfit database and initialize a dataset.

Parameters:
  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • colabfit_database (str) – Name of the colabfit Mongo database to read from.

  • colabfit_dataset (str) – Name of the colabfit dataset instance to read from, usually it is of form, e.g., “DS_xxxxxxxxxxxx_0”

  • colabfit_uri (str) – connection URI of the colabfit Mongo database to read from.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_colabfit(colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017', weight=None, **kwargs)

Read configurations from colabfit database and add them to the dataset.

Parameters:
  • colabfit_database (str) – Name of the colabfit Mongo database to read from.

  • colabfit_dataset (str) – Name of the colabfit dataset instance to read from (usually it is of form, e.g., “DS_xxxxxxxxxxxx_0”)

  • colabfit_uri (str) – connection URI of the colabfit Mongo database to read from.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

classmethod from_path(path, weight=None, file_format='xyz')[source]

Read configurations from path and initialize a dataset using KLIFF’s own parser.

Parameters:
  • path (Union[Path, str]) – Path the directory (or filename) storing the configurations.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • file_format (str) – Format of the file that stores the configuration, e.g. xyz.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_path(path, weight=None, file_format='xyz')[source]

Read configurations from path and append them to dataset.

Parameters:
  • path (Union[Path, str]) – Path the directory (or filename) storing the configurations.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • file_format (str) – Format of the file that stores the configuration, e.g. xyz.

classmethod from_ase(path=None, ase_atoms_list=None, weight=None, energy_key='energy', forces_key='forces', stress_key='stress', slices=':', file_format='xyz')[source]

Read configurations from ase.Atoms object and initialize a dataset. The expected inputs are either a pre-initialized list of ase.Atoms, or a path from which the dataset can be read from (usually an extxyz file). If the configurations are in a file, or a directory, it would use ~ase.io.read() to read the configurations. Therefore, it is expected that the file format is supported by ASE.

Example

>>> from ase.build import bulk
>>> from kliff.dataset import Dataset
>>> ase_configs = [bulk("Al"), bulk("Al", cubic=True)]
>>> dataset_from_list = Dataset.from_ase(ase_atoms_list=ase_configs)
>>> dataset_from_file = Dataset.from_ase(path="configs.xyz", energy_key="Energy")
Parameters:
  • path (Union[str, Path, None]) – Path the directory (or filename) storing the configurations.

  • ase_atoms_list (Optional[List[Atoms]]) – A list of ase.Atoms objects.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • energy_key (str) – Name of the field in extxyz/ase.Atoms that stores the energy.

  • forces_key (str) – Name of the field in extxyz/ase.Atoms that stores the forces.

  • stress_key (str) – Name of the field in extxyz/ase.Atoms that stores the stress.

  • slices (Union[slice, str]) – Slice of the configurations to read. It is used only when path is a file.

  • file_format (str) – Format of the file that stores the configuration, e.g. xyz.

Return type:

Dataset

Returns:

A dataset of configurations.

add_from_ase(path=None, ase_atoms_list=None, weight=None, energy_key='energy', forces_key='forces', stress_key='stress', slices=':', file_format='xyz')[source]

Read configurations from ase.Atoms object and append to a dataset. The expected inputs are either a pre-initialized list of ase.Atoms, or a path from which the dataset can be read from (usually an extxyz file). If the configurations are in a file, or a directory, it would use ~ase.io.read() to read the configurations. Therefore, it is expected that the file format is supported by ASE.

Example

>>> from ase.build import bulk
>>> from kliff.dataset import Dataset
>>> ase_configs = [bulk("Al"), bulk("Al", cubic=True)]
>>> dataset = Dataset()
>>> dataset.add_from_ase(ase_atoms_list=ase_configs)
>>> dataset.add_from_ase(path="configs.xyz", energy_key="Energy")
Parameters:
  • path (Union[str, Path, None]) – Path the directory (or filename) storing the configurations.

  • ase_atoms_list (Optional[List[Atoms]]) – A list of ase.Atoms objects.

  • weight (Union[Weight, Path, None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).

  • energy_key (str) – Name of the field in extxyz/ase.Atoms that stores the energy.

  • forces_key (str) – Name of the field in extxyz/ase.Atoms that stores the forces.

  • stress_key (str) – Name of the field in extxyz/ase.Atoms that stores the stress.

  • slices (str) – Slice of the configurations to read. It is used only when path is a file.

  • file_format (str) – Format of the file that stores the configuration, e.g. xyz.

classmethod from_lmdb(lmdb_file, n_configs=None, config_key_prefix=None, coords_key='coords', species_key='species', pbc_key='PBC', cell_key='cell', energy_key='energy', forces_key='forces', stress_key='stress', config_weight_key='config_weight', energy_weight_key='energy_weight', forces_weight_key='forces_weight', stress_weight_key='stress_weight', metadata_keys=None, weight_file=None)[source]

Load dataset from an LMDB file.

Parameters:
  • lmdb_file (Path) – Path to the LMDB file.

  • n_configs (Optional[int]) – Number of configurations to load.

  • config_key_prefix (Optional[str]) – KLIFF assumes that configurations can be loaded as “prefix{idx}” where idx is the index of the configuration in the LMDB file.

  • coords_key (str) – Key to get coordinates from the lmdb configuration.

  • species_key (str) – Key to get species from the lmdb configuration.

  • pbc_key (str) – Key to get PBC array from the lmdb configuration.

  • cell_key (str) – Key to get cell vectors from the lmdb configuration.

  • energy_key (str) – Key to get energy from the lmdb configuration.

  • forces_key (str) – Key to get forces from the lmdb configuration.

  • stress_key (str) – Key to get stress from the lmdb configuration.

  • config_weight_key (str) – Key to get config_weight from the lmdb configuration.

  • energy_weight_key (str) – Key to get energy_weight from the lmdb configuration.

  • forces_weight_key (str) – Key to get forces_weight from the lmdb configuration.

  • stress_weight_key (str) – Key to get stress_weight from the lmdb configuration.

  • metadata_keys (Optional[List[str]]) – List of keys to get all metadata from the lmdb configuration.

  • weight_file (Optional[Path]) – Path to the KLIFF weight file.

Return type:

Dataset

Returns:

Dataset object.

add_from_lmdb(lmdb_file, n_configs, config_key_prefix, coords_key, species_key, pbc_key, cell_key, energy_key, forces_key, stress_key, config_weight_key, energy_weight_key, forces_weight_key, stress_weight_key, metadata_keys)[source]

Add configurations from an LMDB file.

Parameters:
  • lmdb_file – Path to the LMDB file.

  • n_configs – Number of configurations to load.

  • config_key_prefix – KLIFF assumes that configurations can be loaded as “prefix{idx}” where idx is the index of the configuration in the LMDB file.

  • coords_key – Key to get coordinates from the lmdb configuration.

  • species_key – Key to get species from the lmdb configuration.

  • pbc_key – Key to get PBC array from the lmdb configuration.

  • cell_key – Key to get cell vectors from the lmdb configuration.

  • energy_key – Key to get energy from the lmdb configuration.

  • forces_key – Key to get forces from the lmdb configuration.

  • stress_key – Key to get stress from the lmdb configuration.

  • config_weight_key – Key to get config_weight from the lmdb configuration.

  • energy_weight_key – Key to get energy_weight from the lmdb configuration.

  • forces_weight_key – Key to get forces_weight from the lmdb configuration.

  • stress_weight_key – Key to get stress_weight from the lmdb configuration.

  • metadata_keys – List of keys to get all metadata from the lmdb configuration.

to_lmdb(lmdb_file)[source]
classmethod from_huggingface(hf_id, split='train', n_configs=None, coords_key='positions', species_key='atomic_numbers', pbc_key='pbc', cell_key='cell', energy_key='energy', forces_key='atomic_forces', stress_key=None, weights_file=None, **load_kwargs)[source]

Load dataset from a HuggingFace Hub dataset.

Parameters:
  • hf_id (str) – Huggingface id e.g. “colabfit/xxMD-CASSCF_train”

  • split (str) – which split to load, e.g. “train”

  • n_configs (Optional[int]) – optionally limit to the first N configs

  • *_key – column names in the HF dataset

  • load_kwargs – passed through to datasets.load_dataset

Return type:

Dataset

Returns:

Dataset

add_from_huggingface(hf_id, split, n_configs, coords_key, species_key, pbc_key, cell_key, energy_key, forces_key, stress_key=None, weights_file=None, **load_kwargs)[source]

Add configurations from a HuggingFace Hub dataset.

Parameters:
  • hf_id – Huggingface id e.g. “colabfit/xxMD-CASSCF_train”

  • split – which split to load, e.g. “train”

  • n_configs – optionally limit to the first N configs

  • *_key – column names in the HF dataset

  • load_kwargs – passed through to datasets.load_dataset

to_path(path, prefix=None)[source]

Save the dataset to a folder, as per the KLIFF xyz format. The folder will contain multiple files, each containing a configuration. Prefix is added to the filename of each configuration. Path is created if it does not exist.

Parameters:
  • path (Union[Path, str]) – Path to the directory to save the dataset.

  • prefix (Optional[str]) – Prefix to add to the filename of each configuration.

Return type:

None

to_ase(path)[source]

Save the dataset to a file in ASE format. The file will contain multiple configurations, each separated by a newline. The file will be saved in the specified path. The file format is determined by the extension of the path.

Parameters:

path (Union[Path, str]) – Path to the file to save the dataset.

Return type:

None

to_ase_list()[source]

Convert the dataset to a list of ase.Atoms objects.

Return type:

List[Atoms]

Returns:

List of ase.Atoms objects.

to_colabfit(colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017')[source]

Save dataset to a colabfit database. :type colabfit_database: str :param colabfit_database: :type colabfit_dataset: str :param colabfit_dataset: :type colabfit_uri: str :param colabfit_uri:

Returns:

get_configs()[source]

Get shallow copy of the configurations.

Return type:

List[Configuration]

save_weights(path)[source]

Save the weights of the configurations to a file.

Parameters:

path (Union[Path, str]) – Path of the file to save the weights.

static add_weights(configurations, source)[source]

Load weights from a text file/ Weight class. The text file should contain 1 to 4 columns, whitespace seperated, formatted as, ` Config Energy Forces Stress 1.0    0.0    10.0   0.0 ` `{note} The column headers are case-insensitive, but should have same name as above. The weight of 0.0 will set respective weight as `None`. The length of column can be either 1 (all configs same weight) or n, where n is the number of configs in the dataset. ` Missing columns are treated as 0.0, i.e. above example file can also be written as ` Config Forces 1.0    10.0 `

It also now supports the yaml weight file. The yaml file should be formatted as, ``` - config: [1.0, 1.0, 1.0]

energy: 0.0 forces: [1.0, 1.0, 1.0] stress: 0.0

  • config: [1.0, 1.0, 1.0] energy: 0.0 forces: [1.0, 1.0, 1.0] stress: 0.0

``` Any missing key is treated as 0.0. The weights are assumed to be in same order as the dataset configurations.

Parameters:
  • configurations (Union[List[Configuration], Dataset]) – List of configurations to add weights to.

  • source (Union[Path, str, Weight]) – Path to the configuration file

add_metadata(metadata)[source]

Add metadata to the dataset object.

Parameters:

metadata (dict) – A dictionary containing the metadata.

get_metadata(key)[source]

Get the metadata of the dataset.

Parameters:

key (str) – Key of the metadata to get.

Returns:

Value of the metadata.

property metadata

Return the metadata of the dataset.

check_properties_consistency(properties=None)[source]

Check which of the properties of the configurations are consistent. These consistent properties are saved a list which can be used to get the attributes from the configurations. “Consistent” in this context means that same property is available for all the configurations. A property is not considered consistent if it is None for any of the configurations.

Parameters:

properties (Optional[List[str]]) – List of properties to check for consistency. If None, no properties are checked. All consistent properties are saved in the metadata.

static get_manifest_checksum(dataset_manifest, transform_manifest=None)[source]

Get the checksum of the dataset manifest.

Parameters:
  • dataset_manifest (dict[str, Any]) – Manifest of the dataset.

  • transform_manifest (Optional[dict[str, Any]]) – Manifest of the transformation.

Return type:

str

Returns:

Checksum of the manifest.

static get_dataset_from_manifest(dataset_manifest)[source]

Get a dataset from a manifest.

Examples

  1. Manifest file for initializing dataset using ASE parser:

```yaml dataset:

type: ase # ase or path or colabfit path: Si.xyz # Path to the dataset save: True # Save processed dataset to a file save_path: /folder/to # Save to this folder shuffle: False # Shuffle the dataset weights: /path/to/weights.dat # or dictionary with weights keys:

energy: Energy # Key for energy, if ase dataset is used forces: forces # Key for forces, if ase dataset is used

```

2. Manifest file for initializing dataset using KLIFF extxyz parser: ```yaml dataset:

type: path # ase or path or colabfit path: /all/my/xyz # Path to the dataset save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset weights: # same weight for all, or file with weights

config: 1.0 energy: 0.0 forces: 10.0 stress: 0.0

```

3. Manifest file for initializing dataset using ColabFit parser: ```yaml dataset:

type: colabfit # ase or path or colabfit save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset weights: None colabfit_dataset:

dataset_name: database_name: database_url:

```

Parameters:

dataset_manifest (dict) – List of configurations.

Return type:

Dataset

Returns:

A dataset of configurations.

exception kliff.dataset.dataset.DatasetError(msg)[source]