kliff.dataset¶
- class kliff.dataset.Configuration(cell, species, coords, PBC, energy=None, forces=None, stress=None, weight=None, identifier=None)[source]¶
Class of atomic configuration. This is used to store the information of an atomic configuration, e.g. supercell, species, coords, energy, and forces.
- Parameters:
cell (
ndarray) – A 3x3 matrix of the lattice vectors. The first, second, and third rows are,
, and
, respectively.
species (
List[str]) – A list of N strings giving the species of the atoms, where N is the number of atoms.coords (
ndarray) – A Nx3 matrix of the coordinates of the atoms, where N is the number of atoms.PBC (
List[bool]) – A list with 3 components indicating whether periodic boundary condition is used along the directions of the first, second, and third lattice vectors.energy (
Optional[float]) – energy of the configuration.forces (
Optional[ndarray]) – A Nx3 matrix of the forces on atoms, where N is the number of atoms.stress (
Optional[List[float]]) – A list with 6 components in Voigt notation, i.e. it returns. See: https://en.wikipedia.org/wiki/Voigt_notation
weight (
Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.identifier (
Union[str,Path,None]) – a (unique) identifier of the configuration
- classmethod from_file(filename, weight=None, file_format='xyz')[source]¶
Read configuration from file.
- Parameters:
filename (
Path) – Path to the file that stores the configuration.file_format (
str) – Format of the file that stores the configuration (e.g. xyz).
- to_file(filename, file_format='xyz')[source]¶
Write the configuration to file.
- Parameters:
filename (
Path) – Path to the file that stores the configuration.file_format (
str) – Format of the file that stores the configuration (e.g. xyz).
- classmethod from_colabfit(database_client, data_object, weight=None)[source]¶
Read configuration from colabfit database .
- Parameters:
database_client (
MongoDatabase) – Instance of connected MongoDatabase client, which can be used to fetch database from colabfit-tools dataset.data_object (
dict) – colabfit data object dictionary to be associated with current configuration and property.weight (
Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.
- to_colabfit(database_client, data_object, weight=None)[source]¶
Save configuration from colabfit database.
- Parameters:
database_client (
MongoDatabase)data_object (
dict)weight (
Optional[Weight])
Returns:
- classmethod from_ase_atoms(atoms, weight=None, energy_key='energy', forces_key='forces', stress_key='stress')[source]¶
Read configuration from ase.Atoms object.
- Parameters:
atoms (
Atoms) – ase.Atoms object.weight (
Optional[Weight]) – an instance that computes the weight of the configuration in the loss function.energy_key (
str) – Name of the field in extxyz that stores the energy.forces_key (
str) – Name of the field in extxyz that stores the forces.stress_key (
str) – Name of the field in extxyz that stores the stress.
- to_ase_atoms()[source]¶
Convert the configuration to ase.Atoms object.
- Returns:
ase.Atoms representation of the Configuration
- property cell: ndarray¶
3x3 matrix of the lattice vectors of the configurations.
- property PBC: List[bool]¶
A list with 3 components indicating whether periodic boundary condition is used along the directions of the first, second, and third lattice vectors.
- property species: List[str]¶
Species string of all atoms.
- property coords: ndarray¶
A Nx3 matrix of the Cartesian coordinates of all atoms.
- property energy: float | None¶
Potential energy of the configuration.
- property forces: ndarray¶
Return a Nx3 matrix of the forces on each atoms.
- property stress: List[float]¶
Stress of the configuration. The stress is given in Voigt notation i.e
.
- property weight¶
Get the weight class of the loss function.
- property identifier: str¶
Return identifier of the configuration.
- property fingerprint¶
Return the stored fingerprint of the configuration.
- property path: Path | None¶
Return the path of the file containing the configuration. If the configuration is not read from a file, return None.
- property metadata: dict¶
Return the metadata of the configuration.
- get_num_atoms_by_species()[source]¶
Return a dictionary of the number of atoms with each species.
- Return type:
Dict[str,int]
- count_atoms_by_species(symbols=None)[source]¶
Count the number of atoms by species.
- Parameters:
symbols (
Optional[List[str]]) – species to count the occurrence. If None, all species present in the configuration are used.- Returns:
- with key the species string, and value the number of
atoms with each species.
- Return type:
{specie, count}
- order_by_species()[source]¶
Order the atoms according to the species such that atoms with the same species have contiguous indices.
- classmethod bulk(**kwargs)[source]¶
Transparent wrapper to get KLIFF configuration from bulk ASE atoms. Mostly for convenience.
- Parameters:
**kwargs – All the args that will be passed to ase.build.bulk
- Return type:
- Returns:
Configuration
- class kliff.dataset.Dataset(configurations=None)[source]¶
A dataset of multiple configurations (
Configuration).- Parameters:
configurations (
Optional[Iterable]) – A list ofConfigurationobjects.
- classmethod from_colabfit(cls, colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017', weight=None, **kwargs)¶
Read configurations from colabfit database and initialize a dataset.
- Parameters:
weight (
Union[Weight,Path,None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).colabfit_database (
str) – Name of the colabfit Mongo database to read from.colabfit_dataset (
str) – Name of the colabfit dataset instance to read from, usually it is of form, e.g., “DS_xxxxxxxxxxxx_0”colabfit_uri (
str) – connection URI of the colabfit Mongo database to read from.
- Return type:
- Returns:
A dataset of configurations.
- add_from_colabfit(colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017', weight=None, **kwargs)¶
Read configurations from colabfit database and add them to the dataset.
- Parameters:
colabfit_database (
str) – Name of the colabfit Mongo database to read from.colabfit_dataset (
str) – Name of the colabfit dataset instance to read from (usually it is of form, e.g., “DS_xxxxxxxxxxxx_0”)colabfit_uri (
str) – connection URI of the colabfit Mongo database to read from.weight (
Union[Weight,Path,None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).
- classmethod from_path(path, weight=None, file_format='xyz')[source]¶
Read configurations from path and initialize a dataset using KLIFF’s own parser.
- Parameters:
path (
Union[Path,str]) – Path the directory (or filename) storing the configurations.weight (
Union[Weight,Path,None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).file_format (
str) – Format of the file that stores the configuration, e.g. xyz.
- Return type:
- Returns:
A dataset of configurations.
- add_from_path(path, weight=None, file_format='xyz')[source]¶
Read configurations from path and append them to dataset.
- Parameters:
path (
Union[Path,str]) – Path the directory (or filename) storing the configurations.weight (
Union[Weight,Path,None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).file_format (
str) – Format of the file that stores the configuration, e.g. xyz.
- classmethod from_ase(path=None, ase_atoms_list=None, weight=None, energy_key='energy', forces_key='forces', stress_key='stress', slices=':', file_format='xyz')[source]¶
Read configurations from ase.Atoms object and initialize a dataset. The expected inputs are either a pre-initialized list of ase.Atoms, or a path from which the dataset can be read from (usually an extxyz file). If the configurations are in a file, or a directory, it would use ~ase.io.read() to read the configurations. Therefore, it is expected that the file format is supported by ASE.
Example
>>> from ase.build import bulk >>> from kliff.dataset import Dataset >>> ase_configs = [bulk("Al"), bulk("Al", cubic=True)] >>> dataset_from_list = Dataset.from_ase(ase_atoms_list=ase_configs) >>> dataset_from_file = Dataset.from_ase(path="configs.xyz", energy_key="Energy")
- Parameters:
path (
Union[str,Path,None]) – Path the directory (or filename) storing the configurations.ase_atoms_list (
Optional[List[Atoms]]) – A list of ase.Atoms objects.weight (
Union[Weight,Path,None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).energy_key (
str) – Name of the field in extxyz/ase.Atoms that stores the energy.forces_key (
str) – Name of the field in extxyz/ase.Atoms that stores the forces.stress_key (
str) – Name of the field in extxyz/ase.Atoms that stores the stress.slices (
Union[slice,str]) – Slice of the configurations to read. It is used only when path is a file.file_format (
str) – Format of the file that stores the configuration, e.g. xyz.
- Return type:
- Returns:
A dataset of configurations.
- add_from_ase(path=None, ase_atoms_list=None, weight=None, energy_key='energy', forces_key='forces', stress_key='stress', slices=':', file_format='xyz')[source]¶
Read configurations from ase.Atoms object and append to a dataset. The expected inputs are either a pre-initialized list of ase.Atoms, or a path from which the dataset can be read from (usually an extxyz file). If the configurations are in a file, or a directory, it would use ~ase.io.read() to read the configurations. Therefore, it is expected that the file format is supported by ASE.
Example
>>> from ase.build import bulk >>> from kliff.dataset import Dataset >>> ase_configs = [bulk("Al"), bulk("Al", cubic=True)] >>> dataset = Dataset() >>> dataset.add_from_ase(ase_atoms_list=ase_configs) >>> dataset.add_from_ase(path="configs.xyz", energy_key="Energy")
- Parameters:
path (
Union[str,Path,None]) – Path the directory (or filename) storing the configurations.ase_atoms_list (
Optional[List[Atoms]]) – A list of ase.Atoms objects.weight (
Union[Weight,Path,None]) – an instance that computes the weight of the configuration in the loss function. If a path is provided, it is used to read the weight from the file. The file must be a plain text file with 4 whitespace separated columns: config_weight, energy_weight, forces_weight, and stress_weight. Length of the file must be equal to the number of configurations, or 1 (in which case the same weight is used for all configurations).energy_key (
str) – Name of the field in extxyz/ase.Atoms that stores the energy.forces_key (
str) – Name of the field in extxyz/ase.Atoms that stores the forces.stress_key (
str) – Name of the field in extxyz/ase.Atoms that stores the stress.slices (
str) – Slice of the configurations to read. It is used only when path is a file.file_format (
str) – Format of the file that stores the configuration, e.g. xyz.
- classmethod from_lmdb(lmdb_file, n_configs=None, config_key_prefix=None, coords_key='coords', species_key='species', pbc_key='PBC', cell_key='cell', energy_key='energy', forces_key='forces', stress_key='stress', config_weight_key='config_weight', energy_weight_key='energy_weight', forces_weight_key='forces_weight', stress_weight_key='stress_weight', metadata_keys=None, weight_file=None)[source]¶
Load dataset from an LMDB file.
- Parameters:
lmdb_file (
Path) – Path to the LMDB file.n_configs (
Optional[int]) – Number of configurations to load.config_key_prefix (
Optional[str]) – KLIFF assumes that configurations can be loaded as “prefix{idx}” where idx is the index of the configuration in the LMDB file.coords_key (
str) – Key to get coordinates from the lmdb configuration.species_key (
str) – Key to get species from the lmdb configuration.pbc_key (
str) – Key to get PBC array from the lmdb configuration.cell_key (
str) – Key to get cell vectors from the lmdb configuration.energy_key (
str) – Key to get energy from the lmdb configuration.forces_key (
str) – Key to get forces from the lmdb configuration.stress_key (
str) – Key to get stress from the lmdb configuration.config_weight_key (
str) – Key to get config_weight from the lmdb configuration.energy_weight_key (
str) – Key to get energy_weight from the lmdb configuration.forces_weight_key (
str) – Key to get forces_weight from the lmdb configuration.stress_weight_key (
str) – Key to get stress_weight from the lmdb configuration.metadata_keys (
Optional[List[str]]) – List of keys to get all metadata from the lmdb configuration.weight_file (
Optional[Path]) – Path to the KLIFF weight file.
- Return type:
- Returns:
Dataset object.
- add_from_lmdb(lmdb_file, n_configs, config_key_prefix, coords_key, species_key, pbc_key, cell_key, energy_key, forces_key, stress_key, config_weight_key, energy_weight_key, forces_weight_key, stress_weight_key, metadata_keys)[source]¶
Add configurations from an LMDB file.
- Parameters:
lmdb_file – Path to the LMDB file.
n_configs – Number of configurations to load.
config_key_prefix – KLIFF assumes that configurations can be loaded as “prefix{idx}” where idx is the index of the configuration in the LMDB file.
coords_key – Key to get coordinates from the lmdb configuration.
species_key – Key to get species from the lmdb configuration.
pbc_key – Key to get PBC array from the lmdb configuration.
cell_key – Key to get cell vectors from the lmdb configuration.
energy_key – Key to get energy from the lmdb configuration.
forces_key – Key to get forces from the lmdb configuration.
stress_key – Key to get stress from the lmdb configuration.
config_weight_key – Key to get config_weight from the lmdb configuration.
energy_weight_key – Key to get energy_weight from the lmdb configuration.
forces_weight_key – Key to get forces_weight from the lmdb configuration.
stress_weight_key – Key to get stress_weight from the lmdb configuration.
metadata_keys – List of keys to get all metadata from the lmdb configuration.
- classmethod from_huggingface(hf_id, split='train', n_configs=None, coords_key='positions', species_key='atomic_numbers', pbc_key='pbc', cell_key='cell', energy_key='energy', forces_key='atomic_forces', stress_key=None, weights_file=None, **load_kwargs)[source]¶
Load dataset from a HuggingFace Hub dataset.
- Parameters:
hf_id (
str) – Huggingface id e.g. “colabfit/xxMD-CASSCF_train”split (
str) – which split to load, e.g. “train”n_configs (
Optional[int]) – optionally limit to the first N configs*_key – column names in the HF dataset
load_kwargs – passed through to datasets.load_dataset
- Return type:
- Returns:
Dataset
- add_from_huggingface(hf_id, split, n_configs, coords_key, species_key, pbc_key, cell_key, energy_key, forces_key, stress_key=None, weights_file=None, **load_kwargs)[source]¶
Add configurations from a HuggingFace Hub dataset.
- Parameters:
hf_id – Huggingface id e.g. “colabfit/xxMD-CASSCF_train”
split – which split to load, e.g. “train”
n_configs – optionally limit to the first N configs
*_key – column names in the HF dataset
load_kwargs – passed through to datasets.load_dataset
- to_path(path, prefix=None)[source]¶
Save the dataset to a folder, as per the KLIFF xyz format. The folder will contain multiple files, each containing a configuration. Prefix is added to the filename of each configuration. Path is created if it does not exist.
- Parameters:
path (
Union[Path,str]) – Path to the directory to save the dataset.prefix (
Optional[str]) – Prefix to add to the filename of each configuration.
- Return type:
None
- to_ase(path)[source]¶
Save the dataset to a file in ASE format. The file will contain multiple configurations, each separated by a newline. The file will be saved in the specified path. The file format is determined by the extension of the path.
- Parameters:
path (
Union[Path,str]) – Path to the file to save the dataset.- Return type:
None
- to_ase_list()[source]¶
Convert the dataset to a list of ase.Atoms objects.
- Return type:
List[Atoms]- Returns:
List of ase.Atoms objects.
- to_colabfit(colabfit_database, colabfit_dataset, colabfit_uri='mongodb://localhost:27017')[source]¶
Save dataset to a colabfit database. :type colabfit_database:
str:param colabfit_database: :type colabfit_dataset:str:param colabfit_dataset: :type colabfit_uri:str:param colabfit_uri:Returns:
- get_configs()[source]¶
Get shallow copy of the configurations.
- Return type:
List[Configuration]
- save_weights(path)[source]¶
Save the weights of the configurations to a file.
- Parameters:
path (
Union[Path,str]) – Path of the file to save the weights.
- static add_weights(configurations, source)[source]¶
Load weights from a text file/ Weight class. The text file should contain 1 to 4 columns, whitespace seperated, formatted as,
` Config Energy Forces Stress 1.0 0.0 10.0 0.0 ``{note} The column headers are case-insensitive, but should have same name as above. The weight of 0.0 will set respective weight as `None`. The length of column can be either 1 (all configs same weight) or n, where n is the number of configs in the dataset. `Missing columns are treated as 0.0, i.e. above example file can also be written as` Config Forces 1.0 10.0 `It also now supports the yaml weight file. The yaml file should be formatted as, ``` - config: [1.0, 1.0, 1.0]
energy: 0.0 forces: [1.0, 1.0, 1.0] stress: 0.0
config: [1.0, 1.0, 1.0] energy: 0.0 forces: [1.0, 1.0, 1.0] stress: 0.0
``` Any missing key is treated as 0.0. The weights are assumed to be in same order as the dataset configurations.
- Parameters:
configurations (
Union[List[Configuration],Dataset]) – List of configurations to add weights to.source (
Union[Path,str,Weight]) – Path to the configuration file
- add_metadata(metadata)[source]¶
Add metadata to the dataset object.
- Parameters:
metadata (
dict) – A dictionary containing the metadata.
- get_metadata(key)[source]¶
Get the metadata of the dataset.
- Parameters:
key (
str) – Key of the metadata to get.- Returns:
Value of the metadata.
- property metadata¶
Return the metadata of the dataset.
- check_properties_consistency(properties=None)[source]¶
Check which of the properties of the configurations are consistent. These consistent properties are saved a list which can be used to get the attributes from the configurations. “Consistent” in this context means that same property is available for all the configurations. A property is not considered consistent if it is None for any of the configurations.
- Parameters:
properties (
Optional[List[str]]) – List of properties to check for consistency. If None, no properties are checked. All consistent properties are saved in the metadata.
- static get_manifest_checksum(dataset_manifest, transform_manifest=None)[source]¶
Get the checksum of the dataset manifest.
- Parameters:
dataset_manifest (
dict[str,Any]) – Manifest of the dataset.transform_manifest (
Optional[dict[str,Any]]) – Manifest of the transformation.
- Return type:
str- Returns:
Checksum of the manifest.
- static get_dataset_from_manifest(dataset_manifest)[source]¶
Get a dataset from a manifest.
Examples
Manifest file for initializing dataset using ASE parser:
type: ase # ase or path or colabfit path: Si.xyz # Path to the dataset save: True # Save processed dataset to a file save_path: /folder/to # Save to this folder shuffle: False # Shuffle the dataset weights: /path/to/weights.dat # or dictionary with weights keys:
energy: Energy # Key for energy, if ase dataset is used forces: forces # Key for forces, if ase dataset is used
2. Manifest file for initializing dataset using KLIFF extxyz parser: ```yaml dataset:
type: path # ase or path or colabfit path: /all/my/xyz # Path to the dataset save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset weights: # same weight for all, or file with weights
config: 1.0 energy: 0.0 forces: 10.0 stress: 0.0
3. Manifest file for initializing dataset using ColabFit parser: ```yaml dataset:
type: colabfit # ase or path or colabfit save: False # Save processed dataset to a file shuffle: False # Shuffle the dataset weights: None colabfit_dataset:
dataset_name: database_name: database_url:
- Parameters:
dataset_manifest (
dict) – List of configurations.- Return type:
- Returns:
A dataset of configurations.
- kliff.dataset.read_extxyz(filename)[source]¶
Read atomic configuration stored in extended xyz file_format.
- Parameters:
filename (
Path) – filename to the extended xyz file- Returns:
3x3 array, supercell lattice vectors species: species of atoms coords: Nx3 array, coordinates of atoms PBC: periodic boundary conditions energy: potential energy of the configuration; None if not provided in file forces: Nx3 array, forces on atoms; None if not provided in file stress: 1D array of size 6, stress on the cell in Voigt notation; None if not
provided in file
- Return type:
cell
- kliff.dataset.write_extxyz(filename, cell, species, coords, PBC, energy=None, forces=None, stress=None, bool_as_str=False)[source]¶
Write configuration info to a file in extended xyz file_format.
- Parameters:
filename (
Path) – filename to the extended xyz filecell (
ndarray) – 3x3 array, supercell lattice vectorsspecies (
List[str]) – species of atomscoords (
ndarray) – Nx3 array, coordinates of atomsPBC (
List[bool]) – periodic boundary conditionsenergy (
Optional[float]) – potential energy of the configuration; If None, not write to fileforces (
Optional[ndarray]) – Nx3 array, forces on atoms; If None, not write to filestress (
Optional[List[float]]) – 1D array of size 6, stress on the cell in Voigt notation; If None, not write to filebool_as_str (
bool) – If True, write PBC as “T” or “F”; otherwise, write PBC as 1 or 0.