Dataset#
This Module contains classes and functions to deal with dataset.
A dataset is comprised of a set of configurations, which provide the training data to optimize potential (parameters) or provide the test data to test the quality of potential.
A configuration should have three lattice vectors
of the simulation cell,
flags to indicate periodic boundary conditions (PBC
), species
and
coordinates
of all atoms. These collectively define a configuration and are,
typically, considered as the input in terms of potential fitting.
A configuration should also contain a set of output (target), which the
optimization algorithm adjust the potential (parameters) to match.
For example, if the force-matching scheme is used for the fitting, the output can be
the forces on individual atoms. The currently supported outputs include
energy
, forces
, and stress
.
See also
See kliff.dataset.Configuration
for a complete list of the member
functions of the Configuration class.
To create a data, do:
from kliff.dataset import Dataset
path = 'path_to_my_dataset_files'
dset = Dataset(path, format='extxyz')
where path
is a file storing a configuration or a directory containing multiple
files. If given a directory, all the files in this directory and its subdirectories
with the extension corresponding to the specified format will be read. For
example, if format='extxyz'
, all the files with an extension .xyz
in
path
and its subdirectories will be read.
The size of the dataset can be obtained by:
dset_size = dset.get_num_configs()
and a list of configurations constituting the dataset can be obtained by:
configs = dset.get_configs()
See also
See kliff.dataset.Dataset
for a complete list of the member functions
of the Dataset class.
Inspect dataset#
KLIFF provides a command line tool to get a statistics of a dataset of files.
For example, for the
Si_training_set.tar.gz
(the tarball can be extracted by: $ tar xzf Si_training_set.tar.gz
), running:
$ kliff dataset --count Si_training_set
prints out the below information:
================================================================================
KLIFF Dataset Count
Notation: "──dir_name (a/b)"
a: number of .xyz files in the directory "dir_name"
b: number of .xyz files in the directory "dir_name" and its subdirectories
Si_training_set (0/1000)
├──NVT_runs (600/600)
└──varying_alat (400/400)
================================================================================
Dataset Format#
More than often, your dataset is generated from first-principles calculations
using packages like VASP, SIESTA, and Quantum Espresso among others. Their
output file format may not be support by KLIFF. You can use parse these output to
get the necessary data, and then convert to the format supported by KLIFF using the
functions kliff.dataset.write_config()
and kliff.dataset.read_config()
.
Currently supported dataset format include:
extended XYZ (.xyz)
Extended XYZ#
The Extended XYZ format is an enhanced version of the basic XYZ format that
allows extra columns to be present in the file for additional per-atom properties as
well as standardizing the format of the comment line to include the cell lattice
and other per-frame parameters. It typically has the .xyz
extension.
It would be easy to explain the format with an example. Below is an example of the extended XYZ format supported by KLIFF:
8
Lattice="4.8879 0 0 0 4.8879 0 0 0 4.8879" PBC="1 1 1" Energy=-29.3692121943 Properties=species:S:1:pos:R:3:force:R:3
Si 0.00000e+00 0.00000e+00 0.00000e+00 2.66454e-15 -8.32667e-17 4.02456e-16
Si 2.44395e+00 2.44395e+00 0.00000e+00 1.62370e-15 7.21645e-16 8.46653e-16
Si 0.00000e+00 2.44395e+00 2.44395e+00 0.00000e+00 3.60822e-16 2.01228e-16
Si 2.44395e+00 0.00000e+00 2.44395e+00 1.33227e-15 -4.44089e-16 8.74350e-16
Si 1.22198e+00 1.22198e+00 1.22198e+00 4.44089e-15 1.80411e-16 1.87350e-16
Si 3.66593e+00 3.66593e+00 1.22198e+00 9.29812e-16 -2.67841e-15 -3.22659e-16
Si 1.22198e+00 3.66593e+00 3.66593e+00 5.55112e-17 3.96905e-15 8.87786e-16
Si 3.66593e+00 1.22198e+00 3.66593e+00 -2.60902e-15 -9.43690e-16 6.37999e-16
The first line list the number of atoms in the system.
The second line follow the
key=value
structure. if avalue
contains any space (e.g.Lattice
), it should be placed in the quotation marks" "
. The supported keys are:Lattice
represents the three Cartesian lattice vectors: the first 3 numbers denote, the next three numbers denote
, and the last 3 numbers denote
. Note that
,
, and
should follow the right-hand rule such that the volume of the cell can be obtained by
.
PBC
. Three integers of1
or0
(or three characters ofT
orF
) to indicate whether to use periodic boundary conditions along,
, and
, respectively.
Energy
. A real value of the total potential energy of the system.Properties
provides information of the names, size, and types of the data that are listed in the body part of the file. For example, theProperties
in the above example means that the atomic species information (a string) is listed in the first column of the body, the next three columns list the atomic coordinates, and the last three columns list the forces on atoms.
Each line in the body lists the information, indicated by Properties
in the
second line, for one atom in the system, taking the form:
species x y z fx fy fz
The coordinates x y z
should be given in Cartesian values, not fractional
values. The forces fx fy fz
can be skipped if you do not want to use them.
Note
An atomic configuration stored in the extended XYZ format can be visualized using the OVITO program.
Weight#
As mentioned in Theory, the reference can be any material
properties, which can carry different physical units. The weight in the loss function
can be used to put quantities with different units on a common scale. The weights also
give us access to set which properties or configurations are more important, for example,
in developing a potential for a certain application
(see Define your weight class).
KLIFF uses weight class to compute and store the weight information for each configuration. The basic structure of the class is shown below.
class Weight():
"""A class to deal with weights for each configuration."""
def __init__(self):
#... Do necessary steps to initialize the class
def compute_weight(self, config):
#... Compute the weights for the given configutation
@property
def some_weight(self):
#... Add properties to retrieve the weight values
Default weight class#
KLIFF has several built-in weight classes. As a default, KLIFF uses kliff.dataset.weight.Weight
,
which put a single weight for each property.
from kliff.dataset import Dataset
from kliff.dataset.weight import Weight
path = 'path_to_my_dataset_files'
weight = Weight()
dset = Dataset(path, weight=weight, format='extxyz')
# Retrieve the weights
config_weight = configs[0].config_weight
energy_weight = configs[0].energy_weight
forces_weight = configs[0].forces_weight
stress_weight = configs[0].stress_weight
config_weight
is the weight for the configuration and energy_weight
,
forces_weight
, and stress_weigth
are the weights for energy, forces, and stress,
respectively. The default value for each weight is 1.0.
One can also specify different values for these weights. For example, one might want to
weigh the energy 10 times as the forces. It can be done by specifying the weight values
while instantiating kliff.dataset.weight.Weight
.
weight = Weight(
config_weight=1.0, energy_weight=10.0, forces_weight=1.0, stress_weight=1.0
)
Note
Another use case is if one wants to, for example, exclude the energy in the loss
function, which can be done by setting energy_weight=0.0
.
Magnitude-inverse weight#
KLIFF also provides another weight class that computes the weight based on the magnitude of the data, applying different weight on each data point. The weight calculation is motivated by formulation suggested by Lenosky et al. [lenosky1997],
and
are parameters to compute the weight. They can be thought as
a padding and a fractional scaling terms. When
corresponds to energy,
the norm is the absolute value of the energy. When
correspond to forces,
the norm is a vector norm of the force vector acting on the corresponding atom. This also
mean that each force component acting on the same atom will have the same weight. If
correspond to stress, then the norm is a Frobenius norm of the stress
tensor, giving the same weight for each component in the stress tensor.
To use this weight, we instantiate MagnitudeInverseWeight
weight class:
from kliff.dataset.weight import MagnitudeInverseWeight
weight = MagnitudeInverseWeight(
config_weight=1.0,
weight_params={
"energy_weight_params": [c1e, c2e],
"forces_weight_params": [c1f, c2f],
"stress_weight_params": [c1s, c2s],
}
)
config_weight
specifies the weight for the entire configuration.
weight_params
is a dictionary containing and
for energy,
forces, and stress. The default value is:
weight_params = {
"energy_weight_params": [1.0, 0.0],
"forces_weight_params": [1.0, 0.0],
"stress_weight_params": [1.0, 0.0],
}
Additionally, for each key, we can pass in a float
, which set the value of
with
.
- lenosky1997
Lenosky, T.J., Kress, J.D., Kwon, I., Voter, A.F., Edwards, B., Richards, D.F., Yang, S., Adams, J.B., 1997. Highly optimized tight-binding model of silicon. Phys. Rev. B 55, 15281544. https://doi.org/10.1103/PhysRevB.55.1528
Define your weight class#
We can also define a custom weight class to use in KLIFF. As an example, suppose we are developing a potential that will be used to investigate fracture properties. The training sets includes both configurations with and without cracks. For this application, we might want to put larger weights for the configurations with cracks. Below is an example of weight class that achieve this goal.
from kliff.dataset.weight import Weight
class WeightForCracks(Weight):
"""An example weight class that put larger weight on the configurations with
cracks. This class inherit from ``kliff.dataset.weight.Weight``. We just need to
modify ``compute_weight`` method to put larger weight for the configurations with
cracks. Other modifications might need to be done for different weight class.
"""
def __init__(self, energy_weight, forces_weight):
super().__init__(energy_weight=energy_weight, forces_weight=forces_weight)
def compute_weight(self, config):
identifier = config.identifier
if 'with_cracks' in identifier:
self._config_weight = 10.0
With this weight class, we can use the built-in residual_fn
to achieve the same
result as the implementation in Use your own residual function.