Trainer Manifest¶
KLIFF uses YAML configuration files to control the training of interatomic force fields with machine-learning models. A typical configuration file is divided into the following top-level sections:
workspace
dataset
model
transforms
training
export (optional)
Each section is itself a dictionary with keys and values that specify particular settings. The minimal required sections are typically workspace, dataset, model, and training, while transforms and export are optional but often useful. Especially transforms is almost always used for ML models, for transforming the coordinates.
Below is a general explanation of each section, along with examples. Refer to the provided example configuration files to see these in practice.
1. workspace¶
Purpose¶
The workspace section manages where training runs are stored, random seeds, and other essential housekeeping. By specifying a seed here, you ensure reproducible results.
Common Keys¶
name: Name of the main workspace folder to create or use.
seed: Random seed for reproducibility.
resume: (Optional) Whether to resume from a previous checkpoint.
Example¶
workspace:
name: test_run
seed: 12345
resume: False
2. dataset¶
Purpose¶
Specifies how to load and configure the training (and validation) data. KLIFF can process data from various sources (ASE, file paths, ColabFit, etc.). This section tells KLIFF how to interpret your dataset and which properties (energy, forces, etc.) to use.
Common Keys¶
type: Dataset format, e.g.
ase,path, orcolabfit.path: Path to the dataset if using
aseorpath(ignored forcolabfit).shuffle: Whether to shuffle the data.
save: Whether to store a preprocessed version of the dataset on disk.
dynamic_loading: (Optional) If true, loads data in chunks at runtime (for large datasets).
keys: A sub-dict mapping property names in the raw dataset to standardized ones recognized by KLIFF (
energy,forcesetc.).
Example¶
dataset:
type: ase
path: Si.xyz
save: False
shuffle: True
keys:
energy: Energy
forces: forces
3. model¶
Purpose¶
Defines the model used to fit the interatomic force field. KLIFF supports multiple backends, including KIM models (kim type) and Torch/PyTorch-based ML models (torch type).
Common Keys¶
type: (Optional) Potential backend, such as
kimortorch.name: Identifier for the model; for KIM, a recognized KIM model name; for Torch, a
.ptfile or descriptive string.path: Filesystem path where the model is loaded/saved.
input_args: (Torch-specific) Lists the data fields that feed into the model’s forward pass (e.g.,
z,coords, etc.).precision: (Torch-specific) Set to
doubleorsingle; currentlydoubleis typically used.
Tip
For a custom/ non-torch script exportable model, the user need to manually intantiate the trainer class with the model, and config dict.
Example (KIM Model)¶
model:
path: ./
name: SW_StillingerWeber_1985_Si__MO_405512056662_006
Example (Torch Model)¶
model:
path: ./model_dnn.pt
name: "TorchDNN"
Example (Torch GNN Model)¶
Model to be provided manually at runtime
model:
type: torch
path: ./
name: "TorchGNN2"
input_args:
- z
- coords
- edge_index0
- contributions
precision: double
4. transforms¶
Purpose¶
Allows modifications to the data or the model parameters before or during training. These can be transformations on classical potential parameters (e.g., applying a log transform) or on the configuration data (e.g., generating descriptors or graph representations for ML models).
Common Keys¶
parameter: A list of classical potential parameters that can be optimized or transformed. Parameters can be simple strings or dictionaries defining a transform (e.g.,
LogParameterTransformwith bounds).configuration: Typically used for ML-based or Torch-based models to specify data transforms. For instance, computing a descriptor or building a graph adjacency.
properties: Transform the dataset-wide properties like energy and forces. Usually it is used to normalize the energy/forces.
Example (Parameter Transform for KIM)¶
Allow the model to sample in log space. The transformed parameter list in KIM models will be treated as the parameters which are to be trained.
transforms:
parameter:
- A
- B
- sigma:
transform_name: LogParameterTransform
value: 2.0
bounds: [[1.0, 10.0]]
Example (Configuration Transform for Torch)¶
Map the coordinates to Behler symmetry function (all keywords are case sensitive).
transforms:
configuration:
name: Descriptor
kwargs:
cutoff: 4.0
species: ["Si"]
descriptor: SymmetryFunctions
hyperparameters: "set51"
Example (Graph Transform)¶
Generate radial edge graphs for GNNs.
transforms:
configuration:
name: RadialGraph
kwargs:
cutoff: 8.6
species: ["H", "He", "Li", ..., "Og"] # entire periodic table example
n_layers: 1
5. training¶
Purpose¶
Controls the training loop, including the loss function, optimizer, learning rate scheduling, dataset splitting, and other hyperparameters like batch size and epochs.
Subsections¶
5.1 loss¶
function: Name of the loss function, e.g.,
MSE.weights: Dictionary or path to a file specifying relative weighting of different terms (energy, forces, stress, etc.).
loss_traj: (Optional) Log the loss trajectory.
5.2 optimizer¶
name: Name of the optimizer (e.g.,
L-BFGS-B,Adam).provider: If needed, indicates which library (e.g., Torch).
learning_rate: Base learning rate.
kwargs: Additional args for the optimizer (e.g.,
tolfor L-BFGS).ema: (Optional) Exponential moving average parameter for advanced training stabilization.
5.3 lr_scheduler¶
name: Learning rate scheduler type (
ReduceLROnPlateau, etc.).args: Arguments that configure the scheduler (e.g.,
factor,patience,min_lr).
5.4 training_dataset / validation_dataset¶
train_size, val_size: Number of configurations or fraction of the total data.
train_indices, val_indices: (Optional) File paths specifying which indices belong to the train/val sets.
5.5 Additional Controls¶
batch_size: Number of configurations in each mini-batch.
epochs: How many iterations (epochs) to train.
device: Computation device, e.g.
cpuorcuda.num_workers: Parallel data loading processes.
ckpt_interval: How often (in epochs) to save a checkpoint.
early_stopping: Criteria for terminating training early.
patience: Epochs to wait for improvement.
min_delta: Smallest improvement threshold.
verbose: Print detailed logs if
true.log_per_atom_pred: Log predictions per atom.
Example¶
training:
loss:
function: MSE
weights: "./weights.dat"
normalize_per_atom: true
optimizer:
name: Adam
learning_rate: 1.e-3
lr_scheduler:
name: ReduceLROnPlateau
args:
factor: 0.5
patience: 5
min_lr: 1.e-6
training_dataset:
train_size: 3
validation_dataset:
val_size: 1
batch_size: 2
epochs: 20
device: cpu
ckpt_interval: 2
early_stopping:
patience: 10
min_delta: 1.e-4
log_per_atom_pred: true
6. export (Optional)¶
Purpose¶
Used to export the trained model for external usage (for instance, creating a KIM-API model or packaging everything into a tar file).
Common Keys¶
generate_tarball: Boolean deciding whether to create a
.tararchive of the trained model and dependencies.model_path: Directory to store the exported model.
model_name: Filename for the exported model.
Example¶
export:
generate_tarball: True
model_path: ./
model_name: SW_StillingerWeber_trained_1985_Si__MO_405512056662_006
Example: Training a KIM Potential¶
Let us define a vey value dict directly and try to train a simple Stillinger-Weber Si potential
Step 0: Get the dataset¶
!wget https://raw.githubusercontent.com/openkim/kliff/main/examples/Si_training_set_4_configs.tar.gz
!tar -xvf Si_training_set_4_configs.tar.gz
--2025-04-11 15:15:44-- https://raw.githubusercontent.com/openkim/kliff/main/examples/Si_training_set_4_configs.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response...
200 OK
Length: 7691 (7.5K) [application/octet-stream]
Saving to: ‘Si_training_set_4_configs.tar.gz’
Si_traini 0%[ ] 0 --.-KB/s
Si_training_set_4_c 100%[===================>] 7.51K --.-KB/s in 0s
2025-04-11 15:15:44 (141 MB/s) - ‘Si_training_set_4_configs.tar.gz’ saved [7691/7691]
Si_training_set_4_configs/
Si_training_set_4_configs/Si_alat5.431_scale0.005_perturb1.xyz
Si_training_set_4_configs/Si_alat5.409_scale0.005_perturb1.xyz
Si_training_set_4_configs/Si_alat5.442_scale0.005_perturb1.xyz
Si_training_set_4_configs/Si_alat5.420_scale0.005_perturb1.xyz
Step 1: workspace config¶
Create a folder named SW_train_example, and use it for everything
workspace = {"name": "SW_train_example", "random_seed": 12345}
Step 2: define the dataset¶
dataset = {"type": "path", "path": "Si_training_set_4_configs", "shuffle": True}
Step 3: model¶
Install the KIM model if not already installed.
Tip
You can also provide custom KIM model by defining the path to a valid KIM portable model. In that case KLIFF will install the model for you.
!kim-api-collections-management install user SW_StillingerWeber_1985_Si__MO_405512056662_006
/usr/bin/sh: 1: kim-api-collections-management: not found
model = {"name": "SW_StillingerWeber_1985_Si__MO_405512056662_006"}
Step 4: select parameters to be trained¶
transforms = {"parameter": ["A", "B", "sigma"]}
Step 5: training¶
Lets train it using scipy, lbfgs optimizer (physics based models can only work with scipy optimizers). With test train split of 1:3.
training = {
"loss" : {"function" : "MSE"},
"optimizer": {"name": "L-BFGS-B"},
"training_dataset" : {"train_size": 3},
"validation_dataset" : {"val_size": 1},
"epoch" : 10
}
Step 6: (Optional) export the model?¶
export = {"model_path":"./", "model_name": "MySW__MO_111111111111_000"} # name can be anything, but better to have KIM-API qualified name for convenience
Step 7: Put it all together, and pass to the trainer¶
training_manifest = {
"workspace": workspace,
"model": model,
"dataset": dataset,
"transforms": transforms,
"training": training,
"export": export
}
from kliff.trainer.kim_trainer import KIMTrainer
trainer = KIMTrainer(training_manifest)
trainer.train()
trainer.save_kim_model()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[10], line 1
----> 1 from kliff.trainer.kim_trainer import KIMTrainer
3 trainer = KIMTrainer(training_manifest)
4 trainer.train()
ModuleNotFoundError: No module named 'kliff'
The model should now be trained, you can install it as:
!kim-api-collections-management install user MySW__MO_111111111111_000
Found local item named: MySW__MO_000000000000_000.
In source directory: /home/amit/Projects/COLABFIT/kliff/kliff/docs/source/introduction/MySW__MO_000000000000_000.
(If you are trying to install an item from openkim.org
rerun this command from a different working directory,
or rename the source directory mentioned above.)
Found installed driver... SW__MD_335816936951_005
[100%] Built target MySW__MO_000000000000_000
Install the project...
-- Install configuration: "Release"
-- Installing: /home/amit/.kim-api/2.3.0+v2.3.0.GNU.GNU.GNU.2022-07-11-20-25-52/portable-models-dir/MySW__MO_000000000000_000/libkim-api-portable-model.so
-- Set non-toolchain portion of runtime path of "/home/amit/.kim-api/2.3.0+v2.3.0.GNU.GNU.GNU.2022-07-11-20-25-52/portable-models-dir/MySW__MO_000000000000_000/libkim-api-portable-model.so" to ""
Success!
Let us quickly check the trained model, here we are using the ASE calculator to check the energy and forces
from ase.calculators.kim.kim import KIM
from ase.build import bulk
si = bulk("Si")
model = KIM("MySW__MO_111111111111_000")
si.calc = model
print(si.get_potential_energy())
print(si.get_forces())
Errors¶
libstd++errors
/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29’ not found (required by /opt/mambaforge/mambaforge/envs/kliff/lib/libkim-api.so.2)
This indicates that your conda environment is not properly setting up the LD_LIBRARY_PATH. You can fix this by running the following command:
This should prepend the correct libstd++ path to the LD_LIBRARY_PATH variable.