Trainer Manifest
================

KLIFF uses YAML configuration files to control the training of
interatomic potentials with machine-learning models. A typical
configuration file is divided into the following top-level sections:

1. **workspace**
2. **dataset**
3. **model**
4. **transforms**
5. **training**
6. **export** (optional)

Each section is itself a dictionary with keys and values that specify
particular settings. The minimal required sections are typically
``workspace``, ``dataset``, ``model``, and ``training``, while
``transforms`` and ``export`` are optional but often useful. Especially
``transforms`` is almost always used for ML models, for transforming the
coordinates.

Below is a general explanation of each section, along with examples.
Refer to the provided example configuration files to see these in
practice.

1. ``workspace``
----------------

Purpose
~~~~~~~

The ``workspace`` section manages where training runs are stored, random
seeds, and other essential housekeeping. By specifying a seed here, you
ensure reproducible results.

Common Keys
~~~~~~~~~~~

-  **name**: Name of the main workspace folder to create or use.
-  **seed**: Random seed for reproducibility.
-  **resume**: (Optional) Whether to resume from a previous checkpoint.

Example
~~~~~~~

.. code:: yaml

   workspace:
     name: test_run
     seed: 12345
     resume: False

2. ``dataset``
--------------

.. _purpose-1:

Purpose
~~~~~~~

Specifies how to load and configure the training (and validation) data.
KLIFF can process data from various sources (ASE, file paths, ColabFit,
etc.). This section tells KLIFF how to interpret your dataset and which
properties (energy, forces, etc.) to use.

.. _common-keys-1:

Common Keys
~~~~~~~~~~~

-  **type**: Dataset format, e.g. ``ase``, ``path``, or ``colabfit``.
-  **path**: Path to the dataset if using ``ase`` or ``path`` (ignored
   for ``colabfit``).
-  **shuffle**: Whether to shuffle the data.
-  **save**: Whether to store a preprocessed version of the dataset on
   disk.
-  **dynamic_loading**: (Optional) If true, loads data in chunks at
   runtime (for large datasets).
-  **keys**: A sub-dict mapping property names in the raw dataset to
   standardized ones recognized by KLIFF (``energy``, ``forces`` etc.).

.. _example-1:

Example
~~~~~~~

.. code:: yaml

   dataset:
     type: ase
     path: Si.xyz
     save: False
     shuffle: True
     keys:
       energy: Energy
       forces: forces

3. ``model``
------------

.. _purpose-2:

Purpose
~~~~~~~

Defines the model used to fit the interatomic potential. KLIFF
supports multiple backends, including KIM models (``kim`` type) and
Torch/PyTorch-based ML models (``torch`` type).

.. _common-keys-2:

Common Keys
~~~~~~~~~~~

-  **type**: (Optional) Potential backend, such as ``kim`` or ``torch``.
-  **name**: Identifier for the model; for KIM, a recognized KIM model
   name; for Torch, a ``.pt`` file or descriptive string.
-  **path**: Filesystem path where the model is loaded/saved.
-  **input_args**: (Torch-specific) Lists the data fields that feed into
   the model’s forward pass (e.g., ``z``, ``coords``, etc.).
-  **precision**: (Torch-specific) Set to ``double`` or ``single``;
   currently ``double`` is typically used.

.. tip::

   For a custom/ non-torch script exportable model, the user need to manually intantiate the trainer class with the model, and config dict.

Example (KIM Model)
~~~~~~~~~~~~~~~~~~~

.. code:: yaml

   model:
     path: ./
     name: SW_StillingerWeber_1985_Si__MO_405512056662_006

Example (Torch Model)
~~~~~~~~~~~~~~~~~~~~~

.. code:: yaml

   model:
     path: ./model_dnn.pt
     name: "TorchDNN"

Example (Torch GNN Model)
~~~~~~~~~~~~~~~~~~~~~~~~~

**Model to be provided manually at runtime**

.. code:: yaml

   model:
     type: torch
     path: ./
     name: "TorchGNN2"
     input_args:
       - z
       - coords
       - edge_index0
       - contributions
     precision: double

--------------

4. ``transforms``
-----------------

.. _purpose-3:

Purpose
~~~~~~~

Allows modifications to the data or the model parameters before or
during training. These can be transformations on classical potential
parameters (e.g., applying a log transform) or on the configuration data
(e.g., generating descriptors or graph representations for ML models).

.. _common-keys-3:

Common Keys
~~~~~~~~~~~

-  **parameter**: A list of classical potential parameters that can be
   optimized or transformed. Parameters can be simple strings or
   dictionaries defining a transform (e.g., ``LogParameterTransform``
   with bounds).
-  **configuration**: Typically used for ML-based or Torch-based models
   to specify data transforms. For instance, computing a descriptor or
   building a graph adjacency.
-  **properties**: Transform the dataset-wide properties like energy and
   forces. Usually it is used to normalize the energy/forces.

Example (Parameter Transform for KIM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Allow the model to sample in log space. The transformed parameter list
in KIM models will be treated as the parameters which are to be trained.

.. code:: yaml

   transforms:
     parameter:
       - A
       - B
       - sigma:
           transform_name: LogParameterTransform
           value: 2.0
           bounds: [[1.0, 10.0]]

Example (Configuration Transform for Torch)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Map the coordinates to Behler symmetry function (all keywords are case
sensitive).

.. code:: yaml

   transforms:
     configuration:
       name: Descriptor
       kwargs:
         cutoff: 4.0
         species: ["Si"]
         descriptor: SymmetryFunctions
         hyperparameters: "set51"

Example (Graph Transform)
~~~~~~~~~~~~~~~~~~~~~~~~~

Generate radial edge graphs for GNNs.

.. code:: yaml

   transforms:
     configuration:
       name: RadialGraph
       kwargs:
         cutoff: 8.6
         species: ["H", "He", "Li", ..., "Og"]  # entire periodic table example
         n_layers: 1

5. ``training``
---------------

.. _purpose-4:

Purpose
~~~~~~~

Controls the training loop, including the **loss function**,
**optimizer**, **learning rate scheduling**, dataset splitting, and
other hyperparameters like batch size and epochs.

Subsections
~~~~~~~~~~~

5.1 ``loss``
^^^^^^^^^^^^

-  **function**: Name of the loss function, e.g., ``MSE``.
-  **weights**: Dictionary or path to a file specifying relative
   weighting of different terms (energy, forces, stress, etc.).
-  **loss_traj**: (Optional) Log the loss trajectory.

5.2 ``optimizer``
^^^^^^^^^^^^^^^^^

-  **name**: Name of the optimizer (e.g., ``L-BFGS-B``, ``Adam``).
-  **provider**: If needed, indicates which library (e.g., Torch).
-  **learning_rate**: Base learning rate.
-  **kwargs**: Additional args for the optimizer (e.g., ``tol`` for
   L-BFGS).
-  **ema**: (Optional) Exponential moving average parameter for advanced
   training stabilization.

5.3 ``lr_scheduler``
^^^^^^^^^^^^^^^^^^^^

-  **name**: Learning rate scheduler type (``ReduceLROnPlateau``, etc.).
-  **args**: Arguments that configure the scheduler (e.g., ``factor``,
   ``patience``, ``min_lr``).

5.4 ``training_dataset`` / ``validation_dataset``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  **train_size**, **val_size**: Number of configurations or fraction of
   the total data.
-  **train_indices**, **val_indices**: (Optional) File paths specifying
   which indices belong to the train/val sets.

5.5 Additional Controls
^^^^^^^^^^^^^^^^^^^^^^^

-  **batch_size**: Number of configurations in each mini-batch.
-  **epochs**: How many iterations (epochs) to train.
-  **device**: Computation device, e.g. ``cpu`` or ``cuda``.
-  **num_workers**: Parallel data loading processes.
-  **ckpt_interval**: How often (in epochs) to save a checkpoint.
-  **early_stopping**: Criteria for terminating training early.

   -  **patience**: Epochs to wait for improvement.
   -  **min_delta**: Smallest improvement threshold.

-  **verbose**: Print detailed logs if ``true``.
-  **log_per_atom_pred**: Log predictions per atom.

.. _example-2:

Example
~~~~~~~

.. code:: yaml

   training:
     loss:
       function: MSE
       weights: "./weights.dat"
       normalize_per_atom: true
     optimizer:
       name: Adam
       learning_rate: 1.e-3
       lr_scheduler:
         name: ReduceLROnPlateau
         args:
           factor: 0.5
           patience: 5
           min_lr: 1.e-6

     training_dataset:
       train_size: 3
     validation_dataset:
       val_size: 1

     batch_size: 2
     epochs: 20
     device: cpu
     ckpt_interval: 2
     early_stopping:
       patience: 10
       min_delta: 1.e-4
     log_per_atom_pred: true

6. ``export`` (Optional)
------------------------

.. _purpose-5:

Purpose
~~~~~~~

Used to export the trained model for external usage (for instance,
creating a KIM-API model or packaging everything into a tar file).

.. _common-keys-4:

Common Keys
~~~~~~~~~~~

-  **generate_tarball**: Boolean deciding whether to create a ``.tar``
   archive of the trained model and dependencies.
-  **model_path**: Directory to store the exported model.
-  **model_name**: Filename for the exported model.
-  **driver_version**: Specific driver version you want to target for export. Only supported for TorchML driver currently.

.. _example-3:

Example
~~~~~~~

.. code:: yaml

   export:
     generate_tarball: True
     model_path: ./
     model_name: SW_StillingerWeber_trained_1985_Si__MO_405512056662_006

--------------

Example: Training a KIM Potential
=================================

Let us define a vey value dict directly and try to train a simple
Stillinger-Weber Si potential

Step 0: Get the dataset
-----------------------

In your shell (or notebook with ``!``).

.. code-block:: bash

    wget https://raw.githubusercontent.com/openkim/kliff/main/examples/Si_training_set_4_configs.tar.gz
    tar -xvf Si_training_set_4_configs.tar.gz


.. parsed-literal::

    --2025-02-27 12:10:06--  https://raw.githubusercontent.com/openkim/kliff/main/examples/Si_training_set_4_configs.tar.gz
    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 7691 (7.5K) [application/octet-stream]
    Saving to: ‘Si_training_set_4_configs.tar.gz.1’
    
    Si_training_set_4_c 100%[===================>]   7.51K  --.-KB/s    in 0s      
    
    2025-02-27 12:10:07 (30.7 MB/s) - ‘Si_training_set_4_configs.tar.gz.1’ saved [7691/7691]
    
    Si_training_set_4_configs/
    Si_training_set_4_configs/Si_alat5.431_scale0.005_perturb1.xyz
    Si_training_set_4_configs/Si_alat5.409_scale0.005_perturb1.xyz
    Si_training_set_4_configs/Si_alat5.442_scale0.005_perturb1.xyz
    Si_training_set_4_configs/Si_alat5.420_scale0.005_perturb1.xyz


Step 1: workspace config
------------------------

Create a folder named ``SW_train_example``, and use it for everything

.. code-block:: python

    workspace = {"name": "SW_train_example", "random_seed": 12345}

Step 2: define the dataset
--------------------------

.. code-block:: python

    dataset = {"type": "path", "path": "Si_training_set_4_configs", "shuffle": True}

Step 3: model
-------------

Install the KIM model if not already installed.

.. tip::

   You can also provide custom KIM model by defining the `path` to a valid KIM portable model. In that case KLIFF will install the model for you.

.. code:: bash

    kim-api-collections-management install user SW_StillingerWeber_1985_Si__MO_405512056662_006


.. parsed-literal::

    Item 'SW_StillingerWeber_1985_Si__MO_405512056662_006' already installed in collection 'user'.
    
    Success\!


.. code-block:: python

    model = {"name": "SW_StillingerWeber_1985_Si__MO_405512056662_006"}

Step 4: select parameters to be trained
---------------------------------------

.. code-block:: python

    transforms = {"parameter": ["A", "B", "sigma"]}

Step 5: training
----------------

Lets train it using scipy, lbfgs optimizer (physics based models can
only work with scipy optimizers). With test train split of 1:3.

.. code-block:: python

    training = {
        "loss" : {"function" : "MSE"},
        "optimizer": {"name": "L-BFGS-B"},
        "training_dataset" : {"train_size": 3},
        "validation_dataset" : {"val_size": 1},
        "epoch" : 10
    }

Step 6: (Optional) export the model?
------------------------------------

.. code-block:: python

    export = {"model_path":"./", "model_name": "MySW__MO_111111111111_000"} # name can be anything, but better to have KIM-API qualified name for convenience

Step 7: Put it all together, and pass to the trainer
----------------------------------------------------

.. code-block:: python

    training_manifest = {
        "workspace": workspace,
        "model": model,
        "dataset": dataset,
        "transforms": transforms,
        "training": training,
        "export": export
    }

.. code-block:: python

    from kliff.trainer.kim_trainer import KIMTrainer
    
    trainer = KIMTrainer(training_manifest)
    trainer.train()
    trainer.save_kim_model()


.. parsed-literal::

    2025-02-27 13:31:08.806 | INFO     | kliff.trainer.base_trainer:initialize:343 - Seed set to 12345.
    2025-02-27 13:31:08.809 | INFO     | kliff.trainer.base_trainer:setup_workspace:390 - Either a fresh run or resume is not requested. Starting a new run.
    2025-02-27 13:31:08.811 | INFO     | kliff.trainer.base_trainer:initialize:346 - Workspace set to SW_train_example/SW_StillingerWeber_1985_Si__MO_405512056662_006_2025-02-27-13-31-08.
    2025-02-27 13:31:08.818 | INFO     | kliff.dataset.dataset:add_weights:1126 - No explicit weights provided.
    2025-02-27 13:31:08.819 | INFO     | kliff.dataset.dataset:add_weights:1131 - Weights set to the same value for all configurations.
    2025-02-27 13:31:08.820 | INFO     | kliff.trainer.base_trainer:initialize:349 - Dataset loaded.
    2025-02-27 13:31:08.822 | WARNING  | kliff.trainer.base_trainer:setup_dataset_transforms:524 - Configuration transform module name not provided.Skipping configuration transform.
    2025-02-27 13:31:08.823 | INFO     | kliff.trainer.base_trainer:setup_dataset_split:601 - Training dataset size: 3
    2025-02-27 13:31:08.824 | INFO     | kliff.trainer.base_trainer:setup_dataset_split:609 - Validation dataset size: 1
    2025-02-27 13:31:08.827 | INFO     | kliff.trainer.base_trainer:initialize:354 - Train and validation datasets set up.
    2025-02-27 13:31:09.208 | INFO     | kliff.models.kim:get_model_from_manifest:782 - Model SW_StillingerWeber_1985_Si__MO_405512056662_006 is already installed, continuing ...
    2025-02-27 13:31:09.220 | INFO     | kliff.trainer.base_trainer:initialize:358 - Model loaded.
    2025-02-27 13:31:09.221 | INFO     | kliff.trainer.base_trainer:initialize:363 - Optimizer loaded.
    2025-02-27 13:31:09.227 | INFO     | kliff.trainer.base_trainer:save_config:475 - Configuration saved in SW_train_example/SW_StillingerWeber_1985_Si__MO_405512056662_006_2025-02-27-13-31-08/4b78c8b75efa6dbe06a2bb42588dfa5d.yaml.
    2025-02-27 13:31:09.361 | INFO     | kliff.trainer.kim_trainer:train:201 - Optimization successful: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
    2025-02-27 13:31:09.364 | INFO     | kliff.models.kim:write_kim_model:657 - KLIFF trained model write to `/home/amit/Projects/COLABFIT/kliff/kliff/docs/source/introduction/MySW__MO_000000000000_000`
    2025-02-27 13:31:11.476 | INFO     | kliff.trainer.kim_trainer:save_kim_model:239 - KIM model saved at MySW__MO_000000000000_000


The model should now be trained, you can install it as:

.. code:: bash

    !kim-api-collections-management install user MySW__MO_111111111111_000


.. parsed-literal::

    Found local item named: MySW__MO_000000000000_000.
    In source directory: /home/amit/Projects/COLABFIT/kliff/kliff/docs/source/introduction/MySW__MO_000000000000_000.
       (If you are trying to install an item from openkim.org
        rerun this command from a different working directory,
        or rename the source directory mentioned above.)
    
    Found installed driver... SW__MD_335816936951_005
    [100%] Built target MySW__MO_000000000000_000
    Install the project...
    -- Install configuration: "Release"
    -- Installing: /home/amit/.kim-api/2.3.0+v2.3.0.GNU.GNU.GNU.2022-07-11-20-25-52/portable-models-dir/MySW__MO_000000000000_000/libkim-api-portable-model.so
    -- Set non-toolchain portion of runtime path of "/home/amit/.kim-api/2.3.0+v2.3.0.GNU.GNU.GNU.2022-07-11-20-25-52/portable-models-dir/MySW__MO_000000000000_000/libkim-api-portable-model.so" to ""
    
    Success!


Let us quickly check the trained model, here we are using the ASE
calculator to check the energy and forces

.. code-block:: python

    from ase.calculators.kim.kim import KIM
    from ase.build import bulk
    
    si = bulk("Si")
    model = KIM("MySW__MO_111111111111_000")
    si.calc = model
    print(si.get_potential_energy())
    print(si.get_forces())

Errors
------

1. ``libstd++`` errors

..

   /lib/x86_64-linux-gnu/libstdc++.so.6: version \`GLIBCXX_3.4.29’ not
   found (required by
   /opt/mambaforge/mambaforge/envs/kliff/lib/libkim-api.so.2)

This indicates that your conda environment is not properly setting up
the ``LD_LIBRARY_PATH``. You can fix this by running the following
command:

.. code:: bash

   export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

This should prepend the correct ``libstd++`` path to the
``LD_LIBRARY_PATH`` variable.