PolySet-dev

A framework for representing polymers as ensembles rather than single objects.

⛔️

Why PolySet?

In polymer science, one principle is foundational:

A polymer is not a single molecule,
but a distribution of chains.

This population — spanning short, long, and rare chains — governs the material’s behavior.
It shapes mechanical response, diffusion, entanglement, and every property that depends on chain statistics. The ensemble is not an approximation; it is the defining reality of polymer matter.

Yet, when polymers enter machine-learning workflows, this fundamental principle is often lost. Current representations collapse a whole distribution into a single idealized structure,
treating a polymer as if it were monodisperse and deterministic.

This simplification erases the very feature that distinguishes polymers from small molecules: their inherent stochastic nature.

PolySet restores this missing dimension.
It represents a polymer as a finite, weighted ensemble of chains — a form that respects polymer physics while remaining fully compatible with modern machine-learning encoders.

💡

What PolySet is?

PolySet represent a polymer as a collection of representative chains, each carrying a weight that reflects its likelihood in the material. Instead of a monolithic description, PolySet builds a finite ensemble that captures the variability naturally present in a polymer sample.

This ensemble serves as a foundation for:

downstream machine-learning tasks,
property prediction,
data augmentation,
or simply a more faithful representation of the material itself.

PolySet changes how the polymer is presented to the model,
not how the model learns.

💡

How PolySet works?

In real materials, the molecular-weight distribution (MWD) is determined by the synthesis process. Yet current polymer databases almost never report this distribution and typically provide only Mn and Đ.

PolySet reconstructs a plausible chain-length distribution consistent with these two quantities by generating a finite ensemble of representative chains with statistical weights.

Because different synthesis routes can produce different tail behaviors, PolySet offers four plausible sampling modes (basic, short, middle and long) as illustrated below.

These modes do not change the polymer itself. They simply reflect different, physically reasonable shapes that a distribution may take when only (Mn, Đ) are known.

Each chain in the ensemble:

corresponds to a plausible member of the polymer population,
respects the provided Mn and Đ,
carries a weight derived from the reconstructed distribution,
contributes proportionally when constructing polymer-level embeddings.

This provides machine-learning models with a representation that reflects the statistical nature of polymer matter.

💡

What PolySet Enables?

PolySet replaces the traditional “one polymer = one object” paradigm with a more realistic ensemble-based representation. This shift unlocks four key capabilities:

Physical realism:

The representation incorporates the true variability of the sample, rather than collapsing it into a single symbolic structure.

Expressiveness:

Two polymers with identical Mn and Đ can differ substantially in their underlying chain-length distributions. PolySet makes these differences visible to downstream models. The example below illustrates how embeddings derived from the four sampling modes naturally separate in feature space — even though all modes share the same Mn and Đ.

Stability:

Properties driven by long or rare chains become easier to learn, reducing noise and improving the robustness of ML workflows.

Compatibility:

PolySet does not replace chemical encoders. It acts as a representation layer that can be plugged into any downstream ML model.

Together, these features make PolySet a universal bridge between polymer physics
and modern machine learning.

💡

Vision

Polymers are among the most intrinsically statistical materials in nature. Yet their digital representations have remained deterministic for decades.

In its current release, the framework addresses homopolymers, representing them as distributions of chain lengths — the most fundamental expression of their stochastic nature.

This first step opens the door to a broader reframing:

random copolymers,
block copolymers,
grafted and star polymers.

The ambition of PolySet is to build a unified language for all forms of non-deterministic polymer matter — a language faithful to the statistical nature of the materials it seeks to describe.

🌟

Try PolySet

PyPI package:
pip install polysetlib

GitHub repository:
https://github.com/kFERJI/PolySet

Dataset (Zenodo):
A curated dataset of PolySet ensembles for testing and benchmarking is available here.
ferji, . khalid . (2025). PolySet: Statistical Ensemble Polymer Embeddings Dataset for Machine Learning [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17861022