This Blog is Untitled: Dealing with aggregates: I don't like tedious work

I used to study nanoparticles (NPs) self-assembly in polymer matrices before 2017 by coarse-grained simulation method. Analyzing morphologies of self-assemblies is a tedious work: one needs to repeatedly watch frames from simulation trajectories, finding appropriate view points, counting number of aggregates, measure size of aggregates, etc. Obtaining integrate aggregates under periodic boundary conditions is also a challenging work. In addition, one usually needs to run plenty of simulations under different conditions to find a desired morphology; selecting desired morphologies automatically is even more challenging: most of the morphologies are ill-defined, it is hard to use some common characterization methods, e.g., $g(r)$ or $S(q)$ of NPs give little differences amongst percolated morphologies.
Based on such demands, I designed a co-pilot which can:

Automatically clustering the clusters;
Remove periodic boundary conditions and make the center-of-mass at $(0,0,0)^T$;
Adjust the view vector along the minor axis of the aggregate;
Classify the aggregate.

In step 1, firstly, I calculate the $g(r)$ of NP centers and find $r$ where $g(r)$ reaches its 1st valley, simultaneously, the distance matrix under periodic boundary condition. Using this distance matrix and $r$, a DBSCAN or HDBSCAN algorithm is performed to cluster NPs. In simulations, usually systems with very low loading of NPs are considered, DBSCAN is efficient enough; for some co-polymer self-assembly systems, a coarse-grained method is preferred: divide simulation box into grids and DBSCAN on the grids with density as weights, then perform DBSCAN on each coarse-grained clusters. DFS or BFS + Linked Cell List is also a good choice if particles are too many. All these struggles are due to that kd-tree is not working under periodic boundary conditions.

After obtaining clusters in step 1, we must remove periodic boundary conditions of the cluster. If in step 1, one uses BFS or DFS + Linked Cell List method, then one can remove periodic boundary condition during clustering; but this method has limitations, it does not work properly if the cluster is percolated throughout the box. Therefore, in this step, I use circular statistics to deal with the clusters. In periodic boundary condition simulation box, distance between an NP in an aggregate and the midpoint will never exceed $L/2$ in corresponding box dimension. Midpoint here is not center-of-mass, e.g., distance between a nozzle point and center-of-mass of an ear-syringe-bulb is clearly larger than its half length; midpoint is a actually a "de-duplicated" center-of-mass. Besides, circular mean also puts most points in the center in case of percolation. Therefore, in part 2, we have following steps:

Choose a $r_\text{cut}$ to test whether the aggregate is percolate;
If the aggregate is percolate, evaluate the circular mean of all points $r_c$;
Set all coordinates $r$ as $r\to pbc(r-r_c)$;
If the aggregate is not percolate, midpoint is evaluated by calculating circular mean of coordinates $r$ where $\rho(r)>0$, $\rho(r)$ is calculated using bin size that smaller than $r_\text{cut}$ used in step 1;
Same as step 3, update coordinates;
After step 5, the aggregates are unwrapped from the box, set $r\to r-\overline{r}$ to set center-of-mass at $(0,0,0)^T$

Circular mean $\alpha$ of samples $\lbrace\alpha_i\rbrace$ is defined as the minimization of summation of a circular metric function $d(\alpha,\beta):=1-\cos(\alpha-\beta)$
$$\alpha=\underset{\beta}{\operatorname{argmin}}\sum_i (1-\cos(\alpha_i-\beta))$$
Adjusting the view vector is simple, evaluate the eigenspace of gyration tensor as $rr^T/n$ and sort the eigenvectors by eigenvalue, i.e., $\lambda_1\ge\lambda_2\ge\lambda_3$, then the minor axis is corresponding eigenvector $v_3$, the aggregate then can be rotated by $[v_1, v_2, v_3]$ so that the minor axis is $z$-axis.

The last step is a bit more tricky, the best trail I attempted was to use SVC, a binary classification method. I used about 20 samples labeled as "desired", these 20 samples were extended to 100 samples by adding some noises to the samples, e.g., moving coordinates a little bit, adding several NPs into the aggregate or removing several NPs randomly, without "breaking" the category of the morphology. Together with 100 "undesired" samples, I trained the SVC with a Gaussian kernel. The result turned out to be pretty good. I also tried to use ANN to classify all 5 categories of morphologies obtained from simulations, but ANN model did not work very well, perhaps the reason was lack of samples or the model I built was too rough. I didn't try other multi-class methods, anyway, that part of work was done, I stopped developing this co-pilot long time ago.