Analyze
=======

Enter the output directory.

.. code-block:: bash

    cd out

Extract scores
--------------

.. code-block:: bash

    extract_scores.py 

This should create a couple of files including all_scores_sorted_uniq.csv

Create a CIF file of the top 10 models
--------------------------------------

.. code-block:: bash
    
    rebuild_atomic.py --top 10 --project_dir ../ elongator.json all_scores_sorted_uniq.csv

Open the used EM map and models in UCSF Chimera and Xlink Analyzer. You would see that structures are fit to the map and crosslinks are satisfied only to some extent (are violated by only slightly). Spoiler: will be fixed in the refinement.

Quick checks
------------

* Quick test of convergence

    Run quick test to assess convergence of the model score in randomly selected modelling trajectories

    .. code-block:: bash

        plot_convergence.R total_score_logs.txt 20

    (change ``20`` to have less or more trajectories in the plots).

    Open the resulting ``convergence.pdf`` to visualize the convergence.

    It will plot score evolution for 20 runs. If the plots reach a plateau for all or most runs, the sampling converges.


* Visualize score distributions

    .. code-block:: bash

        plot_scores.R all_scores.csv

    Open the resulting ``scores.pdf`` 

    This can be used to:

        * evaluate value ranges of restraints and use this information to re-scale the weights, for example, to bring all restraints to the same scale.
          
        * select thresholds for analysis below and for selecting models for the refinement.


Assess sampling exhaustiveness
------------------------------

Run sampling performance analysis with imp-sampcon tool (described by `Viswanath et al. 2017 <https://doi.org/10.1016/j.bpj.2017.10.005>`_)

.. warning:: For the global optimization, the sampling exhaustiveness is not always applicable. For some cases, the optimization at this stage
    can actually work so well that it leads to all or most models being the same, resulting in very few clusters. In such cases, the sampling is exhaustive under the assumptions in the json but the estimation of sampling precision won't be possible.
    In such cases we recommend to intensively refine (e.g. with high initial temperatures in simulated annealing) the top (or all models) to create a diverse set of models for analysis.


#. Prepare the ``density.txt`` file

    .. code-block:: bash

        create_density_file.py --project_dir ../ elongator.json --by_rigid_body

#. Prepare the ``symm_groups.txt`` file storing information necessary to properly align homo-oligomeric structures

    .. code-block:: bash

        create_symm_groups_file.py --project_dir ../ elongator.json params.py


#. Run ``setup_analysis.py`` script to prepare input files for the sampling exhaustiveness analysis. 
   
    .. code-block:: bash

        setup_analysis.py -s all_scores.csv -o analysis -d density.txt 

    Optionally, you can restrict the analysis to models scoring better than a given threshold:

    .. code-block:: bash

        setup_analysis.py -s all_scores.csv -o analysis -d density.txt --score_thresh -16000

    ``--score_thresh`` is to filtered out some rare very poorly scoring models (the threshold can be adjusted based on the ``scores.pdf`` generated above)


#. Run ``imp-sampcon exhaust`` tool (command-line tool provided with IMP) to perform the actual analysis:

    .. code-block:: bash

        cd analysis

        imp_sampcon exhaust -n elongator \
        --rmfA sample_A/sample_A_models.rmf3 \
        --rmfB sample_B/sample_B_models.rmf3 \
        --scoreA scoresA.txt --scoreB scoresB.txt \
        -d ../density.txt \
        -m cpu_omp \
        -c 4 \
        -gp \
        -g 5.0 \
        --ambiguity ../symm_groups.txt

#. In the output you will get, among other files:

    * ``elongator.Sampling_Precision_Stats.txt``
      
        Estimation of the sampling precision. In this case it will be around 5-10 Angstrom

    * Clusters obtained after clustering at the above sampling precision in directories and files starting from ``cluster`` in their names, containing information about the models in the clusters and cluster localization densities
      
    * ``elongator.Cluster_Precision.txt`` listing the precision for each cluster, in this case between 7-20 Angstrom
      
    * PDF files with plots with the results of exhaustiveness tests
    
    See `Viswanath et al. 2017 <https://doi.org/10.1016/j.bpj.2017.10.005>`_ for detailed explanation of these concepts.

#. Optimize the plots
   
    The fonts and value ranges in X and Y axes in the default plots from ``imp_sampcon exhaust`` are frequently not optimal. For this you have to adjust them manually.

    #. Copy the original ``gnuplot`` scripts to the current ``analysis`` directory by executing:
       
        .. code-block:: bash

            copy_sampcon_gnuplot_scripts.py

        This will copy for scripts to the current directory:

            * ``Plot_Cluster_Population.plt`` for the ``elongator.Cluster_Population.pdf`` plot

            * ``Plot_Convergence_NM.plt`` for the ``elongator.ChiSquare.pdf`` plot

            * ``Plot_Convergence_SD.plt`` for the ``elongator.Score_Dist.pdf`` plot

            * ``Plot_Convergence_TS.plt`` for the ``elongator.Top_Score_Conv.pdf`` plot


    #. Edit the scripts to adjust according to your liking
       
    #. Run the scripts again::

        gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt
        gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt
        gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt
        gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt

#. Extract cluster models for visualization

    .. code-block:: bash

        extract_cluster_models.py \
            --project_dir <full path to the original project directory> \
            --outdir <cluster directory> \
            --ntop <number of top models to extract> \
            --scores <path to the score CSV file used as input for analysis> \
            Identities_A.txt Identities_B.txt <list of cluster models> <path to the json>

    For example, to extract the 5 top scoring models from cluster 0:

    .. code-block:: bash

        extract_cluster_models.py \
            --project_dir ../../ \
            --outdir cluster.0/ \
            --ntop 5 \
            --scores ../all_scores.csv \
            Identities_A.txt Identities_B.txt cluster.0.all.txt ../elongator.json

    The models are saved in the CIF format to ``cluster.0`` directory


#. If you want to re-cluster at a specific threshold (e.g. to get bigger clusters), you can do:
   
    .. code-block:: bash

        mkdir recluster
        cd recluster/
        cp ../Distances_Matrix.data.npy .
        cp ../*ChiSquare_Grid_Stats.txt .
        cp ../*Sampling_Precision_Stats.txt .
        imp_sampcon exhaust -n elongator \
        --rmfA ../sample_A/sample_A_models.rmf3 \
        --rmfB ../sample_B/sample_B_models.rmf3 \
        --scoreA ../scoresA.txt --scoreB ../scoresB.txt \
        -d ../density.txt \
        -m cpu_omp \
        -c 4 \
        -gp \
        --ambiguity ../../symm_groups.txt \
        --skip \
        --cluster_threshold 40

    And generate cluster models updating paths:

    .. code-block:: bash

        extract_cluster_models.py \
            --project_dir ../../../ \
            --outdir cluster.0/ \
            --ntop 1 \
            --scores ../../all_scores.csv \
            ../Identities_A.txt ../Identities_B.txt cluster.0.all.txt ../elongator.json

Conclusions
-----------

The analysis shows that the sampling converged was exhaustive:

    * the individual runs converge
      
    * the exhaustiveness has been reached at 5-10 Angstrom according to the statistical test
      
    * two random samples of models have similar score distributions and are similarly distributed over the clusters

.. warning:: 

    The precision estimates should be taken with caution and do not represent "a resolution" because:

    * the exhaustiveness is reached under the assumptions taken (weights, rigid body definitions etc.)
      
    * the global optimization samples from a set of pre-calculated fits which constrain the search space, thus the precision is only defined within this space

Thus, the recombination step is not necessary.

Since, however, the crosslinks are still violated, we will still perform the refinement step.