Analyze
=======

Enter the output directory.

.. code-block:: bash

    cd out/refinement

Extract scores
--------------

.. code-block:: bash

    extract_scores.py --multi

Visualize score distributions (will be needed later to infer score thresholds)

.. code-block:: bash

    plot_scores.R all_scores.csv

Create a CIF file of the top 10 models
--------------------------------------
   
.. code-block:: bash

    rebuild_atomic.py --project_dir <full path to the original project directory> --top 10 all_scores_sorted_uniq.csv  --rmf_auto

Yes - no JSON project file in argument - here each refinement run had its own project file (pointing to a different input structure),
and ``rebuild_atomic.py`` will locate the JSON project files automatically based on the model IDs in the all_scores_sorted_uniq.csv file.

``--project_dir`` is necessary if you use relative paths in the JSON project file.
`` --rmf_auto`` will read the beads from RMF file and use Modeller to re-build full atomic loops!

For example here:

.. code-block:: bash

    rebuild_atomic.py --project_dir ../../ --top 10 all_scores_sorted_uniq.csv  --rmf_auto

Open the used EM map and models in UCSF Chimera and Xlink Analyzer. You would see that structures are fit to the map still well but crosslinks are now satisfied!

Quick convergence check:
------------------------
   
    .. code-block:: bash
    
        plot_convergence.R total_score_logs.txt 20

    Open the resulting ``scores.pdf`` 

Assess sampling exhaustiveness
------------------------------

Run sampling performance analysis with imp-sampcon tool (described by `Viswanath et al. 2017 <https://doi.org/10.1016/j.bpj.2017.10.005>`_)

#. Prepare the ``density.txt`` file

    .. code-block:: bash

        create_density_file.py --project_dir <full path to the original project directory> <path_to_one_of_the_refinement_folders>/out/elongator_refine.json --by_rigid_body

    e.g. 

    .. code-block:: bash

        create_density_file.py --project_dir ../../ 0000171/out/elongator_refine.json --by_rigid_body

    replacing ``0000171`` with the name of any directory in your refinement directory.

#. Prepare the ``symm_groups.txt`` file storing information necessary to properly align homo-oligomeric structures

    .. code-block:: bash

        create_symm_groups_file.py --project_dir <full path to the original project directory> <path_to_one_of_the_refinement_folders>/elongator.json

    e.g. 

    .. code-block:: bash

        create_symm_groups_file.py --project_dir ../../ 0000171/out/elongator_refine.json 0000171/out/params_refine.py


#. Run ``setup_analysis.py`` script to prepare input files for the sampling exhaustiveness analysis. 
   
    .. code-block:: bash

        setup_analysis.py -s all_scores.csv -o analysis -d density.txt --score_thresh 70000

    Here we use a score threshold derived from the corresponding score distribution in ``scores.pdf``, 
    to filter out poorly fitting models.

#. Run ``imp-sampcon exhaust`` tool (command-line tool provided with IMP) to perform the actual analysis:

    .. code-block:: bash

        cd analysis

        imp_sampcon exhaust -n elongator \
        --rmfA sample_A/sample_A_models.rmf3 \
        --rmfB sample_B/sample_B_models.rmf3 \
        --scoreA scoresA.txt --scoreB scoresB.txt \
        -d ../density.txt \
        -m cpu_omp \
        -c 4 \
        -gp \
        -g 5.0 \
        --ambiguity ../symm_groups.txt

#. In the output you will get, among other files:

    * ``elongator.Sampling_Precision_Stats.txt``
      
        Estimation of the sampling precision. In this case it will be around 20 Angstrom

    * Clusters obtained after clustering at the above sampling precision in directories and files starting from ``cluster`` in their names, containing information about the models in the clusters and cluster localization densities
      
    * ``elongator.Cluster_Precision.txt`` listing the precision for each cluster, in this case around 10-20 Angstrom
      
    * PDF files with plots with the results of exhaustiveness tests
    
    See `Viswanath et al. 2017 <https://doi.org/10.1016/j.bpj.2017.10.005>`_ for detailed explanation of these concepts.

#. Optimize the plots
   
    The fonts and value ranges in X and Y axes in the default plots from ``imp_sampcon exhaust`` are frequently not optimal. For this you have to adjust them manually.

    #. Copy the original ``gnuplot`` scripts to the current ``analysis`` directory by executing:
       
        .. code-block:: bash

            copy_sampcon_gnuplot_scripts.py

        This will copy for scripts to the current directory:

            * ``Plot_Cluster_Population.plt`` for the ``elongator.Cluster_Population.pdf`` plot

            * ``Plot_Convergence_NM.plt`` for the ``elongator.ChiSquare.pdf`` plot

            * ``Plot_Convergence_SD.plt`` for the ``elongator.Score_Dist.pdf`` plot

            * ``Plot_Convergence_TS.plt`` for the ``elongator.Top_Score_Conv.pdf`` plot


    #. Edit the scripts to adjust according to your liking
       
    #. Run the scripts again::

        gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt
        gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt
        gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt
        gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt

#. Extract cluster models:

    For example, for the top cluster:
       
    .. code-block:: bash
    
        extract_cluster_models.py \
        --project_dir ../../../ \
        --outdir cluster.0/ \
        --ntop 5 \
        --scores ../all_scores.csv \
        --rebuild_loops \
        Identities_A.txt Identities_B.txt cluster.0.all.txt

    Yes, no json file as the last argument (contrary to :doc:`analysis_denovo`), for refinement of multiple models the program will find JSON files itself (as they are different for each model).

#. If you want to re-cluster at a specific threshold (e.g. to get bigger clusters), you can do:
   
    .. code-block:: bash

        mkdir recluster
        cd recluster/
        cp ../Distances_Matrix.data.npy .
        cp ../*ChiSquare_Grid_Stats.txt .
        cp ../*Sampling_Precision_Stats.txt .
        imp_sampcon exhaust -n elongator \
        --rmfA ../sample_A/sample_A_models.rmf3 \
        --rmfB ../sample_B/sample_B_models.rmf3 \
        --scoreA ../scoresA.txt --scoreB ../scoresB.txt \
        -d ../density.txt \
        -m cpu_omp \
        -c 4 \
        -gp \
        --ambiguity ../../symm_groups.txt \
        --skip \
        --cluster_threshold 25 \
        --voxel 2

    And generate cluster models updating paths:

    .. code-block:: bash

        extract_cluster_models.py \
            --project_dir ../../../../ \
            --outdir cluster.0/ \
            --ntop 1 \
            --scores ../../all_scores.csv \
            --rebuild_loops \
            ../Identities_A.txt ../Identities_B.txt cluster.0.all.txt


Conclusions
-----------

The analysis shows that the sampling converged was exhaustive:

    * the individual runs converge
      
    * the exhaustiveness has been reached at 10-20 Angstrom according to the statistical test
      
    * two random samples of models have similar score distributions and are similarly distributed over the clusters

.. warning:: 

    The precision estimates still should be taken with caution, as for the previous stage, but here 
    the estimates are a bit more realistic because the original models were allowed to explore larger conformational space.

The crosslinks are now satisfied in the top scoring model.

For further biological hypothesis generation based on the model, we would use the top scoring models from each cluster and from all runs.

We might use this top scoring model as a representative model for figures.

As the "final output", however, we would use the ensemble of models (e.g. from the clusters or top 10 models from all runs) for depicting the uncertainty.