Analyze

Enter the output directory.

cd out/refinement

Extract scores

extract_scores.py --multi

Visualize score distributions (will be needed later to infer score thresholds)

plot_scores.R all_scores.csv

Create a CIF file of the top 10 models

rebuild_atomic.py --project_dir <full path to the original project directory> --top 10 all_scores_sorted_uniq.csv  --rmf_auto

Yes - no JSON project file in argument - here each refinement run had its own project file (pointing to a different input structure), and rebuild_atomic.py will locate the JSON project files automatically based on the model IDs in the all_scores_sorted_uniq.csv file.

--project_dir is necessary if you use relative paths in the JSON project file. `` –rmf_auto`` will read the beads from RMF file and use Modeller to re-build full atomic loops!

For example here:

rebuild_atomic.py --project_dir ../../ --top 10 all_scores_sorted_uniq.csv  --rmf_auto

Open the used EM map and models in UCSF Chimera and Xlink Analyzer. You would see that structures are fit to the map still well but crosslinks are now satisfied!

Quick convergence check:

plot_convergence.R total_score_logs.txt 20
Open the resulting scores.pdf

Assess sampling exhaustiveness

Run sampling performance analysis with imp-sampcon tool (described by Viswanath et al. 2017)

Prepare the density.txt file

create_density_file.py --project_dir <full path to the original project directory> <path_to_one_of_the_refinement_folders>/out/elongator_refine.json --by_rigid_body

e.g.

create_density_file.py --project_dir ../../ 0000171/out/elongator_refine.json --by_rigid_body

replacing 0000171 with the name of any directory in your refinement directory.

Prepare the symm_groups.txt file storing information necessary to properly align homo-oligomeric structures

create_symm_groups_file.py --project_dir <full path to the original project directory> <path_to_one_of_the_refinement_folders>/elongator.json

e.g.

create_symm_groups_file.py --project_dir ../../ 0000171/out/elongator_refine.json 0000171/out/params_refine.py

Run setup_analysis.py script to prepare input files for the sampling exhaustiveness analysis.
setup_analysis.py -s all_scores.csv -o analysis -d density.txt --score_thresh 70000
Here we use a score threshold derived from the corresponding score distribution in scores.pdf, to filter out poorly fitting models.

Run imp-sampcon exhaust tool (command-line tool provided with IMP) to perform the actual analysis:

cd analysis

imp_sampcon exhaust -n elongator \
--rmfA sample_A/sample_A_models.rmf3 \
--rmfB sample_B/sample_B_models.rmf3 \
--scoreA scoresA.txt --scoreB scoresB.txt \
-d ../density.txt \
-m cpu_omp \
-c 4 \
-gp \
-g 5.0 \
--ambiguity ../symm_groups.txt

In the output you will get, among other files:
- elongator.Sampling_Precision_Stats.txt
  
  Estimation of the sampling precision. In this case it will be around 20 Angstrom
- Clusters obtained after clustering at the above sampling precision in directories and files starting from cluster in their names, containing information about the models in the clusters and cluster localization densities
- elongator.Cluster_Precision.txt listing the precision for each cluster, in this case around 10-20 Angstrom
- PDF files with plots with the results of exhaustiveness tests
See Viswanath et al. 2017 for detailed explanation of these concepts.
Optimize the plots
The fonts and value ranges in X and Y axes in the default plots from imp_sampcon exhaust are frequently not optimal. For this you have to adjust them manually.
1. Copy the original gnuplot scripts to the current analysis directory by executing:
  copy_sampcon_gnuplot_scripts.py
  
  This will copy for scripts to the current directory:
  
  Plot_Cluster_Population.plt for the elongator.Cluster_Population.pdf plot
  
  Plot_Convergence_NM.plt for the elongator.ChiSquare.pdf plot
  
  Plot_Convergence_SD.plt for the elongator.Score_Dist.pdf plot
  
  Plot_Convergence_TS.plt for the elongator.Top_Score_Conv.pdf plot
2. Edit the scripts to adjust according to your liking
3. Run the scripts again:
  gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt
Extract cluster models:
For example, for the top cluster:
extract_cluster_models.py \ --project_dir ../../../ \ --outdir cluster.0/ \ --ntop 5 \ --scores ../all_scores.csv \ --rebuild_loops \ Identities_A.txt Identities_B.txt cluster.0.all.txt
Yes, no json file as the last argument (contrary to Analyze), for refinement of multiple models the program will find JSON files itself (as they are different for each model).

If you want to re-cluster at a specific threshold (e.g. to get bigger clusters), you can do:

mkdir recluster
cd recluster/
cp ../Distances_Matrix.data.npy .
cp ../*ChiSquare_Grid_Stats.txt .
cp ../*Sampling_Precision_Stats.txt .
imp_sampcon exhaust -n elongator \
--rmfA ../sample_A/sample_A_models.rmf3 \
--rmfB ../sample_B/sample_B_models.rmf3 \
--scoreA ../scoresA.txt --scoreB ../scoresB.txt \
-d ../density.txt \
-m cpu_omp \
-c 4 \
-gp \
--ambiguity ../../symm_groups.txt \
--skip \
--cluster_threshold 25 \
--voxel 2

And generate cluster models updating paths:

extract_cluster_models.py \
    --project_dir ../../../../ \
    --outdir cluster.0/ \
    --ntop 1 \
    --scores ../../all_scores.csv \
    --rebuild_loops \
    ../Identities_A.txt ../Identities_B.txt cluster.0.all.txt

Conclusions

The analysis shows that the sampling converged was exhaustive:

the individual runs converge

the exhaustiveness has been reached at 10-20 Angstrom according to the statistical test

two random samples of models have similar score distributions and are similarly distributed over the clusters

Warning

The precision estimates still should be taken with caution, as for the previous stage, but here the estimates are a bit more realistic because the original models were allowed to explore larger conformational space.

The crosslinks are now satisfied in the top scoring model.

For further biological hypothesis generation based on the model, we would use the top scoring models from each cluster and from all runs.

We might use this top scoring model as a representative model for figures.

As the “final output”, however, we would use the ensemble of models (e.g. from the clusters or top 10 models from all runs) for depicting the uncertainty.