Analyze
Enter the output directory.
cd out/refinement
Extract scores
extract_scores.py --multi
Visualize score distributions (will be needed later to infer score thresholds)
plot_scores.R all_scores.csv
Create a CIF file of the top 10 models
rebuild_atomic.py --project_dir <full path to the original project directory> --top 10 all_scores_sorted_uniq.csv --rmf_auto
Yes - no JSON project file in argument - here each refinement run had its own project file (pointing to a different input structure),
and rebuild_atomic.py
will locate the JSON project files automatically based on the model IDs in the all_scores_sorted_uniq.csv file.
--project_dir
is necessary if you use relative paths in the JSON project file.
`` –rmf_auto`` will read the beads from RMF file and use Modeller to re-build full atomic loops!
For example here:
rebuild_atomic.py --project_dir ../../ --top 10 all_scores_sorted_uniq.csv --rmf_auto
Open the used EM map and models in UCSF Chimera and Xlink Analyzer. You would see that structures are fit to the map still well but crosslinks are now satisfied!
Quick convergence check:
plot_convergence.R total_score_logs.txt 20
Open the resulting
scores.pdf
Assess sampling exhaustiveness
Run sampling performance analysis with imp-sampcon tool (described by Viswanath et al. 2017)
Prepare the
density.txt
filecreate_density_file.py --project_dir <full path to the original project directory> <path_to_one_of_the_refinement_folders>/out/elongator_refine.json --by_rigid_body
e.g.
create_density_file.py --project_dir ../../ 0000171/out/elongator_refine.json --by_rigid_body
replacing
0000171
with the name of any directory in your refinement directory.Prepare the
symm_groups.txt
file storing information necessary to properly align homo-oligomeric structurescreate_symm_groups_file.py --project_dir <full path to the original project directory> <path_to_one_of_the_refinement_folders>/elongator.json
e.g.
create_symm_groups_file.py --project_dir ../../ 0000171/out/elongator_refine.json 0000171/out/params_refine.py
Run
setup_analysis.py
script to prepare input files for the sampling exhaustiveness analysis.setup_analysis.py -s all_scores.csv -o analysis -d density.txt --score_thresh 70000
Here we use a score threshold derived from the corresponding score distribution in
scores.pdf
, to filter out poorly fitting models.Run
imp-sampcon exhaust
tool (command-line tool provided with IMP) to perform the actual analysis:cd analysis imp_sampcon exhaust -n elongator \ --rmfA sample_A/sample_A_models.rmf3 \ --rmfB sample_B/sample_B_models.rmf3 \ --scoreA scoresA.txt --scoreB scoresB.txt \ -d ../density.txt \ -m cpu_omp \ -c 4 \ -gp \ -g 5.0 \ --ambiguity ../symm_groups.txt
In the output you will get, among other files:
elongator.Sampling_Precision_Stats.txt
Estimation of the sampling precision. In this case it will be around 20 Angstrom
Clusters obtained after clustering at the above sampling precision in directories and files starting from
cluster
in their names, containing information about the models in the clusters and cluster localization densitieselongator.Cluster_Precision.txt
listing the precision for each cluster, in this case around 10-20 AngstromPDF files with plots with the results of exhaustiveness tests
See Viswanath et al. 2017 for detailed explanation of these concepts.
Optimize the plots
The fonts and value ranges in X and Y axes in the default plots from
imp_sampcon exhaust
are frequently not optimal. For this you have to adjust them manually.Copy the original
gnuplot
scripts to the currentanalysis
directory by executing:copy_sampcon_gnuplot_scripts.py
This will copy for scripts to the current directory:
Plot_Cluster_Population.plt
for theelongator.Cluster_Population.pdf
plotPlot_Convergence_NM.plt
for theelongator.ChiSquare.pdf
plotPlot_Convergence_SD.plt
for theelongator.Score_Dist.pdf
plotPlot_Convergence_TS.plt
for theelongator.Top_Score_Conv.pdf
plot
Edit the scripts to adjust according to your liking
Run the scripts again:
gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt
Extract cluster models:
For example, for the top cluster:
extract_cluster_models.py \ --project_dir ../../../ \ --outdir cluster.0/ \ --ntop 5 \ --scores ../all_scores.csv \ --rebuild_loops \ Identities_A.txt Identities_B.txt cluster.0.all.txt
Yes, no json file as the last argument (contrary to Analyze), for refinement of multiple models the program will find JSON files itself (as they are different for each model).
If you want to re-cluster at a specific threshold (e.g. to get bigger clusters), you can do:
mkdir recluster cd recluster/ cp ../Distances_Matrix.data.npy . cp ../*ChiSquare_Grid_Stats.txt . cp ../*Sampling_Precision_Stats.txt . imp_sampcon exhaust -n elongator \ --rmfA ../sample_A/sample_A_models.rmf3 \ --rmfB ../sample_B/sample_B_models.rmf3 \ --scoreA ../scoresA.txt --scoreB ../scoresB.txt \ -d ../density.txt \ -m cpu_omp \ -c 4 \ -gp \ --ambiguity ../../symm_groups.txt \ --skip \ --cluster_threshold 25 \ --voxel 2
And generate cluster models updating paths:
extract_cluster_models.py \ --project_dir ../../../../ \ --outdir cluster.0/ \ --ntop 1 \ --scores ../../all_scores.csv \ --rebuild_loops \ ../Identities_A.txt ../Identities_B.txt cluster.0.all.txt
Conclusions
The analysis shows that the sampling converged was exhaustive:
the individual runs converge
the exhaustiveness has been reached at 10-20 Angstrom according to the statistical test
two random samples of models have similar score distributions and are similarly distributed over the clusters
Warning
The precision estimates still should be taken with caution, as for the previous stage, but here the estimates are a bit more realistic because the original models were allowed to explore larger conformational space.
The crosslinks are now satisfied in the top scoring model.
For further biological hypothesis generation based on the model, we would use the top scoring models from each cluster and from all runs.
We might use this top scoring model as a representative model for figures.
As the “final output”, however, we would use the ensemble of models (e.g. from the clusters or top 10 models from all runs) for depicting the uncertainty.