Analyze ======= Enter the output directory. .. code-block:: bash cd out/refinement Extract scores -------------- .. code-block:: bash extract_scores.py --multi Visualize score distributions (will be needed later to infer score thresholds) .. code-block:: bash plot_scores.R all_scores.csv Create a CIF file of the top 10 models -------------------------------------- .. code-block:: bash rebuild_atomic.py --project_dir --top 10 all_scores_sorted_uniq.csv --rmf_auto Yes - no JSON project file in argument - here each refinement run had its own project file (pointing to a different input structure), and ``rebuild_atomic.py`` will locate the JSON project files automatically based on the model IDs in the all_scores_sorted_uniq.csv file. ``--project_dir`` is necessary if you use relative paths in the JSON project file. `` --rmf_auto`` will read the beads from RMF file and use Modeller to re-build full atomic loops! For example here: .. code-block:: bash rebuild_atomic.py --project_dir ../../ --top 10 all_scores_sorted_uniq.csv --rmf_auto Open the used EM map and models in UCSF Chimera and Xlink Analyzer. You would see that structures are fit to the map still well but crosslinks are now satisfied! Quick convergence check: ------------------------ .. code-block:: bash plot_convergence.R total_score_logs.txt 20 Open the resulting ``scores.pdf`` Assess sampling exhaustiveness ------------------------------ Run sampling performance analysis with imp-sampcon tool (described by `Viswanath et al. 2017 `_) #. Prepare the ``density.txt`` file .. code-block:: bash create_density_file.py --project_dir /out/elongator_refine.json --by_rigid_body e.g. .. code-block:: bash create_density_file.py --project_dir ../../ 0000171/out/elongator_refine.json --by_rigid_body replacing ``0000171`` with the name of any directory in your refinement directory. #. Prepare the ``symm_groups.txt`` file storing information necessary to properly align homo-oligomeric structures .. code-block:: bash create_symm_groups_file.py --project_dir /elongator.json e.g. .. code-block:: bash create_symm_groups_file.py --project_dir ../../ 0000171/out/elongator_refine.json 0000171/out/params_refine.py #. Run ``setup_analysis.py`` script to prepare input files for the sampling exhaustiveness analysis. .. code-block:: bash setup_analysis.py -s all_scores.csv -o analysis -d density.txt --score_thresh 70000 Here we use a score threshold derived from the corresponding score distribution in ``scores.pdf``, to filter out poorly fitting models. #. Run ``imp-sampcon exhaust`` tool (command-line tool provided with IMP) to perform the actual analysis: .. code-block:: bash cd analysis imp_sampcon exhaust -n elongator \ --rmfA sample_A/sample_A_models.rmf3 \ --rmfB sample_B/sample_B_models.rmf3 \ --scoreA scoresA.txt --scoreB scoresB.txt \ -d ../density.txt \ -m cpu_omp \ -c 4 \ -gp \ -g 5.0 \ --ambiguity ../symm_groups.txt #. In the output you will get, among other files: * ``elongator.Sampling_Precision_Stats.txt`` Estimation of the sampling precision. In this case it will be around 20 Angstrom * Clusters obtained after clustering at the above sampling precision in directories and files starting from ``cluster`` in their names, containing information about the models in the clusters and cluster localization densities * ``elongator.Cluster_Precision.txt`` listing the precision for each cluster, in this case around 10-20 Angstrom * PDF files with plots with the results of exhaustiveness tests See `Viswanath et al. 2017 `_ for detailed explanation of these concepts. #. Optimize the plots The fonts and value ranges in X and Y axes in the default plots from ``imp_sampcon exhaust`` are frequently not optimal. For this you have to adjust them manually. #. Copy the original ``gnuplot`` scripts to the current ``analysis`` directory by executing: .. code-block:: bash copy_sampcon_gnuplot_scripts.py This will copy for scripts to the current directory: * ``Plot_Cluster_Population.plt`` for the ``elongator.Cluster_Population.pdf`` plot * ``Plot_Convergence_NM.plt`` for the ``elongator.ChiSquare.pdf`` plot * ``Plot_Convergence_SD.plt`` for the ``elongator.Score_Dist.pdf`` plot * ``Plot_Convergence_TS.plt`` for the ``elongator.Top_Score_Conv.pdf`` plot #. Edit the scripts to adjust according to your liking #. Run the scripts again:: gnuplot -e "sysname='elongator'" Plot_Cluster_Population.plt gnuplot -e "sysname='elongator'" Plot_Convergence_NM.plt gnuplot -e "sysname='elongator'" Plot_Convergence_SD.plt gnuplot -e "sysname='elongator'" Plot_Convergence_TS.plt #. Extract cluster models: For example, for the top cluster: .. code-block:: bash extract_cluster_models.py \ --project_dir ../../../ \ --outdir cluster.0/ \ --ntop 5 \ --scores ../all_scores.csv \ --rebuild_loops \ Identities_A.txt Identities_B.txt cluster.0.all.txt Yes, no json file as the last argument (contrary to :doc:`analysis_denovo`), for refinement of multiple models the program will find JSON files itself (as they are different for each model). #. If you want to re-cluster at a specific threshold (e.g. to get bigger clusters), you can do: .. code-block:: bash mkdir recluster cd recluster/ cp ../Distances_Matrix.data.npy . cp ../*ChiSquare_Grid_Stats.txt . cp ../*Sampling_Precision_Stats.txt . imp_sampcon exhaust -n elongator \ --rmfA ../sample_A/sample_A_models.rmf3 \ --rmfB ../sample_B/sample_B_models.rmf3 \ --scoreA ../scoresA.txt --scoreB ../scoresB.txt \ -d ../density.txt \ -m cpu_omp \ -c 4 \ -gp \ --ambiguity ../../symm_groups.txt \ --skip \ --cluster_threshold 25 \ --voxel 2 And generate cluster models updating paths: .. code-block:: bash extract_cluster_models.py \ --project_dir ../../../../ \ --outdir cluster.0/ \ --ntop 1 \ --scores ../../all_scores.csv \ --rebuild_loops \ ../Identities_A.txt ../Identities_B.txt cluster.0.all.txt Conclusions ----------- The analysis shows that the sampling converged was exhaustive: * the individual runs converge * the exhaustiveness has been reached at 10-20 Angstrom according to the statistical test * two random samples of models have similar score distributions and are similarly distributed over the clusters .. warning:: The precision estimates still should be taken with caution, as for the previous stage, but here the estimates are a bit more realistic because the original models were allowed to explore larger conformational space. The crosslinks are now satisfied in the top scoring model. For further biological hypothesis generation based on the model, we would use the top scoring models from each cluster and from all runs. We might use this top scoring model as a representative model for figures. As the "final output", however, we would use the ensemble of models (e.g. from the clusters or top 10 models from all runs) for depicting the uncertainty.