Frequently Asked Questions
What does nonredundant / all members count refer to ??
The nonredundant member counts refer to sequences which are cluster representatives at 90% identity and 90% coverage clustering thresholds. This offers an overview over the sequence count excluding highly similar sequences reported in UniprotKB (which could be either isoforms, incomplete transcripts or sequences with a couple of mutations).
By contrast, the all member counts take into account all sequences regardless of their similarity.
What does NL and NBS classes stand for within NLR classification ?
The NL class contains sequences in which no CC / TIR / RPW8 domains were annotated but they contain NBS and LRR domains.
The NBS class contains sequences which consists only in a complete / incomplete NBS domain and no other NLR associated domains are present.
Sequences in these two groups could either be incomplete sequences of a proper CNL/TNL/RNL group or could be standalone NLRs (in the case of NL class), or could be part of a different functional class of proteins apart from NLRs. On a case basis, users can inspect the neighbouring clusters inside the cluster visualization page to further assess which case might be more likely. For example, starting from a given NL-type cluster, if there exists a close-by cluster having CNL architecture with high sequence similarity, this could be a good indication that the inspected NL cluster might correspond to a subgroup of incomplete sequences separated from the main CNL cluster due to the overlap (coverage) cut-off.
Why do some pages load slowly ?
Several pages require performing computations on our servers or complex queries on our database, therefore some pages may load slower. More complex computations such as generating custom clusters will require 1-10 minutes, therefore a job ID will be created, and jobs will be reachable at the generated link for at least 7 days.
Are user data private ?
This website is in compliance with the EU General Data Protection Regulation (GDPR).
The only personal information collected is the name and e-mail addresses provided by submitters writing to us, within the contact page. We need this information to potentially contact the submitter in order to respond back to their questions / suggestions. The data is kept until the message is fully processed. If however at any time you would like your information to be completely erased from our database, please send an e-mail to the head of our group, and we will immediately proceed to do this. No other personal information are collected by our website.
In order to protect the privacy and anonymity of researchers we did not implement a mechanism for user login or for sending job results via e-mail. All jobs are accessed only via the randomly generated job ID. However, anyone having this job ID could in theory access the results, but the job ID is a random complex alphanumeric string which makes it almost impossible to guess due to the large number of possible combinations and even more, users do not have access to the list of existing job IDs, making it fairly impossible to profile other researchers work and interests.
How to cite data computed within Plant NLRscape?
NLRscape: An atlas of plant NLR proteins. Martin EC, Ion CF, Ifrimescu F, Spiridon L, Bakker J, Goverse A, Petrescu AJ. Nucleic acids research, 2022, gkac1014. https://doi.org/10.1093/nar/gkac1014.
Additional thrid party publications that require to be cited depending on the content type :
RaptorX Predict Propertyused for secondary structure predictions in : variability plots, MSA annotations.
Sheng Wang, Jian Peng, Jianzhu Ma, Jinbo Xu. Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields.Scientific Reports, 2016
LRR predictorused for LRR motif predictions in : variability plots, MSA annotations.
Eliza C. Martin, Octavina C. A. Sukarta, Laurentiu Spiridon, Laurentiu G. Grigore, Vlad Constantinescu, Robi Tacutu, Aska Goverse, Andrei-Jose Petrescu. LRRpredictor - a new LRR motif detection method for irregular motifs of plant NLR proteins using ensemble of classifiers. Genes 2020, 11, 286.
MAFFTused for sequence alignments : variability plots, MSA annotations.
Kazutaka Katoh, Daron M. Standley. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 2013, 30:772-780.
Logomakerused for variability plots.
Tareen A, Kinney JB. Logomaker: beautiful sequence logos in Python. Bioinformatics. 2020;36(7):2272-2274.
MSAViewerused for interactive MSA visiualization.
Cytoscapeused for representing cluster graphs.
Franz M, Lopes CT, Huck G, Dong Y, Sumer O, Bader GD. (2016). Cytoscape.js: a graph theory library for visualization and analysis. Bioinformatics (2016) 32 (2): 309-311 first published online September 28, 2015
AlphaFold 3D modelsused for mapping cluster variability on 3D structure.
Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Varadi M, Anyango S et al., AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research, Volume 50, Issue D1, 7, 2022, Pages D439–D444
Plotly libraryused for interactive visualization of histograms, taxa sunburst, identity matrices .
Inc., P. T. (2015). Collaborative data science. Montreal, QC: Plotly Technologies Inc. Retrieved from https://plot.ly
Clusters by homology
What does homology cluster mean?
A homology cluster signifies a group of sequences sharing similarity above the identity and coverage percentages used as cutoff. For each cluster a sequence representative is selected as the cluster center as being the sequence with the most connections (number of sequences having an identity percent higher than the cutoff). The UniprotKB ID of this sequence representative will further be designated as the cluster ID. Details about the clustering approach can be found in the user guide.
What does edge length mean within the graph?
The graph representation tries to find the best 2D projection using as edge length the identity percent between two cluster representatives by employing an edge-weigthed spring embedded layout solver implemented in Cytoscape (Franz et al., 2016; Shannon et al., 2003). As the protein space is highly dimensional, a 2D representation of all these relationships cannot be perfectly achieved, therefore we suggest also inspecting the identity values shown on each edge label to avoid misleading interpretations. This 2D representation are aimed for a general perspective of relationships between NLR main groups.
Why does the search bar inside the graph visualization tool not find my cluster ID?
Only clusters having more than 25 members are included in this graph, in order to ease visualisation. Separately, on each cluster detailed page, a subgraph of its neighbour clusters can be further visualised.
Cluster visualization page
Why when inspecting the predefined clusters containing my protein of interest, some homologs are missing ?
The predifined homology clusters are build by selecting as cluster representative the sequence having the most ‘connections’ (i.e. number of neighbour sequences with identity percents above the threshold). Therefore, if the protein of interest is located at the extremity of the cluster, some of their homologs will be placed in a different cluster. This occurs because the identity relationships landscape is complex, and even though the clustering algorithm (MMseqs2 - Steinegger et al., 2018) tries to find the best space partition of the NLR space, particular clusters might be very ‘close’ to each other. To overcome this matter, please use the ‘Generate custom cluster’ feature, which will build a cluster centered around your protein of interest and will include all homologs above the selected identity and coverage cutoffs. This feature can be accesed also from the sequece detailed page. Details and discussions about the clustering approach as well as instructions about creating custom clusters can be found within the userguide.
Identity matrices: What is the difference between ‘with gaps’ and ‘without gaps’ ?
Identity matrices: Why inside the matrix there are some sequences with identity percents below the expected cutoff ?
The identity percents shown in the identity matrices (with / without gaps) are computed on the multiple sequence alignment (MSA) of the cluster members, after the cluster was defined using Mafft method. This offers a more reliable view of the similarity relationships between the sequences, while the initial clustering method used (MMseqs2 - Steinegger et al., 2018) relies on some approximations and speed optimisations due to the large size of the NLR database. Also, the identity percets are subjective to the MSA algorithm used and also on how gaps are being accounted for – therefore we provide two types of identity measured : with or without considering gaps.
Identity matrices: Why are some of the labes in the identity matrices not shown ?
If the matrix size is large, only a fraction of the labels will be shown on the screen. Please select a rectangular frame using the mouse inside the matrix to zoom in. Alternativley, the matrices can be downloaded as ASCI files and users can further use them for custom display means.
Variability plots: What does “no gap dropout” / “5% gaps dropend” mean ?
The “no gap droput” will show all positions of the alignment including all sequence insertions occuring in at least one sequence. For large clusters, this plot might be difficult to visualise as it will contain many insertion. For this case the “5% gap” plot might be more practical as it will “drop” (not show) alignment positions where more thatn 95% of the sequences have a gap. Both plots are informative for different purpuses, as the first “original” shows regions subjected to insertion / deletion events, while the 95% gap dropout options is more practical to inspect the overall profile common to the clustered sequences.
Variability plots: Why some domains / motifs are shown with dodged / lighter nuances within the General overview variability plot ?
The color code used for domains and motifs is a function of the percentage of sequences having that annotations. Therefore the margins of a domain will appear in lighter nuances as the annotated domain margin is subjected to small variations. Similarly if a domain is present in only a few of the cluster members, the color will be very fade. For sequence motifs the same rule applies, as we considered to be very informative to have a measure of consensus of which motifs are consistent within the cluster, and where they aren’t, especially for the first LRR motifs within the LRR domain, where higher irregularities of the LxxLxL motif occur.
Variability plots: Why some LRR motifs are missing from the 2D LRR plot but appear in the general overview variability plot?
In the LRR 2D plot only LRR motifs predicted in more than 40% sequences of the cluster are shown.
There are 2 possible scenarios: - There are some recent incompatibilities with the newest versions of Chrome (or other web browsers) that the third party tool used needs to address. When they will come out with a fix we will update it in Plant NLRscape atlas as well. By then, we suggest using a different browser (ex: Firefox) - The cluster is very large: this tool is not suitable for large alignments and will load very slow. However on small/medium sized cluster it offers nice interactive features. Either the case, we suggest using the PHP MSA viewer, that despite not offering the same interactive features, it will load fast even for large clusters. Separately the alignments can be downloaded as FASTA format, as well as the secondary structure and LRR motif predictions in GFF format, which can be imported in a standalone MSA software.
Franz M, Lopes CT, Huck G, Dong Y, Sumer O, Bader GD. (2016). Cytoscape.js: a graph theory library for visualisation and analysis. Bioinformatics (2016) 32 (2): 309-311 first published online September 28, 2015
Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498-2504.
Steinegger M and Soeding J. Clustering huge protein sequence sets in linear time. Nature Communications, 2018.