Skip to content

The most exposed regions of SARS-CoV-2 structural proteins are subject to under positive selection and gene overlap may locally modify this behavior

Notifications You must be signed in to change notification settings

arubval/JABI2023

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

The most exposed regions of SARS-CoV-2 structural proteins are subject to under positive selection and gene overlap may locally modify this behavior

The SARS-CoV-2 virus pandemic that emerged in 2019 has been an unprecedented event in international science, as it has been possible to sequence millions of genomes, tracking their evolution very closely. This has enabled various types of secondary analyses of these genomes, including the measurement of their sequence selection pressure. In this work we have been able to measure the selective pressure of all the described SARS-CoV-2 genes, even analyzed by sequence regions, and we show how this type of analysis allows us to separate the genes between those subject to positive selection (usually those that code for surface proteins or those exposed to the host immune system) and those subject to negative selection because they require greater conservation of their structure and function. We have also seen that when another gene with an overlapping reading frame appears within a gene sequence, the overlapping sequence between the two genes evolves under a stronger purifying selection than the average of the non-overlapping regions of the main gene. We propose this type of analysis as a useful tool for locating and analyzing all the genes of a viral genome, when an adequate number of sequences are available.

The Ka/Ks ratio is used to measure the pressure selection. This is the number of non-synonymous substitutions (Ka) per synonymous substitution site (Ks). We have previously shown that this ratio can measure evolutionary pressure in bacteria such as Helicobacter pylori or Acinetobacter baumannii.

  1. Rubio A, Pérez-Pulido AJ. Protein-Coding Genes of Helicobacter pylori Predominantly Present Purifying Selection though Many Membrane Proteins Suffer from Selection Pressure: A Proposal to Analyze Bacterial Pangenomes. Genes. 2021; 12(3):377. https://doi.org/10.3390/genes12030377
  2. Rubio A, Jimenez J, Pérez-Pulido AJ. Assessment of selection pressure exerted on genes from complete pangenomes helps to improve the accuracy in the prediction of new genes. Brief Bioinform. 2022 Mar 10;23(2):bbac010. doi: 10.1093/bib/bbac010. PMID: 35108356.

imagen

Fig. 1. Ka/Ks ratio in the pangenome of H. pylori. Distribution of the Ka/Ks ratio in all the protein-coding genes from the pangenome, where different groups are highlighted (different colors) and genes that encode for uncharacterized proteins (asterisks). Black dots represent the median of the distribution. Note the double grouping of genes inside both +1 and −2 reading frames.

In this work, we used the genome of the SARS-CoV-2 virus, which caused the pandemic that began in 2019. Its genome codes for non-structural proteins (nsp), structural proteins (S, M, E and N) and accessory factors that help correct assembly. The protocol has been adapted for use with approximately 2000 strains sequenced in a standardized manner at the Hospital Universitario San Pedro.

imagen

Fig. 2. Viral genes and protocol for calculating the selection pressure. (A) SARS-CoV-2 genome organization. The genome is divided into non-structural genes (nsp genes coming from both ORF1a and ORF1ab), structural genes (S, E, M, and N) and accessory factors. (B) Procedure for calculating the Ka/Ks ratio: 1) Coding sequences (CDS) are obtained from the reference strain (GenBank:MN908947.3). 2) The six putative reading frames for each CDS in the reference strain are extracted, and homologous sequences are searched for in all the strains. 3) The Ka/Ks ratio for each frame can then be calculated using pairwise alignments of the homologs (the same genes from the other viral strains). The analyzed gene is highlighted in blue, and nucleotide changes or in red (when they correspond to nonsynonymous changes), or in green (to synonymous changes). 4) Finally, the distribution of Ka/Ks ratios from all the viral genes can be shown, where we expect a value lower than 1 for most of the genes in the frame +1 (negative selection), and slightly higher values in the frame −2. However, the other four frames should show values greater than 1 (positive selection).

The structural and accessory genes of the virus, which are the proteins most exposed to the host immune system and interact with proteins of the infected cell, show positive selection. However, non-structural proteins, which are involved in the processes that take place once the viral genome has entered the cell, show negative selection, suggesting that these proteins have essential functions

imagen

Fig. 3. Ka/Ks ratio for SARS-CoV-2 genes. (A) Ka/Ks ratio versus the p-value obtained (which depends on both number of pairwise alignments used and the length of the sequence) for all SARS-CoV-2 genes. The different shapes of the dots highlight the six possible reading frames, and the different colors highlight the different gene clusters (see legend). The points corresponding to frame +1 are labeled with the gene name. On the right side a zoom for the range of p-values between 0.001 and 0.1 has been shown to better see the dispersion. (B) Ka/Ks ratio distribution separated by the 6 reading frames analyzed and by gene type (structural, non-structural, and accessory).

The next step was to determine whether these variations in the KaKs ratio were global or localized to specific points. A sliding window protocol was designed for this purpose. Most structural genes show global changes in the KaKs ratio. However, the spike (S) gene has higher ratios in the regions that interact, at protein level, with host cells (ACE2 receptor, receptor binding domain).

imagen

Fig. 4. Ka/Ks ratio distribution along the sequence of structural genes. (A) Ka/Ks ratio is calculated from each pairwise alignment of each gene in a window of 57 nucleotides. The window slides in 9 nucleotides steps, and the complete profile is finally plotted along the entire length of the gene. For the E gene, a window of 30 with a slide of 6 was used, due to its short length. (B) Distribution of Ka/Ks along the length of genes S, M, N and E (black line). The percentage of mutations per position obtained from Nextstrain database is also shown for comparison (https://nextstrain.org/ncov/gisaid/global/6m). The primary Y-axis represents the Ka/Ks ratio, and the secondary Y-axis the percentage of mutations in relative value. In addition, variants of concern (VOC) from the Outbreak.info database (colored stars), and the Pfam domains have been included (below): S → bCovS1N (PF16451, Betacoronavirus-like spike glycoprotein S1, N-terminal), bCoV_S1_RBD (PF09408, Betacoronavirus spike glycoprotein S1, receptor binding), CoV_S1_C (PF19209, Coronavirus spike glycoprotein S1, C-terminal), CoV_S2 (PF01601, Coronavirus spike glycoprotein S2); M → CoVM (PF01635, Coronavirus M matrix/glycoprotein); N → bCoV_lipid_BD (PF09399, Betacoronavirus lipid binding protein), bCoV_Orf14 (PF17635, Betacoronavirus uncharacterised protein 14), CoV_nucleocap (PF00937, Coronavirus nucleocapsid); E → CoVE (PF02723, Coronavirus small envelope protein E). The blue line marks the Ka/Ks value of 1.

imagen

Fig. 5. SARS-CoV-2 Spike protein structure highlighting regions with a higher Ka/Ks ratio. (A) Surface and cartoon representation of Spike protein (PDB:6VXX). The receptor binding domain has been marked at the top. (B) Receptor binding domain viewed from above. (C) Surface and cartoon representation of the N-terminal region of the N protein (PDB:6M3M, positions 41-174). Amino acids involved in binding to the virus genome, whose mutations are known to affect this binding, have been labeled along with their position in the protein sequence. (D) Surface and cartoon representation of the C-terminal region of the N protein (PDB:6WJI, positions 257-364). All the structures have been colored with different intensities of red depending on the value of the Ka/Ks ratio.

The Ka/Ks ratio was analyzed in ORFs overlapping important structural genes. In general, a low ratio was obtained, which could partly explain the low Ka/Ks ratio values of structural genes in alternative reading frames (frame +2 and frame +3) with respect to the value given by their reading frame +1.

imagen

Fig. 6. Ka/Ks ratio in overlapping genes. (A) Overlapping regions of S, N and ORF3 genes. The frame relative to the main gene has been differently colored: +2 (red), +3 (green). (B) Ka/Ks ratio versus p-value for overlapping ORFs. Genes were distinguished by different colors, and frames by different shapes. (C) Ka/Ks ratio distribution separated by the 6 reading frames analyzed. (D) Distribution of Ka/Ks ratio along the length of gene ORF3. The percentage of mutations per position obtained from Nextstrain database is also shown for comparison (https://nextstrain.org/ncov/gisaid/global/6m). The primary Y-axis represents the Ka/Ks ratio, and the secondary Y-axis the percentage of mutations in relative value. In addition, variants of concern (VOC) from the Outbreak.info database were added (colored stars). The blue line marks the Ka/Ks value of 1.

The results presented here show how to analyze the selection pressure to which the genes of a viral genome are subjected, which is not only useful for locating highly conserved regions and drug targets, but also allows the analysis of overlapping genes

About

The most exposed regions of SARS-CoV-2 structural proteins are subject to under positive selection and gene overlap may locally modify this behavior

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published