1、High content of proteins containing 21st and 22nd amino acids, selenocysteine and pyrrolysine, in a symbiotic deltaproteobacterium of gutless worm Olavius algarvensisABSTRACTSelenocysteine (Sec) and pyrrolysine (Pyl) are rare amino acids that are cotranslationally inserted into proteins and known as
2、 the 21st and 22nd amino acids in the genetic code. Sec and Pyl are encoded by UGA and UAG codons, respectively, which normally serve as stop signals. Herein, we report on unusually large selenoproteomes and pyrroproteomes in a symbiont metagenomic dataset of a marine gutless worm, Olavius algarvens
3、is. We identified 99 selenoprotein genes that clustered into 30 families, including 17 new selenoprotein genes that belong to six families. In addition, several Pyl-containing proteins were identified in this dataset. Most selenoproteins and Pyl-containing proteins were present in a single deltaprot
4、eobacterium, 1 symbiont, which contained the largest number of both selenoproteins and Pyl-containing proteins of any organism reported to date. Our data contrast with the previous observations that symbionts and host-associated bacteria either lose Sec utilization or possess a limited number of sel
5、enoproteins, and suggest that the environment in the gutless worm promotes Sec and Pyl utilization. Anaerobic conditions and consistent selenium supply might be the factors that support the use of amino acids that extend the genetic code.INTRODUCTIONSelenium (Se) is an essential micronutrient due to
6、 its requirement for biosynthesis and function of the 21st amino acid, selenocysteine (Sec). This amino acid is typically found in the active sites of a small number of selenoproteins in all three domains of life: archaea, bacteria and eukaryotes (14). Biosynthesis of Sec and its cotranslational ins
7、ertion into polypeptides require a complex molecular machinery that recodes in-frame UGA codons, which normally function as stop signals, to serve as Sec codons (59). Although the occurrence of selenoprotein genes is limited, the Sec UGA codon has become the first addition to the universal genetic c
8、ode since the code was deciphered 40 years ago (10).The mechanism of Sec insertion differs in the three domains of life. In bacteria, this process has been most thoroughly elucidated in Escherichia coli (1,2,6). Translation of bacterial selenoprotein mRNA requires both a selenocysteine insertion seq
9、uence (SECIS) element, which is a stem-loop structure immediately downstream of Sec-encoding UGA codon (5,11,12), and trans-acting factors dedicated to Sec incorporation (8). In archaea and eukaryotes, SECIS elements are located in 3-UTRs and some factors involved in Sec biosynthesis and insertion a
10、re different. Recent identification of Sec synthase, SecS, in eukaryotes, which is different from the bacterial Sec synthase, SelA, provided important insights into Sec biosynthesis in these organisms (13).Recently, an additional rare amino acid pyrrolysine (Pyl), was identified, which expanded the
11、canonical genetic code to 22 amino acids (14,15). Pyl is inserted in response to UAG codon in several methanogenic archaea (14). Although the mechanism of Pyl biosynthesis and incorporation into protein is not fully understood, the presence of a tRNApyl gene (pylT) with the CUA anticodon and of clas
12、s II aminoacyl-tRNA synthetase gene (pylS) argued for cotranslational incorporation of Pyl (15). In Desulfitobacterium hafniense, a single bacterium, in which a Pyl-containing protein was found, PylS consists of two proteins: PylSn and PylSc (15).In recent years, large-scale genome sequencing projec
13、ts, including both organism-specific and environmental metagenomic projects, provided a large volume of gene and protein sequence information. However, selenoprotein genes are almost universally misannotated in these datasets because UGA has the dual function of encoding Sec and terminating translat
14、ion, and only the latter function is recognized by current annotation programs. Several bioinformatics tools have been developed to address this problem and can be used to identify selenoprotein genes (1622). These programs have successfully identified many new selenoproteins in both prokaryotic and
15、 eukaryotic genomes, as well as in the Sargasso Sea environmental samples (23).Complex symbiotic relationships between bacteria and multicellular eukaryotes have evolved in several environments, but science has traditionally focused on interactions that are pathogenic (24). Recently, there has been
16、increased recognition of symbiotic interactions that benefit both the microorganism and the host (25). A recent metagenomic analysis of the symbiotic microbial consortium of the marine oligochaete Olavius algarvensis, a worm lacking a mouth, gut and nephridia, revealed four major co-occurring symbio
17、nts, which belong to Deltaproteobacteria (1 and 4) and Gammaproteobacteria (1 and 3), as well as one minor Spirochaete species. Since some Deltaproteobacteria are selenoprotein-rich organisms (27), we analyzed the selenoproteomes of these symbionts to examine a possible relationship between selenium
18、 and symbiosis.To characterize selenoproteome in these symbionts, we adopted a Sec/cysteine(Cys) homology-based search approach, which has been successfully used to characterize the selenoproteomes of both prokaryotes (22) and one of the largest prokaryotic sequencing projects, the Sargasso Sea micr
19、obial sequencing project (23). We detected known selenoproteins present in this metagenomic dataset and identified several novel selenoproteins. Interestingly, one deltaproteobacterium, 1 symbiont, contains at least 57 selenoproteins, which is the largest number of selenoproteins reported to date in
20、 any organism. In addition, several Pyl-containing proteins were identified and most were also found in the same 1 symbiont. Our results provide new insights into understanding evolution and function of these rare amino acids.MATERIALS AND METHODSDatabases and resourcesAssembled sequences of the Ola
21、vius symbionts metagenome were obtained from NCBI with the project accession number type:entrez-nucleotide,attrs:text:AASZ00000000,term_id:115510967AASZ00000000 (ftp:/ftp.ncbi.nih.gov/genbank/wgs/wgs.AASZ.1.gbff.gz). The database contained 5597 genomic sequences, which corresponded to a total of 23.
22、7 million nucleotides. Non-redundant (NR) protein database was downloaded from NCBI ftp server. This dataset contained a total of 4 644 764 protein sequences (1 603 127 260 amino acids). BLAST (28) was also obtained from NCBI.Identification of Cys/TGA pairs in homologous sequences and minimal ORFsEa
23、ch Cys-containing protein sequence in the NR database was initially searched against the Olavius symbionts metagenomic database for possible TGA/TAG/TAA-containing homologs using TBLASTN with default parameters. Only local alignments, in which Cys in the query protein was aligned with TGA codon in t
24、he nucleotide sequence from the Olavius symbionts metagenomic database, were selected for further analysis. For each TGA-containing nucleotide sequence identified in the metagenomic database, regions upstream and downstream of the putative in-frame TGA codon were analyzed to identify a minimal ORF.
25、If a stop codon was found between the in-frame TGA codon and an initiation codon (ATG or GTG), such a TGA-containing sequence was discarded.Analyses of TGA-flanking regions and sequence clusteringWe analyzed the conservation of TGA-flanking regions in all six reading frames using BLASTX. If the best
26、 hit, which covered the TGA codon with at least a 10-nt overlap, was in a different reading frame than the TGA codon, the corresponding sequence was filtered out. RPS-BLAST was then used to search against conserved domains database (CDD). If the best hit which covered the TGA codon with at least a f
27、ive-residue overlap was in a different reading frame or additional stop codons appeared within the conserved domain in the same frame, the sequence was removed.We used BL2SEQ to cluster remaining protein sequences into different groups. If a local alignment of two proteins had an E-value below 104 a
28、nd was at least 20 amino acid long, as well as the predicted Sec residues were located at the same position or very close (no more than three residues apart) in the alignment, the two proteins were assigned to the same cluster.Cysteine conservation and selenoprotein classificationAll clusters were a
29、utomatically searched against NCBI NR and microbial databases using BLASTX and TBLASTX. Each predicted ORF containing an in-frame TGA was considered further only if at least two corresponding Cys-containing homologs were detected and the proportion of TGA/Cys pairs in the set of homologs was 50%.The
30、 remaining clusters were analyzed for occurrence of bacterial SECIS elements, located immediately downstream of the in-frame TGA codons, using bSECISearch program (19). The final clusters were manually analyzed and divided into three groups: known selenoproteins, new selenoproteins (clusters contain
31、ing at least two different sequences with conserved in-frame TGA codons) and selenoprotein candidates (clusters containing only one sequence). It should be noted that sequencing errors that generate in-frame UGA codons could not be excluded for selenoprotein candidates.Identification of Pyl operon p
32、roteins and known Pyl-containing proteinsPylT and PylS sequences from Methanosarcina barkeri (accession number type:entrez-nucleotide,attrs:text:AY064401,term_id:21322022AY064401) were used to search for possible homologs in the metagenomic dataset. Candidate tRNAPyl was further analyzed to identify
33、 structural features associated with known tRNAPyl, including a six base-pair acceptor stem and a base between the D and acceptor stems (15). Other genes in the Pyl operon (pylB, pylC, pylD) were also analyzed by comparative sequence analyses.The TBLASTN program with default parameters was used to search for known Pyl-containing methylamine methyltransferases. Open reading frames (ORFs) and conservation of UAG-flanking regions were examined manually. Multiple alignments were generated with ClustalW (29).