Sequences - peptide, DNA, RNA

    Peptide sequence format

    Code: peptide

    Peptides can be entered using one or three letter amino acid abbreviations.

    A text file containing sequences should contain only one type of sequence (only one or only three lettered sequences but not both). Each line must have one and only one continuous line in the text file without spaces. Abbreviations used:

    3-letter Ala Arg Asn Asp Asx Cys Gln Glu Glx Gly His Ile Leu Lys Met Phe Pro Pyl Sec Ser Thr Trp Tyr
    1-letter A R N D B C Q E Z G H I L K M F P O U S T W Y

    Example

    Valid files
    
    PPPALPPKKR
    APTMLPPASDFA
    
    ProProProAlaLeuProProLysLysArg
    AlaProThrMetProProProLeuProPro
    Invalid files
    
    PPPALPPKKR
    AlaProThrMetProProProLeuProPro
    
    ProProProAlaLeuProProLysLysArg
    AlaProThrMetPPPLPP

    Custom amino acids

    In addition to the amino acids listed above, custom amino acids dictionary can be defined.

    The custom_aminoacids.dict file is stored in the .chemaxon directory (UNIX) or the user's chemaxon directory using MS Windows.

    The usual format of the dictionary file is:

    
    molName=L-Alanine   Ala A   [#6;A;H3X4][#6@H;H1X4]([#7;A;X3])-[#6]=O |wD:1.1,(3.85,-1.33,;2.31,-1.33,;1.54,-2.67,;1.54,,;)| 3   4
    molName=L-Cysteine  Cys C   [#7;A;X3][#6@@H;H1](-[#6H2]-[#16H1])-[#6]=O |wD:1.0,(1.54,-2.67,;2.31,-1.33,;3.85,-1.33,;4.62,-2.67,;1.54,,;)|  1   5   4

    where the corresponding columns are:

    field description
    name optional (introduced in Marvin 6.2) molName=name
    long (three-letters code) abbreviation A capital letter followed by two small onesAla
    short (one-letter code) abbreviation X followed by characters in parentheses. Allowed characters are the letters of the alphabet, numbers and the dash character, e.g. molName=Sarcosine Sar X(Sar) ....
    SMARTS representation of the amino acid fragment without terminal OH Note the SMARTS strings representing amino acid fragments are denoting the hydrogens and sometimes the connection numbers to avoid ambiguity.For example if only the C[C@H](N)C=O string is used for L-alanine in the first example, this would match for many other amino acids as well as some of them are "containing" this string as a substructure. No query bonds allowed.
    coordinates of the structure Molecular coordinates are needed for cleaning. If they are missing, Ctrl+2 creates the coordinates for the structure.Coordinates can be generated by Molconvert using: cxsmarts:c option
    the number of the backbone nitrogen in the SMARTS string 3 for Ala in the first example
    the number of the C terminal carbon 4 for Ala in the first example
    the number for other attachment point if needed 4 for L-cysteine in the second example

    The columns should be separated by Tab characters.

    The name is an optional field. If omitted, the entry should start directly with the 3-letter abbreviation (no Tab character is required).

    Example

    Phosphoserine can be added as a custom amino acid to the dictionary as follows:

    
    molName=Phosphoserine   Sep     X(Sep)  [#7]-[#6@@H](-[#6]-[#8]P([#8])([#8])=O)-[#6]=O |(1.54,-2.67,;2.31,-1.33,;3.85,-1.33,;4.62,-2.67,;6.16,-2.67,;7.7,-2.67,;6.16,-1.13,;6.16,-4.21,;1.54,,;)|       1       9

    {info} To describe an aromatic custom amino acid both the aromatic and the Kekule form should be in the custom_aminoacids.dict file with the same short and long names.

    See also Peptide import and export options

    DNA/RNA sequence format

    DNA/RNA sequences can be entered using one letter nucleic acid abbreviations. Each line must have one and only one continuous line in the text file without spaces. Abbreviations used:

    DNA A C G T
    RNA A C G U

    Code : dna, rna

    Example
    Valid files:
    
    ACGTACGT
    ACCCCGTGGGT
    
    A-C-G-T-A-C-G-T
    A-C-C-C-C-G-T-G-G-G-T
    
    dA-dC-dG-dT-dA-dC-dG-dT
    dA-dC-dC-dC-dC-dG-dT-dG-dG-dG-dT
    Invalid files
    
    acgtacgt