Sequences - peptide, DNA, RNA

Peptide sequence format

Code: peptide

Peptides can be entered using one or three letter amino acid abbreviations.

A text file containing sequences should contain only one type of sequence (only one or only three lettered sequences but not both). Each line must have one and only one continuous line in the text file without spaces. Abbreviations used:

3-letter

Ala

Arg

Asn

Asp

Asx

Cys

Gln

Glu

Glx

Gly

His

Ile

Leu

Lys

Met

Phe

Pro

Pyl

Sec

Ser

Thr

Trp

Tyr

1-letter

A

R

N

D

B

C

Q

E

Z

G

H

I

L

K

M

F

P

O

U

S

T

W

Y

Example

Valid files
PPPALPPKKR
APTMLPPASDFA
ProProProAlaLeuProProLysLysArg
AlaProThrMetProProProLeuProPro
Invalid files
PPPALPPKKR
AlaProThrMetProProProLeuProPro
ProProProAlaLeuProProLysLysArg
AlaProThrMetPPPLPP

Custom amino acids

In addition to the amino acids listed above, custom amino acids dictionary can be defined.

The custom_aminoacids.dict file is stored in the .chemaxon directory (UNIX) or the user's chemaxon directory using MS Windows.


The usual format of the dictionary file is: molName=L-Alanine Ala A [CX4H3][C@HX4H1]([NX3])C=O |wD:1.1,(3.85,-1.33,;2.31,-1.33,;1.54,-2.67,;1.54,,;)| 3 4
molName=L-Cysteine Cys C [NX3][C@@HH1]([CH2][SH1])C=O |wD:1.0,(1.54,-2.67,;2.31,-1.33,;3.85,-1.33,;4.62,-2.67,;1.54,,;)| 1 5 4


where the corresponding columns are:

name

not an obligatory field (introduced in Marvin 6.2)

molName=name

long (three-letters code) abbreviation

A capital letter followed by two small ones

    
    
Ala    
    

short (one-letter code) abbreviation

X and some characters will follow this character between parentheses.

Allowed characters are the letters of the alphabet, numbers and the dash character.

molName=Sarcosine    Sar    X(Sar) ....

SMARTS representation of the amino acid fragment without terminal OH

Note the SMARTS strings representing amino acid fragments are denoting the hydrogens and sometimes the connection numbers to avoid ambiguity.

For example if only the C[C@H](N)C=O string is used for L-alanine in the first example, this would match for many other amino acids as well as some of them are "containing" this string as a substructure.

No query bonds allowed.

coordinates of the structure

Molecular coordinates are needed for cleaning. If they are missing, Ctrl+2 creates the coordinates for the structure.

Coordinates can be generated by Molconvert using:

cxsmarts:c

option

the number of the backbone nitrogen in the SMARTS string

3 for Ala in the first example

the number of the C terminal carbon

4 for Ala in the first example

the number for other attachment point if needed

S for L-cysteine in the second example

The name and the coordinates are not obligatory fields.

The columns should be separated by tab characters.

To describe an aromatic custom amino acid both the aromatic and the Kekule form should be in the custom_aminoacids.dict file with the same short and long names.

DNA/RNA sequence format

DNA/RNA sequences can be entered using one letter nucleic acid abbreviations. Each line must have one and only one continuous line in the text file without spaces. Abbreviations used:

DNA

A

C

G

T

RNA

A

C

G

U

Code: dna, rna

Example
Valid files:
ACGTACGT
ACCCCGTGGGT
A-C-G-T-A-C-G-T
A-C-C-C-C-G-T-G-G-G-T
dA-dC-dG-dT-dA-dC-dG-dT
dA-dC-dC-dC-dC-dG-dT-dG-dG-dG-dT
Invalid files
acgtacgt