英语论文SVM的蛋白质与DNA交互作用预测研究(2)

菜单

iii. INTRODUCTION

DNA generally refers to the expression of a gene. The binding to DNA motifs and histones that form part of the structure of DNA and bind to it less specifically. Also there are proteins that repair DNA such as uracil-DNA glycosylase interact closely with it. Proteins bind to DNA in the major groove, however, there are exceptions. Past reports have shown that 2%–3% of a prokaryotic genome and 6%–7% of a eukaryotic genome encodes DNA-binding proteins [3]. The interactions can be formed by different domains, such as the zinc finger or the helix-turn-helix. These interactions are involved in a variety of biological processes including DNA replication, DNA repair, viral infection, DNA packing, and DNA modifications [4]. Understanding of the molecular mechanisms of how proteins called transcription factors (TFs) recognize their specific binding sites encoded into genomic DNA represents one of the main, long-standing issues in the molecular biophysics. Surprisingly, some experiments have demonstrated that DNA surrounding a specific TF binding sites greatly influences binding specificity. We expect that our results will significantly affect the understanding of molecular, biophysical principles of transcriptional regulation, and greatly improve the ability to predict how many in DNA sequences influence gene expression programs in cells of living organisms.

The use of computational methods in the prediction of DNA protein sequences has many advantages. It is less tedious and time consuming compared to doing the actual physical experiments. Financially, it is beneficial because there would be less requirements on obtaining the actual samples for experimentation, a lot of the money spent on buying, maintaining certain materials will be saved. And the practical applications of DNA protein predictions are vast. Drugs target proteins that bind to the DNA making molecules that bind to the double helix parts of DNA and interfere with the interactions between DNA and proteins. One type of target are the use of telomerase inhibitors, this is an area related to cancer treatment. Telomeres are at chromosome ends and they protect the ends from damage and help to make sure DNA replication occurs as it was meant to. In somatic cells, the life span has an 'end date.' A tumour cell, on the other hand, keeps its telomere ends stable, so that the tumour cell can continue to survive. This, of course, presents a predicament for treatment but telomerase inhibitors have addressed this dilemma for researchers. Several strategies were creates in form of data sets that held information on the DNA-binding site identification , DBP374 was the largest database used, which is optimal for initiating novel studies. New study research on DNA–protein interactions may be able to employ a data set that has already been used in the literature, which makes use of the direct comparison with previous studies. Additionally, two specific databases are devoted to protein–DNA interactions using available information from the PDB. The Protein Data Bank (PDB) is a database for the three-dimensional structural data for proteins and nucleic acids which are large molecular structures. Data is usually obtained by X-ray crystallography, NMR spectroscopy, or more commonly, cryo-electron microscopy, and submitted by biologists and biochemists from all over the globe and is accessed freely on the Internet through the websites of its affiliate organisations .It is a very important tool in biological research. Scientist are now required to submit their research data structure to the PBD. Many other databases use protein structures deposited in the PDB. For example, SCOP and CATH does so.

In a protein–DNA complex, an amino acid residue in the protein is defined as a binding site if the distance between any atoms of this residue and any atoms of the DNA molecule is less than a specific cut-off value. Several previous studies on DNA–protein binding site prediction have used various definitions of DNA-binding sites [6]. Kuznetsov stated that the cut-off distance of 4.5 Å gave the best separation between the binding and nonbinding residues when using evolutionary and structural information to predict binding sites, while Si et al [6] applied cut-off distances of 3.5, 4.0, 4.5, 5.0, 5.5, 6.0 Å, and binding sites with the solvent accessible surface area (ASA) in two data sets and chose 3.5 Å as the most proper definition. ASA refers to DNA-binding residues that have a tendency to be exposed to a solvent to create contacts with the DNA structure, which makes relative solvent accessibility a useful predictive feature. Studies have focused only on surface residues in the prediction [12]. Similar to the secondary structure, the relative ASA can be predicted based on the protein sequence or calculated through the protein structure using specific software. The relative ASA of each residue in a protein was calculated when the DNA molecule was present (non-complexed). Non-complexed was considered to be the protein structure extracted alone from the PDB file. Surface accessible residues were defined as residues with a relative ASA of >5%. The sequence similarity among proteins in the data set is important to the prediction outcome. The current methods state that the similarity level should be kept below 30%–35%. A single representative from each protein set was identified and sub-sequences of other proteins in a data set were eliminated.