Courses > Elective courses during the 7 th & 9th Semester > From Genome to Proteome

From Genome to Proteome (extra info)

Genome Projects


Why sequencing whole genomes and not just retrieving the expressed genes from cDNA libraries? Identifying all genes, i.e. the coding potential of a species, determining the broad genomic landscape, studying the evolution of the genome, classifying genes into families of related functions and analyzing the genetic polymorphisms are the principal reasons.
Genome mapping strategies follow two major, interactive approaches, genetic mapping and physical mapping. Genetic mapping in higher eukaryotes is based on identifying meiotic recombination frequences between pairs of polymorphic sites on the genome. Physical mapping makes use of the shotgun sequencing strategy, i.e. fragmentation of genomic DNA, determination of partial sequences and assembly of a contiguous region through the fragment overlaps. Both approaches apply old concepts (Morgan, 1910; Sanger, 1953) with the tools of modern technology.
Genes correspond to 25% of the DNA in human. Of the extragenic DNA (75%) most is repetitive (50%), mainly transposable elements (45%). One fifth of the genome is covered with gene deserts (sequences of >0.5 Mb without any gene). Most human genes are highly interrupted, with 10 introns on average and much larger introns than exons. In total, the coding sequences (exons) represent a tiny 1% of the genome. To pinpoint the location of genes, of major value is a set of DNA markers that have been developed through the Human Genome Project. Such genomic markers include Alu elements, GC-rich regions, light G-bands, CpG islands and ESTs.
How does the genome evolve? Several theories emerge as more and more data assemble from different sequenced species. In the bacterial world, where repetitive and non-coding DNA are uncommon, the coding potential (total gene number) correlates well with the size of the genome (C value). No such correlation is evident in Eukarya, where differences in the gene organization and repetitive DNA elements appear to be critical. With respect to the repetitive DNA and the copy number of genes in gene families, the observed evolutionary differences are explained by a mechanism of DNA duplication-and-transposition. Evidence for the extent of other phenomena, such as horizontal gene transfer and adaptive gene losses, also derives from the comparative analysis of genomic sequences.

Identification of genes and gene products


How many genes are there in the human species? In 2002, Celera Genomics estimated 40 thousand, and International Human Sequencing Consortium estimated 30 thousand. Today, the estimated number is about 21 thousand. In any measure, the number of human genes is relatively small in the evolutionary complexity scale; however, it can be explained in terms of the tendency of increase in intron sizes, expansion in the domain architecture combinations, and alternative splicing.
How are the gene numbers derived? They are based on a battery of in vitro and in silico methods, both at the genome level (study of sequence homologies and characteristic gene/exon indices) and at the transcriptome level (detection of expressed sequences by use of exon traps, ESTs or microarrayed oligonucleotides). Expressed sequence tags (ESTs) are of important use in both cases. 5’ESTs (derived from evolutionarily conserved regions) can be used to pinpoint the location of homologous genes in many different species, while 3’ESTs can be used to locate genes specific to the particular species. Overall, the strategic procedure of locating, identifying and finally isolating a gene based on the genome databanks and the combination of the genome mining technologies is known as positional cloning.
En masse transcriptional expression analysis of the whole set of genes in a genome is achieved with the high-density DNA microarray technology. The appropriate oligonucleotides (corresponding to each one of the predicted gene sequences) are arrayed in a miniaturized solid support and allowed to hybridize with fluorescently tagged RNA prepared from the tissue or cell source under study. Quantification of the fluorescent signals emitted from the hybridized samples yields the relative expression level of each gene. Major applications of this transcriptomic type of microarrays include determination of constitutively expressed and tissue-specific genes, determination of alterations in gene expression levels in response to defined molecular stimuli, or identification of genes that become up-regulated or down-regulated in pathological situations. Independently, there are also two genomic types of microarrays, one for DNA sequencing and SNP genotyping, and one for determination of the gene copy numbers.
Assignment of gene products to families of functional relatives (gene families) is based primarily on DNA sequence alignments between different gene homologs that might be orthologs (evolutionary homologs from different species), paralogs (homologs derived from duplication and divergence in the same species) or xenologs (homologs in evolutionarily distant species attributed to horizontal DNA transfer). In the terminology of Molecular Phylogeny, gene families are equivalent to taxa; evolutionary dendrograms for genes (gene trees) are constructed in a similar way as evolutionary dendrograms for species (species trees) are; however, gene trees differ from the corresponding species trees, depending on the relative evolution rate of the gene family specified. As a general rule, with increasing evolutionary complexity of Eukarya, the number of gene families increases but approximates asymptotically an upper limit, while the number of members in each family increases dramatically without approximating an apparent upper limit. The same is true of the domain architecture families.

Proteome Analysis


Why Proteomics? What is the value of the data produced from proteomic analysis, i.e. analysis of the gene products beyond the level of the genome, or the transcriptome, alone? Concisely, proteomics is indispensable for (a) complete annotation of genome, evaluating false- or true-positive coding sequences, (b) protein expression profiling, which differs from the mRNA expression profiling of the genes, (c) protein function analysis, especially with respect to the "unknown-function" portion of gene products, (d) post-translational modification analysis, (e) subcellular localization and targeting study of proteins, (f) protein-interaction profiling.
The typical proteomic analysis proceeds with 3 major steps: (1) Separation and display of proteins of the proteome in a 2-dimensional electrophoresis (2-DE) gel, (2) determination of mass spectra of individual polypeptides with mass spectrometry (MS), (3) identification of polypeptide sequences with in silico software, using the electronic databases. The core step in proteomic analysis is MS, including typically in-gel digestion of the protein sample, ionization of the peptide components through MALDI (matrix-assisted laser desorption/ionization) or ESI (electrospray ionization), separation of the peptide ions in an electric field according to their m/z ratio, and detection-documentation of the mass spectra. There are two main MS procedures used in proteomics, MALDI-TOF (time-of-flight), for peptide mass fingerprinting, and ESI-QqQ (triple quadrupole), for peptide sequencing. There are 3 major applications of Proteomics: Protein expression profiling, Structural proteomics, and Functional proteomics. The classical realm of Proteomics, connected to the 2-DE and MS technologies, applies mostly to the protein expression profiling part. Both structural (the analysis of protein complexes and interactions) and functional aspects (the analysis of post-translational protein modifications with respect to different functional states) can also be studied with MS technology. However, in addition to MS, more specialized strategies have been developed and used for such studies.
More relevant to the Functional Proteomics part is the combination of approaches used for the study of structure-function relationships. These include, in principle, the detailed structural analyses of a protein (at all levels, from the primary sequence to the tertiary and quaternary structure) at a rationally designed series of functional or non-functional conformational states, in order to deduce information on the molecular mechanism. Data utilized for designing structure-function experiments can be derived (a) from in vitro studies on the ligand counterparts of a protein, such as substrates, inhibitors, allosteric regulators, etc., (b) from the physicochemical properties of a given protein that are either known experimentally or deduced theoretically, and (c) from bioinformatic analysis of gene trees and sequence or domain consensus singals, in silico. Data produced from structure-function experiments may include X-ray or NMR structures, site-directed biophysical evidence, in situ experimental evidence with use of radioactive or fluorescent probes, results from in vivo functional assays, or even the construction of novel gene family dendrograms in silico. The converging evidence from all these experimental lines is used to derive models for the detailed working mechanism of a protein.
More relevant to the Structural Proteomics part is the combination of approaches used to delineate interactions of a protein with other proteins or non-protein molecules. Approaches of this type aim at defining the interactome, i.e. the whole range of protein interactions in a given proteome, and indicating novel components of protein complexes involved in such important functions as metabolism, transcription regulation, or signaling. The first such methodology designed for high-throughput data analysis is the yeast 2-hybrid system (1989). Yeast 2-hybrid (that exploits the domain-reconstitution properties of the yeast transcription factor GAL4) is only one of several, subsequently developed, 2-hybrid systems; all of these systems use the reconstitution of a split biochemical function (a protein split in two domains) guided by two interacting protein moieties, as an assay to measure protein interaction.

Biomedical Applications


Genetic polymorphism is responsible for phenotypic variation of a species and, in the geological time scale, presents the molecular substrate for evolution through speciation. Due to polymorphism, a species genome is actually a total of all individual genomes of the species; genome sequencing is meant not only to produce analytical physical maps of a species genome, but also delineate the variations in the individual genome sequences. Mapping of the human genome reveals that the vast majority of sequence variation between individuals is due to Single Nucleotide Polymorphisms (SNPs). The catalog includes more than 6 million SNPs and the estimate is that one SNP occurs every 100 bp of the genome. Why is it important to catalog SNPs? Although many SNPs do not produce physical changes, other SNPs may affect susceptibility to disease and even influence a patient’s response to a drug regimen. Thus, SNPs can help (a) understand the molecular template of multifactorial disorders, and (b) develop a data-based, "personalized" approach to medicine. They also serve as genome markers to pinpoint the position of genes, or other sequences, on the human genome map.
How are SNP profiles (SNP genotypes) analyzed? In principle, the analysis of SNP genotypes can be based on (a) the DNA polymerase reaction (Sanger dideoxy sequencing, kinetic PCR, ARMS, primer extension assay), (b) the reaction of other DNA modifying enzymes (ligase, exonuclease, endonuclease, restriction endonuclease assays), (c) DNA-DNA hybridization (microarray sequencing technique). Detection of an SNP can be finally accomplished in (a) an electrophoresis gel, (b) a fluorescent plate-reader (real-time PCR), (c) a microarray reader, or even (d) a MALDI-mass spectrometer. For high-throughput SNP analysis, conventional Sanger sequencing is not the method of choice; other methods have been used, like DNA microarray sequencing or, lately, next-generation sequencing which makes use of nanotechnology.
Discovering genes and molecular mechanisms underlying human diseases is a major application of the technology and bioinformatics background developed from the Human Genome Project. Linkage disequilibrium analysis in conjunction with positional cloning is used to locate disease susceptibility chromosomal loci and identify candidate disease-associated genes within them; molecular diagnostic tools similar to the ones of SNP analysis are used to correlate abberations of a candidate gene sequence with disease-specific lesions in patients, on population-based studies. New genes revealed from such studies are further examined in experimental cell culture and animal systems that model human disease phenotypes, in order to (a) understand the mechanism of disease development and progression at the molecular cellular and biochemical level, (b) provide novel diagnostic and/or prognostic disease markers, and (c) investigate potentials for therapeutic strategies. The above line of experimentation is best applied on unigenic genetic disorders (attributed to abberation of one gene); many such genes have been discovered and are under study to improve diagnosis and therapy, in this way ( More intense and combined studies are needed, however, in the more common case of multifactorial, multigenic disorders (attributed to the consorted action of many genes), like cancer, asthma, cardiovascular diseases, inflammatory bowel diseases, etc. In the latter case, an important emerging tool for diagnosis of disease phenotypes and evaluation of candidate genes is DNA microarray analysis to identify characteristic global changes of gene expression profiles at the transcriptomic level; such studies can be complemented with mass spectrometric analysis of expression levels and protein microarray analysis of molecular interactions, at the proteomic level, to provide a more comprehensive approach to the mechanism of disease.
With respect to development of diagnostic and/or prognostic markers and design of therapeutic strategies, molecular analysis of a human disease provides the experimental background to focus on selected target molecules. The diagnostic and therapeutic potential of such target molecules can be tested in cell culture and/or animal models of the disease, where moden technologies like RNA-interference, gene targeting and transgenes offer the tools to functionally disrupt, knock out, mutate, or overexpress, accordingly, the genes of interest. Therapeutic intervention to the course of disease on the basis of such molecular targets can follow two major routes: (a) gene therapy strategies and (b) novel drug design and use. The application of gene therapy protocols suffers currently from problems of inadequate gene delivery-expression inside the targeted diseased cells and of safety risk due to the use of viral delivery vectors; application of nonviral DNA vectors is now tested to improve efficacy and safety of gene therapy protocols. Drug design, optimization and development is yet another field of application of the post-genomic scientific knowledge, using integrated high-throughput cycles of structure-function analysis, chemical synthesis, probe-target interaction assays and biological testing. In addition, the realization that individual genotypes can strongly affect each patient’s positive, neutral or negative reaction to a certain drug has initiated application of high-throughput genotype (SNP) screening for development of more "personalized" drugs adapted optimally to the corresponding patient’s pharmacogenetic profile.