Одним из главных принципов уникальной «системы Физтеха», заложенной в основу образования в МФТИ, является тщательный отбор одаренных и склонных к творческой работе представителей молодежи. Абитуриентами Физтеха становятся самые талантливые и высокообразованные выпускники школ всей России и десятков стран мира.

Студенческая жизнь в МФТИ насыщенна и разнообразна. Студенты активно совмещают учебную деятельность с занятиями спортом, участием в культурно-массовых мероприятиях, а также их организации. Администрация института всячески поддерживает инициативу и заботится о благополучии студентов. Так, ведется непрерывная работа по расширению студенческого городка и улучшению быта студентов.

Адрес e-mail:

Advanced Bioinformatics I

Vsevolod Yu.Makeev


Dr.Sci. (Phys.& Math.)


Vavilov Institute of General Genetics


Laboratory of System Biology and Computational Genetics
Makeev 

Information about course:


Recommended texts:

  1. R.Durbin, S.Eddy, A.Krogh, G.Mitcheson. Biological sequence analysis. Cambridge University Press, 1998 (2006 reprinting).
  2. M.Borodovsky, S.Ekisheva. Problems and solutions in biological sequence analysis, Cambridge University Press, 2006.
  3. D.Gusfield. Algorithms on Strings, Trees and Sequences — Computer Science and Computational Biology. Cambridge University Press, 1997.
  4. A.Apostolico, C.Guerra, S.Istrail, P.Pevzner, M.Waterman. RECOMB06, Proceedings of the Tenth Conference on Research in Computational Molecular Biology, Springer LNCS Vol. 3909, Berlin, 2006.
  5. P.Baldi and S.Brunak. Bioinformatics: the Machine Learning Approach. MIT Press, 2001.

Recommended web site (to start with): bioinformatics.  


Bioinformatics is the field of science growing from the application of mathematics, statistics and information technology to the study and analysis of very large biological and particularly genetic data sets.


This course is devoted to study of mathematical models and computer algorithms used in DNA and protein sequence analysis. Students will implement simplified versions of the algorithms studied in the course as computer programs. Additionally, you will get experience in using sequence analysis tools available either locally or via Internet.


The lab time may also be used for tests, student presentations and additional lectures.


Closed notes & books «surprise» quizzes (10 min) may take place on Mondays or Fridays. 


Homework: Small group efforts are encouraged, but you are responsible for writing /typing and understanding your solution.


Course outline:


Introduction. The natural science paradigm. Molecular biology as a study of the information processing in the cell. Bioinformatics as an interdisciplinary field. Genes and Genetic Code. Sequencing genomes. Major public data resources: US: Entrez (NCBI); European Union: EMBL-EBI (EBI) and SwissProt (SIB); Japan: DDBJ (NIG).


Models and algorithms of sequence analysis.


Developing sequence analysis tools. Strings, graphs and algorithms. Deterministic (string based) models and algorithms for string matching.


Probabilistic models of DNA sequences. Multinomial models. Models for protein-coding and non-coding regions. Bayesian inference. Estimation of model parameters. Supervised and unsupervised classification & clusterization of DNA sequences. Machine learning approach.


Probabilistic models of sequences conserved in evolution. Gibbs sampling algorithm for multiple sequence alignment. Algorithms for prediction of functional sites in DNA sequences (RBS sites, promoters, splice sites).


Markov models. Homogeneous and inhomogeneous Markov models. Hidden Markov models. Pattern recognition with (hidden) Markov models. Example: identification of CpG islands in genomic DNA.


Algorithms for gene identification in genomic DNA. Three-periodic Markov models fpr protein-coding regons. Hidden Markov models for prokaryotic and eukaryotic genes. Viterbi algorithm, Forward and Backward algorithms, posterior decoding algorithm.


Next generation sequencing. Sequence assembly. External information: physical mapping. Shortest common superstring. Difficulties of assembly in eukaryotic genomes with transposable elements and repeats.


Analysis of sequence pairs. 


Pairwise alignment of biomolecular sequences. Search for similarities. Global alignment of two sequences. Needleman-Wunsch algorithm. Local alignment. Smith-Waterman algorithm.


Search for similarities in biomolecular sequences. Necessity of scoring system. Dot-matrix method. Statistical distributions of similar words. Common words in two random sequences. Distribution of the maximum length of a common word in two random sequences.


Derivation of scoring functions for amino acid substitutions observed in pairwise alignments. The notion of relative entropy. Dayhoff's series of scoring matrices (PAM matrices). Dayhoff's approach. Estimation of parameters of mutation matrices using alignments of closely related sequences. Derivation of scoring functions for amino acid substitutions observed in BLOCKS database. Series of BLOSUM matrices.


Markov models of DNA sequence evolution. Jukes-Cantor’s and Kimura’s models. Rate of mutation matrices and matrices of transition probabilities. Amino acid classification. Markov model of protein sequence evolution.


Lab schedule:


Lab 1. Introduction. Biological data resources on Internet. Major data formats. Downloading sequence data from GenBank and SwissProt. Basic operations with sequences.


Lab 2. Generation of random sequences. Maximum likelihood estimation of parameters of probabilistic model of DNA sequence. Computing of posterior probability of a model (for the case of competing models).


Lab 3. Study of local variation of GC composition. Implementation of the algorithm of DNA sequence segmentation.


Labs 4–5. Implementation of the Bayesian algorithm of recognition of protein-coding sequence.


Lab 6. Implementation of the dot-plot algorithm.


Lab 7. Implementation of the Needleman-Wunsch algorithm of global pair-wise DNA sequence alignment and the Smith-Waterman algorithm of local pair-wise DNA sequence alignment.


List of selected publications:

  1. I.Kulakovskiy, V.Levitsky, D.Oshchepkov, L.Bryzgalov, I.Vorontsov, V.Makeev. From binding motifs in ChIP-Seq data to improved models of transcription factor binding sites. J.Bioinform.Comput.Biol. 2013,Feb.; 11(1): 1340004.
  2. V.Makeev. Predictive biology using systems and integrative analysis and methods. J Biomol Struct Dyn. 2013; 31(1):1–3.
  3. E.Permina, Y.Medvedeva, P.Baeck, S.Hegde, S.Mande, V.Makeev. Identification of self-consistent modulons fr om bacterial microarray expression data with the help of structured regulon gene sets. J.Biomol.Struct.Dyn. 2013; 31(1):115–24.
  4. I.Kulakovskiy, Y.Medvedeva, U.Schaefer, A.Kasianov, I.Vorontsov, V.Bajic, V.Makeev. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic.Acids.Res. 2013 Jan1; 41(D1): D195–D202. Epub.2012,Nov.,21. PubMed PMID: 23175603.
  5. A.Nikulova, A.Favorov, R.Sutormin, V.Makeev, A.Mironov. CORECLUST: identification of the conserved CRM grammar together with prediction of gene regulation. Nucleic.Acids.Res. 2012,Mar.15.
  6. А.А.Никулова, М.С.Полищук, В.Г.Туманян, В.Ю.Макеев, А.А.Миронов, А.В.Фаворов, Корреляция кластеров сайтов связывания и экспериментальных данных по связыванию белков с ДНК позволяют предполагать структуры регуляторных модулей. Биофизика. — 2012. — Т.57, С.212–214.
  7. Y.Hara, N.Kadotani, H.Izui, J.I.Katashkina, T.Kuvaeva, I.Andreeva, L.Golubeva, D.Malko, V.Makeev, S.Mashko, Y.Kozlov. The complete genome sequence of Pantoea ananatis AJ13355, an organism with great biotechnological potential. Appl.Microbiol.Biotechnol. 2012Jan.; 93(1): 331–41.
  8. Ш,Хедж, Е.Ю.Климова, Ш.Манде, Ю.А.Медведева, В.Ю.Макеев, Е.А.Пермина. Использование пар генов, входящих в один оперон, для определения порога значимости коэффициента корреляции уровней экспрессии генов. Биофизика, 2011, т.56, вып.6, стр.1062–1064.
  9. I.Kulakovskiy, A.Belostotsky, A.Kasianov, N.Esipova, Y.Medvedeva, I.Eliseeva, V.Makeev. A Deeper Look Into Transcription Regulatory Code By Preferred Pair Distance Templates For Transcription Factor Binding Sites. Bioinformatics. 2011, 27: 2621–2624.
  10. M.Logacheva, A.Kasianov, D.Vinogradov, T.Samigullin, M.Gelfand, V.Makeev, A.Penin. De novo sequencing and characterization of floral transcriptome in two species of buckwheat (Fagopyrum). BMC Genomics. 2011,Jan.,13; 12(1): 30.
  11. I.Kulakovskiy, V.Boeva, A.Favorov, V.Makeev. Deep and wide digging for binding motifs in ChIP-Seq.data. Bioinformatics. 2010, Oct.15; 26(20): 2622–3.
  12. Y.Medvedeva, M.Fridman, N.Oparina, D.Malko, E.Ermakova, I.Kulakovskiy, A.Heinzel, V.Makeev. Intergenic, gene terminal, and intragenic CpG islands in the human genome. BMC Genomics. 2010,Jan.,19; 11: 48.
Если вы заметили в тексте ошибку, выделите её и нажмите Ctrl+Enter.

© 2001-2016 Московский физико-технический институт
(государственный университет)

Техподдержка сайта

МФТИ в социальных сетях

soc-vk soc-fb soc-tw soc-li soc-li
Яндекс.Метрика