MSc in Biosciences from ENS Lyon, France. PhD in epigenetics and cancer from the Cancer research centre of Lyon, France. Since then, working on RNA-seq and ChIP-seq data at the Roslin Institute, Edinburgh, UK.
Brief description of research project
The amount of publicly available sequencing data in transcriptomic and epigenomics is rapidly increasing, outpacing our ability to thoroughly analyse and valorise it. In this program we have attempted to produce tools and analyses to fill this growing gap between data and knowledge.
We analysed four datasets of transcription factor (TF) chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) in human and mouse, totalling more than a thousand experiments (in press). We notably classified TF depending on the location of their binding sites with regards to gene transcription start sites (TSS), we analysed TF ‘hotspots’ (regions with many binding events), identified TF with similar binding sites through three distinct methods, and integrated TF ChIP-seq data with position of known variants involved in human diseases.
We integrated the datasets analysed with other public TF ChIP-seq, RNA-seq and CAGE datasets in human, mouse and fly (for now) on a web application we developed in this program, Heat*seq1. Heat*seq allows easy browsing of the datasets, exploration of the relationships between experiments, and allows researcher to compare their experiments with all experiments in a datasets in minutes.
We are finalizing an analysis of Roadmap Epigenomics data, focusing on DNA methylation (WGBS), accessibility (DNAse-seq), histone post-translational modification (ChIP-seq) and expression (RNA-seq) data to further unravel the links between epigenetic marks and the human transcriptome diversity (30 epigenetic marks in 33 different cell types / tissues). Different associations are investigated:
- Between epigenetic marks at TSS and gene expression level.
- Between epigenetic marks at gene transcription termination sites (TES) and gene expression level.
- Between epigenetic marks at middle exons (neither first nor last) and exon expression level.
- Between epigenetic marks at middle exons and exon inclusion ratios.
Crucially, the analysis are performed for all Gencode-annotated genes together and by analysis each gene type (protein coding gene, lncRNA, miRNA, rRNA, tRNA, pseudogenes, etc.) individually, to unravel gene type specific relations. Analyses are performed both by comparing all genes in each cell type one by one, and by comparing each of the genes/exons across all 33 cell types. As this analyses is generating thousands of visualisations due to combinatorial explosion, we are developing a web application to allow easy exploration of the results (available soon), and we are writing an academic paper summarising results.
During the fellowship, I have also started working on single cell RNA-seq data and have generated an R package for feature selection of single-cell RNA-seq analyses (submitted), and we have generated single cell data in mammalian ES cells.
Another side project, a study investigating the use of co-expression networks to provide experimental evidence for functional annotation transfer between homologous genes in different species was also performed.
1 Heat*seq: a web-application to compare High Throughput Sequencing (HTS) experiments to public dataset.