1 zip-compressed file archive
— Digital origin: born digital
MD5 checksum: 5b13bebb8ab23efcaa606d697507ddea
Downloaded from the Insight Centre web site at http://mlg.ucd.ie/dynamic/index.html, 2014-08-28
This corpus is made available for non-commercial and research purposes only. If you make use of this dataset please reference the publication: Greene, D., Bryan, K. and Cunningham, P. (2008), "Parallel Integration of Heterogeneous Genome-Wide Data Sources", Proc. 8th International Conference on BioInformatics and BioEngineering (BIBE 2008).
We retrieved a set of 38,661 yeast-related MEDLINE abstracts retrieved from PubMed, corresponding to the references enumerated in the SGD literature curation database (as downloaded in May 2008). Since the database provides links between references and genes, we can form a "meta-document" for each gene consisting of the concatenation of all abstracts annotated as pertaining to that gene. From this we constructed a bag-of-words model, represented in the form of a term-gene matrix. To pre-process the data we removed dubious ORFs, and applied standard stop-word removal and stemming techniques to the abstracts. We subsequently removed terms occurring in less than three documents. Our final dataset consists of 6013 ORFs described by 62,859 unique terms.
The corpus is provided in pre-processed matrix format. -- The files contained in the archive given above have the following formats: yeast.mtx: Term frequencies stored in a sparse term-gene matrix in Matrix Market format; yeast.terms: Complete list of content-bearing terms in the corpus, with each line corresponding to a row of the term-gene matrix ; yeast.docs: List of ORFs, with each line corresponding to a column of the term-gene matrix.
All rights, including copyright, in the content of the original abstracts are owned by the original authors.
Greene, Derek, Bryan, Kenneth, Cunningham, Pádraig, 1962-.
8th IEEE International Conference on BioInformatics and BioEngineering, 2008. BIBE 2008
Parallel integration of heterogeneous genome-wide data sources 1-7 –https://doi.org/10.1109/BIBE.2008.4696710
Prepared by staff of UCD Library, University College Dublin