Yeast Literature Corpus - UCD Digital Library

Yeast Literature Corpus

Abstract: We provide here a new text corpus, mined from biomedical literature, which refers to the terms used to describe Saccharomyces cerevisiae ORFs.

In collection Insight Centre for Data Analytics

# Label Extent MIME type Data
Yeast Literature Corpus 1 zip-compressed file archive
12785134 bytes
Date created:
Type of Resource
software, multimedia
Physical description
1 zip-compressed file archive
12785134 bytes — Digital origin: born digital (application/zip)
MD5 checksum: 5b13bebb8ab23efcaa606d697507ddea
Downloaded from the Insight Centre web site at, 2014-08-28
This corpus is made available for non-commercial and research purposes only. If you make use of this dataset please reference the publication: Greene, D., Bryan, K. and Cunningham, P. (2008), "Parallel Integration of Heterogeneous Genome-Wide Data Sources", Proc. 8th International Conference on BioInformatics and BioEngineering (BIBE 2008).
Dataset construction
We retrieved a set of 38,661 yeast-related MEDLINE abstracts retrieved from PubMed, corresponding to the references enumerated in the SGD literature curation database (as downloaded in May 2008). Since the database provides links between references and genes, we can form a "meta-document" for each gene consisting of the concatenation of all abstracts annotated as pertaining to that gene. From this we constructed a bag-of-words model, represented in the form of a term-gene matrix. To pre-process the data we removed dubious ORFs, and applied standard stop-word removal and stemming techniques to the abstracts. We subsequently removed terms occurring in less than three documents. Our final dataset consists of 6013 ORFs described by 62,859 unique terms.
Data format
The corpus is provided in pre-processed matrix format. -- The files contained in the archive given above have the following formats: yeast.mtx: Term frequencies stored in a sparse term-gene matrix in Matrix Market format; yeast.terms: Complete list of content-bearing terms in the corpus, with each line corresponding to a row of the term-gene matrix ; List of ORFs, with each line corresponding to a column of the term-gene matrix.
All rights, including copyright, in the content of the original abstracts are owned by the original authors.
Dataset   linked data (dct) Computer dataset   linked data (rda)

Referenced by
Greene, Derek, Bryan, Kenneth, Cunningham, Pádraig, 1962-. 8th IEEE International Conference on BioInformatics and BioEngineering, 2008. BIBE 2008 Parallel integration of heterogeneous genome-wide data sources 1-7 –
Record source
Prepared by staff of UCD Library, University College Dublin

Rights & Usage Conditions