1 Trillion words coming from Google

Google will soon be releasing a dataset of over 1 trillion wordsâ€”a move they suggest might help advance various text-processing arts (e.g, machine translation, speech recognition, spelling correction, entity detection, etc.). Here’s a blurb from their research blog:

“We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

Watch for an annnouncement at the LDC, who will be distributing it soon, and then order your set of 6 DVDs.

LDC is the Linguistic Data Consortium at the University of Pennsylvania.

Post Views: 24