by Matt Revelle
A few months back I started working on a news classifier for Highput. I had originally planned to use Clojure to write Hadoop jobs, but found that most of the input data was available as a stream and the allure of continuously producing results was too much. RSS feeds, twitter streams, and collaborative ranking (e.g, HN, reddit) are continuous, the classifier should be as well. The data retrieval, extraction, transformation, and loading (ETL) was done using a distributed workflow built from three simple components...
January, 3 2010 • 0 Comments • 1 Faves