Posts

by Matt Revelle 

Distributed processing and Unix philosophy

A few months back I started working on a news classifier for Highput. I had originally planned to use Clojure to write Hadoop jobs, but found that most of the input data was available as a stream and the allure of continuously producing results was too much. RSS feeds, twitter streams, and collaborative ranking (e.g, HN, reddit) are continuous, the classifier should be as well. The data retrieval, extraction, transformation, and loading (ETL) was done using a distributed workflow built from three simple components: Python, Beanstalk, and Tokyo Tyrant. The producer/consumer processes are Python programs, Beanstalk is a message queue and divvies jobs to consumers, while Tokyo Tyrant stores intermediate results.

Python is a great programming language for getting things done: it has loads of libraries, pseudocode-like syntax, and keeps simple things simple. Aside from gaining throughput performance by distributing across multiple machines, one of the advantages of using a workflow for data processing is the reduction of complexity that comes from dividing a larger problem. Each processing step in the workflow was implemented as a separate Python program. Each instance of the program requests a job from Beanstalk, does its work, then throws the result into a different job queue. Any of the processes can be killed at any time without losing the task. That's because Beanstalk keeps track of jobs, even after a worker has reserved it - if the worker fails to report back then the job will be returned to a ready state and made available to other workers. Beanstalk also supports persistent queues, protecting against losing jobs if the beanstalkd process dies. Finally, there is Tokyo Tyrant, the network interface to Tokyo Cabinet. Tokyo Cabinet received deserving attention in 2009, it's the best little database, supporting multiple modes useful in various circumstances: simple key/value pairs, fixed-size arrays, B-trees, and schema-free tables. Ultimately, the news classifier in production would use something with an automated process for balancing storage nodes, but Tokyo Cabinet is great for prototyping even if it's not the longterm solution.

With these tools, a distributed system prototype for ETL can be built in a day. And not just a hacked up system that should be thrown away, but one that is nearly good enough for production - add deployment management, failure detection and recovery and it's complete. That is significant work, but all are concerns for users of most heavier distributed system frameworks too.  While this solution is ad-hoc and inappropriate for some situations, it's amazing what can be done with a small set of well-designed, task-specific tools in a short period of time.

Filed under  //   distributed   programming   unix   workflow  

Comments [0]

Sleeping dogs

Comments [0]

Logo draft

Been working on a logo for Lightpost Software, think it's getting closer... Have a couple other layout and color variants, but this is my current pick.

Filed under  //   design   graphic  

Comments [0]

Little brown bat

Hanging out next to the barber shop.

Comments [0]

Pictograph in progress

A new draft for Lightpost Software's pictograph:

Or, alternate coloring:

Hint: it's supposed to be a lightpost.

Filed under  //   design   graphic  

Comments [0]

Finger, decentralized social networks, and ownership

A group of developers have started WebFinger, an open source project to revive finger by implementing it atop HTTP.  Although finger was historically used for supplying contact information and personal news, similar to personal blogs, a new webfinger + online storage gives users the ability to consolidate their online persona and retain ownership of content such as photos, videos, blog posts, and personal opinions. This is a step towards reclaiming personal ownership of online content and not filling out the same personal information and preferences for every site or social network platform.
 
Decentralization and privacy while leveraging collaboration tools such as social networks needs to be the future of the Internet.

Filed under  //   decentralized   distributed   finger   network   social  

Comments [0]

Reactable in Chicago

Got to play with a Reactable in Chicago's Museum of Science & Industry.

Comments [0]

Leaves opening

Comments [0]

Clojure's RestFn

Earlier today on #clojure, there was a brief discussion on how larger (including infinite!) arity functions are implemented in Clojure. An example similar to what started the discussion:

 
(apply + (range 100)) 

 
The interesting bit happens in RestFn and how the compiler lays out the bytecode.
 
The implementation of + is:
 
(defn + 
  "Returns the sum of nums. (+) returns 0." 
  {:inline (fn [x y] `(. clojure.lang.Numbers (add ~x ~y))) 
   :inline-arities #{2}} 
  ([] 0) 
  ([x] (cast Number x)) 
  ([x y] (. clojure.lang.Numbers (add x y))) 
  ([x y & more] 
  (reduce + (+ x y) more))) 

 
There are four implementations of +, which gets used depends on the number of arguments provided. The + function is a RestFn instance and it's applyTo method is called from apply.

So far we have:

 
(apply f xs) -> f.applyTo(xs). 
 

This then calls: 




 f.doInvoke(xs.first(), (xs = xs.next()).first(), xs.next()) 


 

Which matches the implementation that handles the parameter list [x y & more]. The Clojure compiler, when emitting the bytecode for +, places the implementation defined for [x y & more] under the appropriate doInvoke method and overrides RestFn's default implementation of throwing an exception about an unsupported arity.

Filed under  //   clojure   jvm   programming  

Comments [0]

The amsmath matrix environment for alignment

I just spent a good hour searching for a way to left align a couple lines of type definitions with amsmath.  The matrix environment, which provides matrix formatting without delimiters, is a solution.

Filed under  //   latex   tex  

Comments [0]