Pivotal Labs

Screen Scrape No More...Seriously!

edit Posted by Parker Thompson on Sunday February 10, 2008 at 02:48PM

This week I had the pleasure of attending Dapper Camp put on by the folks at Dapper for their user community. Mitch Kapor kicked it off with a talk on disruptive technology, openness, and innovation. We then got to hear both from the Dapper team about new and upcoming features, and folks like Aaron Fulkerson of MindTouch about using Dapper to repurpose data. All around it was pretty interesting.

What I love about Dapper is that it helps solve one of the big issues I see our clients have: data. We can build just about anything, but if an application needs some specific data (and many do), products must be launched with sub-par (but available) data, or worse launches can end up being delayed. In many cases, we can end up spending a large amount of time (aka money) getting/munging data rather than developing features. Note: I also think the ease if pushing data out of apps via instant gadgets makes Dapper very interesting but that's a whole separate post.

The "Dapp Factory," a Rhino-based server application and a web front-end that deals with just about any site by proxying your requests and modeling the DOM on the proxy, then recording your actions for later replay. But, their secret sauce is a super-cool algorithm that figures out the structure of pages in such a way that your API can withstand changes to the target site, making your feed resilient to all but massive site overhauls. You then simply consume an XML or JSON feed, or use a simple API to dynamically construct paramaterized feeds.

There are other companies trying to make data less painful. Metaweb, for example, provides an incredibly fast graph engine and relational schemas (think RDF) that makes real-time use interesting. But, if the data you need isn't in Freebase (a likelihood until they get larger), or your data is continually being updated, you will still be stuck scraping and relating the data, and that's generally where most of the work is.

Take as a small example Dav's awesome Vacation Planner. The concept is simple, the feature set is small, but getting the data is a pain (see article). Some sites don't have APIs, and those that do provide unstansardized, sometimes buggy, and are often often missing the data you need.

I could imagine writing a dapp parser akin to ActiveResource pretty easily (I hear a ruby SDK came out of DapperCamp, but I can't find it). With a little more work, it would probably be easy to add cache_fu support, and ruby modules that could be mixed into models (for asynchronous data gathering) and controllers (for serve now vs. polling) to easily support Dav's polling mechanism.

This would leave Dav with pretty simple model (data) code, and the luxury of focussing on whether to add wikipedia integration for population figures or the Big Mac Index, rather than tweaking his Mechanize xpaths all weekend. I vote for the Big Mac Index.

So, the next time someone suggests you screen scrape to get that data you need, tell them to give Dapper a shot. And if anyone wants to write 'ActiveDapp' let me know. It could be really fun.

Note: In the spirit of full disclosure, Jon Aizen (Dapper CTO) is a friend of mine and they gave me a free t-shirt...and a sandwich (thanks).

Comments

  1. Parker Thompson Parker Thompson on February 11, 2008 at 03:13AM

    Here's what I imagine an implementation of Dav's vacation finder using the (not-yet-existant) "ActiveDapp" library might look like:

    # base class for results from 
    # various of travel sites 
    class Fare < ActiveDapp::Model
          # don't eveen search without these
      parameter :airport, :required => true
      parameter :city, :required => true
    end
    
    class ExciteFare < Fare
      dapp "url/to/excite/dapp'
    end
    
    class FareCompareFare < Fare
      dapp "url/to/carecompare/dapp'
    end
    
       # model representing dapp results of VRBO
    class Rental < ActiveDapp::Model
      dapp "url/to/VRBO/dapp'
           # map fields from one dapp to another's search parameters
       has_many :fares, :join => [: airport => :renter_location]
    end
    
    
    # model that has knowledge of dapps, but isn't one
     class Vacation < ActiveDapp::Model
       property :airport, :required => true
    
       has_many :rentals, :params [ :renter_location => :airport]
    
       has_many :fares, :through => :rentals  do
        def lowest
          proxy.sort {|a, b| a.price <=> b.price }.first
        end
       end 
    
       def price
         self.rental.price + self.rental.fares.lowest.price
       end
    end
    
    @vacations = Vacation.find(:airport => 'SFO', :sort_by => :price)
    @vacations.each do |v|
       puts v.city
       puts v.price
       puts v.rental.start_date
       puts v.rental.end_date
       puts v.fare.airline
       puts v.fare.price
    end
    
  2. Viktor Viktor on February 11, 2008 at 12:25PM

    It is a better way of handling data. Especially for me, i deal with several layers for clients. good job.

  3. Peyton Peyton on February 12, 2008 at 11:44AM

    This is good system on handling data. So users specifically the company mostly handles volume of data will not complicated to operate them.

Add a Comment (MarkDown available)