Docendo discimus

$self->explain

  • Calendar

    May 2010
    M T W T F S S
    « Apr   Jul »
     12
    3456789
    10111213141516
    17181920212223
    24252627282930
    31  
  • Archives

  • Recent Posts

  • Bling

Vedalken scrapes the Web

Posted by brunorc on May 2, 2010

In order to have a Magic cards database, I need some data. There are different options – for example I can just sit down and write down all my cards, as they go. This method have several disadvantages: first of all it is time consuming, I have to add all new cards by hand, and I will never know anything about all those cool cards which I don’t have (yet). And of course, the coolest cards are almost always those missing ones.

So I decided to go for another option. Almost all the information one would ever want is already published in the Internet. For example, you can get all the Magic cards ever released using this URL.

However, that would involve many mouse clicks and would be definitely boring. Another option is to scrap the site – and scrapping the sites is something Perl is really good at! Using the LWP module one can create a browser, which is able to send HTTP requests and receive HTTP responses. LWP comes very handy, but anything sent in the response has to be parsed and then filtered. Luckily, there’s WWW::Mechanize module, built on top of LWP. Find every link of class X and follow it, then frobnicate the resulted HTML – at that kind of tasks WWW::Mechanize can really excel.

#!/usr/bin/env perl

use strict;
use warnings;

use WWW::Mechanize;

my %cards; # here we will store the cards info

my $url = "http://www.essentialmagic.com/cardsets/default.asp";

my $mech = WWW::Mechanize->new;

$mech->get($url);

Now the page with the links to all sets is fetched and stored in the $mech object. All sets are presented in three different formats, but for building the simple database the “Checklist” will be the most appropriate. Quick look at some links reveals they always include the word “Checklist” in the URL, so can be easily indentified:

my @sets = $mech->find_all_links( url_regex => qr/Checklist/ );

I don’t know about you, but I certainly not interested in having all cards in my database (for instance the Vanguard edition). Thus I would prepare a list of sets that should be skipped.

If the set shouldn’t be skipped, all links leading to the card info should be parsed and the card info should be registered. Since I would like to have the full list of names only (the rest will be scrapped from Gatherer), the text of the link would be sufficient. However, I have my cards separated by colors, so I’d like to have the names to be divided into “colored” lists. Unfortunately, the color is not a part of the link… but every link has a mouseOver event with detailed information about the given card! Of course, the event is a part of link definition, so all the necessary data can be retrieved from link:

foreach my $set (@sets) {
    next
        if grep { $set->url =~ /ID=$_$/ } @unwanted_sets;

    $mech->get( $set->url );

    my @cards = $mech->find_all_links( url_regex => qr{em2/Cards/default.aspx} );

    foreach my $card (@cards) {
        $cards{ $card->text } = { raw => $card->attrs->{onmouseover} };
    }
}

After running this code I discovered a lot of warnings about undefined or uninitialized values. Quick look at the HTML code of the page revealed, that every card features the link “Buy”, which also has an URL matching the pattern. Some workaround was necessary:

    foreach my $card (@cards) {
        next
            if $card->text eq 'Buy';

        $cards{ $card->text } = { raw => $card->attrs->{onmouseover} };
    }

Once the cards are fetched, I’d like to sort them by color, which can be determined on the base of mana cost. This should be easy, since all the information is included in the JavaScript call:

onmouseover="showCardPopup(this, 46518, 'Felidar Sovereign', '4WW', 'Creature', 'Cat Beast', 4, 6, 75, 'Mythic', 'Vigilance, lifelink\nAt the beginning of your upkeep, if you have 40 or more life, you win the game.');"

So one can just split the value of the onmouseover attribute on comma and use the fourth element. But not so fast – sometimes the name can include a comma as well, for instance “Iona, Shield of Emeria”. Also, all text fields are written in apostrophes, which have to be removed; escaped apostrophes have to be unescaped. After looking at the HTML code of the page I came up with the following solution:

  • I don’t care about anything but mana cost;
  • mana cost comes after the name;
  • name may contain comma(s);
  • so just remove elements from the beginning of the list returned by split, until mana cost is on its proper position.

However, recognizing the mana cost can be somewhat tricky. Luckily, there are many fields after the mana cost which don’t include commas – card type & subtype, power, toughness, set number and rarity. Rarity seems a good choice for a check, since it only comes with four different values (some debugging revealed that in fact it may have five values, “Unknown” being the fifth one, and used only for Basic Lands). My plan is to shift elements of the list until the tenth element matches one of the possible rarity values:

my @rarity = qw/Common Uncommon Rare Mythic Unknown/;

foreach my $card ( sort keys %cards ) {
    my @info = split( /, /, $cards{$card}->{raw} );

    @info = map {
        $_ =~ s/^'//;
        $_ =~ s/'$//;
        $_ =~ s/\\'/'/g;
        $_ } @info;

    while ( scalar @info > 9 and not grep { $info[9] eq $_ } @rarity ) {
        shift @info;
    }

The scalar @info > 9 condition eliminates the possibility of iterating over the empty list forever (which was the case of the “Buy” links) – ugly, but works; even if the “Buy” links have been eliminated, I decided to keep this piece “just in case”. Then, I’m not interested in Basic Lands. Any other card should have its mana cost analyzed and then should be assigned to the proper color category. It will be done for all cards, so I just loop over sort keys %cards, obtaining the alphabetical per-color lists.

    my $mana = $info[3];
    my $type = $info[4];
    my $color;

    next
        if $type =~ /Basic/;

    if ( $type eq 'Land' ) {
        $color = 'Land';
    }
    elsif ( $mana =~ /^[0-9]+$/ ) {
        $color = 'None';
    }
    elsif ( $mana =~ /W/ and not $mana =~ /[BURG]/ ) {
        $color = 'White';
    }
    elsif ( $mana =~ /B/ and not $mana =~ /[WURG]/ ) {
        $color = 'Black';
    }
    elsif ( $mana =~ /U/ and not $mana =~ /[BWRG]/ ) {
        $color = 'Blue';
    }
    elsif ( $mana =~ /R/ and not $mana =~ /[BUWG]/ ) {
        $color = 'Red';
    }
    elsif ( $mana =~ /G/ and not $mana =~ /[BURW]/ ) {
        $color = 'Green';
    }
    else {
        $color = 'Multicolor';
    }

    push @{ $cards_by_color{$color} }, { name => $card };
}

Now it’s time to put the whole stuff into the database. My preferred choice is PostgreSQL, but it can be an overkill for one table. Maybe SQLite would be better, but I was not sure if it’s installed.

perl -MDBI -MDBD::SQLite -e '1'

No errors, so I can use SQLite.

After adding some SQL, the script looks like this:

#!/usr/bin/env perl

use strict;
use warnings;

use WWW::Mechanize;
use DBI;

my $dbh = DBI->connect( "dbi:SQLite:dbname=vedalken.db", "", "" )
    or die "Cannot open or create the DB file: $DBI::errstr";

$dbh->do( q/
    CREATE TABLE IF NOT EXISTS Card (
        id          INTEGER PRIMARY KEY,
        name        VARCHAR(64) UNIQUE NOT NULL,
        manacost    VARCHAR(8) NOT NULL DEFAULT 0,
        color       VARCHAR(10),
        amount      INTEGER NOT NULL DEFAULT 0
    )
/ ) or die "Cannot create table: ", $dbh->errstr;

my $sth = $dbh->prepare( q/
    INSERT INTO Card (name, manacost, color) VALUES (?, ?, ?)
/ ) or die "Cannot prepare insert statement: ", $dbh->errstr;

my %cards;
my %cards_by_color = (
    Land        => [],
    None        => [],
    Blue        => [],
    Red         => [],
    Black       => [],
    Green       => [],
    White       => [],
    Multicolor  => [],
);

my $url = "http://www.essentialmagic.com/cardsets/default.asp";
my @unwanted_sets = qw/32 23 1 30 53/;

my $mech = WWW::Mechanize->new;

$mech->get($url);

my @sets = $mech->find_all_links( url_regex => qr/Checklist/ );

foreach my $set (@sets) {
    next
        if grep { $set->url =~ /ID=$_$/ } @unwanted_sets;

    print "Processing set pointed by ", $set->url;

    $mech->get( $set->url );

    my @cards = $mech->find_all_links( url_regex => qr{em2/Cards/default.aspx} );

    print " - ", scalar @cards, " cards found\n";

    foreach my $card (@cards) {
        next
            if $card->text eq 'Buy';

        $cards{ $card->text } = { raw => $card->attrs->{onmouseover} };
    }
}

my @rarity = qw/Common Uncommon Rare Mythic Unknown/;

foreach my $card ( sort keys %cards ) {
    my @info = split( /, /, $cards{$card}->{raw} );

    @info = map {
        $_ =~ s/^'//;
        $_ =~ s/'$//;
        $_ =~ s/\\'/'/g;
        $_ } @info;

    while ( scalar @info > 9 and not grep { $info[9] eq $_ } @rarity ) {
        shift @info;
    }

    my $mana = $info[3];
    my $type = $info[4];
    my $color;

    next
        if $type =~ /Basic/;

    if ( $type eq 'Land' ) {
        $color = 'Land';
    }
    elsif ( $mana =~ /^[0-9]+$/ ) {
        $color = 'None';
    }
    elsif ( $mana =~ /W/ and not $mana =~ /[BURG]/ ) {
        $color = 'White';
    }
    elsif ( $mana =~ /B/ and not $mana =~ /[WURG]/ ) {
        $color = 'Black';
    }
    elsif ( $mana =~ /U/ and not $mana =~ /[BWRG]/ ) {
        $color = 'Blue';
    }
    elsif ( $mana =~ /R/ and not $mana =~ /[BUWG]/ ) {
        $color = 'Red';
    }
    elsif ( $mana =~ /G/ and not $mana =~ /[BURW]/ ) {
        $color = 'Green';
    }
    else {
        $color = 'Multicolor';
    }

    $sth->execute( $card, $color, $mana )
        or die "Cannot execute insert statement: ", $dbh->errstr;
}

$dbh->disconnect;

This version was run and collected the information about 11257 unique cards. If you are lazy and/or you don’t want to scrap essentialmagic.com, the SQLite file is only 200kB after bzipping, so just contact me (unfortunately I cannot upload a non-media file here).

Advertisements

3 Responses to “Vedalken scrapes the Web”

  1. Dries Van Looy said

    Hello, that’s some handy stuff! I never worked with Perl though, and I haven’t much programming experience but I would try to make an application with magic cards, to learn my schoolstuff with it!

    It would be awesome if you could send me the SQL file!

    Greets from Belgium,
    Dries Van Looy

  2. Since you can’t upload a non-media file here, have you considered using a site such as Box.net for storage space? It should be able to support a file 200 kB even with a free account. I use it as a place to store my deck files since I publish on Blogger.

    ~Olin Gallet Magic the Gathering Reviews

  3. Brian said

    Hi there, I’m interested in playing around with the database you mentioned in this post, would it be possible to get it, or the most recent one (looks like you’ve written a couple of different scrapers). If you still have it and are sharing would you mind emailing me a copy? Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: