Docendo discimus

$self->explain

  • Calendar

    July 2017
    M T W T F S S
    « Sep    
     12
    3456789
    10111213141516
    17181920212223
    24252627282930
    31  
  • Archives

  • Recent Posts

  • Bling

Posts Tagged ‘scraping’

Vedalken reborn (in Alara)

Posted by brunorc on May 9, 2010

I planned to later update the Vedalken’s database with the contents of Gatherer, the official database of Magic cards, run by Wizards of the Coast. However, I discovered that the content scraped from Essential Magic was not good enough – probably because I was parsing comma separated list, which was a part of JavaScript function call. Moreover I discovered that some basic lands still made it to get to the database. And finally, I started to check how to scrape the Gatherer just to find that there’s no easy way to get the given card just by its name. The only unique identifier in Gatherer is the multiverse ID. Frankly, one card may have many multiverse IDs, because every time it gets printed in a different edition, new multiverse ID is assigned.

Because of all those issues I decided to start again from scratch – even if I already put all my non-basic lands and colorless spells in the database. I can always use this information and just do UPDATE… so I renamed the table (that was very good move, as it occurred later) and started almost from scratch.

This time I wanted to isolate the code into functions to make the flow clear and understandable. Here’s the top part of the script:

#!/usr/bin/env perl

use strict;
use warnings;

use WWW::Mechanize;
use HTML::TreeBuilder;
use DBI;

my %cards;

my $mech = WWW::Mechanize->new;
my $BASE_URL = "http://gatherer.wizards.com/Pages";

my $dbh;
my $sth;

scan_set( $_, \%cards, $mech )
    foreach get_sets($mech);

print scalar keys %cards, " unique cards found\n";

scan_card( $_, $cards{$_}, $mech )
    foreach keys %cards;

$dbh->disconnect;

There’s a new module – HTML::TreeBuilder, part of HTML::Tree suite – which is necessary to parse HTML in a little bit more advanced way than it’s possible with WWW::Mechanize only. The logic is separated into three functions: get_sets (providing a list of all sets, since cards will be searched by set names), scan_set (updating the global hash with info about cards from the particular set) and scan_card (fetching to the DB block of HTML code with card description). At the end of run I should have raw HTML for each card stored in the database, so I can parse it and update detailed information later. Another advantage is that I don’t have to make final decisions about the data model, just focusing on the scraping part.

I only define my database-related variables without any initialization – I will discuss it later, since it had really interesting effects.

The first function takes the $mech object as an argument and returns the list of sets (fetched from Essential Magic, as before), processed by the next function:

sub get_sets {
    my $browser = shift;
    my @sets;

    $browser->get('http://www.essentialmagic.com/cardsets/default.asp');

    my $tree = HTML::TreeBuilder->new_from_content( $browser->content );

    my $link_check = sub {
         $_[0]->attr('href') =~ m!CardSets/Overview!;
    };

    foreach my $class ( qw/Row1 Row2/ ) {
        my @rows = $tree->look_down( '_tag' => 'td', 'class' => $class );

        foreach my $row ( @rows ) {
            if ( my $link = $row->look_down(  '_tag' => 'a', $link_check ) ) {
                ( my $id = $link->attr('href') ) =~ s/^.+ID=//;

                next
                    if grep { $id == $_ } qw/1 23 30 32 53/;

                $sets[$id] = $row->look_down( '_tag' => 'img' )->attr('alt');
                # fix differences
                $sets[$id] =~ s/Timeshifted/"Timeshifted"/;
                $sets[$id] =~ s/WorldWake/Worldwake/;
                $sets[$id] =~ s/10th/Tenth/;
                $sets[$id] =~ s/9th/Ninth/;
                $sets[$id] =~ s/8th/Eighth/;
                $sets[$id] =~ s/7th/Seventh/;
                $sets[$id] =~ s/6th/Sixth/;
                $sets[$id] =~ s/5th/Fifth/;
                $sets[$id] =~ s/4th/Fourth/;
                $sets[$id] =~ s/Portal-/Portal/;
                $sets[$id] =~ s/.*Starter 1999.*/Starter 1999/;
            }
        }
    }
    return reverse grep { defned $_ } @sets;
}

In this function HTML::TreeBuilder is used to find particular elements of HTML code. In line 14 I want to get all <td> tags of the given class. However, in line 17 beside passing a list of parameters specifying tag type or attributes, I also use the $link_check variable. This variable is in fact a reference to an anonymous function. The function itself is defined in line 9. The documentation for look_down function shows an example where the anonymous function is passed directly in the call, but since the look_down is called in the if clause, I wanted to make the code shorter. Anyway, instead or beside of parameter => value pairs, one can pass a reference to a function, which would return true or false (indicating if the condition was met). I was only interested in links, where href parameter was matching some specific pattern. In case of WWW::Mechanize I could have used something like this:

my @links = $mech->find_all_links( url_regex => qr{CardSets/Overview} );

but HTML::TreeBuilder only allows simple comparisons. In fact, everything else can be put in a function, so it’s a good deal.

All matching links have to be prepared and stored. Preparation include removing everything from the beginning, so only the ID of the set is left. I’m not particularly interested in those IDs, but they will make the list organized. Also, I want to skip some of them. Then I use the alternate text (since sets are indicated in the table using images…) and fix it in case of some sets; at the end I return the reversed list of all sets. The list is reversed, because I want to have cards with their latest multiverse IDs (for example Glory Seeker was recently reprinted in the Rise of the Eldrazi, but its first version appeared in Onslaught).

Once the list is ready (and undefined values are filtered out by grep), it is passed to the foreach loop, where every item will be processed by the scan_set function:

sub scan_set {
    my ( $set_name, $cards_href, $browser ) = @_;
    print "Processing set \"$set_name\" ";

    my $set_url = "%s/Search/Default.aspx?output=checklist&action=advanced&set=[%%22%s%%22]";

    $browser->get( sprintf( $set_url, $BASE_URL, $set_name ) );

    my $tree = HTML::TreeBuilder->new_from_content( $browser->content );

    my @rows = $tree->look_down( '_tag' => 'tr', 'class' => 'cardItem' );
    print " - ", scalar @rows, " cards found\n";

    foreach my $row ( @rows ) {
        my $link = $row->look_down( '_tag' => 'a', 'class' => 'nameLink' );

        my $name = $link->as_text;

        next
            if grep { $name eq $_ } qw/Plains Forest Swamp Island Mountain/;

        ( my $id = $link->attr('href') ) =~ s/^.*multiverseid=//;

        if ( exists $cards_href->{$name} ) {
            push @{ $cards_href->{$name}{gids} }, $id;
        }
        else {
            $cards_href->{$name} = {
                gid     => $id,
                gids    => [ $id ],
            };
        }
    }
}

The first argument of this function is the set name, and the third one is a “browser” (WWW::Mechanize object). The second one is passed in the call as \%cards. Again, it’s a reference – but this time it’s a reference to a hash (usually called a “hashref”). Usage of the reference allows one to pass arguments by reference (quite surprising, right?), instead of passing them by value:

my @list = qw/foo bar baz/;

sub by_val {
    my @args = @_;

    foreach (@args) {
        $_ = uc $_;
    }
}

sub by_ref {
    ( my $args ) = @_;

    foreach (@$args) {
        $_ = uc $_;
    }
}

by_val(@list);
print join( ', ', @list ), $/;

by_ref(\@list);
print join( ', ', @list ), $/;

The first function receives a list of arguments passed by their values and stores them in the @args list. But in fact they are copied, one by one, and the function operates only on those copies. That’s why changes are not stored in the original (elements of @list are not converted to upper case). The latter function receives one argument – which is the reference to the list (references are denoted by the backslash before the original variable). In the foreach loop it has to be dereferenced, so Perl will use the object pointed by the reference and treat is like an array.

Actually, the same effect can be achieved by using the original argument list:

sub by_direct {
    foreach (@_) {
        $_ = uc $_;
    }
}

since accessing arguments like $_[0], $_[1] will also allow to modify the original passed value. But naming arguments is always a good idea; also passing by reference is the only way to pass complex data structures to a function.

sub scan_card {
    my ( $cardname, $storage, $browser ) = @_;

    my $card_url = "%s/Card/Details.aspx?multiverseid=%d";

    $browser->get( sprintf( $card_url, $BASE_URL, $storage->{gid} ) );

    my $tree = HTML::TreeBuilder->new_from_content( $browser->content );

    $storage->{raw} = ( $tree->look_down( '_tag' => 'div', 'class' => 'smallGreyMono' ) )[1];

    $storage->{gids} = join( ', ', @{ $storage->{gids} } );

    $sth->execute( $cardname, @{$storage}{ qw/gid gids raw/ } );
}

Since $storage is a reference to a hash, instead of using $storage{something} the arrow notation is used: $storage->{something}. Also, the “gids” field contains a reference to an array, created in line 30 of the scan_set function, and dereferenced in line 25 of the same function, to have push being called on it. Here, in line 12 it’s dereferenced in the same way. And in line 14 I want to pass the values of three fields to the function call; instead of writing every single one like $storage->{gid} and so on, I use hash slices. If the $storage was an ordinary hash, I could write @storage{ ... }, but since it’s a reference, I have to dereference it: @{ $storage }{ ... }.

Perl is very flexible in treating data structures according to the programmer’s will. In the line 10 I call the look_down function, which – when called in so called “list context” – returns a list of objects matching the given conditions. In scalar context it returns the first element found. But I’m interested in the second one… thus I call the function in the list context, I treat the whole call as a list and return only the second object of this “list”.

Last but not least I had to initialize my DB-related variables. I put them on the bottom of the file in the BEGIN block. Those named blocks are quite useful, but I used the wrong one. The problem is, the BEGIN blocks (since there may be more than one) are executed in the compilation phase. So when I ran perl -c scraper.pl (to check if there are no stupid syntax errors), this code got executed. Luckily, there was no DROP TABLE IF EXISTS, and moreover I renamed the already populated table. But actually I wanted this code to initialize my variables, so I should have used the INIT block (or use the $COMPILING variable to check if the code is compiled or run, but that would be completely wrong).

# used to be BEGIN...
INIT {
    $dbh = DBI->connect( "dbi:SQLite:dbname=vedalken.db", "", "" )
        or die "Cannot open or create the DB file: $DBI::errstr";

    $dbh->do( q/
        CREATE TABLE IF NOT EXISTS Card (
            id      INTEGER PRIMARY KEY,
            name    VARCHAR(64) UNIQUE NOT NULL,
            gid     INTEGER NOT NULL,
            gids    TEXT,
            raw     TEXT
        )
    / ) or die "Cannot create table: ", $dbh->errstr;

    $sth = $dbh->prepare( q/
        INSERT INTO Card (name, gid, gids, raw)
        VALUES (?, ?, ?, ?)
    / ) or die "Cannot prepare insert statement: ", $dbh->errstr;
}

Of course every BEGIN or INIT block is still just a block, so any variable defined inside its scope will not be visible or accessible from outside. Thus $dbh and $sth had to be defined beforehand, in the global scope (because closures would be an overkill).

Anyway – I ended up with the database populated with raw HTML scraped from Gatherer (10961 cards). Next time I plan to use HTML::TreeBuilder to parse it and fill the data model.

Posted in Perl | Tagged: , , , , , , | 1 Comment »

Vedalken improves

Posted by brunorc on May 2, 2010

Nobody’s perfect.

Vedalken, apart from their strictly structured minds and a tendency towards logic, aren’t perfect either. After some heavy testing and inspecting the database populated by the script, I came up with several bugfixes.

Here’s the new version:

#!/usr/bin/env perl

use strict;
use warnings;

use WWW::Mechanize;
use Data::Dumper;
use DBI;

my $dbh = DBI->connect( "dbi:SQLite:dbname=vedalken.db", "", "" )
    or die "Cannot open or create the DB file: $DBI::errstr";

$dbh->do( q/
    CREATE TABLE IF NOT EXISTS Card (
        id          INTEGER PRIMARY KEY,
        name        VARCHAR(64) UNIQUE NOT NULL,
        manacost    VARCHAR(8) NOT NULL DEFAULT 0,
        color       VARCHAR(10),
        cardtype    VARCHAR(32),
        subtype     VARCHAR(32),
        raw         TEXT,
        amount      INTEGER NOT NULL DEFAULT 0
    )
/ ) or die "Cannot create table: ", $dbh->errstr;

my $sth = $dbh->prepare( q/
    INSERT INTO Card (name, manacost, color, cardtype, subtype, raw) 
    VALUES (?, ?, ?, ?, ?, ?)
/ ) or die "Cannot prepare insert statement: ", $dbh->errstr;

my %cards;

my $url = "http://www.essentialmagic.com/cardsets/default.asp";
my @unwanted_sets = qw/32 23 1 30 53/;

my $mech = WWW::Mechanize->new;

$mech->get($url);

my @sets = $mech->find_all_links( url_regex => qr/Checklist/ );

foreach my $set (@sets) {
    next
        if grep { $set->url =~ /ID=$_$/ } @unwanted_sets;

    print "Processing set pointed by ", $set->url;

    $mech->get( $set->url );

    my @cards = $mech->find_all_links( url_regex => qr{em2/Cards/default.aspx} );

    print " - ", scalar @cards, " cards found\n";

    foreach my $card (@cards) {
        next
            if $card->text eq 'Buy';

        $cards{ $card->text } = { raw => $card->attrs->{onmouseover} };
    }
}

print scalar keys %cards, " unique cards found\n";

my @rarity = qw/Common Uncommon Rare Mythic Unknown/;

my %colors = (
    W => 'White',
    B => 'Black',
    U => 'Blue',
    R => 'Red',
    G => 'Green'
);

foreach my $card ( sort keys %cards ) {
    ( my $raw = join( ' ', split( /\n/, $cards{$card}->{raw} ) ) )
        =~ s/^showCardPopup\((.+)\);$/$1/;

    my @info = split( /, /, $raw );

    splice @info, 0, 2;

    @info = map {
        $_ =~ s/^'//;
        $_ =~ s/'$//;
        $_ =~ s/\\'/'/g;
        $_ } @info;

    while ( scalar @info > 6 and not grep { $info[6] eq $_ } @rarity ) {
        shift @info;
    }

    my $mana = $info[0];
    my $type = $info[1];
    my $subt = $info[2];
    my $color;

    next
        if $type =~ /^Basic/;

    $color = $type eq 'Land'
        ?   'Land'
        :   $mana =~ /^[0-9]+$/
            ?   'Colorless'
            :   'Multicolor';

    if ( $color eq 'Multicolor' ) {
        foreach my $c ( keys %colors ) {
            $color = $colors{$c}
                if $mana =~ /$c/
                    and not grep { $mana =~ /$_/ } grep { $_ ne $c } keys %colors;
        }
    }

    $sth->execute( $card, $mana, $color, $type, $subt, join( ', ', @info ) )
        or die "Cannot execute insert statement: ", $dbh->errstr;
}

$dbh->do( q/UPDATE Card SET cardtype = 'Instant' WHERE cardtype IN ('Interrupt', 'Mana Source')/ );
$dbh->do( q/UPDATE Card SET cardtype = 'Creature' WHERE cardtype = 'Summon'/ );

$dbh->disconnect;

First of all, since the information about card type and subtype is already present, why not to use it? So I added new columns in the table. Also, because of the database the %cards_by_color hash is no longer needed, so it was removed. I added a hash with colors, which is later used to loop over it in lines 106-112 in order to determine the single color. In fact, the whole logic of discovering the color was rewritten – as well as handling the call to showCardPopup:

    ( my $raw = join( ' ', split( /\n/, $cards{$card}->{raw} ) ) )
        =~ s/^showCardPopup\((.+)\);$/$1/;

First, I get rid of multiline descriptions (there were some), by splitting on a newline and joining on a space. The result of this operation is then assigned to $raw. Then I use a regular expression to get rid of JavaScript bits; since the parentheses are used in regular expressions to capture matches, I have to escape the “real” parentheses. Actually, I should have used the /x modifier for regexp, which would definitely improve the readiness of this code:

    ( my $raw = join( ' ', split( /\n/, $cards{$card}->{raw} ) ) )
        =~ s/^showCardPopup \( (.+) \) ;$/$1/x;

And now for determining the colors:

    $color = $type eq 'Land'
        ?   'Land'
        :   $mana =~ /^[0-9]+$/
            ?   'Colorless'
            :   'Multicolor';

    if ( $color eq 'Multicolor' ) {
        foreach my $c ( keys %colors ) {
            $color = $colors{$c}
                if $mana =~ /$c/
                    and not grep { $mana =~ /$_/ } grep { $_ ne $c } keys %colors;
        }
    }

The ternary operator is being chained to allow setting the “colorless” colors. After that, every card is assumed to be multicolor. Then, for every color in the table, this color is set if mana cost matches the letter describing this color, and in the same time mana cost doesn’t match (first grep) any color different from the particular color being checked (second grep).

Since the vocabulary of Magic cards has been changed many times, some cards may get registered in the database under some archaic categories. So I added two lines which fixes those cases, calling SQL UPDATE on the populated database.

Update: Some cards appears more than one time with postfixes from “(1)” to “(4)”. I fixed it by deleting them from the table, because I really didn’t want to scrape Essential Magic over and over again.

Posted in Perl | Tagged: , , | Leave a Comment »

Vedalken scrapes the Web

Posted by brunorc on May 2, 2010

In order to have a Magic cards database, I need some data. There are different options – for example I can just sit down and write down all my cards, as they go. This method have several disadvantages: first of all it is time consuming, I have to add all new cards by hand, and I will never know anything about all those cool cards which I don’t have (yet). And of course, the coolest cards are almost always those missing ones.

So I decided to go for another option. Almost all the information one would ever want is already published in the Internet. For example, you can get all the Magic cards ever released using this URL.

However, that would involve many mouse clicks and would be definitely boring. Another option is to scrap the site – and scrapping the sites is something Perl is really good at! Using the LWP module one can create a browser, which is able to send HTTP requests and receive HTTP responses. LWP comes very handy, but anything sent in the response has to be parsed and then filtered. Luckily, there’s WWW::Mechanize module, built on top of LWP. Find every link of class X and follow it, then frobnicate the resulted HTML – at that kind of tasks WWW::Mechanize can really excel.

#!/usr/bin/env perl

use strict;
use warnings;

use WWW::Mechanize;

my %cards; # here we will store the cards info

my $url = "http://www.essentialmagic.com/cardsets/default.asp";

my $mech = WWW::Mechanize->new;

$mech->get($url);

Now the page with the links to all sets is fetched and stored in the $mech object. All sets are presented in three different formats, but for building the simple database the “Checklist” will be the most appropriate. Quick look at some links reveals they always include the word “Checklist” in the URL, so can be easily indentified:

my @sets = $mech->find_all_links( url_regex => qr/Checklist/ );

I don’t know about you, but I certainly not interested in having all cards in my database (for instance the Vanguard edition). Thus I would prepare a list of sets that should be skipped.

If the set shouldn’t be skipped, all links leading to the card info should be parsed and the card info should be registered. Since I would like to have the full list of names only (the rest will be scrapped from Gatherer), the text of the link would be sufficient. However, I have my cards separated by colors, so I’d like to have the names to be divided into “colored” lists. Unfortunately, the color is not a part of the link… but every link has a mouseOver event with detailed information about the given card! Of course, the event is a part of link definition, so all the necessary data can be retrieved from link:

foreach my $set (@sets) {
    next
        if grep { $set->url =~ /ID=$_$/ } @unwanted_sets;

    $mech->get( $set->url );

    my @cards = $mech->find_all_links( url_regex => qr{em2/Cards/default.aspx} );

    foreach my $card (@cards) {
        $cards{ $card->text } = { raw => $card->attrs->{onmouseover} };
    }
}

After running this code I discovered a lot of warnings about undefined or uninitialized values. Quick look at the HTML code of the page revealed, that every card features the link “Buy”, which also has an URL matching the pattern. Some workaround was necessary:

    foreach my $card (@cards) {
        next
            if $card->text eq 'Buy';

        $cards{ $card->text } = { raw => $card->attrs->{onmouseover} };
    }

Once the cards are fetched, I’d like to sort them by color, which can be determined on the base of mana cost. This should be easy, since all the information is included in the JavaScript call:

onmouseover="showCardPopup(this, 46518, 'Felidar Sovereign', '4WW', 'Creature', 'Cat Beast', 4, 6, 75, 'Mythic', 'Vigilance, lifelink\nAt the beginning of your upkeep, if you have 40 or more life, you win the game.');"

So one can just split the value of the onmouseover attribute on comma and use the fourth element. But not so fast – sometimes the name can include a comma as well, for instance “Iona, Shield of Emeria”. Also, all text fields are written in apostrophes, which have to be removed; escaped apostrophes have to be unescaped. After looking at the HTML code of the page I came up with the following solution:

  • I don’t care about anything but mana cost;
  • mana cost comes after the name;
  • name may contain comma(s);
  • so just remove elements from the beginning of the list returned by split, until mana cost is on its proper position.

However, recognizing the mana cost can be somewhat tricky. Luckily, there are many fields after the mana cost which don’t include commas – card type & subtype, power, toughness, set number and rarity. Rarity seems a good choice for a check, since it only comes with four different values (some debugging revealed that in fact it may have five values, “Unknown” being the fifth one, and used only for Basic Lands). My plan is to shift elements of the list until the tenth element matches one of the possible rarity values:

my @rarity = qw/Common Uncommon Rare Mythic Unknown/;

foreach my $card ( sort keys %cards ) {
    my @info = split( /, /, $cards{$card}->{raw} );

    @info = map {
        $_ =~ s/^'//;
        $_ =~ s/'$//;
        $_ =~ s/\\'/'/g;
        $_ } @info;

    while ( scalar @info > 9 and not grep { $info[9] eq $_ } @rarity ) {
        shift @info;
    }

The scalar @info > 9 condition eliminates the possibility of iterating over the empty list forever (which was the case of the “Buy” links) – ugly, but works; even if the “Buy” links have been eliminated, I decided to keep this piece “just in case”. Then, I’m not interested in Basic Lands. Any other card should have its mana cost analyzed and then should be assigned to the proper color category. It will be done for all cards, so I just loop over sort keys %cards, obtaining the alphabetical per-color lists.

    my $mana = $info[3];
    my $type = $info[4];
    my $color;

    next
        if $type =~ /Basic/;

    if ( $type eq 'Land' ) {
        $color = 'Land';
    }
    elsif ( $mana =~ /^[0-9]+$/ ) {
        $color = 'None';
    }
    elsif ( $mana =~ /W/ and not $mana =~ /[BURG]/ ) {
        $color = 'White';
    }
    elsif ( $mana =~ /B/ and not $mana =~ /[WURG]/ ) {
        $color = 'Black';
    }
    elsif ( $mana =~ /U/ and not $mana =~ /[BWRG]/ ) {
        $color = 'Blue';
    }
    elsif ( $mana =~ /R/ and not $mana =~ /[BUWG]/ ) {
        $color = 'Red';
    }
    elsif ( $mana =~ /G/ and not $mana =~ /[BURW]/ ) {
        $color = 'Green';
    }
    else {
        $color = 'Multicolor';
    }

    push @{ $cards_by_color{$color} }, { name => $card };
}

Now it’s time to put the whole stuff into the database. My preferred choice is PostgreSQL, but it can be an overkill for one table. Maybe SQLite would be better, but I was not sure if it’s installed.

perl -MDBI -MDBD::SQLite -e '1'

No errors, so I can use SQLite.

After adding some SQL, the script looks like this:

#!/usr/bin/env perl

use strict;
use warnings;

use WWW::Mechanize;
use DBI;

my $dbh = DBI->connect( "dbi:SQLite:dbname=vedalken.db", "", "" )
    or die "Cannot open or create the DB file: $DBI::errstr";

$dbh->do( q/
    CREATE TABLE IF NOT EXISTS Card (
        id          INTEGER PRIMARY KEY,
        name        VARCHAR(64) UNIQUE NOT NULL,
        manacost    VARCHAR(8) NOT NULL DEFAULT 0,
        color       VARCHAR(10),
        amount      INTEGER NOT NULL DEFAULT 0
    )
/ ) or die "Cannot create table: ", $dbh->errstr;

my $sth = $dbh->prepare( q/
    INSERT INTO Card (name, manacost, color) VALUES (?, ?, ?)
/ ) or die "Cannot prepare insert statement: ", $dbh->errstr;

my %cards;
my %cards_by_color = (
    Land        => [],
    None        => [],
    Blue        => [],
    Red         => [],
    Black       => [],
    Green       => [],
    White       => [],
    Multicolor  => [],
);

my $url = "http://www.essentialmagic.com/cardsets/default.asp";
my @unwanted_sets = qw/32 23 1 30 53/;

my $mech = WWW::Mechanize->new;

$mech->get($url);

my @sets = $mech->find_all_links( url_regex => qr/Checklist/ );

foreach my $set (@sets) {
    next
        if grep { $set->url =~ /ID=$_$/ } @unwanted_sets;

    print "Processing set pointed by ", $set->url;

    $mech->get( $set->url );

    my @cards = $mech->find_all_links( url_regex => qr{em2/Cards/default.aspx} );

    print " - ", scalar @cards, " cards found\n";

    foreach my $card (@cards) {
        next
            if $card->text eq 'Buy';

        $cards{ $card->text } = { raw => $card->attrs->{onmouseover} };
    }
}

my @rarity = qw/Common Uncommon Rare Mythic Unknown/;

foreach my $card ( sort keys %cards ) {
    my @info = split( /, /, $cards{$card}->{raw} );

    @info = map {
        $_ =~ s/^'//;
        $_ =~ s/'$//;
        $_ =~ s/\\'/'/g;
        $_ } @info;

    while ( scalar @info > 9 and not grep { $info[9] eq $_ } @rarity ) {
        shift @info;
    }

    my $mana = $info[3];
    my $type = $info[4];
    my $color;

    next
        if $type =~ /Basic/;

    if ( $type eq 'Land' ) {
        $color = 'Land';
    }
    elsif ( $mana =~ /^[0-9]+$/ ) {
        $color = 'None';
    }
    elsif ( $mana =~ /W/ and not $mana =~ /[BURG]/ ) {
        $color = 'White';
    }
    elsif ( $mana =~ /B/ and not $mana =~ /[WURG]/ ) {
        $color = 'Black';
    }
    elsif ( $mana =~ /U/ and not $mana =~ /[BWRG]/ ) {
        $color = 'Blue';
    }
    elsif ( $mana =~ /R/ and not $mana =~ /[BUWG]/ ) {
        $color = 'Red';
    }
    elsif ( $mana =~ /G/ and not $mana =~ /[BURW]/ ) {
        $color = 'Green';
    }
    else {
        $color = 'Multicolor';
    }

    $sth->execute( $card, $color, $mana )
        or die "Cannot execute insert statement: ", $dbh->errstr;
}

$dbh->disconnect;

This version was run and collected the information about 11257 unique cards. If you are lazy and/or you don’t want to scrap essentialmagic.com, the SQLite file is only 200kB after bzipping, so just contact me (unfortunately I cannot upload a non-media file here).

Posted in Perl | Tagged: , , , , , | 3 Comments »