Docendo discimus

$self->explain

  • Calendar

    January 2022
    M T W T F S S
     12
    3456789
    10111213141516
    17181920212223
    24252627282930
    31  
  • Archives

  • Recent Posts

  • Bling

Posts Tagged ‘Perl’

Perl 6 in shops near you

Posted by brunorc on July 20, 2010

In case you didn’t know it yet – Rakudo Star will be released soon. Well, actually you can know exactly the day when Rakudo Star will be released.

So prepare your browser, download it, try it and then go to YAPC::EU 2010 in Pisa. And there will be much rejoicing!

And Magic. I still work on Vedalken (in fact that’s the reason why I not write about it) and I hope to reveal more code here soon. And of course in Pisa I will have a deck or two, just in case – I think Giel will bring them as well :-)

Posted in Perl, Perl 6 | Tagged: , , , , | Leave a Comment »

Vedalken reborn (in Alara)

Posted by brunorc on May 9, 2010

I planned to later update the Vedalken’s database with the contents of Gatherer, the official database of Magic cards, run by Wizards of the Coast. However, I discovered that the content scraped from Essential Magic was not good enough – probably because I was parsing comma separated list, which was a part of JavaScript function call. Moreover I discovered that some basic lands still made it to get to the database. And finally, I started to check how to scrape the Gatherer just to find that there’s no easy way to get the given card just by its name. The only unique identifier in Gatherer is the multiverse ID. Frankly, one card may have many multiverse IDs, because every time it gets printed in a different edition, new multiverse ID is assigned.

Because of all those issues I decided to start again from scratch – even if I already put all my non-basic lands and colorless spells in the database. I can always use this information and just do UPDATE… so I renamed the table (that was very good move, as it occurred later) and started almost from scratch.

This time I wanted to isolate the code into functions to make the flow clear and understandable. Here’s the top part of the script:

#!/usr/bin/env perl

use strict;
use warnings;

use WWW::Mechanize;
use HTML::TreeBuilder;
use DBI;

my %cards;

my $mech = WWW::Mechanize->new;
my $BASE_URL = "http://gatherer.wizards.com/Pages";

my $dbh;
my $sth;

scan_set( $_, \%cards, $mech )
    foreach get_sets($mech);

print scalar keys %cards, " unique cards found\n";

scan_card( $_, $cards{$_}, $mech )
    foreach keys %cards;

$dbh->disconnect;

There’s a new module – HTML::TreeBuilder, part of HTML::Tree suite – which is necessary to parse HTML in a little bit more advanced way than it’s possible with WWW::Mechanize only. The logic is separated into three functions: get_sets (providing a list of all sets, since cards will be searched by set names), scan_set (updating the global hash with info about cards from the particular set) and scan_card (fetching to the DB block of HTML code with card description). At the end of run I should have raw HTML for each card stored in the database, so I can parse it and update detailed information later. Another advantage is that I don’t have to make final decisions about the data model, just focusing on the scraping part.

I only define my database-related variables without any initialization – I will discuss it later, since it had really interesting effects.

The first function takes the $mech object as an argument and returns the list of sets (fetched from Essential Magic, as before), processed by the next function:

sub get_sets {
    my $browser = shift;
    my @sets;

    $browser->get('http://www.essentialmagic.com/cardsets/default.asp');

    my $tree = HTML::TreeBuilder->new_from_content( $browser->content );

    my $link_check = sub {
         $_[0]->attr('href') =~ m!CardSets/Overview!;
    };

    foreach my $class ( qw/Row1 Row2/ ) {
        my @rows = $tree->look_down( '_tag' => 'td', 'class' => $class );

        foreach my $row ( @rows ) {
            if ( my $link = $row->look_down(  '_tag' => 'a', $link_check ) ) {
                ( my $id = $link->attr('href') ) =~ s/^.+ID=//;

                next
                    if grep { $id == $_ } qw/1 23 30 32 53/;

                $sets[$id] = $row->look_down( '_tag' => 'img' )->attr('alt');
                # fix differences
                $sets[$id] =~ s/Timeshifted/"Timeshifted"/;
                $sets[$id] =~ s/WorldWake/Worldwake/;
                $sets[$id] =~ s/10th/Tenth/;
                $sets[$id] =~ s/9th/Ninth/;
                $sets[$id] =~ s/8th/Eighth/;
                $sets[$id] =~ s/7th/Seventh/;
                $sets[$id] =~ s/6th/Sixth/;
                $sets[$id] =~ s/5th/Fifth/;
                $sets[$id] =~ s/4th/Fourth/;
                $sets[$id] =~ s/Portal-/Portal/;
                $sets[$id] =~ s/.*Starter 1999.*/Starter 1999/;
            }
        }
    }
    return reverse grep { defned $_ } @sets;
}

In this function HTML::TreeBuilder is used to find particular elements of HTML code. In line 14 I want to get all <td> tags of the given class. However, in line 17 beside passing a list of parameters specifying tag type or attributes, I also use the $link_check variable. This variable is in fact a reference to an anonymous function. The function itself is defined in line 9. The documentation for look_down function shows an example where the anonymous function is passed directly in the call, but since the look_down is called in the if clause, I wanted to make the code shorter. Anyway, instead or beside of parameter => value pairs, one can pass a reference to a function, which would return true or false (indicating if the condition was met). I was only interested in links, where href parameter was matching some specific pattern. In case of WWW::Mechanize I could have used something like this:

my @links = $mech->find_all_links( url_regex => qr{CardSets/Overview} );

but HTML::TreeBuilder only allows simple comparisons. In fact, everything else can be put in a function, so it’s a good deal.

All matching links have to be prepared and stored. Preparation include removing everything from the beginning, so only the ID of the set is left. I’m not particularly interested in those IDs, but they will make the list organized. Also, I want to skip some of them. Then I use the alternate text (since sets are indicated in the table using images…) and fix it in case of some sets; at the end I return the reversed list of all sets. The list is reversed, because I want to have cards with their latest multiverse IDs (for example Glory Seeker was recently reprinted in the Rise of the Eldrazi, but its first version appeared in Onslaught).

Once the list is ready (and undefined values are filtered out by grep), it is passed to the foreach loop, where every item will be processed by the scan_set function:

sub scan_set {
    my ( $set_name, $cards_href, $browser ) = @_;
    print "Processing set \"$set_name\" ";

    my $set_url = "%s/Search/Default.aspx?output=checklist&action=advanced&set=[%%22%s%%22]";

    $browser->get( sprintf( $set_url, $BASE_URL, $set_name ) );

    my $tree = HTML::TreeBuilder->new_from_content( $browser->content );

    my @rows = $tree->look_down( '_tag' => 'tr', 'class' => 'cardItem' );
    print " - ", scalar @rows, " cards found\n";

    foreach my $row ( @rows ) {
        my $link = $row->look_down( '_tag' => 'a', 'class' => 'nameLink' );

        my $name = $link->as_text;

        next
            if grep { $name eq $_ } qw/Plains Forest Swamp Island Mountain/;

        ( my $id = $link->attr('href') ) =~ s/^.*multiverseid=//;

        if ( exists $cards_href->{$name} ) {
            push @{ $cards_href->{$name}{gids} }, $id;
        }
        else {
            $cards_href->{$name} = {
                gid     => $id,
                gids    => [ $id ],
            };
        }
    }
}

The first argument of this function is the set name, and the third one is a “browser” (WWW::Mechanize object). The second one is passed in the call as \%cards. Again, it’s a reference – but this time it’s a reference to a hash (usually called a “hashref”). Usage of the reference allows one to pass arguments by reference (quite surprising, right?), instead of passing them by value:

my @list = qw/foo bar baz/;

sub by_val {
    my @args = @_;

    foreach (@args) {
        $_ = uc $_;
    }
}

sub by_ref {
    ( my $args ) = @_;

    foreach (@$args) {
        $_ = uc $_;
    }
}

by_val(@list);
print join( ', ', @list ), $/;

by_ref(\@list);
print join( ', ', @list ), $/;

The first function receives a list of arguments passed by their values and stores them in the @args list. But in fact they are copied, one by one, and the function operates only on those copies. That’s why changes are not stored in the original (elements of @list are not converted to upper case). The latter function receives one argument – which is the reference to the list (references are denoted by the backslash before the original variable). In the foreach loop it has to be dereferenced, so Perl will use the object pointed by the reference and treat is like an array.

Actually, the same effect can be achieved by using the original argument list:

sub by_direct {
    foreach (@_) {
        $_ = uc $_;
    }
}

since accessing arguments like $_[0], $_[1] will also allow to modify the original passed value. But naming arguments is always a good idea; also passing by reference is the only way to pass complex data structures to a function.

sub scan_card {
    my ( $cardname, $storage, $browser ) = @_;

    my $card_url = "%s/Card/Details.aspx?multiverseid=%d";

    $browser->get( sprintf( $card_url, $BASE_URL, $storage->{gid} ) );

    my $tree = HTML::TreeBuilder->new_from_content( $browser->content );

    $storage->{raw} = ( $tree->look_down( '_tag' => 'div', 'class' => 'smallGreyMono' ) )[1];

    $storage->{gids} = join( ', ', @{ $storage->{gids} } );

    $sth->execute( $cardname, @{$storage}{ qw/gid gids raw/ } );
}

Since $storage is a reference to a hash, instead of using $storage{something} the arrow notation is used: $storage->{something}. Also, the “gids” field contains a reference to an array, created in line 30 of the scan_set function, and dereferenced in line 25 of the same function, to have push being called on it. Here, in line 12 it’s dereferenced in the same way. And in line 14 I want to pass the values of three fields to the function call; instead of writing every single one like $storage->{gid} and so on, I use hash slices. If the $storage was an ordinary hash, I could write @storage{ ... }, but since it’s a reference, I have to dereference it: @{ $storage }{ ... }.

Perl is very flexible in treating data structures according to the programmer’s will. In the line 10 I call the look_down function, which – when called in so called “list context” – returns a list of objects matching the given conditions. In scalar context it returns the first element found. But I’m interested in the second one… thus I call the function in the list context, I treat the whole call as a list and return only the second object of this “list”.

Last but not least I had to initialize my DB-related variables. I put them on the bottom of the file in the BEGIN block. Those named blocks are quite useful, but I used the wrong one. The problem is, the BEGIN blocks (since there may be more than one) are executed in the compilation phase. So when I ran perl -c scraper.pl (to check if there are no stupid syntax errors), this code got executed. Luckily, there was no DROP TABLE IF EXISTS, and moreover I renamed the already populated table. But actually I wanted this code to initialize my variables, so I should have used the INIT block (or use the $COMPILING variable to check if the code is compiled or run, but that would be completely wrong).

# used to be BEGIN...
INIT {
    $dbh = DBI->connect( "dbi:SQLite:dbname=vedalken.db", "", "" )
        or die "Cannot open or create the DB file: $DBI::errstr";

    $dbh->do( q/
        CREATE TABLE IF NOT EXISTS Card (
            id      INTEGER PRIMARY KEY,
            name    VARCHAR(64) UNIQUE NOT NULL,
            gid     INTEGER NOT NULL,
            gids    TEXT,
            raw     TEXT
        )
    / ) or die "Cannot create table: ", $dbh->errstr;

    $sth = $dbh->prepare( q/
        INSERT INTO Card (name, gid, gids, raw)
        VALUES (?, ?, ?, ?)
    / ) or die "Cannot prepare insert statement: ", $dbh->errstr;
}

Of course every BEGIN or INIT block is still just a block, so any variable defined inside its scope will not be visible or accessible from outside. Thus $dbh and $sth had to be defined beforehand, in the global scope (because closures would be an overkill).

Anyway – I ended up with the database populated with raw HTML scraped from Gatherer (10961 cards). Next time I plan to use HTML::TreeBuilder to parse it and fill the data model.

Posted in Perl | Tagged: , , , , , , | 1 Comment »

Catalyst is easy – using the database

Posted by brunorc on September 19, 2009

There was not too much about doing the Catalyst things recently. I went into some musings about Perl: why it is cool, relevant and blah.

But recently I had a nice conversation with a friend, who works in the same company. He has a strong Java background, but frankly it didn’t turn him into a brainless, corporative ant. It’s quite interesting, but even now, when I work in the Perl-driven company, I find myself having conversations with Java people :-) And those conversations are mostly nice, so maybe that’s why I keep having them. Or maybe just those people are interesting and nice, regardless of their favourite programming language? Nevertheless, this guy also plays Magic – and he’s quite good at this. So there’s at least one thing that really unites us.

Anyway, I like his pragmatic point of view. And this conversation gave me the impulse to once again think about coolness and relevancy. A programming language is cool and relevant, if it helps you in writing programs, that solve your problems. If you can do it in a fast and consistent, elegant manner, then everything is OK. This compels me to cut my musings and go back to the real Perl.

First “useful” application shown here was a non-persistent life counter for a duel of Magic planeswalkers. If someone liked to make it more persistent, he would probably think about the database – you know, those huge things Oracle used to sell before people discovered that SQL must die, because Erlang is cooler. But for the Magic counter we don’t need Oracle, SQLite will be quite enough.

Now let’s go to the application directory – I assume you are able to type in your console:

sqlite3 magicount.db

and get output similar to this one:

SQLite version 3.4.0
Enter ".help" for instructions
sqlite>

which roughly translates to it works! If it doesn’t, go to the link above, download the SQLite for your operating system, install it and try again. If it works for you as well, it is the time to create the database and populate it:

sqlite> CREATE TABLE player ( id INTEGER PRIMARY KEY, name TEXT );
sqlite> INSERT INTO player (name) VALUES ('Bruno');
sqlite> INSERT INTO player (name) VALUES ('Betka');
sqlite> INSERT INTO player (name) VALUES ('Chris');
sqlite> INSERT INTO player (name) VALUES ('Giel');
sqlite> .quit

In SQLite, INTEGER PRIMARY KEY means that field will be autoincremented, so we don’t need to specify the value for id. After quitting we can check if there’s something inside:

sqlite3 magicount.db
SQLite version 3.4.0
Enter ".help" for instructions
sqlite> select * from player;
1|Bruno
2|Betka
3|Chris
4|Giel
sqlite>

Who would expect, eh?

Now it is the time to create the new Model for the database. It is achieved by the following incantation:

script/magicount_create.pl model MyModelName DBIC::Schema MySchemaClass create=static dbi:DB_type:how_to_get_the_DB

MyModelName looks rather obvious – it is the name of the model; it will be used later to access it, so it should be meaningful. MySchemaClass is – sure – the name of the class, representing the DB schema. The create option can have two values: static and dynamic, but since nowadays you can recreate your DB-based classes without losing the code written by yourself, static is the best choice. The last argument is the DSN, used by underlying DBI module – it should include the type of the database (so DBI can use the appropriate driver) and the way to access it (if you want some more elaborated description, check the documentation of the DBIC::Schema helper). So, in the end, everything should look like this:

script/magicount_create.pl model MCDB DBIC::Schema MagiCount::Schema create=static dbi:SQLite:magicount.db

MCDB stands for MagiCount DataBase – not very fancy, but quite short. We use the SQLite driver and give the name of the file, where the database is stored. If everything is OK and you have installed all necessary modules (you can check the list here – usually they will be installed with the most important one, Catalyst::Model::DBIC::Schema, but this list may be handy in case of obstacles), you should get the similar output:

exists "/Users/bruno/devel/MagiCount/script/../lib/MagiCount/Model"
exists "/Users/bruno/devel/MagiCount/script/../t"
Dumping manual schema for MagiCount::Schema to directory /Users/bruno/devel/MagiCount/script/../lib ...
Schema dump completed.
created "/Users/bruno/devel/MagiCount/script/../lib/MagiCount/Model/MCDB.pm"
created "/Users/bruno/devel/MagiCount/script/../t/model_MCDB.t"

Go take a look at the created files – probably the most interesting one will be lib/MagiCount/Schema/Result/Player.pm:

package MagiCount::Schema::Result::Player;

use strict;
use warnings;

use base 'DBIx::Class';

__PACKAGE__->load_components("InflateColumn::DateTime", "Core");
__PACKAGE__->table("player");
__PACKAGE__->add_columns(
  "id",
  {
    data_type => "INTEGER",
    default_value => undef,
    is_nullable => 1,
    size => undef,
  },
  "name",
  {
    data_type => "TEXT",
    default_value => undef,
    is_nullable => 1,
    size => undef,
  },
);
__PACKAGE__->set_primary_key("id");


# Created by DBIx::Class::Schema::Loader v0.04006 @ 2009-09-19 15:06:06
# DO NOT MODIFY THIS OR ANYTHING ABOVE! md5sum:R0fDKDbCt11OQ9KmMzHd/Q


# You can replace this text with custom content, and it will be preserved on regeneration
1;

As you can see, it is the Perl representation of the SQL code. Thanks to this Perl is aware of the structure of tables and all fields included. Also, the library that powers this stuff – DBIx::Class – is not only able to create such files from the existing database; you can use it the other way round, shaping your database with the Perl code, as DBIx::Class will create the structure of tables described in modules (check the documentation for DBIC). Also, you can freely add your code below the MD5 sum line and it will be maintained.

This all looks cool, but there’s a lot of stuff behind and around it, so now I will only show how to use it on a very simple example. Instead of writing the names of players in the form, we’re going to use the HTML select elements. In the create.tt2 file we can change:

      <label for="player1">Name of the first player</label>
      <input type="text" name="player1" size="12" /><br />
 

to:

      <label for="player1">Name of the first player</label>
      <select name="player1">
        [% FOREACH player IN c.model('MCDB::Player').all %]
        <option value="[% player.name %]">[% player.name %]</option>
        [% END %]
      </select><br />
 

doing the same for the second player. And if we want to have players ordered alphabetically:

        [% FOREACH player IN c.model('MCDB::Player').all.sort('name') %]
        <option value="[% player.name %]">[% player.name %]</option>
        [% END %]
 

Of course we’re accessing Model from the View and doubling the amount of database calls. But at least our players are persistent! In the next episode: updating the contents of the table.

Posted in Catalyst for intimidated | Tagged: , , , , , , | Leave a Comment »