How to Mine Craigslist

Posted by Prolific Programmer Wed, 23 Jul 2008 21:04:00 GMT

I've been working with Jameel on filtering the local Craigslist board. Lifehacker suggests an alternate approach using Google. So, you process your query using the search api, parse what comes out the other end, and apply whatever filters you configured. It would be good for people like Elie, who replies to posts "wanting to talk, usually on the phone ... i always post in different cities ... then i don't have to meet anyone in person ... and i call with my number blocked". Perhaps if she ever gets some social skills, she'd ask for help, but I'm not holding my breath.

How to Summarise Auntie's Front Page

Posted by Prolific Programmer Tue, 24 Jun 2008 06:55:00 GMT

My automated news summariser has been enhanced and made faster. I am certain this isn't the most efficient way of solving it, but it does work and does so reasonably fast. Also, I've standardised on using par as the distribution format. If you still prefer the old method, the script's source is pasted below. To run the par, you'll be typing perl -MPAR BBC.par and just let it do its thing.

As for the "auntie" moniker, it is a nickname for the BBC, which is the script's news source.

use strict;
use XML::RSS::Parser::Lite;
use LWP::Simple;
use HTML::TreeBuilder;

my $url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml";
my $xml = get($url);
my $rp = new XML::RSS::Parser::Lite;
$rp -> parse($xml); 
for (my $count = 0; $count != $rp->count() - 1; $count++) {
  my $item = $rp -> get($count);
  $url = $item->get('url');
  my $h = HTML::TreeBuilder->new_from_content(get($url));
  my @links = $h->look_down('_tag','p', sub { 
    my $paragraph = $_[0]->as_text;
    next if not defined($paragraph);
    if ( $paragraph =~ /(.*)([!?.])?/ )  { 
        my $length = split /\s/, $1;
        print "$1$2 " if $length > 3;
    } 
  } );
  $h->delete;
  print "\n--------------------------------------------------------------------------------\n" if $count != 0;

How to Summarise the BBC

Posted by Prolific Programmer Thu, 12 Jun 2008 06:27:00 GMT

I've had a long-time interest in data mining and current affairs. So, I also need to cut back on time spent reading news every day.

The perl script below actually generates some pretty accurate summaries of the BBC front page. There are some dodgier aspects, but it is automated, so you're not going to get as good a job as a human reader, but I'll put my faith in it. So, it's posted below and made available for others to use. Please let me know how you find it.


use warnings;
use strict;
use diagnostics;
use XML::RSS::Parser::Lite;
use LWP::Simple;
use HTML::TreeBuilder;

my $url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml";
my $xml = get($url);
my $rp = new XML::RSS::Parser::Lite;
$rp -> parse($xml); 
for (my $count = 0; $count < $rp->count(); $count++) {
  print "\n------------------------------------------------------------------------\n" if $count != 0;
  my $item = $rp -> get($count);
  $url = $item->get('url');
  my $h = HTML::TreeBuilder->new_from_content(get($url));
  $h->elementify();
  my @outline;
  my @outline2 = $h->look_down('_tag','p', sub { push @outline, $_[0]->as_text} );
  my @phrases;
  foreach (@outline) {
    /(.*)([!?.])?/;
    print "$1$2 ";
  }
  $h->delete;
}