How to Summarise the BBC
I've had a long-time interest in data mining and current affairs. So, I also need to cut back on time spent reading news every day.
The perl script below actually generates some pretty accurate summaries of the BBC front page. There are some dodgier aspects, but it is automated, so you're not going to get as good a job as a human reader, but I'll put my faith in it. So, it's posted below and made available for others to use. Please let me know how you find it.
use warnings;
use strict;
use diagnostics;
use XML::RSS::Parser::Lite;
use LWP::Simple;
use HTML::TreeBuilder;
my $url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml";
my $xml = get($url);
my $rp = new XML::RSS::Parser::Lite;
$rp -> parse($xml);
for (my $count = 0; $count < $rp->count(); $count++) {
print "\n------------------------------------------------------------------------\n" if $count != 0;
my $item = $rp -> get($count);
$url = $item->get('url');
my $h = HTML::TreeBuilder->new_from_content(get($url));
$h->elementify();
my @outline;
my @outline2 = $h->look_down('_tag','p', sub { push @outline, $_[0]->as_text} );
my @phrases;
foreach (@outline) {
/(.*)([!?.])?/;
print "$1$2 ";
}
$h->delete;
}
