How to Clean Your Inbox
Lifehacker relays how to clear your inbox using GMail. Basically, you create a filter that marks all matching messages as red and apply it to all matching conversations. Remember to remove the filter after you're done.
How to Search GMail from the Comfort of Your Command-Line
The command-line gmail search is working. Next step: see how to speed it up. It's still taking almost a minute to search 317 messages. Code pasted after the flip, as with the last message.
package com.prolificprogrammer.lucenegmail;
import java.io.File;
import java.util.logging.Logger;
import java.util.logging.Level;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.Session;
import javax.mail.Store;
import javax.mail.internet.InternetAddress;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
public class SearchGMail {
private static Logger logger = Logger.getLogger(new SearchGMail().getClass().getCanonicalName());
public static void main (String[] args) throws Exception {
//logger.setLevel(java.util.logging.Level.FINE);
try {
File path = new File(System.getProperty("java.io.tmpdir")+File.separator+"gmail.index");
path.mkdir();
path.deleteOnExit();
long starttime = System.currentTimeMillis();
IndexWriter index = new IndexWriter(path.getAbsolutePath(), new StandardAnalyzer(), true);
Session session = Session.getDefaultInstance(System.getProperties(), null);
Store store = session.getStore("pop3s");
store.connect("pop.gmail.com", args[0], args[1]);
logger.fine("Connected!");
Folder folder = store.getDefaultFolder();
folder = folder.getFolder("INBOX");
folder.open(Folder.READ_ONLY);
logger.fine("Opened INBOX");
Message[] messages = folder.getMessages();
int x;
for (x = 0; x != messages.length; x++) {
try {
Document document = new Document();
String allField = ((InternetAddress)messages[x].getFrom()[0]).getAddress()+"\n"+messages[x].getSubject();
document.add(new Field("all", allField, Field.Store.YES, Field.Index.TOKENIZED));
Field messageNumberField = new Field("messageNumber", new Integer(x).toString(), Field.Store.YES, Field.Index.NO);
messageNumberField.setBoost((float)0.0);
document.add(messageNumberField);
index.addDocument(document);
logger.fine("Message "+x+" added.");
} catch (OutOfMemoryError e) {
index.optimize();
continue;
}
}
index.optimize();
index.close();
logger.info("Index Constructed -- now searching");
IndexSearcher searcher = new IndexSearcher(path.getAbsolutePath());
Analyzer analyzer = new StandardAnalyzer();
String query = args[2];
QueryParser queryParser = new QueryParser("all", analyzer);
Query parsedQuery = queryParser.parse(query);
Hits hits = searcher.search(parsedQuery);
for (int i = 0; i!= hits.length();i++) {
Document doc = hits.doc(i);
System.out.println("Message "+doc.getField("messageNumber").stringValue()+" matches "+query+" with a score of "+hits.score(i));
}
searcher.close();
long endtime = System.currentTimeMillis();
logger.severe("program took "+new Long(endtime-starttime).toString()+" miliseconds to search "+new Integer(x).toString()+" messages, which occupy "+new Long(path.length()).toString()+" bytes.");
java.awt.Toolkit.getDefaultToolkit().beep();
} catch (ArrayIndexOutOfBoundsException e) {
logger.severe("Usage: "+new SearchGMail().getClass().getName()+" [google login] [password] [query]\nAll required");
System.exit(-1);
}
}
}
How to Search Gmail from the comfort of your Keyboard 2
The Java code below leverages Lucene 2.3.1 and javamail to create a command-line search of your GMail inbox. It's actually quite slow, so I'd like to speed it up over time, but it does give updates as it runs, perhaps too many. Any (and all) suggestions appreciated?
package com.prolificprogrammer.lucenegmail;
import java.io.File;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.Session;
import javax.mail.Store;
import javax.mail.internet.InternetAddress;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
public class SearchGMail {
public static void main (String[] args) throws Exception {
File path = new File(System.getProperty("java.io.tmpdir")+File.separator+"gmail.index");
path.mkdir();
path.deleteOnExit();
long starttime = System.currentTimeMillis();
IndexWriter index = new IndexWriter(path.getAbsolutePath(), new StandardAnalyzer(), true);
Session session = Session.getDefaultInstance(System.getProperties(), null);
Store store = session.getStore("imaps");
store.connect("imap.gmail.com", args[0], args[1]);
System.err.println("Connected!");
Folder folder = store.getDefaultFolder();
folder = folder.getFolder("INBOX");
folder.open(Folder.READ_ONLY);
System.err.println("Opened INBOX");
Message[] messages = folder.getMessages();
System.err.println("Messages retrieved!");
int x;
for (x = 0; x != messages.length; x++) {
Document document = new Document();
String allField = ((InternetAddress)messages[x].getFrom()[0]).getAddress()+"\n"+messages[x].getSubject();
document.add(new Field("all", allField, Field.Store.YES, Field.Index.TOKENIZED));
document.add(new Field("messageNumber", new Integer(x).toString(), Field.Store.YES, Field.Index.NO));
index.addDocument(document);
System.err.println("Message "+x+" added.");
}
index.optimize();
index.close();
System.err.println("Ok, index constructed with "+x+" messages in "+path.getAbsolutePath()+", now searching it");
IndexSearcher searcher = new IndexSearcher(path.getAbsolutePath());
Analyzer analyzer = new StandardAnalyzer();
String query = args[3];
QueryParser queryParser = new QueryParser("all", analyzer);
Query parsedQuery = queryParser.parse(query);
Hits hits = searcher.search(parsedQuery);
for (int i = 0; i!= hits.length();i++) {
Document doc = hits.doc(i);
System.out.println(doc.getField("messageNumber"));
}
searcher.close();
long endtime = System.currentTimeMillis();
System.err.println("program took "+endtime-starttime+" miliseconds to search "+x+" messages, which occupy "+path.length()+" bytes.");
java.awt.Toolkit.getDefaultToolkit().beep();
}
}
How to Search Your Gmail in One Command
So, tonight, aside from reminiscing about old flames, Tareeq and I got to implementing a command line search for Gmail. Leveraging Javamail, lucene and maintaining no notion of state whatever. It allows you to type ./search.sh from:Tareeq subject:Tunis and returns the subject lines of messages that match the query, sorted by score. This is my first maven-managed project and so far, I'm liking it much better than ant. I haven't timed a run yet, maybe at the weekend?
How to Improve GMail's Spam Filter
GoogleMail has a spam filter second-to-none, but it could be better. The one failure that's a very low-hanging fruit (at least from my view) is language identification. Indeed, I'm including code below to identify a given message's language and give it a confidence:
#!/usr/bin/env perl
use warnings;
use strict;
use diagnostics;
use Lingua::Identify qw/:language_identification/;
use Mail::POP3Client;
my $pop = new Mail::POP3Client(USER => "$USER",
PASSWORD => "$PASSWORD",
HOST => 'pop.gmail.com',
PORT => 995,
USESSL => 'true'
);
$pop->Connect;
my $count = $pop->Count;
my $debug = 1;
for (my $counter = 1; $counter != $count; $counter++) {
my ($language, $prob);
while (my $text = $pop->Body($counter)) {
print $text if defined $debug;
($language, $prob) = langof($text);
}
print "Message $counter is $language, $prob probability.\n";
}
$pop->Close;
So, basically, you analyse the sent mail as a control group to determine which languages the user knows. Then you store these languages and anything that doesn't match these can be assumed to be spam. Then you just apply the standard bayesian filter.
