My nightly SA mass-checks have been hanging up this week on a 1MB email (not sure how a 1MB message got in my mass-check corpus, but that’s not important). It turns out that it was Mail::DKIM v0.29 that was taking about 150 seconds to process the message, while the rest of SA was only taking about 10 seconds. Upgrading to Mail::DKIM v0.30.1 resolves the problem… the DKIM check is fast (I didn’t time it, probably under a second).
The speed-up may be due to Mark Martinec’s optimizations in v0.30. It could be that the optimization was just to not do the crypto on the body, though, since the message in question did not have a signature (the sender doesn’t sign mail).
February 18th, 2008
One thing I’ve forgotten, twice now, to do before attempting to run a job on Hadoop clusters that I’ve setup in a hurry to demo something is to make sure that all of the nodes are registered in DNS or at least have entires in their hosts files about every node (datanodes, the namenode, the master and all the slaves) in the cluster.
If you start out with the master not knowing what name the slaves’ IPs map to the slaves won’t be able to connect to the master, even if you use IPs in the conf/slaves file. This seems silly to me, but that’s the way it is, at least as of 0.14.4. You’ll discover and fix this first. The slaves will then connect and the first level of links will start to work in the the master’s web interface.
Now the TaskTrackers on the slave nodes will successfully run tasks and will probably complete the map stage. If the nodes have varying performance levels or your data isn’t well distributed on your HDFS file system the map stage may appear to hang (or repeat the same percentage(s) over and over). If you make it through the map stage the reduce stage will fail to complete, for the same reason the map stage may fail, if you didn’t also configure each of the slaves nodes to know the names for all of the other slave nodes. As soon as a slave node needs data off of another datanode (either for a map task or a reduce task, etc) it’ll face the same problem it initially had in contacting the master node (but this time other slave nodes) before you configured DNS for it.
So… make sure that all machines involved in the cluster know the hostname (and it’s IP) of every other machine in the cluster. Configuring just the master to know all the slave’s names/IPs or just all of the slaves to know the master name/IP will not work, you need both — even if you haven’t used a single hostname in either of the conf/masters or conf/slaves files.
February 8th, 2008
In early December I was nominated and elected to become a member of the Apache Software Foundation (ASF). I’ve been contributing to SpamAssassin, an Apache project, since just after the release of 3.0.0 in October 2004 (and a committer since March of 2005). I’ve been using and working on SpamAssassin since Justin Mason’s first release back around 2001. I guess, so far, I’ve failed to do more harm than good and elected a member as punishment.
January 26th, 2008
We released Apache SpamAssassin 3.2.4 on Monday. It fixes a number of significant bugs in sa-compile and includes some improvements to the async DNS code. Also included are fixes for problems with non-SQL based user configs and bayes databases. If you use either you may benefit from these fixes, although I recommend that you use SQL for both.
There have been reports that 3.2.4 is faster than previous 3.2 versions. I haven’t benchmarked it but I would guess that, while individual messages may be processed faster, overall message throughput may only improve a little.
You can download SpamAssassin here.
January 13th, 2008
Cool. I just noticed today that you can configure Thunderbird to trust SpamAssassin’s “junk mail headers” (ie. the X-Spam-* headers that SA adds by default). Apparently it’s done this for a while as the wiki page that describes the feature was last edited in March 2006. I’m assuming there’s a downside though, as I don’t suspect that there’s anything to prevent Thunderbird from trusting forged SA headers added by bad dudes.
January 11th, 2008
As I, and probably anyone with a brain, suspected, the start of November brought the start of increasing spam volumes yet again. I suspect that the increasing volume of spam being sent will again reach a plateau (in comparison to the current rapid growth) around March after which it’ll increase slowly all the way throughout October. November will, again, kick off the annual spam volume increase season again.
Below is a chart of attempted spam message delivery to my personal domains throughout the last few months. The number for each week represents the number of messages in my rolling 60 day corpus at that time. I’ve gone from a weekly increase of 3% in the 60 day rolling corpus to a weekly 10% increase. The numbers are a little weird to look at given that they’re a measure of a 60 day rolling corpus, but that doesn’t really matter as they’re meaningless anyway (don’t tell that to the “few” spam filtering companies that feel compelled to remind us that 183% or so of the mail they process is spam)… numbers from domain to domain will vary wildly. The bottom line, though, is that spam volumes are definitely increasing and that if you recall previous years there is a pattern to it.
December 31st, 2007
Back in the summer I finally got tired of a number of my UPS backed machines being unexpectedly (although properly, via automation) shutdown. Since the automated shutdowns weren’t due to a power outage (I probably would have noticed the power going out; actually I do notice the power going out, those shutdowns weren’t unexpected) I figured it was due to prolonged instances of under voltage — Penetanguishene used to be notorious for prolonged periods of under voltage; so much so that, during the late 80s and early 90s, we used to literally blow up the transformers on the utility poles in front of my father’s plastics factory due to too high current draw during the too low voltage periods.
Now it turns out that in every instance, since the summer, that I’ve noticed an abnormal voltage situation it has been a case of over voltage. In one case this summer there was an over voltage of 25+ volts (20%+) with a reading of 145 VAC, on a 120 VAC nominal circuit, for more than 5 minutes. Instances of 15% overages have been seen for more than 15 minutes at a time. It’s no wonder why I have to replace so many destroyed electronics every year.
So, since I graph nearly everything that changes state, I started graphing the AC input voltage of a small APC UPS at work. The resolution is only 1 VAC, but so far, there hasn’t been an instance of over voltage since I started graphing the input voltage. Which is great since I’m not graphing it to pinpoint the cause so that I can fix the problem; there’s nothing I can do to fix the problem, aside from complain to Barrie Hydro, which I do regularly anyway.
November 11th, 2007
Cool! I’ve been waiting quite some time for this. Jay Chandler reported on a spam discussion mailing list that he received a stock spam this evening that had an MP3 attachment with audio of a speech synthesizer reading a stock spam. It wasn’t detected by the Razor only spam filtering appliance it went through.
I want one! An MP3 spam, that is.
October 17th, 2007
If the trial is successful, and the airline determines that cell phones don’t interfere with navigational and communications systems, Qantas may expand cell phone use to all its flights and aircraft.
Ah crap, it did interfere with the TCAS. Oh well, live and learn. Well not so much live, or learn really, now that we’re all dead.
September 5th, 2007
Polaris IP, a patent-suit-for-money firm, has accused AOL, Amazon, Borders, Google, IAC, and Yahoo of violating a patent (that was licensed to Polaris IP). Yet another patent that falls under the “f’in obvious to anyone who’s not a tard’ category.
A method for automatically interpreting an electronic message, including the steps of (a) receiving the electronic message from a source; (b) interpreting the electronic message using a rule base and case base knowledge engine; and (c) classifying the electronic message as at least one of (i) being able to be responded to automatically; and (ii) requiring assistance from a human operator. The method for automatically interpreting an electronic message may also include the step of retrieving one or more predetermined responses corresponding to the interpretation of the electronic message from a repository for automatic delivery to the source.
Whatever. I hope that Google and IAC use their massive bank rolls to destroy them. That certainly wouldn’t be doing any evil.
August 30th, 2007