Context : BotsOnWikis

Bot by RichardP which cleans wiki spam. Invaluable in the fight against SpammingThoughtStorms.

No longer relevant to this wiki. And probably not running anymore.

But a good symptom of the spontaneous co-operation of the era.

Many people asked to have WikiMinion clean their wikis too via this page. And he generously helped.

How it worked

The code uses two approaches for identifying edits made by spammers - examining external links and examining source IP addresses. In both approaches if a clean version can be identified the page is reverted back to it. Likewise, if all versions of a page appear to created by a spammer, the page is marked for deletion.

The primary approach identifies spammer edits by checking a database of spammer domain names against the external links included in an edit. This approach catches changes by spammers, even if they rotate through IP addresses (some wiki spammers use IP addresses assigned dynamically from IP address pools or proxy their edits through a collection of compromised machines). Once a page containing external links to spammer domains is identified, a clean version to which to revert is located by examining the links in the historical versions of the page. The problem with this approach is that it is rather slow, consumes significant server resources, and doesn't identify some spammers.

The secondary approach identifies spammer edits by checking a database of spammer IP addresses against the source IP address of the most recent edit. This approach catches the spammers who makes changes from a stable set of IP addresses, but who switch domain names regularly (in fact, a few spammers switch domain names every couple of days). Once an edit made from an IP address known to be used by a spammer is identified, the clean version to which to revert is found by selecting the newest version that did not originate from a spammer IP address. Before the page is reverted the older version is checked for external links to spammer domains, and if found the revert is aborted. This approach is quite fast and doesn't consume significant server resources.

The code uses the secondary aproach first, in order to take care of as many reversions as possible with the lighter-weight approach. The primary approach is then invoked, and it should catch the remaining spammers edits. The biggest problem currently is the effort required to keep the database of spammer domains and IP addresses up to date. It is currently mostly a manual process, although I've written some code that attempts to suggest new IP addresses and domains to add to the database. As it stands, without constant maintanance of the database, the utility of the anti-spam bot gradually decreases. After about two weeks without database updates, it begins to miss a significant number of spammer edits. I'm looking into perhaps automating the discovery process or maybe pulling the database from a centralized location (such as a wiki page) in order to leverage the efforts of a community to keep the database current.


Compare :