For the next release optional DelayedIndexing is planned. Some more discussion is at wiki.chongqed.org.
So far it seems there are no bad effects in my testing. PageRank is returned if the page is removed from the index temporarily. If a page was not indexed during the period Google recalculates PageRank that could be different, but that doesn't happen that frequently. I am relying on the PageRank info we can get from the Google Toolbar which may not be totally accurate or up to date with what they actually use for ranking, but its all we can do. I will give it one more try just to make sure. – Joe [at] chongqed [dot] org 2005-05-07
Chongqed provides some spam prevention tips that include the recommendations to hide recent changes and diffs from spiderbots that respect robots.txt. Using links provided on that page, I developed my own robots.txt to hide that functionality from bots searching dokuwiki sites.
User-agent: *
Disallow: /wiki/doku.php?do=revisions&id=*
Disallow: /wiki/doku.php?do=recent&id=*
Disallow: /wiki/doku.php?id=*&rev=*&do=diff
I validated that file using an online robots.txt validator and reported that the above was correct syntax. The exact text will be different if you use prettyURLs. Also, apparently some robots don't recognize wildcards in filenames.
Are more disallows needed in your opinion or should the above be modified? Should an example file be included with the dokuwiki bundle? – Brennan
This shouldn't be nessessary because DokuWiki sets<meta name=“robots” content=“noindex,nofollow” />in all pages exept for the current revision… Another interesting method could be DelayedIndexing. However I hadn't had a single spam here since introducing the blacklist feature. — Andreas Gohr 27.01.2005 09:09
Hi. Is it possible that Google ignores these META-tags? Because I see that Google only indexes the start-page and does not follow any other links. Other robots, like from Yahoo or MSN do find all the pages correctly. Any hints? (Maybe it's got sth. to do with the URL-rewriting?) — Jan Tammen 07.02.2005
Google does obey these headers. — Andreas Gohr 2005-02-07 10:33
When I view source in my wiki pages I find meta name=“robots” content=“noindex,nofollow” in all pages, not just old revisions. (My Nofollow flag in the configuration is off). Google robots are not indexing me. How can I get rid of that meta line? Sorry if this is off-topic; just desperate… See example my WCpedia: Topical Index page, go to View HTML Source in the browser…
Email adresses are clearly displayed on the site. Shouldn't it be safer to modify them like user-NOSPAM@site.com or user_AT_site.com so that spamming engines can't use them to spam our mailboxes ? — cumulus 13.11.2004 12:58
I'm using thehexmailguard config option here. So mailaddresses are encoded as hex entities.
Let's start a short poll: Should we switch to thevisibleencoding here? — Andreas Gohr 13.11.2004 13:11
I'm not really convinced about the efficiency ofhexconfig option, if my browser can read it, spam engines can too, so is it really efficient ? What about a mix of the two solutions (some kind of bin2hex(“foobar [at] site [dot] com”) .. ) ? Who says paranoia ? :) I vote forvisibleif you think it is more efficient.
Okay I changed it to the visible encoding which should be safe enough — Andreas Gohr 08.01.2005 14:10
What about using javascript to translate me[at]somewhere[dot]com into me@somewhere.com on the load event of a web page?
I think using javascript is a good idea, since most people don't. My justification is, for a spammer, running all javascript code searching for email addresses, which is more time consuming than just reading html code. I hate spam, including wiki spam, so I'll say this: I made a page where you can automatically create javascript links. Check it out to see if it works. If anyone reads this and thinks this is forum spam, delete all or part of this comment. The page is at http://www.addressmunger.com Juan Rodriguez 05.10.2006 01:10
What about parsing the email address into a table. <table><tr><td>me</td><td>@</td><td>somewhere</td><td>.</td><td>com</td></tr></table> You can see it but you cant paste into it. CSS can be used for positioning and text wrapping.
Why not just do the easy thing and put the email address into a graphic? There's already some support somewhere out there for creating graphics on the fly (LaTeX support comes to mind), and using the PHP to scan for a mailto: URL should be good enough, right?
IMHO, using graphics won't help; I have several forums and spammers' tools are sophisticated enough to get through visual confirmation no problem. One idea I've toyed with but haven't done anything with is using off-line 3D rendering to animated GIF's. Imagine you were to engrave an email address as a relief bumpmap on a sheet of glass or transparent plastic –partly reflective, partly refractive; and imagine you're in a forest, daytime, looking at this piece of plastic in front of your eyes: You can't read the bumpmapped writing on it unless you shift and rotate the angle to see how the reflections and refractions change with angle. By rendering multiple images offline with a raytracer and putting them into an animated gif, you create an animation that only the kind of image processing our visual cortex is capable of can descipher. Trying to write software that separates the characters from the noise (trees) in the reflections and refractions would be a nightmare. In the meantime, though, something that might help would be to add a “Can View Emails” attribute to user groups, so that only trusted users can see them. Another idea would be to have email access pass through a textual confirmation box: I've completely stopped spammers at my forums since I installed the “Textual Confirmation” mod. The trick in using the mod, though, is to NOT have any questions that can be answered yes or no. The spambots try answering “yes”, so the question “are you human?” that comes with the mod by default is actually a hole. Good questions are like “Across what sea did Columbus sail?”; though, don't use it now, because if everybody who reads this uses that one, eventually the bots will start trying the obvious answer. Questions must be unique at each website. So, yes; I guess having a question like that come up the first time a user clicks on an email link would help; but it could be brute-forced easily if users can anonymously try again and again; so it would be much better to have textual confirmation at registration time, with a default limit of 2 attempts in a 24 hr period, and then having the Can-View-Emails group attribute set by default only for registered users. –Chuck Starchaser (email address witheld for obvious reasons…)
Have you considered adding Captchas to distinguish spambots from users? There are several open source captcha-generators listed on FreshMeat. This was also once discussed for Wikipedia. If you do implement it, however, consider to add an admin option to disable this on visually impaired individuals.
I would highly appreciate such an option. –dp
It's possible that captchas might be a good idea not quite done right. Here's a recent article on sans.org which provide alternatives. I thought the KittenAuth (at the bottom) was cute, to say the least. –jc
Google announced to honour a new tag for links to prevent comment spam, today. It will also be used by MSN and Yahoo. The question is should we use this tag in DokuWiki for external links? Or should it be optional? I'm not sure what I think about it for wikis. Because when I set a link on a wiki page I think this page is important and its pagerank should increase… — Andreas Gohr 19.01.2005 20:41
I understand the basic idea behind this. But it seems to me that is “targetted” more for blogging and comment spam than external links. The idea of a Wiki is that everyone (sometimes just authorized people) can contribute with content, and that the WikiCommunity of a WikiSite will help handle WikiVandals, like it's been happening so far in this WikiSite. Just take a look of “Old revision restored” summaries that happen from time to time. Backing back to this topic, it seems that if any change is done in DokuWiki to handle this tag then it should be made optional. But I don't see much of a point in doing any changes just for this tag. To make this comment short, you answer your question when you say that when you set an external link it's because it's an important page and you don't mind that its pagerank increases.
I agree. The keyword to focus on iscomment spam. A wiki shouldn't bother about such things, unless it allows pages to have comment section where users are free to go beyond the original intention of the page. And if DokuWiki ever gets this tag, it should optional, like all good things (TM) in life— SameerDS 20 Jan 2005
What about an optional feature that would provide some version of the following behaviors:
* Any link submitted by an editor who is not a logged-in user gets the nofollow tag added by default. Of course, this can be re-edited by a interactive user if need be.
I understand the opinion that the ideal wiki community will handle spam. However, remember that dokuwiki is designed for the needs of people working on documentation projects, which are by their nature very specific at times. That niche range of content can limit the size of a wiki's community along with the problems any new wiki faces in attracting readers. A quick browse of the dokuwiki users page shows that several people are using dokuwiki as a general CMS or using dokuwiki to create a blog alongside their wiki. While dokuwiki's blacklist functionality certainly helps those users, I believe every content generation software needs to play its part in reducing the effectiveness of POST spam. — Brennan
On another discussion page someone else asked for HTML edit links instead of form buttons. The answer was the form buttons prevented spidering. I made the rel=“nofollow” comment then as a possible alternative. To restate: you may want to use the rel=“nofollow” anchor tag attribute for known non-content links, like an edit or config page. Also, you may consider all new comments/pages having a “moderation status” property, where logged in or privledged users could vote on, or rank, or approve content. If approved, any “nofollow” attributes could be removed. I think this may also be usefull for teams evaluating stuff within their DW's - Isao Yagi
Okay if someone does a feature request I will add the tag as an option.
On the sugested ideas of using the tag only for not-logged in users or even use a voting system: this is simply not possible with the way dokuwiki is designed - the parser doesn't know who added a certain link in the page.
And for using it instead of formbuttons - there are two point against it:
— Andreas Gohr 2005-02-07 22:01
It appears (after some experimenting) that Google and Yahoo do still follow a link with rel=“nofollow” even though the name implies they shouldn't. That still hurts spammers trying to increase PageRank and their sites would have been found and indexed anyway so its not really hurting anything. The other main search engines (MSN and Yahoo) support this tag, others likely will too. I don't know yet if MSN also follows these links. – Joe [at] chongqed [dot] org 2005-04-26
Now it appears that Google does not follow a link with rel=“nofollow”. I am still a little confused why it appeared the other way at first, but I am trying to figure that out. Yahoo does follow them. Not enough data from other engines yet. See the experiment. – Joe [at] chongqed [dot] org 2005-05-07
One easy way to help prevent DokuWiki being targeted as an application would be to allow end users to change the form variable names via the configuration file. Pretty sure spammers are writing scripts that target particular applications, and that works because all the forms are exactly the same.
This will at least up the “entry level” for people try to spam DokuWiki sites - they'd need to parse the HTML to extract the form variable names.
This could be a good way to trap spammers that are parsing the HTML to get form variables.
While this might work it would as well confuse a lot of people and render a site less accessable. Just consider all those people browsing the web with text/only browsers (which very much increase performance and use far less bandwidth hence making browsing faster and cheaper). And don't forget people using screenreaders or Braille devices which will either ignore CSS altogether or interprete it in a way neither intended by you nor actually helping the people. You must never forget that a web-page may be supported by CSS (and, for what it's worth, Javascript) but it should never depend on it. Otherwise you're creating nothing more than a closed door or just an annoying site. So, by and large, your suggested CSS-trap will strike legitimate users more than a robot (which doesn't have to care about faked form fields but may just as well try them all) with the additional drawback of increased bandwidth usage for each page served and an 'uglier' markup. So beware!
Given CSS like this (generated per request by PHP);
<style type="text/css" media="all"> /* Generate this CSS per request on the server side */ #name1 { display: block; } #name2 { display: none; } #email1 { display: none; } #email2 { display: block; } #comment1 { display: none; } #comment2 { display: block; } </style>
And a form, also generated by PHP, like;
<form action="blog_comment.php?id=someUniqueId" method="POST"> Name: <input id="name1" type="text" name="name1"> <input id="name2" type="text" name="name2"> Email: <input id="email1" type="text" name="email1"> <input id="email2" type="text" name="email2"> Comment: <textarea id="comment1" name="comment1"></textarea> <textarea id="comment2" name="comment2"></textarea> <br> <input type="submit" value="Post"> </form>
The knowledge of which form fields are actually meant to be filled in is contained in the CSS. If they get as far as parsing that, it could be made more difficult by relating styles to tags via CSS class selectors. The uniqueId in the POST URL identifies which set of fields contain the real data while a script which parses the form could be fooled into submitting data in the wrong fields, thereby identifying itself.
This is raised between these two blogs;
http://www.sitepoint.com/blog-post-view.php?id=217519 http://www.sitepoint.com/blog-post-view.php?id=220357
Might be useful for everyone if, when DokuWiki detects an attempted spam, it publishes the URLs contained in the spam via RSS. Could be a first step towards a kind of “P2P” reactive anti-spam network.
Maybe interesting
The newest trick of spamers and harvesters is, to download the site in export_??? or in Edit-Mode, to read in the text easier and get the Emails in Clear-Text, without protection.
New self-defense required.
To use a identifiable php like doku.php is giving spamers advice how to attack a site, and witch edit protocol is used to post text. Specially for Bots interesting, and for scanning the web via google (simply searching doku.php guestbook.php comments.php and so one…)
i found a tool to spam the web automatically, XR…. (i down't put the full name here, to make no publicity for this pigs, they sell the program for many 100$). The claim to by able to decode captcha with OCR like used in dokuwiki (maybe true, maybe a lie). And they spam via proxy. Now manly Blogs an Forums have problems with this “tool”…