davidlyness.com

Things I find interesting.


Google's PageRank algorithm assigns a number between 1 and 10 to web domains (such as facebook.com or davidlyness.com), and is one of the main factors in determining how high to place a website in the list of Google search results. The method by which a domain's PageRank is calculated is patented (and hence public), but a domain's PageRank is never actually shown in the search results themselves. It is also notoriously difficult to find a simple way to query Google's servers for this information.

When Google released the Google Toolbar back in 2000, it included a feature which displayed the PageRank of the website currently being viewed. With a bit of reverse-engineering, I developed the below function which will return the PageRank of a given domain. Interesting to note is that the returned rank is always between 1 and 9 - so sites like google.com that have a de facto PageRank of 10 appear as having a PageRank of 9.

function getPageRank($domain) {
	$domainlen = strlen($domain);
	$seed = "Mining PageRank is AGAINST GOOGLE'S TERMS OF SERVICE. Yes, I'm talking to you, scammer.";
	$seedlen = strlen($seed);
	$result = 0x01020345;
	for ($i = 0; $i < $domainlen; $i++) {
		$pos = $i % $seedlen;
		$result ^= ord($seed[$pos]) ^ ord($domain[$i]);
		$result = (($result >> 23) & 0x1ff) | $result << 9;
	}
	$checksum = 8 . dechex($result);
	$url = sprintf("http://toolbarqueries.google.com/tbr?client=navclient-auto&ch=%s&features=Rank&q=info:%s", $checksum, $domain);
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	$response = curl_exec($ch);
	curl_close($ch);
	$rank = substr(strrchr($response, ":"), 1, 1);
	return $rank;
}


Disclaimer:
  • As can be seen from $seed, abuse of this script is in violation of Google's Terms of Service. Use at your own risk.
  • Google can change the initialised value of $result at any time - doing so will render this script unusable until this value is updated.


When choosing a password for a new service, there are often restrictions on the password you can choose or the policies enforced. For example, requiring the password to be a certain length, using certain character sets or requiring a new password be chosen after a fixed time period. While some of these restrictions are effective in preventing the password from being compromised through a brute force (exhaustive search) attack, others are arbitrary and some actually lessen the security of the password - the very thing they're designed to prevent. Looking at each policy I've come across in turn, I'll label them as effective, ineffective or harmful.

Throughout this post, I use the term "key space" to refer to the set of all possible passwords given that certain policies are in place. For example, the key space of all one-character lowercase passwords is \(\{a, b, c, ..., z\}\), and has size 26; and the key space of all two-character lowercase passwords is \(\{aa, ab, ac, ..., ba, bb, bc, ..., zz\}\) and has size \(26^2=676\).

Password construction policies



Minimum length - effective

If a minimum length of 6 characters were enforced, pass is invalid and password is valid. This is one of the simplest policies to implement and understand, and it is also one of the most effective. By mandating that all passwords are of a certain length, it makes it much more difficult for an attacker to try all passwords in the key space in an attempt to find one that works. If we're using just the lowercase English alphabet with 26 characters, the number of passwords of length 5 is \(11881376\), whereas the number of passwords of length 6 is \(308915776\) - increasing the key space by a factor of 26. A side effect is that the attacker would know not to try any passwords of length 5 or less, but this is negligible next to the increased password length.

Maximum length - harmful

If a maximum length of 6 characters were enforced, password is invalid and pass is valid. In complete opposition to the previous policy, this limits the key space and makes passwords much easier to crack. It also calls into question the method by which the service is storing the passwords in the database - more on that later.

Multiple character sets - effective

This policy enforces the use of multiple character sets - usually lowercase, uppercase, numeric and special symbols. For example, abcd is invalid and Ab1! is valid. Similar to having a minimum length, this vastly increases the size of the key space and makes a brute force attack more difficult. As an example, let's consider a password of 4 characters. If we are using only lowercase letters, the size of the key space is \(26^4=456976\). If we now include lowercase letters (26), uppercase letters (26), numbers (10) and the "?" and "!" special symbols (2), this gives us a total of 64 characters, making our key space \(64^4=16777216\) in size - over 35 times larger.

Password cannot be an English word - effective

A website could maintain a wordlist (for example, a list of all the words in the English dictionary), and disallow any passwords that match a word in the list (or contain a word in the list). For example, cat is invalid, whereas abc is valid. At first glance this may appear to be ineffective as it limits the key space, giving an attacker less passwords to search. However, a brute force attack is often not the first line of attack - instead, the attacker will try all words in the dictionary, with the thinking that most users will choose an English word as their password. By removing this possibility it forces the attacker to perform a brute force attack instead, which is much more cumbersome.

Password cannot contain an English word - ineffective

This is similar to the previous policy, but here the English word could be a substring of the password, rather than the whole password. cat is still invalid, but so is qqqcatqqq. While the previous policy is designed to force attackers to use a brute force attack rather than a dictionary attack, an attacker would already be using a brute force attack before he came across a password such as qqqcatqqq. (Technically, the attacker could use a heuristic brute force algorithm to determine the probability of a substring in the password being more likely as it is a dictionary word and therefore tried sooner - but this is overkill for minimal gain, and may even slow down the candidate throughput if CPU time is being consumed by the heuristic engine.)

Password cannot contain your other information - effective

When signing up to a new web service, you're often asked to provide additional information like your full name, your email address and a username. Many users may choose to use something like their username as their password, making their password very easy to remember, but also easily guessable. Unfortunately this was a common practice before this password policy was implemented on a widespread basis.

Password cannot contain repeating characters - ineffective

As an example, yyy is invalid and xyz is valid. While this may be a sensible policy when using a keypad (it is often easy to see which keys are pressed most often), when implemented as a password policy all it serves to do is restrict the key space.

Password administration policies



Password lockout after a number of failed attempts - effective

This essentially thwarts any brute force attack when the attacker does not have access to the verification details stored in the database (i.e. it forces an offline attack). For example, by requiring administrator intervention if an incorrect password is entered 5 times, it strikes a balance between convenience for the user (who is not likely to incorrectly enter their password 5 times) and security against a brute force attack. An attacker, who would normally try hundreds or thousands of passwords per second using an automated program, would instead find that they are locked out after only a few attempts.

New password must be different from previous password(s) - effective

If an attacker manages to gain knowledge of our password, we would want to be able to change our password so that he no longer has access. If the current password is pass and a policy is in place to prevent identical passwords in the most recent three passwords, the user would need to change their password 3 times before changing it back to pass (for example, pass to pass1 to pass2 to pass3 to pass). This helps to mitigate the possibility that the attacker will re-gain access after a user changes their password. (This policy is often combined with password expiration, below.)

New password must not be similar to previous password(s) - harmful

This seems like a very similar policy to the previous one, but with a crucial difference. To understand it, we need to delve into how these passwords are being stored. As discussed in a previous post, the proper practice is for passwords to be hashed (and salted) when stored in a database, rather than being stored in plaintext. This prevents an attacker who gains control of the password database from instantly knowing all users' passwords. A good cryptographic hash function's outputs should reveal nothing about the plaintext passwords - for example, the sha1 hash of abcd is 81fe8bfe87576c3ecb22426f8e57847382917acf, and the sha1 hash of abce is 0a431a7631cabf6b11b984a943127b5e0aa9d687 - that is, the hashes are completely different regardless of the similarity of the plaintexts. A system that can tell whether the new password is similar to the old one(s) is storing the previous passwords in plaintext (or using a form of reversible encryption, which is almost as bad).

Minimum length of time between password changes - harmful

A 24-hour policy of this nature simply means "If a password is changed at 9am on Monday, it cannot be changed again until 9am on Tuesday at the earliest". I cannot see any advantages to this policy, and would welcome feedback from anyone who believes it has its uses. On the other hand, this gives a would-be attacker a 24-hour window to have guaranteed access to a user's system if they can find out their new password.

Maximum length of time between password changes (password expiration) - debatable

I believe this is the most debatable policy on this list. While it guarantees that attackers who have discovered a password have only a finite usage time, it often forces the user to choose similar passwords every time the current password expires (pass to pass1 to pass2 to pass3, etc). This practice is undetectable to the system administrator if they are following correct password storage procedure, but is intuitive for an attacker to guess even if a lockout policy is in place - an expired password of pass3 would likely prompt an attacker to guess pass4. An interesting (mathsy) paper was published in 2010 that weighs the benefits of password expiration.

Conclusion



The following password policies are effective:
  • Minimum length
  • Multiple character sets
  • No English words
  • No personal information
  • Lockout policy
  • New password different from previous passwords


The following password policies are ineffective:
  • No English words contained in password
  • No repeating characters


The following password policies are harmful:
  • Maximum length
  • New password not similar to previous passwords
  • Minimum length of time between password changes


The benefits of the following password policies are debatable:
  • Maximum length of time between password changes (password expiration)


While weighing up the above methods of keeping passwords safe from attackers, we often forget that users are only human and will exert the minimum amount of effort required to comply with policy. If we enforce all of the above recommended policies, an average user may take to writing their password down and storing it under their keyboard or on their monitor. While this may be against corporate policy, it is less enforceable than technical password restrictions and is much more prone to discovery by nearby colleagues - whose intentions may not be benign.


Although less common nowadays with the pervasion of social media, you still see people posting their email addresses online and obfuscating them in some way. For example, user@example.com may be displayed as user[at]example[dot]com or user@example.com.nospam. The purpose of this obfuscation is to allow regular visitors to learn your email address, but prevent automated bots from harvesting it to send you spam. There are many such techniques - both effective and ineffective - for obfuscating email addresses, and since I couldn't find any quantifiable evidence of a comparison between the various methods I decided to compile some evidence myself.

The experiment

In 2010, I set up a web page containing 9 davidlyness.com email addresses - one in plaintext and each other obfuscated using a different method. I then directed each of these email addresses to separate mail accounts, and ensured that no received messages were marked and deleted as spam by the email software. I ensured it was indexed by Google and other search engines by including a hidden link on this website's homepage and including the page in the sitemap. Then I let two years pass, allowing bots to harvest the email addresses and send spam.

Obfuscation methods

Plain text (control)

This email address was displayed in plaintext to act as a baseline for the other addresses.

Use [at] and [dot]

Nothing too special going on here - we simply replace the @ with [at] and the . with [dot] as described above.

Add a "nospam" comment

Also shown above, add a portion of the email address that the sender is expected to know to remove before sending.

Intersperse HTML comments

In HTML, a comment is included using the following syntax: <!-- this is a comment -->. HTML comments are not parsed by browsers, meaning an email with an embedded comment should display as normal, but may fool a bot attempting to harvest email addresses - it is likely looking at the source code of the page, not the final rendered version. This email address looks like <!-- blah -->user<!-- blah -->@<!-- blah -->example<!-- blah -->.<!-- blah -->com.

Percent encoding

Using a method similar to PHP's urlencode function, we can encode each character of the email address as a percentage sign followed by two hexadecimal characters, which is automatically parsed by the browser when the page is rendered. Again, this is obfuscated in the source code but not in the final rendered output. For example, user@example.com will be encoded as %75%73%65%72%40%65%78%61%6d%70%6c%65%2e%63%6f%6d.

ROT13 on alphabetic characters

The ROT13 function encodes alphabetic characters by shifting them 13 places. (Forwards or backwards makes no difference as there are 26 characters in total.) So, user@example.com becomes hfre@rknzcyr.pbz. If this is encoded by server-side software, it can be decoded using JavaScript by performing the function again.

JavaScript string concatenation

We can "build" the email address using JavaScript using something like this:

var string1, string2, string3;
string1 = "user";
string2 = "example";
string3 = "com";
document.write(string1 + "@" + string2 + "." + string3);


Insert element with display:none CSS property

A span element is added to the email string, similar to the HTML comment method above, but we also add a rule for this element to have the display:none CSS property. This allows the user to copy the email address without obfuscation, but obfuscates the address at both a source code and non-stylesheet level.

Encoding the email address as a picture

The last method is also a common one, and it is to encode the email address as a picture so that the email address does not exist at all in text form on the page. So, user@example.com would be encoded as user@example.com picture.

Results

email obfuscation results chart

The ROT13 and picture methods of obfuscation were the only two methods that yielded a perfect score of zero spam messages received, which I believe means that the addresses themselves were not harvested. The CSS display:none method was close behind, with only 36 messages received after a 2 year period. (What may be interesting to note is that all 36 messages came in the last 4 months of the experiment - meaning that spammers likely updated their harvesting software to decode this method.)

Conclusion

If you have to make your email address publicly available, from this experiment it looks like the ROT13 and picture methods of obfuscation are currently the most effective. However, they each have their own disadvantages:
  • ROT13 requires the use of server-side software like PHP to encode the address (or have the address hard-coded in ROT13 form already), and also client-side JavaScript to re-encode / decode it on the other side. If the user's browser does not support JavaScript (or they are using a tool like NoScript to selectively enable JavaScript), the address will continue to appear in obfuscated form.
  • By placing your email address in a picture, the user can neither copy and paste the email address into their mail client nor click a mailto link for the address, discouraging them from sending an email in the first place. An image also takes up significantly more bandwidth than text, but as the email image above is less than 4KB in size this is not a major concern.


As mentioned at the beginning of the post, the prevalence of social media websites has mitigated the severity of this problem. Sites like Facebook and Twitter take on the responsibility of protecting your mailbox from spam, and you don't have to deal with obfuscating a link to your profile. Alternatively, you can handle the email delivery yourself using a web-based contact form like this one.