davidlyness.com

Things I find interesting.

Posts in the Maths category (page 1 of 2)

This post is inspired by someone who asked me question today along the lines of "can a string's md5 hash be its own value?". The answer is that it's possible - nothing fundamental about the algorithm prevents it - but the more interesting challenge is to find the probability that such an event can occur.

The md5 hash algorithm takes a string of arbitrary length and returns a 128 bit result - usually displayed as a 32-digit hexadecimal number. For example:

md5("test123") = c38f4ec93b73956709932e6464f4f339
md5("test456") = 0f3c7ce16f5c313b81a9e54758bcae8a


In order for a string to hash to itself, it's not difficult to see that the string itself must be exactly 128 bits, or 32 hexadecimal characters. Therefore, a naive (and somewhat pointless) approach would be to try all possible strings for a potential match. We've limited our search space to all strings of exactly 32 hexadecimal characters in length, which gives us a total of \(16^{32}=2^{128}\) strings to search. At a modest rate of calculating 1,000,000 hashes per second, it would take us \(1.079 \times 10^{25}\) years to try all possible combinations. A quicker way would be nice.

Let's start from a purely mathematical standpoint, and come back to specific functions like md5 later. At the most basic level, for a given function \(f\), we would like to check whether there exists an input \(x\) such that \[f\left(x\right)=x.\] We also know a few things about \(f\). It is a cryptographic hash function, meaning that the output will always be a string of fixed length - let's say length \(n\) when represented in binary. This means that \(x\) - our candidate for a fixed point of \(f\) - must also be of length \(n\). This gives a total of \(2^n\) candidates for \(x\).

Another property of cryptographic hash functions is that they aim to distribute their results over its range as uniformly as possible to minimise hash collisions. Modelling \(f\) as a uniform distribution, \[\mathbb{P}\left({f\left(x\right)=x}\right)=\frac{1}{2^n}\] for a given \(x\). As an intermediate step, let's calculate the probability that, for all possible values of \(x\), \(x\) is not a fixed point of \(f\). This will be \(1-\frac{1}{2^n}\) multiplied by the number of possible values of \(x\) - so, \[\mathbb{P}\left({\forall x:f\left(x\right)\not=x}\right)=\left({1-\frac{1}{2^n}}\right )^{2^n},\]and since \(\lim_{\alpha \to \infty} \left(1-\frac{1}{\alpha} \right )^\alpha = \frac{1}{e}\), we get \[\lim_{n \to \infty}\mathbb{P}\left({\forall x:f\left(x\right)\not=x}\right)=\frac{1}{e}.\] Now, to get the probability that some \(x\) does has this property, we calculate the opposite: \[\lim_{n \to \infty}\mathbb{P}\left({\exists x:f\left(x\right)=x}\right)=1-\frac{1}{e}.\] Therefore, the probability that a hash function of sufficient length has a fixed point is approximated by \(1-\frac{1}{e}\approx 0.6321=63.21\%\).

Coming back to the md5 function, it outputs a string of binary length 128. So, \(n=128\) and as \(2^{128}\) is a very large number, the probability of a fixed point of the md5 algorithm is very well approximated as 63.21%. As our mathematical method is generic and only dependent on certain properties common to all cryptographic hash functions, it also works for other algorithms like SHA-1 (\(n=160\)) and SHA-2 (\(n=224\), \(n=256\), \(n=384\) or \(n=512\)).

While the above calculation predicts the existence of a fixed point, it doesn't give us any details about how to find it. The process of locating arbitrary hash collisions is relatively easier due to the birthday paradox, but even these are infeasible to determine for a cryptographically secure hash function. As it stands, we're unlikely to find such a fixed point in these large-\(n\) hash functions unless calculation speeds hugely increase, a mathematical weakness is found in the algorithm itself, or someone gets very, very lucky.


In computing, a quine is a computer program which outputs a copy of its own source code. (See the Wikipedia page on quines for more information and examples.) I recently discovered a different sort of quine: mathematical ones.

Consider the function \[{1\over 2} < \left\lfloor \mathrm{mod}\left(\left\lfloor {y \over 17} \right\rfloor 2^{-17 \lfloor x \rfloor - \mathrm{mod}(\lfloor y\rfloor, 17)},2\right)\right\rfloor\] where
  • \(\lfloor x \rfloor\) is the floor function, i.e. the largest integer less than \(x\);
  • \(\bmod(b,m)\) is the modulo operation, which, when graphed over \(0\leq x\leq 106\) and \(n\leq y\leq n+17\) with n = 4 858 450 636 189 713 423 582 095 962 494 202 044 581 400 587 983 244 549 483 093 085 061 934 704 708 809 928 450 644 769 865 524 364 849 997 247 024 915 119 110 411 605 739 177 407 856 919 754 326 571 855 442 057 210 445 735 883 681 829 823 754 139 634 338 225 199 452 191 651 284 348 332 905 131 193 199 953 502 413 758 765 239 264 874 613 394 906 870 130 562 295 813 219 481 113 685 339 535 565 290 850 023 875 092 856 892 694 555 974 281 546 386 510 730 049 106 723 058 933 586 052 544 096 664 351 265 349 363 643 957 125 565 695 936 815 184 334 857 605 266 940 161 251 266 951 421 550 539 554 519 153 785 457 525 756 590 740 540 157 929 001 765 967 965 480 064 427 829 131 488 548 259 914 721 248 506 352 686 630 476 300.


This is known as Tupper's Formula (created by Jeff Tupper in 2001). It produces the following "graph".

quine picture

If you look closely, this graph is a bitmap which has been chosen out of all possible bitmaps of this size by the value of \(n\) specified above, making it not truly self-referential. A real (and complex) self-referential mathematical formula was created by Jakub Trávník earlier this year.


Last year I wrote and gave a presentation on the development of calculus through the ages - from the Ancient Egyptians through to modern times. I thought it might be interesting to give people a chance to look through it - and if you stick through it to the end you get a maths joke too!




(Use the fullscreen button to view it at a more readable size.)