The problem with your p(n) and q(n) curves are that they measure the probability of at least one match. What we really want to measure is the expected number of matches since due to a difficulty setting having more than one match is important. This is what gives a relationship of number of hashes/second * amount of RAM. Even if we only increase the hashes/second by a factor of two we still get twice as many matches giving linear scaling in exactly the same way as just using whatever birthdayhash algorithm you choose by itself.
Actually I've given it some more thought, and now I'm finding this very convincing.
Let's say 365 possible birthdays and a CPU can generate 10 per second, and process a maximum of 10 per second into memory.
Add a GPU that can generate 100 per second, reduce the search space by 90%, discard 90 of the birthdays that aren't in the smaller search space, and send the 10 to the CPU to process into a search space of 36 days.
So the q(n) might only be half as effective as the p(n), but hashing can be added *infinity and we're getting linear scaling.
We need to prevent easy reduction of the search space.
You want hash-speed to be linear once the target memory has been consumed. As the number of items in memory grows the performance of the algorithm approaches the linear speed of the birthday hash alone; however, it does not eliminate the need for the memory to be present. Putting the momentum property-aside, having a linear hash function that requires multiple GB of RAM to perform at peak performance means that the cost of going parallel with the hash function is an order of magnitude higher.
But lets see if we can keep things non-linear. Suppose your BirthdayHash has 32 bits, but your nonce search space only has 16 bits. You will run out of nonces long before you are able to generate enough birthdays to get very high on the q(n) curve. I would guess you could tune the parameters in the following way:
Limit the nonce search space such that the expected number of birthday matches is 1. If this were the actual birthday problem, we would set the nonce nonce search space to ~50 and number of days to 365. 99% of the time you would find 1 match, though sometimes you might find 2 or 0.
Once all nonces of been searched, you have to start over again.
Using these parameters you would require 50x as much memory to find the target as to verify the target. If you were to grow both the nonce search space and the bits of the BirthdayHash you would maintain the same principle while requiring any amount of RAM.
The performance of the hash function would now be a matter of how quickly one could fill the RAM. This is of course linear with respect to processing power. The goal is to require as much RAM as possible per unit of processing power. All we are really achieving with this hash function is to increase the number and type of transistors necessary.
Some goals:
1) get the fastest possible BirthdayHash algorithm because this will generate data via a CPU at a rate that nears the memory throughput. Any ASIC that could perform the HASH faster would be stuck waiting on the memory bus.
Once the memory bus becomes the bottleneck in performance we have won.