Evaluation of difficulty will depend on the individual miner's experience after reading the dataset's metadata and existing reported results (if any) recorded also on the block chain.
How to prevent sybill attacks on the evaluation process? Or is reading and evaluating a dataset also subject to a processing fee?
How would existing reported results help in determining the difficulty of a challenge?
What metric is used to objectively define the difficulty of a challenge to begin with?
1) the evaluation process is similar to validating the hash signature of hashcash, there is zero cost in evaluation, as long as 51% of the miners are incentivized to be honest, the evaluation will be sound.
2) existing reported results could be results of previous epochs, e.g.,
dataset 236 (with hash signature fadc432ad.... and random seed 342)
epoch, datetime stamp, winner's MSE, winning margin over runner up
1 10:33 0.233 0.2
2 10:34 0.210 0.15
3 10:35 0.20 0.12
....
3) metric = 10-fold cross validated prediction / classification error on dataset 236 using random seed 342 (random seed is used to generate the same K=10 equal sized partitions of dataset 236).
3a) will someone manipulate the metric for his benefit? yes, only if the manipulated results help his own submission, but his submission results will be validated by everyone else.
3b) at the beginning of the challenge, with zero reported results, miners can run a quick training, to get the results of the first few epochs, and get a feel of the problem difficulty.
3c) Experts in the field, e.g., computer vision, after looking at the dataset, will know how difficult the problem is.
Sybill attacks can be discouraged by requiring each dataset that has a defined problem to come with an attached processing fee. X% of miners could collude to pick one fake dataset, and pretend to do work on it, where X is simply the dataset with largest number of miners (colluded). but in the end, only one winner earns the reward. But this does not prevent the remainder miners from also working on it, possibly with better algorithms or experience, and who may eventually win the competition.
How to bootstrap such a cryptocurrency if mining a block requires a processing fee? Ie. where do the first miners get their coins for paying the processing fee if no coins have yet been mined?
How to determine if a dataset is fake?
Note that when referring to sybill attacks I'm not even yet talking about the mining process. I'm talking about the
dataset contribution process that happens beforehand, where no miner is yet involved.
Or are you suggesting that the miners solving the challenges should also be the ones contributing the datasets, each dataset submission being attached to a fee?
1) bootstrapping is similar to how bitcoin started, no miners, easy to control 51% of the hash power. Value of coin is low. the pioneers will have to bootstrap the chain by giving out coins, etc. or follow the footsteps of ETH. Need to do research on how these 2 started.
2) yes, authentic dataset creation is a tough problem. Dataset contribution can be constrained as follows:
only aggregate datasets are considered, i.e., 100 individual user profile pictures, contributed by 100 users. This will increase the sybil attack cost. The mining problem could be to classify the user pictures into male/female, different races, etc.
3) fake dataset if contributed by a single person, is hard to detect. That's why I suggested the above. Some one posting a dataset and solving it himself would not reap benefits, unless he mobilize an army of miners, as described earlier.
This leads to another problem: if the dataset is artificially generated with a known formula or DNN by the colluder, then he already knows the answer (perfect fit model), so he could guarantee to be able to win. but he must then control enough miners
in order to make this dataset the chosen dataset for the next block, so in effect he will have to gather on average 30% of the node power in other to fake the win.
How did you reach the conclusion that 30% of computational power would be sufficient for faking a challenge win? 30% seems an awfully low threshold for maintaining security.
30% is just a random number, assuming a free form nomination process: the highest voted candidate gets 30% votes, second highest 25%, 3rd 15%, etc.
How to stop rogue clients from DDoSing the network by flooding it with wrong timestamps and turning 2 minute block intervals into 2 days or weeks or years?
By attaching a cost/fee to each broadcasted result?