Part of the secret to building a safe AI based Neural Network is to have a law or foundation an AI system can base all its functions or activities on. The foundation has to be good or free from evil, and should have networks of specialized databases/sources with different roles for solving different problems or for making an AI function properly... the databases do not derail from the law/foundation. I would expect an AI creator to program an AI to depend more on the sources with most accurate, consistent, safe solutions to specific issues if they really care about safe AI. Once the right source is found the AI constantly learns from it and also remember to give credit.
The problem of AI systems that were intentionally designed to cause harm is an issue. But an AI system may still be dangerous even if it were not intentionally designed to cause harm. We currently do not have good understanding of the inner workings of AI systems, so it is a good idea to first understand the inner workings of AI systems so that we can design them well. Part of the design for these AI systems must include interpretability so that people can observe not only the AI's training data, loss/fitness function, and training data, but so that people can observe and make sense of its inner workings. If an AI has bad processes deep inside, then we need to be able to know about these processes and correct them either through retraining or ablation, and we need to be able to detect these bad inner processes before they result in bad outputs.
I have developed the notion of an LSRDR and some generalizations of this notion to solve cryptographic problems, but LSRDRs may also be used to solve problems related to AI safety and interpretability. Since nobody else is working on this, one should not be surprised that LSRDRs cannot match the performance of neural networks, but I am working on this, and LSRDRs can solve some problems that neural networks have trouble with. For example, can you design a neural network to find a largest clique in a graph? Perhaps it is possible, but LSRDRs can probably do this more efficiently (but it seems like simulated annealing may outperform LSRDRs for the clique problem), and unlike neural networks, LSRDRs can solve the clique problem in some graphs just by looking at the single graph without creating many graphs to train itself with. I hope that generalizations of LSRDR can be soon used to solve more machine learning tasks so that they can better compete with neural networks, but at the very least, we can probably using something like LSRDRs to interpret neural networks. Neural networks as they stand today are quite bad in terms of interpretability, so we can either use more interpretable mathematical systems or we can improve our interpretability tools.
Happy New Years,
-Joseph Van Name Ph.D.