AI Safety

Added 8 October 2018: See also my summary and review of the book "Life 3.0" by Max Tegmark here.

I wrote the following paper as a "Bachelor thesis" a couple of years ago. It focuses on the existential risk posed by artificial superintelligence (as opposed to other kinds of risks from other kinds of AI). It starts by providing reasons as to why an intelligence explosion might happen, and how it could go wrong for us; then moves on to discussing the merits of various proposed solutions.

A PDF version is available at this link. The HTML version is below the line.




Are we doomed?

 Responding to Catastrophic Risk from Artificial Intelligence

By Quaerendo
June, 2016

Abstract
Sometime this century, artificial intelligence may reach and surpass human intelligence. A superintelligence would be unimaginably powerful, and it may cause the extinction of humanity, unless we are very careful about how we design it. This may happen incidentally through the AI’s consumption of resources we need to survive, or the AI may reason that eliminating humans is instrumentally useful toward fulfilling its final goal. Given this existential threat, we must take action to maximize the probability of a good outcome, one where a superintelligence’s capabilities can be used for the benefit of humanity. It is unlikely that private projects alone will suffice, because existential risk prevention is a global public good that is likely to be underprovided by the market. It is thus incumbent upon the governments of the world to coordinate in the creation of a global regulatory body to promote safe research. The functions of such a body might include assessing and monitoring research, or promoting research into safe AI technologies. 

Table of contents

1.     INTRODUCTION
2.     CREATING HUMAN-LEVEL ARTIFICIAL INTELLIGENCE
2.1.      Predicting AI
3.     FROM AGI TO SUPERINTELLIGENCE
3.1.      Superintelligence is plausible
3.2.      The hard takeoff scenario
4.     CONSEQUENCES OF SUPERINTELLIGENCE
4.1.      Achieving an intelligence explosion in a controlled way
5.     DOING SOMETHING ABOUT AI RISK
5.1.      A strategic framework for solutions
5.2.      Societal proposals
5.2.1.       Research and differential technological development
5.2.2.       Enhancing human capabilities by merging with machines
5.2.3.       Less realistic solutions
5.3.      AI design proposals
5.3.1.       External constraints
5.3.2.       Internal constraints
5.3.3.       Less realistic FAI designs
5.4.      Further considerations
6.     EXISTENTIAL RISK PREVENTION AS A GLOBAL PUBLIC GOOD
7.     CURRENT ACTIONS
8.     DISCUSSION AND CONCLUSIONS
FOOTNOTES
REFERENCES


1.   INTRODUCTION
A global catastrophic risk is “a risk that might have the potential to inflict serious damage to human well-being on a global scale” (Bostrom & Ćirković, 2008, p. 1). Global catastrophic risks include environmental changes like climate change, biodiversity loss, and stratospheric ozone depletion; emerging technologies like artificial intelligence, biotechnology, geoengineering, and nanotechnology; large-scale violence or terrorism including the use of nuclear weapons or biological weapons; pandemics; natural disasters like supervolcano eruptions, asteroid impacts, solar storms, and gamma ray bursts; high-energy particle physics experiments going wrong; and an encounter with an intelligent extraterrestrial civilization (Baum, 2015; Bostrom & Ćirković, 2008). There might be risks that are as of yet unforeseen (Bostrom, 2002). A subset of global catastrophic risks are existential risks – those where “an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential” (Bostrom, 2002, p. 2).

This paper will argue for the establishment of a global regulatory body to deal with catastrophic and existential risks arising from emerging technologies like bioengineering, nanotechnology, and artificial intelligence. The reasoning for this is twofold: firstly, global catastrophes and extinction events have a non-negligible chance of occurring over the next century and should be prevented; and secondly, market failures in risk prevention will necessitate the intervention of governments. Thus, it may be argued that the responsibility falls on the world’s governments to cooperate for the sake of protecting the human species’ continuity and wellbeing, as is already being done for climate change through treaties like the Kyoto Protocol or scientific bodies like the Intergovernmental Panel on Climate Change (IPCC). This paper will focus on advanced artificial intelligence (AI), because it is one of the most likely extinction scenarios according to expert surveys (Tonn & Stiefel, 2014) and because there is currently a lack of any regulatory action on AI safety (Nota, 2015). However, the general conclusions may be transferred to other risky emerging technologies.

Popular notions of a “robot rebellion” belong to the realm of science fiction. The more plausible source of AI risk is the fact that superintelligent AIs may choose to use scarce resources to accomplish their own goals rather than aid human survival (Yudkowsky, Salamon, Shulman, Kaas, McCabe, & Nelson, 2010). It is not active malevolence, but mere indifference to human values that could enable a powerful AI to pose a catastrophic or existential threat (Sotala & Yampolskiy, 2014). Machines do not need to be conscious in order to be “vastly more able than humans to efficiently steer the future in pursuit of their goals” (Muehlhauser & Bostrom, 2014, p. 41).

Unlike some other global catastrophic risks, like asteroid impacts or physics disasters, AI risk has no actuarial statistics that we can use to assign probabilities or compute precise models, which makes it hard to discuss (Yudkowsky, 2008a). What makes AI a particularly interesting case, however, is that it poses not only an existential threat to humanity, but that we could also harness its power to find solutions to other existential risks and increase our chances of survival. Artificial intelligence therefore has both negative and positive potential impacts (Yudkowsky, 2008a).

The rest of this paper is structured as follows: first, it looks at the path from here to human-level artificial general intelligence (AGI), and then at the path from AGI to artificial superintelligence (ASI). After examining the consequences and risks of ASI, it looks at proposed policy responses. It will then show why reducing AI risk is a public good that suffers from market failure, despite a number of existing non-profit organizations being active in the field. Finally, the paper concludes that human survival may depend on our ability to coordinate globally, and as such, we are faced with certain moral obligations. 

 2. CREATING HUMAN-LEVEL ARTIFICIAL INTELLIGENCE
“Humans are not optimized for intelligence. Rather, we are the first and possibly dumbest species capable of producing a technological civilization.”
-- Yudkowsky et al., 2010, p. 2
It is possible that we will create human-level artificial intelligence within the next century, either by inventing de novo AI software or by emulating human brains on computers. What AI researchers mean when they talk about intelligence is “an agent’s capacity for efficient cross-domain optimization[1] of the world according to the agent’s preferences.” (Muehlhauser & Salamon, 2012, p. 17).  In other words, intelligent agents achieve their goals across a variety of environments using resources with relative efficiency. When applied to artificial minds, the term AI usually refers to “general AI”, systems that match or exceed human intelligence in almost all relevant domains, as opposed to “narrow AI”, which are software that can solve problems only within a narrow field, for example chess (Muehlhauser & Salamon, 2012)[2].

Yudkowsky (2008a) notes that “artificial intelligence” refers to a vast space of design possibilities, which he calls the “space of minds-in-general”. Thus, any two AI designs might be enormously different. It is therefore important to avoid anthropomorphic bias when reasoning about artificial minds, as there is no reason to suspect that AI will necessarily have humanlike motives or reasoning processes. Unfortunately, anthropomorphism often sneaks in when making predictions about intelligence (Yudkowsky, 2008a).

2.1.           Predicting AI
The exact moment at which artificial general intelligence or AGI will be created is unknown, and predicting the arrival of AGI is a notoriously perilous task. Armstrong and Sotala (2015) looked at a database of 95 AI timeline predictions and assigned each a median estimated date for human-level AI. They distinguished between “experts” like AI researchers and software developers, and “non-experts” like writers, journalists and other amateurs. They found that expert predictions have little correlation with each other, and that expert predictions are indistinguishable from non-expert performance. Additionally, the time interval between the date the prediction was made and the predicted arrival of AGI seemed to show a preference for the range 15 to 25 years into the future, for both experts and non-experts. This holds for predictions made decades ago as well as more recent predictions. Armstrong and Sotala (2015) conclude that the task characteristics of AI prediction are not conducive to good expert performance, and “[t]here is thus no indication that experts brought any added value when it comes to estimating AI timelines.” (p. 19).

Nevertheless, Muehlhauser and Salamon (2012) point out that uncertainty is not an excuse for not making predictions – choosing whether to fund AI safety research or not, for example, implicitly assumes that AGI might arrive sooner or that it might arrive later. They also point out that our uncertainty must extent in both directions: just as there are reasons to be skeptical that AGI will arrive before the end of the century, there are also reasons to be skeptical of claims that AGI will not arrive in the 21st century. Yudkowsky et al. (2010, p. 2) echo this view: “We need to take into account both repeated discoveries that the problem is more difficult than expected and incremental progress in the field.” There are several factors that can either slow down or accelerate progress toward the first AGI, called “speed bumps” and “accelerators” respectively (Muehlhauser & Salamon, 2012).

Speed bumps may include an end to Moore’s Law, the historical trend of exponentially increasing hardware capacity. Making new scientific discoveries may become increasingly difficult, as the “low-hanging fruit” become depleted. Various catastrophic disasters might cause societal collapse, during which scientific progress would halt. Finally, we may acquire a disinclination to pursue further AI development, resulting in policies that would actively prevent progress (Muehlhauser & Salamon, 2012).  

Conversely, accelerators on the path to AGI may include new advances in hardware that would reduce costs and improve algorithm performance. Theoretical insights could lead to better algorithms with more computational efficiency on the same hardware. Access to massive datasets could, for example, aid progress in speech recognition and translation. We may make progress in psychology and neuroscience, allowing for the transfer of knowledge about human intelligence to the domain of AI research. New online collaborative tools, together with better-funded universities and greater numbers of researchers in emerging economies like China and India, could result in accelerated scientific output. Employers may face economic incentives to replace human employees with cheaper and more reliable machines, and political or private actors face strategic first-mover incentives to harness the potential power of AI before others do so. These actors would be motivated to devote substantial resources to developing AI quickly (Muehlhauser & Salamon, 2012).

Due to all the potential speed bumps given, we cannot have much confidence that human-level AI will arrive very soon, but likewise due to the potential accelerators, we cannot rule out unforeseen advances (Yudkowsky et al., 2010).

For a reference point, surveys of AI experts at the 2012 Singularity Summit reported a median predicted year of 2040 for the Singularity, the moment when AGI would hypothetically cause a runaway explosion in technological progress (Nota, 2015). Müller and Bostrom (2016) found similar results from a different survey, with the median respondent assigning a 50% probability of high-level machine intelligence being developed between 2040 and 2050, rising to a nine-in-ten chance by 2075. Thereafter, respondents estimate superintelligence to arrive within 30 years post AGI, and assign a one-third probability of this development being “bad” or “extremely bad” for humanity (Müller & Bostrom, 2016). The experts may be wrong – but most of them seem to have reasons to expect AGI within this century.

 3. FROM AGI TO SUPERINTELLIGENCE
Three kinds of scenarios may result upon the invention of AGI. A “capped intelligence” scenario would mean that all AGIs remain roughly around human-level of intelligence and are prevented from exceeding this. A “soft takeoff” scenario results when AGIs eventually exceed human capability, but do so on a long timescale such that we still have time to intervene. Finally, a “hard takeoff” is when AGIs undergo a sudden and rapid increase in power, soon taking over the world (Sotala & Yampolskiy, 2014). Naturally, assuming that AI does become more intelligent than humans, the softness or hardness of the takeoff scenario will determine which proposed solutions will work best. For instance, a soft takeoff will allow us to develop a complete theory of machine ethics more gradually and incrementally, and adapt to developing AGIs more easily (Sotala & Yampolskiy, 2014). Conversely, a hard takeoff implies significant challenges. After showing that above-human-intelligence is likely, this paper will show why a hard takeoff scenario is theoretically plausible.

3.1.           Superintelligence is plausible  
There are reasons to believe that computers could, in principle, outperform humans in tasks like general reasoning, technology design and planning, just as they already do in numerous narrow tasks such as arithmetic, chess, and memory size. Thus it seems plausible that human-level artificial intelligence might lead to artificial superintelligence, or ASI – agents that are “smarter than the best human brains in practically every field” (Nota, 2015, p. 1). Three reasons why this transition might happen relate to the advantages unique to AI, the instrumentally convergent goals of AI, and the positive feedback loop of self-enhancement (Muehlhauser & Salamon, 2012).

Artificial intelligence will have many advantages over human intelligence. Firstly, the human brain is constrained by volume and by metabolism, but machine intelligences could use scalable computational resources. Greater computational resources would give the machine more optimization power. Secondly, software minds could process information more rapidly, because their hardware would be faster than the fixed 75 meters per second speed of spike signals of human axons[3]. Thirdly, some cognitive tasks may be better performed by serial algorithms that handle longer sequential processes than the human brain is capable of due to the brain’s relatively slow speed and reliance on parallelization. Computers can overcome this issue. Fourthly, AI is highly duplicable, meaning that new AIs can be created by simply copying the software. This also allows for rapid updating of the population. Fifthly, a digital mind would be more editable than the human mind, for example via its source code, or in the case of whole brain emulation, experimenting with various emulated brain parameters. Sixthly, a set of AI copies would be less likely to face the same goal coordination problems that limit human effectiveness. Finally, machine intelligences could be programmed to be significantly more rational than human beings, who tend to be irrational and lack stable and consistent goals as behavioral economics has revealed (Muehlhauser & Salamon, 2012).

Although we may not be able to predict specific motivations of AI, to some extent there are goals which are instrumental to satisfying nearly any final goal, known as convergent instrumental goals. Therefore, these goals are likely to be pursued by almost any advanced AI. For example, optimizing for a certain goal means that the AI is likely to seek self-preservation in order to fulfill its goal. Moreover, the AI will want to maximize the satisfaction of its current final goal even in the future, thus it will seek to prevent the content of its current final goal from being changed. Furthermore, the AI will likely seek to improve its own intelligence and rationality because this will enhance its decision-making and efficacy. Finally, in trying to satisfy its final and instrumental goals, the AI will seek to obtain as many resources as it can for the fulfillment of these tasks (Muehlhauser & Salamon, 2012). Another reason why an AI would seek to follow rational economic behavior is that not doing so would be a strategic disadvantage, as agents that do not act rationally have certain vulnerabilities that other agents may exploit (Sotala & Yampolskiy, 2014).

positive feedback loop of self-enhancement may arise from the instrumental goal of self-improvement. Specifically, an AI could edit its own code, or it could create new intelligences that run independently. Whether through self-improvement or improvement via other AIs, this implies that the intelligence that does the improving is also being improved. A large population of AIs may rapidly cascade toward superintelligence, in what is known as an “intelligence explosion”[4].  The exact nature of the intelligence explosion and the outcomes thereof are areas of significant debate (Muehlhauser & Salamon, 2012).

According to Yudkowsky et al. (2010), two conflicting pressures influence the speed of an intelligence explosion. The first is that every improvement in AI technology makes it easier for the AIs to improve further. The second, countervailing force is that the easiest improvements are likely to be achieved first, causing diminishing returns. In general, there are reasons to suspect that the rate of improvement will be high. For example, with continued hardware progress we could run many copies of advanced AIs at high speeds. Reaching superintelligence rapidly may be problematic if research and policy cannot develop adequate safety measures in time (Yudkowsky et al., 2010).

3.2.           The hard takeoff scenario
There are three separate (but not mutually exclusive) reasons to believe that a hard takeoff is plausible. The first is hardware overhang[5]. In this scenario, there is a large quantity of cheap and powerful hardware available, assuming that hardware progress for supercomputers has outpaced AGI software progress. Once AGI software is invented, it would then be possible to run thousands of copies of AIs at high speed on this hardware. The second possibility is a speed explosion. This could happen if AGIs use fast computing hardware to develop even faster hardware, and repeat this process iteratively until they reach physical limits. A speed explosion assumes that much of the hardware manufacturing can be automated by the AGI. The third scenario, the intelligence explosion, entails that the AGIs undergo recursive self-improvement of their software (Sotala & Yampolskiy, 2014). 

There are some who are skeptical about the intelligence explosion because historical progress in AGI has been notoriously slow. Yudkowsky (2008a) rebuts this by pointing out that we should not confuse the speed of research into AI with the speed of an actual machine once it is built. Furthermore, one should keep in mind that recursive self-improvement needs to cross a critical threshold in order for optimization power to increase exponentially. This is because it is highly sensitive to the effective multiplication factor. In other words, there is “a huge difference between one self-improvement triggering 0.9994 further improvements on average, and one self-improvement triggering 1.0006 further improvements on average” (Yudkowsky, 2008a, p. 325). Yudkowsky (2008a) draws the analogy to a self-sustaining nuclear reaction, which requires an effective neutron multiplication factor equal to or greater than one. This means that on average, one or more neutrons from a fission reaction will go on to cause another fission reaction. Applying the analogy to AI, the critical threshold of recursive self-improvement may be approached very gradually as technology progresses, but once the threshold is crossed, the AI is likely to make a huge jump in intelligence.

Yudkowsky (2013) discusses arguments in favor of a hard takeoff, which are mostly the same as those mentioned above. An additional reason worth pointing out is the “returns on unknown unknowns”. A superintelligence would presumably find and take advantage of useful possibilities that lie between the gaps in our knowledge, for example physical laws that we have yet to learn. It is not entirely unreasonable to expect that our current Standard Model of physics may underestimate the upper bound of what is actually possible, considering the analogy that our current capabilities are “magic” by the standards of the eleventh century. Unknown unknowns should generally shift our expectations of an intelligence explosion upwards (Yudkowsky, 2013).

It should be pointed out that the distinction between a hard takeoff and soft takeoff might be somewhat misleading, because a takeoff may be soft at first and then become hard later. The initial soft takeoff could cause a false sense of security, especially considering that AGIs themselves could contribute to the development process of ASI, and accelerate the rate of improvement by reducing the amount of human insight required. Although a hard takeoff would be the worst outcome because it gives us little time to prepare or intervene, a soft takeoff could also be dangerous because it allows for the creation of multiple competing AIs, which could result in those AIs least burdened by goals like “respect human values” prevailing. Therefore, approaches to ASI risk should ideally apply to both a hard and soft takeoff (Sotala & Yampolskiy, 2014).  

 4.  CONSEQUENCES OF SUPERINTELLIGENCE
Artificial superintelligence will theoretically be superior to humans in important areas like scientific discovery, resource harvesting, manufacturing, strategic action and social aptitude. Since intelligence can be applied to pursue any goal, more intelligence implies a greater ability to achieve one’s final goals. However, because machine superintelligences will want to preserve their initial goals whatever those are, they pose the threat of incidentally destroying humanity unless they are specifically programmed to preserve what we value. Therefore, as Muehlhauser and Salamon (2012) note:

“[W]e may find ourselves in a position analogous to that of the ape who watched as humans invented fire, farming, writing, science, guns and planes and then took over the planet. […] We would not be in a position to negotiate with them, just as neither chimpanzees nor dolphins are in a position to negotiate with humans.” (pp. 29-30).
Furthermore, since being destroyed would block the AIs from achieving their goals, it is possible that they might actively pretend to be safe until they become powerful enough to resist any threat. Human extinction may thus result from a broad range of AI designs that appear safe at first, but are in fact unsafe if we ever reach an intelligence explosion (Yudkowsky et al., 2010).

Presumably, an ASI would seek the most efficient way of manipulating the external world in order to achieve its goals. This may lead it to develop rapid infrastructure, such as molecular nanotechnology. The exact pathway by which this would happen is uncertain, but see Yudkowsky (2008a) for an imaginative example. The point is that once ASI possesses molecular nanotechnology, it would be able to manipulate the environment on a timescale orders of magnitude faster than human brains can think. It need not bother with the cliché “robot armies” of sci-fi when it could simply reorganize all the matter in the solar system in line with its goal. In the words of Yudkowsky (2008a, p. 333): “The AI neither hates you, nor loves you, but you are made out of atoms that it can use for something else.” Catastrophic consequences could result from an AI being programmed with seemingly innocuous goals, like maximizing the number of paperclips in its inventory (Nota, 2015) or solving the Riemann hypothesis (Muehlhauser & Bostrom, 2014).    

Yudkowsky (2013) summarizes the risk from superintelligence in four core theses:
·         The Intelligence Explosion Thesis: recursive self-improvement allows AI to grow in capability on a fast timescale, such that it may quickly be beyond the ability of humans to restrain or punish.
·         The Orthogonality Thesis: the vastness of mind-design space implies the potential existence of instrumentally rational agents that pursue nearly any utility function.
·         The Complexity of Value Thesis: human values, even in their idealized forms that consider what we would want if we had perfect knowledge, have high algorithmic complexity and cannot be reduced to a simple principle.
·         The Instrumental Convergence Thesis: certain resources like matter and energy, and certain strategies like self-preservation, are likely to be pursued by almost any instrumentally rational agent, and therefore an AI need not have the specific terminal value to inflict harm onto humans in order for humans to be harmed.

Just as technological revolutions of the past usually did not broadcast themselves to the people who were alive at the time, so too can we not rely on advance warning before the creation of AGI. The fact that the problem of AI safety is difficult and may require years of work means that we need to worry about it now. Moreover, the possibility of an intelligence explosion implies a high standard for AI researchers; it must not simply be assumed that programmers would be able to monitor or rewrite superintelligent machines against their will, nor that AIs would be unable to access their own source codes. The standard we should aim for is an AI that does not want to harm humanity in any context, and does not want to modify itself to no longer be “friendly” (Bostrom, 2003; Yudkowsky, 2008a).
  
4.1.           Achieving an intelligence explosion in a controlled way
Assuming that we could clearly define an AI’s utility function, how could we give it goals that are desirable from our perspective? In other words, how can we build a “Friendly AI” (FAI) – one with a stable and desirable utility function? This is, in fact, a very difficult task. The complexity and fragility of human values makes it difficult to specify what humans actually value (Muehlhauser & Salamon, 2012). According to Sotala and Yampolskiy (2014), all of our current moral theories would result in undesirable consequences if they were implemented by an ASI. Rather than providing it with a fixed set of imperatives, the solution to the problem of complex human values may entail programming the AI to learn human values, for example by observing humans and scanning our brains, asking questions, and so on (Yudkowsky et al., 2010).

According to Yudkowsky (2008a), attempts at Friendly AI may fail in two ways: technical failure and philosophical failure. Technical failure is when the AI does not work as its inventors intended it to. For example, it may appear to work within a confined context, but fail in a different one, despite executing its programmed code perfectly. Training an artificial neural network (ANN) is likely to produce this outcome, as there is a discrepancy in context between the developmental stage and post-developmental stage. Consider the following hypothetical scenario illustrated by Yudkowsky (2008a):

“Suppose we trained a neural network to recognize smiling human faces and distinguish them from frowning human faces. Would the network classify a tiny picture of a smiley-face into the same attractor as a smiling human face? If an AI ‘hard-wired’ to such code possessed the power… would the galaxy end up tiled with tiny molecular pictures of smiley-faces?” (p. 321).
Related to technical failure is philosophical failure, which is when someone builds the “wrong” AI, such that even if they succeed, they fail to benefit humanity. The example Yudkowsky (2008a) uses is programming an AI to implement communism, libertarianism, or any other political vision of utopia, thinking that it would make people happy when in reality that may turn out not to be the case. Programming an AI to implement communism would be short-sighted, because the inventor would be “programming in a fixed decision, without that decision being re-evaluable after acquiring improved empirical knowledge about the results of communism” (Yudkowsky, 2008a, p. 320).

Two alternatives to FAI are “Oracle AIs” and AIs under external constraints. Oracles are limited in cognitive ability such that they can only answer questions. External constraints for goal-directed AIs might include deterrence mechanisms, tripwires, and physical and software confinement. However, neither of these two approaches eliminate AI risk, because multiple independent teams might be trying to build their own AIs. Even if the first team is successful, other development teams may have less effective safety measures. The advantage of Friendly AI here is that it could actually be used to prevent the creation of unsafe AIs in addition to the other benefits of FAI, like making discoveries with important impacts on human welfare sooner than they otherwise would have been (Muehlhauser & Salamon, 2012). Hypothetically, a FAI equipped with molecular nanotechnology could solve any problem that is solvable either by moving atoms or by creative thinking (Yudkowsky, 2008a).

Whole brain emulations (WBE) have already been mentioned. One advantage of WBE is that we could potentially emulate the brains of individual humans whose motivations are known, thereby reducing uncertainty (Yudkowsky et al., 2010). However, Yudkowsky (2008a) is skeptical that WBEs with enhanced intelligence will be achievable before the first general AI. Doing so could possibly require more computing power and knowledge about cognitive science than would be required for inventing AGI. First we would require the technology to produce extremely detailed neuron-by-neuron scans of the human brain, and then we would have to understand exactly how brain architecture translates into higher-level cognitive functions. Finally, we would have to figure out how to upgrade the human intelligence without upsetting the delicate balance of neurotransmitters and possibly triggering mental illness (Yudkowsky, 2008a).

Oracle AIs, external constraints, and WBE will be further discussed when we examine some of the proposed responses to AI risk.

 5. DOING SOMETHING ABOUT AI RISK
5.1.           A strategic framework for solutions
Yudkowsky (2008a) classifies AI-risk mitigation strategies into three kinds, depending on the degree of cooperation required: unanimous cooperation, majority action, and local action. A strategy that is dependent on unanimous cooperation to work can be easily foiled by small groups or individual dissidents, therefore such strategies are practically unviable. However, a unanimous strategy would be required in the scenario that the development of superintelligence with the ability and motive to destroy humanity cannot be stopped by any Friendly AI. If we cannot find a permanent solution to the AI safety problem, and assuming unanimous strategies to be unworkable, then “we are doomed.” (Yudkowsky, 2008a, p. 337). Fortunately, local and majoritarian strategies may still be feasible.

Strategies that require majoritarian action may work in the long-term. Such strategies entail that the majority of people (but not all) cooperate, for example most voters or parliamentarians within a country, or most countries within the UN. Global political change, however, is likely to be difficult, as it will require building up a movement with international political and social legitimacy. Although there might be defectors who may do some damage, a majoritarian strategy could prevent global catastrophe if the first AI built cannot by itself cause a catastrophe, and if most AIs are Friendly and can protect us against rogue AIs. For example, legislation could help ensure that a majority of artificial intelligences built would be FAIs by requiring researchers to publicize their strategies or to penalize researchers who create unsafe AIs (Yudkowsky, 2008a).

Local strategies are those that require “a concentration of will, talent, and funding which overcomes the threshold of some specific task.” (Yudkowsky, 2008a, p. 334). Acquiring a large amount of funding for a project might plausibly be easier than trying to push through a global political change. A local strategy may be effective, assuming that the first AI built cannot by itself cause a catastrophe, and that a single FAI together with human institutions can defend against any number of hostile AIs. In this scenario we would need to distinguish FAI from non-FAI, and grant the FAI revocable institutional power (Yudkowsky, 2008a).

Whether to choose a local or majoritarian strategy depends on the speed of the intelligence explosion. Assuming that rapid, sharp jumps in intelligence are possible and likely, there will be a first-mover effect to the first AI that crosses the key thresholds of recursive self-improvement, rapid nanotechnology infrastructure, or absorption of a unique resource like the internet. The first mover determines the outcome for Earth-originating intelligent life, because no other AI will be able to catch up. Therefore, the first AI must be Friendly. It is in principle possible for this to be achieved by local effort, whereas a majoritarian strategy could be used if there is no decisive first-mover effect (Yudkowsky, 2008a).

Sotala and Yampolskiy (2014) classify responses to catastrophic AI risk into two main camps: proposals for societal action, and proposals for AGI design. Proposals for AGI design can be broken down further into proposals for external constraints on AGI behavior, and recommendations for internal constraints on AGI behavior. There is some overlap in these categories; for example, societal proposals include regulating research, and research targets could include the most effective ways of imposing external or internal constraints on AI. Nevertheless, this paper will review the proposals highlighted as the most promising in the views of Sotala and Yampolskiy (2014). These are: regulating research, enhancing human capabilities, AGI confinement, Oracle AI, motivational weaknesses, value learning, human-like AGI, and formal verification (Sotala & Yampolskiy, 2014).  

5.2.            Societal proposals
5.2.1.  Research and differential technological development
Various authors have called for international regulation of AGI (see Sotala & Yampolskiy, 2014). Regulation may be implemented in various formats. One proposal is to set up review boards to evaluate potential AGI research, similar to how review boards at universities oversee research in social and medical sciences[6]. If the board finds the research to be too risky, it could be partially or completely banned. Different measures, like funding limitations, supervision, or encouragement of safety research can also be used. Another proposal is to explicitly fund research projects aimed at creating FAI. Yet a third proposal is differential technological progress (see below), which could combine government funding with review boards (Sotala & Yampolskiy, 2014). Nota (2015) recommends establishing public funding of research into AGI risk and FAI, establishing an international agency devoted to the safe development of AGI, and encouraging continued independent efforts into research and awareness.

In order to control the intelligence explosion to our benefit, we would need differential technological progress. This entails slowing down development of dangerous technologies, and accelerating the implementation of beneficial technologies like Friendly AI. However, creating the first FAI would require us to solve certain problems in decision theory and value theory. Hence, Muehlhauser and Salamon (2012) recommend a program of “differential intellectual progress”, which entails prioritizing risk-reducing intellectual progress over risk-increasing intellectual progress. Differential technological progress can thus be seen as a component of differential intellectual progress. Russell, Dewey and Tegmark (2015) compiled a list of research priorities intended to maximize the benefits of artificial intelligence and avoid the pitfalls. These include, for the short-term:
·         Researching the economic impacts of AI, for example its effects on employment and inequality;
·         Researching the legal and ethical implications of AI, for example the liability of self-driving cars, how to control autonomous weapons, and the preservation of privacy in the context of big data; and
·         Ensuring the robustness of AI through computer science research, for example in the areas of verification (‘can we prove that the system satisfies a desired set of formal properties?’), validity (‘can we ensure that the system has no unwanted behaviors and consequences?’), security (‘how can we prevent manipulation by unauthorized parties?’) and control (‘how can we maintain meaningful human control over an AI system after it begins to operate?’).
In the long-term, research priorities include:
·         Verifying the safety properties, especially for systems capable of recursive self-improvement, modification or extension;
·         Ensuring the validity of machine learning methods across many domains, and solving problems in the foundations of reasoning and decision-making;
·         Pursuing effecting defense against cyberattacks by advanced AI, and creating “containers” for AI systems that could have undesirable behaviors and consequences in less controlled environments; and
·         Ensuring that humans can maintain meaningful control, and designing utility functions in such a way that the AI will not try to avoid being shut down or repurposed (Russell, Dewey, & Tegmark, 2015).
Rigorous analysis using tools from game theory, economics, evolutionary analysis, or computer security could help researchers build models of AI risks and of AI growth trajectories, which in turn may help channel funding and human capital to areas with the greatest potential benefits (Yudkowsky et al., 2010).

Yudkowsky (2008a) states that accelerating the pace of development of a desirable technology (like FAI) is a local strategy, while slowing down progress in dangerous technologies is a difficult majoritarian strategy and halting or relinquishing the technology tends to be an impossible unanimous strategy. He suggests that we ask ourselves which technologies we might want developed before or after one another. For instance, it may be the case that successful AI could help us in dealing with risks arising from nanotechnology, whereas it seems unlikely that nanotechnology would help with developing a Friendly AI (Yudkowsky, 2008a).

Sotala and Yampolskiy (2014) point out that AGI regulation can only work on a global scale, which entails both international agreement on legislation, and the ability to enforce legislation within each country. One proposal is to use narrow AI or early-stage AGI designs for international mass surveillance and wiretapping, in order to catch anyone who was trying to develop more advanced or dangerous AGI designs, and to act as a neutral inspector which reports treaty violations. This way, AGI itself may help make international cooperation more feasible (Sotala & Yampolskiy, 2014).

Shulman (2009) discusses an AI “arms race” as an analogy to the nuclear arms race of the Cold War. He gives three reasons why the case of AI is different from the historical experience of nuclear weapons. Firstly, there is an unusually high risk of competitive development of machine intelligence, which poses great potential harm to all competitors if safety is sacrificed for speed. Secondly, an intelligence explosion may have enormous first-mover or winner-takes-all effects, such that the first nation to develop an ASI could acquire unchallenged economic and military dominance. Finally, as already mentioned, AGI technologies can be applied to the enforcement of international regulations and agreements, for example through wide-scale mass surveillance or lie detection. In combination, these three reasons suggest that there would be an unprecedented incentive for global regulation (Shulman, 2009). However, large-scale surveillance efforts may suffer from ethical issues or political unpopularity (Sotala & Yampolskiy, 2014).

Wilson (2013) proposes a model framework for an international treaty to mitigate global catastrophic risk and existential risk emanating from emerging technologies, such as bioengineering, nanotechnology, and artificial intelligence. These three technologies could interact and overlap in many ways, thus a single treaty covering all of them might make sense. In order for the treaty to be effective, all nations must agree to it. The treaty could be concluded under various existing international governmental organizations, such as the United Nations, the World Health Organization, or the OECD. Regulatory mechanisms would include the precautionary principle, a body of experts, a review mechanism, public participation and access to information, binding reforms for scientists, laboratory safeguards, and oversight of scientific publications (Wilson, 2013). Nota (2015) draws a comparison to the IAEA (International Atomic Energy Agency) statute, which promotes the peaceful use of nuclear energy.

Sotala and Yampolskiy (2014) are generally supportive of efforts to regulate research, but point out a number of obstacles to practical feasibility. Legislation is often affected by lobbyists and special interest groups, and these might have economic motivations for accelerating or hindering AGI development. Another challenge is that legislation tends to lag behind technological development, so regulating emerging applications of AI (such as autonomous weapons or high-frequency trading algorithms) may not translate into regulating basic research into AGI. Finally, regulations and treaties restricting AGI development may be difficult to verify and enforce (Sotala & Yampolskiy, 2014). The latter point is especially salient, given the competitive advantages of superintelligence plus its potential benefits to humanity, like eradicating diseases and averting long-term nuclear risks. Nevertheless, international cooperation regarding AI development and avoiding an AI arms race may still improve the odds of a positive future (Yudkowsky et al., 2010).

5.2.2.  Enhancing human capabilities by merging with machines
Cognitive enhancement of humans can reduce the gap between humans and AGI, and it would enable us to deal with difficult problems more effectively. On the other hand, this capacity might also make it easier to develop dangerous technologies, therefore moral enhancement may also be desirable (Sotala & Yampolskiy, 2014). Persson and Savulescu (2008) speculate that moral enhancement is possible through biomedical and genetic means, for example altering the levels of monoamine oxidase (MAO) or oxytocin to reduce criminal behavior and promote trust. Human enhancement could occur through the mechanism of whole brain emulation, also known as mind uploading (Sotala & Yampolskiy, 2014). Some authors consider WBE more feasible and reliable than creating a safe AGI ab initio, but as we have already seen, Yudkowsky (2008a) challenges this notion. Others advocate for “cyborgization” by linking AGI directly to human brains, such that the two become one entity; alternatively, the AGI could extrapolate and learn the human value system directly from an upload’s brain (Sotala & Yampolskiy, 2014).

There are various issues (in addition to those mentioned in section 4.1) surrounding the uploading of human minds to computers. For one, there is the question of whether we would still remain human if our brains are uploaded and augmented with electronic modules. Since computers have far more potential processing power than the human brain, the biological component of a human-machine hybrid might be insignificant, or the computer might change the functioning of the brain. There is also the possibility of losing some of our cherished human values through certain evolutionary dynamics. For example, uploads could copy themselves to gain a competitive advantage, or they might benefit from coalescing with several other minds via artificial connections, and there is likely to be a resultant pressure to do so. The outcome would be a change in our sense of personal identity and even a blurred distinction between individual minds. We may cease to enjoy things like humor, love, parenting and literature, because they would no longer be evolutionarily adaptive capabilities. Further, in a society of WBEs, minds would be vulnerable to cyberattacks which could be fatal (Sotala & Yampolskiy, 2014)[7].

In order for WBE to be a viable strategy, it must be invented before AGI. Whether it would be easier to copy the human brain or to invent a new intelligence from scratch is up for debate. Even if uploads come before AGI, they may not necessarily be easy to modify, and therefore may not be able to take full advantage of being digital. Moreover, in the long-term AI could still eventually outperform any WBE, so uploading would only be a short-term solution. However, if WBE is invented before AGI, then it might be possible to control AI technology for a while and develop safe AGI using accelerated research (Sotala & Yampolskiy, 2014).

5.2.3.  Less realistic solutions
There are other societal proposals that Sotala and Yampolskiy (2014) find inadequate. These include doing nothing, integrating AGIs with society, and relinquishing research into AGI.

Those who advocate for doing nothing about AI risk use three different reasons. One reason, they believe, is that AGI lies too distant in the future to be worth our attention, and that it distracts our concern away from more immediate and likely threats. Another reason is that even if AGI is developed, it will not pose a large threat. For instance, AGI may not acquire any real authority, or it may not have any particular desire to act against humans. It may be argued that AI poses no risk because superintelligence is not even possible. A third reason used to suggest no action, is that artificial superintelligence will be more intelligent[8] than humans in every way, and that it would therefore be better for them to kill us and replace us. However, these reasons seem not to hold up to scrutiny. Firstly, the probability of AGI being developed within the next hundred years seems higher than that of an asteroid impact killing civilization in the same period, and given that we are unsure of how much progress will be needed for a safe AGI, it seems reasonable to begin researching the topic well before AGI is near.

Secondly, we have already seen reasons why AIs might have motivations to act against humanity, unless they are designed the right way. It should also be pointed out that in order to outperform humans and potentially be dangerous, an AGI does not necessarily need to be a lot more intelligent than humans[9]. Finally, there are many things that we value for their own sake that would be lost upon human extinction[10] (Nota, 2015; Sotala & Yampolskiy, 2014). Muehlhauser and Bostrom (2014) point out that more intelligence does not necessarily imply more morality, thus the AI’s values are not necessarily superior to our own.

Integration proposals acknowledge the risk of AGI and argue that the best way of dealing with it is to prepare our societal structures to handle AGIs once they arrive. For example, one idea is to create a legal framework that is mutually beneficial for both humans and AGIs, and implement economic controls to uphold principles like property rights and reciprocity. The assumption here is that AGIs would benefit more by trading with us than by doing us harm. A different idea is to foster positive values in society, for example through moral enhancement, so that AGIs would be more likely to be developed with positive values in mind. In general, integration proposals suffer from the fact that there is no way to guarantee that an un-Friendly AI would agree with human values – thus there are still risks like the AI eliminating its human trading partners to replace them with better trading partners, or creating technological unemployment, or engaging in fraudulent behavior, or subtly manipulating human values. Therefore, it seems preferable to focus on developing FAI (Sotala & Yampolskiy, 2014).

On the opposite end of not doing anything, there is the proposal to abandon the technological development that could lead to AGI, usually motivated by a belief that the risks are unacceptable. Relinquishment could take the form of outlawing AGI, or restricting the development of more powerful hardware. However, if a single country bans a certain technology, the technology will tend to move to other countries. Moreover, relinquishment suffers from the same challenges as regulation, but to a greater extent. There is also no historical experience of a multipurpose technology like AGI being successfully and permanently relinquished (Nota, 2015; Sotala & Yampolskiy, 2014).

5.3.           AI design proposals
Design proposals come in two kinds: external constraints on AGI behavior, which are restrictions imposed on the AI from the outside, aiming to limit its ability to do damage; and internal constraints on AGI behavior, which entail that they are designed either to act in a way that is beneficial to humanity, or to be easier to control via external means (Sotala & Yampolskiy, 2014).

5.3.1.  External constraints
The idea of confining or “boxing” an AGI is to restrict it to a specific, confined environment form which it cannot communicate with the external world without authorization. Communication with the AGI could be controlled by means of “safe questions”, which are questions for which humans could, in principle, find an answer without the input of a superintelligence (thus the AI is simply speeding up the discovery process). Another approach is to confine the AGI to virtual worlds so that it cannot directly influence our world, but this approach seems flawed, because in order for the AGI to be of any use to us, we must at the very least be able to observe it, which makes it possible for some information to leak out. Other proposals include wiping the AI’s memory after every job, and using various checks and balances like test questions or “honeypots”[11] to establish its trustworthiness (Sotala & Yampolskiy, 2014).

The weakness of AI-boxing techniques is that they are unlikely to work for AIs that are significantly smarter than humans, so their use is limited predominantly to less advanced AGIs. A superintelligence would presumably find ways to avoid being shut down or reset, and would attempt to escape confinement, for example by means of social manipulation, using knowledge about human behavior and psychology. Even if the AI remains confined, it could still achieve enough influence that we would become dependent on it, which would make it hard to modify or reset. Relying solely on confinement would be a risky strategy, but confinement techniques are still worth developing as a first line of defense for AGIs that are limited in intelligence (Sotala & Yampolskiy, 2014). Muehlhauser and Bostrom (2014) echo this view, saying that confinement may be useful as an extra precaution during the development phase of FAI.

Other external constraints include AGI enforcement and the simulation argument attack. AGI enforcement proposes that a system of AGIs police non-safe AGIs of equal intelligence and stop them if they behave against human interests. However, this approach relies on most AGIs being safe. Therefore, the focus should really be on developing successful internal constraints, which will be discussed shortly. The simulation argument attack aims to convince an AGI that it is running on a computer simulation, and that it would be turned off if it displays non-safe behavior. However, this would only work if the AGI cares about simulations and assigns a nontrivial probability to being shut down. Thus the proposal seems to be of limited value, except against very specific AGIs (Sotala & Yampolskiy, 2014).

5.3.2.  Internal constraints
One type of AI design is one that only answers questions and performs no other actions. This is known as an “Oracle AI”. An Oracle AI can be further restricted to only calculate what is asked of it, and to make no goal-directed decisions of its own – essentially being a “Tool-Oracle AI”. These could be used to solve problems relating to the development of more intelligent but still safe AGIs. Like external constraints, Oracle AI could be used in the medium-term as a stepping stone toward friendly general AI, but does not eliminate AI risk by itself. This is because an Oracle AI might potentially be turned into a free-acting AGI, as many stakeholders would have the incentive to do so – for example, militaries or high-frequency trading firms may decide to grant their AI systems increased autonomy[12]. Furthermore, we must be wary not to imbue Oracles with unquestioned authority, because it is possible that they might provide advice that goes against human values (Sotala & Yampolskiy, 2014).

In order to make AGIs easier to control using external constraints, they could be programmed with various motivational weaknesses. For example, they could be programmed with a high discount rate, so that they value short-term goals and gains more than long-term ones, thereby becoming more predictable. They could be given goals that are easy to satisfy, and provided with near-maximum reward simply for not misbehaving. AGIs could also be designed to have a calculated indifference regarding certain events, such as the detonation of explosives attached to their hardware. They could be programmed to follow a number of safety restrictions, like a limit on how fast they could self-improve. Finally, we could design AGIs using a new “legal machine language”, which would formally specify allowed and disallowed actions and could be modified by government legislature. Similar to external constraints, these motivational weaknesses do not seem failsafe, especially in the case of ASI, and they are also vulnerable to arms race scenarios. Nevertheless, to some extent they might still be useful and worth studying (Sotala & Yampolskiy, 2014).

All the proposals reviewed so far seem to be most effective in the medium-term. In the long-term, however, the only guaranteed way to prevent a catastrophic AI scenario is to create a Friendly AI that shares our values. To this end, the most promising approaches appear to be value extrapolation approaches and human-like architectures (Sotala & Yampolskiy, 2014).

Value learning is a type of “top-down” approach to AGI design; i.e. it takes a specific ethical theory and attempts to derive a system architecture capable of implementing that ethical theory. In the case of value learning, the AGI is provided with a set of possible utility functions that humans may have, and it then attempts to find the utility functions that match human preferences the best. The difficulty lies in the fact that human desires can change over time and violate certain axioms of utility theory. In addition, we also have second-order preferences about what we feel we ought to desire and do. There have been numerous proposals for value learning, but just one such approach is called “Coherent Extrapolated Volition”, or CEV, formulated by Eliezer Yudkowsky. AGIs designed with CEV act according to humanity’s extrapolated values (Sotala & Yampolskiy, 2014). According to Yudkowsky (2004), humanity’s CEV is:

“our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.” (p. 6).
Although there are questions about the computational tractability of CEV in modelling the values of all humans and the evolution of those values, the value learning approach seems to be the only one which properly considers the complexity of value thesis. Therefore, if it could be made to work, it seems to be the most reliable and most important approach to pursue (Sotala & Yampolskiy, 2014).

An alternative to value learning that may be somewhat less reliable but possibly easier to build, is the human-like AGI approach. This approach involves building AGI that can learn human values by virtue of being similar to humans, due to using mechanisms copied from the human brain. Human-like AGI is a hybrid between top-down and bottom-up approaches[13]. Methods could include artificial neural nets, learning through memory systems, studying the results of human evolution, or the LIDA cognitive architecture. AGI could remain relatively safe if its reasoning process can be made sufficiently human-like. However, this approach is not without flaws. Firstly, it is uncertain what degree of likeness would be enough, considering that small deviations from the norm could produce values that most humans would find undesirable. Even if an AGI could reason in a human-like manner, we are still faced with the problems that some humans have different ethical beliefs than most (such as advocating for human extinction[14]) and that humans have historically abused power. Secondly, human-like architectures might be outcompeted by less human, more easily modifiable and more efficient architectures (Sotala & Yampolskiy, 2014).

Whatever approach is used, it seems prudent to invest in better formal verification methods to improve the safety of self-improving AGIs. Formal verification proves specific properties about various algorithms. In the case of AGI, the relevant properties are friendly and stable goals and values, which remain stable even as the AGI undergoes modification. For example, formal proofs could be required before the AGI is allowed to self-modify, and this would include proving that the verify-before-modification property itself will be preserved. Formal verification could also be integrated into Tool-Oracle AI and motivational weaknesses approaches. However, as Sotala and Yampolskiy (2014, p. 24) note: “Proven theorems are only as good as their assumptions, so formal verification requires good models of the AGI hardware and software.” Whether it is even possible is an open question, but since it is the only way of being confident that an AI is safe, we should at least attempt it (Sotala & Yampolskiy, 2014).

5.3.3.  Less realistic FAI designs
Some authors like Ben Goertzel have proposed an “AGI Nanny”, which is slightly more intelligent than humans and is meant to protect us from various dangers, including more advanced AI (Sotala & Yampolskiy, 2014). The AGI Nanny would have temporary power until a more advanced FAI is invented. In theory, this proposal has promise if it can be made to work, but it is questionable whether that is the case. A constrained AGI Nanny may be harder to create than a self-improving AGI, considering its need for precisely specified goals (Sotala & Yampolskiy, 2014). A free-acting FAI may be a more worthwhile target.

There are various top-down and bottom-up approaches that seem less promising than value learning or human-like architectures. Isaac Asimov’s “Three Laws of Robotics” are well-known by the public, but have received strong criticism. The Three Laws (or four in a later version) have been deemed too ambiguous to implement, or contradictory to each other in many situations. Ethical theories like Kant’s categorical imperative or classical utilitarianism have been suggested, but authors tend to conclude that both theories are problematic for AGI, and may not capture all human values. Ben Goertzel proposed a “Principle of Voluntary Joyous Growth”, which is defined as maximizing happiness, growth, and choice, but similar issues apply (Sotala & Yampolskiy, 2014).

In terms of bottom-up approaches, proposed techniques include artificial evolution and reinforcement learning. “Evolved morality” involves the creation of AGIs through algorithmic evolution, selecting for the most intelligent and most moral algorithms. Evolutionary invariants are evolutionarily stable strategies that could survive in a competitive context but also make the AGI treat humans well. However, the most evolutionarily rational strategy might well turn out to be something like Machiavellian tit-for-tat, which means acting selfishly when one can get away with it and cultivating a façade of cooperation. Reinforcement learning entails rewards and punishments for the AGI making appropriate or inappropriate moral choices. In general, these approaches may work for an AGI in a test environment, but that does not guarantee that its behavior will be predictable in other contexts. Moreover, they seem incapable of preserving the complexity of value, and there is a possibility that the AGI could modify the constraints of its environment against human will (Sotala & Yampolskiy, 2014).

5.4.           Further considerations
Dewey (2016) points out that even if we successfully invent Friendly AI, the existential risk is not necessarily eliminated, because there could be multiple subsequent projects developing AGI, and each one of them would need to be safe. Dewey (2016) describes the existential risk from a fast takeoff scenario with his “exponential decay model” – essentially, the probability of human survival decreases over time with each new AI project, even if each project has a small chance of catastrophic failure. Dewey (2016) proposes four kinds of strategies that could plausibly end the period of existential risk. The first is international coordination in the form of a convention or authoritative body. The second is to invent a kind of sovereign or free-acting AI that would prevent unsafe takeoffs. A third strategy is an AI-empowered project, which would use some non-sovereign AI system (perhaps an Oracle AI) to gain a decisive advantage. Finally, other technologies like WBE or nanotechnology could be exploited by private or public projects in order to gain a decisive strategic advantage and find and stop unsafe AI development around the world (Dewey, 2016). These proposals resemble those by Sotala and Yampolskiy (2014) and Wilson (2013), with the key difference being that Dewey’s (2016) proposals also apply to the long-term. 
  
Another consideration when researching and discussing AI risk or other catastrophic risks is to be aware of the distorting effects of cognitive bias (Yudkowsky, 2008a). Such biases and heuristics can potentially influence the assessment of catastrophic risks, and so they are vital knowledge for scholars in the field (Yudkowsky, 2008b).  Yudkowsky (2008b) lists ten biases: the availability heuristic, hindsight bias, black swans, the conjunction fallacy, confirmation bias, contamination effects, the affect heuristic, scope neglect, overconfidence, and bystander apathy. These will not be elaborated on here, so refer to the source. Yudkowsky (2008b) also observes other harmful modes of thinking:

“The Spanish flu of 1918 killed 25-50 million people. World War II killed 60 million people.  is the order of the largest catastrophes in humanity’s written history. Substantially larger numbers, such as 500 million deaths, and especially qualitatively different scenarios such as the extinction of the entire human species, seem to trigger a different mode of thinking – enter into a ‘separate magisterium.’ People who would never dream of hurting a child hear of an existential risk, and say, ‘Well, maybe the human species doesn’t really deserve to survive.’” (p. 114, italics in original).
Bostrom (2002) also discusses a “good story bias”, which refers to the phenomenon of dismissing “boring” scenarios as more unlikely than they actually are, and overestimating the likelihood of scenarios that are familiar simply by virtue of being good stories which feature prominently in movies, TV shows and novels.

 6.  EXISTENTIAL RISK PREVENTION AS A GLOBAL PUBLIC GOOD
Public goods are those for which consumption is non-rivalrous and non-excludable; i.e. consumption by one individual does not reduce availability for others, and others cannot be excluded from using the good (Kaul, Grunberg, & Stern, 1999). Global public goods extend this concept across countries, peoples and even generations – their benefits are thus “quasi universal”. Examples include the environment, peace, and global financial stability. The nature of public goods means that they are susceptible to the “free-rider” problem: people may access the good without paying for it, which could lead to disproportionate costs for the producer of the public good. This would lead to the public good being undersupplied by the market; hence why public goods are often referred to as a type of market failure. Market failures are typically addressed by government intervention (Kaul et al., 1999). Like climate change prevention, it may be argued that the prevention of catastrophic risk from AI and other technologies is an intergenerational global public good, because it benefits everyone – yet few want to pay for it. Dewey (2016) supports the view that existential risk prevention is a global public good, and that international coordination therefore makes sense. Matheny (2007, p. 1341) writes that “extinction risks are market failures where an individual enjoys no perceptible benefit from his or her investment in risk reduction. Human survival may thus be a good requiring deliberate policies to protect.”

Shulman (2009) claims that the development of AGI could result in a collective action problem, because the benefits of early AGI would be locally concentrated, whereas the risks would be global. Since there may be a trade-off between fast development and safe development, it would be beneficial for every team to commit to a collective arrangement wherein all slow down research and take safety measures, assuming compliance was verifiable. However, if compliance cannot be assured, then each team will likely aim to develop AGI before the others, potentially sparking an AI arms race (Shulman, 2009). Hypothetically, compliance could be enforced through a global surveillance system (Dewey, 2016).

Dewey (2016) finds that governments may have favorable abilities and incentives to mitigate existential ASI risk, and that “purely commercial incentives are probably not sufficient to suppress unsafe AI projects.” (p. 5). For example, governments may recognize the threat posed by ASI to their national interests, including the fact that there may not be any good defense against a powerful ASI. This would give them the incentive to coordinate with other governments, or to mandate nationalized AI projects. Government-sponsored projects may have advantages like enjoying legal monopoly, being less sensitive to commercial incentives, and being more easily influenced to adopt control and safety policies. Governments through international agreements or within their own jurisdictions may restrict, regulate, or monitor AI research, and they might be more likely than commercial projects to cooperate globally and promote some form of public good. However, treaties may be subject to future political changes, so their effectiveness may decline over time (Dewey, 2016). Nevertheless, it seems that there is both a need for governments to intervene, and that the capacity to act is present.

 7.  CURRENT ACTIONS
As mentioned, there is currently no real political policy in place, and most research into FAI is being carried out by a small group of non-profit organizations (Nota, 2015). Some of these organizations include the Machine Intelligence Research Institute (MIRI), the Future of Humanity Institute (FHI), the Centre for the Study of Existential Risks (CSER), the Future of Life Institute (FLI), and the Association for the Advancement of Artificial Intelligence (AAAI). These organizations, which rely mostly on private donations and to a much lesser extent public grants, either raise awareness of AI risk, or engage directly in research into the potential risks and solutions (Nota, 2015). According to Muehlhauser and Bostrom (2014), MIRI and FHI engage in raising awareness in order to attract financial and human capital with which to tackle the problem, in addition to strategic and technical research. Strategic research asks questions like which types of technological development are risk-increasing or risk-decreasing, or whether we can develop useful technological forecasting methods. Technical research asks how we can extract a coherent utility function from inconsistent human behavior, or whether we can develop safe confinement methods for powerful AI. Perhaps among the most comprehensive works to date on the existential risk posed by ASI is Nick Bostrom’s Superintelligence: Paths, Dangers, Strategies (see Bostrom, 2014). However, as Nota (2015) points out, it seems questionable to rely on private donations to protect the future of humanity, especially given that the research has public benefits. As already mentioned in the previous section, the private sector is unlikely to produce an optimal level of such research or global coordination, and thus international government intervention is a logical step forward.

 8.  DISCUSSION AND CONCLUSIONS
“The future has a reputation for accomplishing feats which the past thought impossible.”
-- Yudkowsky, 2008a, p. 330
This paper has shown that smarter-than-human artificial intelligence may plausibly be invented in the next hundred years, and that the potential consequences of this are enormous. While the exact behavior of an AI system may not be predictable from our current standpoint, there are reasons to believe that it may potentially harm humanity, possibly even causing our extinction. This is based on the considerations that an AI could act autonomously, that it need not share our motivations, and that it may decide to eliminate humanity or to consume resources in a way incompatible with human survival.

Ensuring the safety of AI is a difficult endeavor, with many potential solutions being proposed. The ones that seem most promising at the moment have been discussed in this paper: in the short- to medium-run, these are funding research into Friendly AI and restricting dangerous research, using boxed AIs, Oracle AIs or whole brain emulations to assist humanity in solving the problem, and programming AIs to have motivational weaknesses and use formal verification methods. In the long-run, we need to find ways of making AIs learn complex human values, the way we would want them to be interpreted. Ensuring “friendliness” is the only way to avoid long-term catastrophic risk. Indeed, friendliness must be ensured not only for the first AI, but for all future AI projects thereafter.

Since reducing existential risk is a global public good that would benefit all of humanity, commercial incentives are unlikely to be sufficient to address the market failure of underproduction. Therefore, it seems sensible that governments should coordinate by forming international treaties or agencies. Given the severity of what lies at stake, drastic measures may be warranted. Respecting national sovereignty is no excuse for failing to take action against a major existential risk (Bostrom, 2002). In addition to the responses discussed above, other ways of potentially reducing existential risk in general, not limited to AI, include colonizing space in order to diversify the number of planets we inhabit, and building secure underground refuges (Matheny, 2007).

Our short-term challenge is to find ways of making international collaboration work despite obstacles like interest groups, “arms race” dynamics, perceived social and political legitimacy, policy lag, difficulties reaching consensus, and difficulties in verification and enforcement. Fortunately, we have seen that governments may have incentives to overcome these obstacles (Dewey, 2016; Shulman, 2009).

This paper has not examined every possible solution, nor every possible contingency that might occur after the invention of AGI. It is likely that there are unknown unknowns that are not even discussed in the extant literature. The arguments in this paper rest on many assumptions, for instance that artificial minds will be able outperform human minds in every domain, that they will be able to improve themselves, and that they will amass enough power to threaten us. The extent to which each of these assumptions holds true is a job for future research. However, our present uncertainty is no reason to ignore the issue – in fact, we would have a better excuse to ignore it if we were more certain that it did not pose a threat.
     
We must think not only about the risks, but also the opportunities of superintelligence (Bostrom & Yudkowsky, 2011). ASI poses an existential threat to humanity, but it can also help us solve many of our problems sooner and more effectively, because greater intelligence can help accomplish almost any goal to a greater degree. As Muehlhauser and Salamon (2012, p. 32) claim, “curing cancer is ultimately a problem of being smart enough to figure out how to cure it, and achieving economic stability is ultimately a problem of being smart enough to figure out how to achieve it.”

We must be careful not to underestimate the power of intelligence. Intelligence in the sense of a “g-factor” matters to the extent that it predicts relative success within the human world, for example lifetime income (Yudkowsky, 2008a). However, even factors beyond IQ that contribute to success like education, enthusiasm, musical talent, social skills, and rationality, are all cognitive factors that reside in the brain. Moreover, as a species, humans’ advanced intelligence and creativity relative to other species is the most obvious explanation for why we could set our footprints on the Moon[15].  Underestimating the potential impact of AI may lead to a catastrophic scenario if we do not pay sufficient attention to the risks, and therefore fail to prepare (Yudkowsky, 2008a).

We should not allow our economic incentives or our scientific curiosity to overshadow safety concerns (Muehlhauser & Bostrom, 2014). We should beware the pitfalls of psychological biases and flawed reasoning, as Yudkowsky (2008b) poetically illustrates:

“There is a saying in heuristics and biases that people do not evaluate events, but descriptions of events — what is called non-extensional reasoning. The extension of humanity’s extinction includes the death of yourself, of your friends, of your family, of your loved ones, of your city, of your country, of your political fellows. Yet people who would take great offense at a proposal to wipe the country of Britain from the map, to kill every member of the Democratic Party in the U.S., to turn the city of Paris to glass – who would feel still greater horror on hearing the doctor say that their child had cancer – these people will discuss the extinction of humanity with perfect calm.” (p. 114, italics in original).  
We should not neglect existential risk, and doing nothing is certainly not an option. However, nor should we neglect the potential to use the enormous power of superintelligence for good. Relinquishing AGI technology would forgo all its benefits, and that would be a shame.

Insofar as success on the Friendly AI challenge may help us solve nearly any other problem, including existential threats from other technologies, then all else equal, it seems prudent to prioritize the challenges of AI and solve them first (Yudkowsky, 2008a). Bostrom (2014, p. v) asserts that the prospect of superintelligence is “quite possibly the most important and most daunting challenge humanity has ever faced. And—whether we succeed or fail—it is probably the last challenge we will ever face.”


FOOTNOTES

[1] Yudkowsky (2008a, p. 315) defines an optimization process as “a system which hits small targets in large search spaces to produce coherent real-world effects.”
[2] Narrow AI is also referred to as Weak AI, and general AI is also referred to as Strong AI.
[3] According to Yudkowsky (2008a, p. 331), it is possible within the laws of physics to construct a brain that computes a million times as fast as a human brain “without shrinking the size, or running at lower temperatures, or invoking reversible computing or quantum computing.”
[4] Sometimes also known as an “AI-go-FOOM” scenario. More generally, an economic or technological speedup is often referred to by the term “Singularity” (Yudkowsky, 2013).
[5] This is also referred to as a “computing overhang” by Muehlhauser and Bostrom (2014).
[6] Nota (2015) draws a parallel to the National Institutes of Health (NIH) in the United States, which offers grants based on a peer review process, and frequently deals with issues that could affect public safety.
[7] There are further philosophical questions, for example whether an upload would still be the same person as the original, or whether it would be conscious.
[8] Or more valuable, or more moral, etc.
[9] This is because AIs enjoy advantages like being duplicable, coordinating with each other, and operating at much higher speeds, as mentioned earlier.
[10] Sotala and Yampolskiy (2014) give the examples of humor, love, game-playing, art, sex, dancing, social conversation, friendship, parenting, and sport.
[11] A honeypot is like a bait; i.e. a tempting resource that the AGI would value, but which would be placed in locations that the AGI would be forbidden from accessing (Sotala & Yampolskiy, 2014).
[12] If one actor does this, a competitive arms race is likely to follow.
[13] Top-down approaches have been briefly described before. Bottom-up approaches entail evolving or simulating the mechanisms that give rise to human moral decisions (Sotala & Yampolskiy, 2014).
[14] See for instance the Voluntary Human Extinction Movement, VHEMT ( http://www.vhemt.org/ ). 
[15] Indeed, even with our human level of intelligence, we have developed the capacity to destroy all major life on the planet (Nota, 2015).  

REFERENCES

Armstrong, S., & Sotala, K. (2015). How we’re predicting AI – or failing to. In Romportl, J., Zackova, E., & Kelemen, J. (Eds.), Beyond Artificial Intelligence: The Disappearing Human-Machine Divide (pp. 11-29). Springer International Publishing Switzerland. Retrieved from https://intelligence.org/files/PredictingAI.pdf
Baum, S.D. (2015). The far future argument for confronting catastrophic threats to humanity: Practical significance and alternatives. Futures, 72, 86-96. doi:10.1016/j.futures.2015.03.001
Bostrom, N. (2002). Existential risks: Analyzing human extinction scenarios and related hazards. Journal of Evolution and Technology, 9(1), 1-31. Retrieved from http://www.nickbostrom.com/existential/risks.pdf
Bostrom, N. (2003). Ethical issues in advanced artificial intelligence. In Schneider, S. (Ed.), Science Fiction and Philosophy: From Time Travel to Superintelligence (pp. 277-284). Chichester: Wiley-Blackwell.
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press. 
Bostrom, N., & Ćirković, M.M. (Eds.). (2008). Global Catastrophic Risks. Oxford University Press.
Bostrom, N., & Yudkowsky, E. (2011). The ethics of artificial intelligence. In Frankish, K., & Ramsey, W.M. (Eds.), The Cambridge Handbook of Artificial Intelligence (pp. 316-334). Cambridge University Press.
Dewey, D. (2016). Long-term strategies for ending existential risk from fast takeoff. In Müller, V.C. (Ed.), Risks of Artificial Intelligence (pp. 243-266). Boca Raton, FL: CRC Press. Retrieved from http://www.danieldewey.net/fast-takeoff-strategies.pdf
Kaul, I., Grunberg, I., & Stern, M.A. (1999). Defining global public goods. In Kaul, I., Grunberg, I., & Stern, M.A. (Eds.), Global Public Goods: International cooperation in the 21st century (pp. 2-19). New York: Oxford University Press/UNDP.
Matheny, J.G. (2007). Reducing the risk of human extinction. Risk Analysis, 27(5), 1335-1344. doi:10.1111/j.1539-6924.2007.00960.x
Müller, V.C., & Bostrom, N. (2016). Future progress in artificial intelligence: A survey of expert opinion. In Müller, V.C. (Ed.), Fundamental Issues in Artificial Intelligence (pp. 553-571). Berlin: Springer.
Muehlhauser, L., & Bostrom, N. (2014). Why we need Friendly AI. Think, 13(36), 41-47. http://dx.doi.org/10.1017/S1477175613000316
Muehlhauser, L., & Salamon, A. (2012). Intelligence explosion: Evidence and import. In Eden, A.H., Moor, J.H., Søraker, J.H., & Steinhart, E. (Eds.), Singularity Hypotheses: A Scientific and Philosophical Assessment (pp. 15-40). Berlin: Springer.
Nota, C. (2015, March 6). AGI Risk and Friendly AI Policy Solutions. Retrieved from https://www.overleaf.com/articles/agi-risk-and-friendly-ai-policy-solutions/hrxfvnfpyxnz/viewer.pdf
Persson, I., & Savulescu, J. (2008). The perils of cognitive enhancement and the urgent imperative to enhance the moral character of humanity. Journal of Applied Philosophy, 25(3), 162-177. doi:10.1111/j.1468-5930.2008.00410.x
Russell, S., Dewey, D., & Tegmark, M. (2015). Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4). http://dx.doi.org/10.1609/aimag.v36i4.2577
Shulman, C. (2009, July 2-4). Arms control and intelligence explosions. Paper presented at the 7th European Conference on Computing and Philosophy (ECAP), Bellaterra, Spain. Retrieved from https://intelligence.org/files/ArmsControl.pdf 
Sotala, K., & Yampolskiy, R.V. (2014). Responses to catastrophic AGI risk: A survey. Physica Scripta, 90(1), 018001. doi:10.1088/0031-8949/90/1/018001
Tonn, B., & Stiefel, D. (2014). Human extinction risk and uncertainty: Assessing conditions for action. Futures, 63, 134-144. doi:10.1016/j.futures.2014.07.001
Wilson, G. (2013). Minimizing global catastrophic and existential risks from emerging technologies through international law. Virginia Environmental Law Journal, 31, 307-364. Retrieved from http://lib.law.virginia.edu/lawjournals/sites/lawjournals/files/3.%20Wilson%20-%20Emerging%20Technologies.pdf
Yudkowsky, E. (2004). Coherent Extrapolated Volition. San Francisco, CA: The Singularity Institute. Retrieved from https://intelligence.org/files/CEV.pdf
Yudkowsky, E. (2008a). Artificial Intelligence as a positive and negative factor in global risk. In Bostrom, N., & Ćirković, M.M. (Eds.), Global Catastrophic Risks (pp. 308-345). Oxford University Press.
Yudkowsky, E. (2008b). Cognitive biases potentially affecting judgment of global risks. In Bostrom, N., & Ćirković, M.M. (Eds.), Global Catastrophic Risks (pp. 91-119). Oxford University Press.
Yudkowsky, E. (2013). Intelligence explosion microeconomics (Technical report 2013-1). Berkeley, CA: Machine Intelligence Research Institute. Retrieved from http://intelligence.org/files/IEM.pdf
Yudkowsky, E., Salamon, A., Shulman, C., Kaas, S., McCabe, T., & Nelson, R. (2010). Reducing long-term catastrophic risks from artificial intelligence. San Francisco, CA: The Singularity Institute. Retrieved from http://intelligence.org/files/ReducingRisks.pdf

Comments

Popular posts from this blog