EFTA00813960.pdf
dataset_9 pdf 305.2 KB • Feb 3, 2026 • 4 pages
From: Eliezer Yudkowsky
To: Jeffrey Epstein <jeevacation@gmail.com>
Subject: What MIRI does, and why
Date: Wed, 19 Oct 2016 21:49:52 +0000
To be sent to a hedgefundish person who asked for more reading. Rob, Nate, any priority objections to this text,
especially the financial details at bottom?
This is probably the best overall online intro to what we do and why:
hups://intelligence.org/2015/07/27/miris-approach/
In my own words:
MIRI is the original organization (since 2005) trying to get from zero to not-zero on the AGI alignment problem.
This is not an easy problem on which to do real work as opposed to fake work.
Our subjects of study were and are selected by the criterion: "Can we bite off any chunk of the alignment
problem where we can think of anything interesting, non-obvious, and discussable to say about it?" (As opposed
to handwaving, "Well, we don't want the Al to kill humans," great, now what more can you say about that which
would not be obvious in five seconds to somebody skilled in the art?) We wanted to get started on building a
sustained technical conversation about AGI alignment.
There is indeed a failure mode where people go off and do whatever math interests them, and afterwards argue
for it being relevant. We honestly did not do this. We stared at a big confusing problem and tried to bite off
pieces such that we could do real work on them, pieces that would force us to think deeply and technically,
instead of waving our hands and saying "We must sincerely ponder the ethics of blah blah etcetera." If the math
we work on is interesting, I would regard this as validation of our having picked real alignment problems that
forced us to do real thinking, which is why we eventually arrived at interesting and novel applied math.
Why consider AGI alignment a math/conceptual problem at all, as opposed to running out and building modem
deep learning systems that don't kill people?
- The first chess paper in 1950, by Shannon, gave in passing the algorithm for a program that played chess using
unbounded computing power. Edgar Allen Poe in 1833, on the other hand, gave an a priori argument that no
mechanism, such as the calculating engine of Mr. Babbage, could ever play chess, when arguing that the
Mechanical Turk chessplayer must have a concealed human operator. When you don't know how to solve a
problem even using unbounded computing power, you are in some sense confused about the problem. We are
presently unable to code, even in principle, the sort of superintelligent program that would e.g. synthesize a
strawberry on a plate and then stop without destroying the universe, never mind all the stuff humans think of as
deep moral dilemmas. (Modem computer science has a much stronger idea of how to give an unbounded
formula for a superintelligence without the alignment part, e.g. Marcus Flutter's AIXI.)
- Almost all of what we expect to be the interesting and difficult problems of AGI alignment do not
spontaneously materalize at humanity's current level of AI capabilities, and if they did, other people would
provide funding to work on them. For example, an AGI doesn't spontaneously resist having its off-switch pressed
until the AGI understands enough of the big picture to realize what an off-switch is and that its other goals will
not be achieved if there is no AGI trying to achieve them. Some of these problems can be sort of modeled in toy
systems, but not all, and they're hard to toy-model in an interesting way that doesn't just produce trivial results
EFTA00813960
that anyone skilled in the art would easily predict in advance. To build a sustained conversation about what
would happen later in more advanced AI systems, we need to say things precisely enough that a critic can say
"No, that just kills everybody, because..." and the proposer is forced to agree. Precise talk about unbounded
systems can sometimes well-incarnate a stumbling-block, or something we are confused about, which by default
we'd also expect to show up in a sufficiently advanced bounded system. Then we can talk precisely about the
stumbling-block, and have conversations where somebody says "Wrong" and the original proposer stares off for
a second and nods.
One of the major classes of interesting problem we tried to bite off has to do with stability of self-modification,
or coherent cognitive reflection. For example, we'd want to show that "be concerned for other sentient life"
sticks around as an invariant when the AI self-improves or builds other Als.
There's an obvious informal argument that an agent optimizing a utility function U would by default try to build
another agent that optimized U (because that leads to higher expected U than building an agent that optimizes
not-U). When we tried to formalize this argument in what seemed like the obvious way, we started running into
Godelian problems which seemed to imply issues like: "If the AI reasons in a system T, will it be able to trust a
theorem asserted by its own memory, or by an exact copy of itself, given that things usually blow up when a
system T asserts that anything T proves can be trusted?"
The point of this wasn't to investigate reflectivity and self-description in mathematical systems for the sake of
their pure interestingness. We were trying to get to the point where we could write down any coherent
description whatsoever of a self-modifying stable AI. It turned out that we actually did run into the obvious
Godelian obstacle. And so far as we could tell, there was no trivial cunning way to dance around the Godelian
obstacle without losing what seemed like some critical desideratum of a bounded self-modifying Al. For
example, System X can readily reason about System Y if System X is much larger than System Y; but a bounded
AI should expect its future self to be larger and more capable than its current self, not the other way around.
On this avenue, we've recently published a major new result of which Scott Aaronson and others have spoken
favorably. We can now exhibit a system that can talk sensibly and probabilistically about logical propositions
(like the probability that the trillionth decimal digit of pi is a 9, in advance of computing that digit exactly) and
get many nice desiderata simultaneously. E.g., the probability distribution learning from experience as it
discovers more mathematical facts, without any obvious Godelian limitations on what can be learned. More
importantly from our perspective, the probability distribution is able to talk about itself with almost perfect
precision (absolute precision still being unattainable for Godelian reasons). We took "math talking about math"
out of the realm of logical statements and into the realm of probabilistic statements, much further than had
previously been done.
https://intelligence.org/2016/09/12/new-paper-logical-induction/
This in turn takes us a lot closer to being able to write down a description of a not-totally-boundless stably self-
modifying Al that reasons about itself or other cognitive systems that assign probabilities to logical propositions.
Which in turn gets us closer to being able to talk formally about many other reflective and self-modeling and
cognitive-systems-modeling problems in a sufficiently advanced AI that we'd like to be stable.
The current academically dominant version of formal decision theory, "causal decision theory", is problematic
from our own standpoint in that CDT is not reflectively consistent: if you have a CDT agent, it will immediately
run off and try to self-modify into something that's not a CDT agent. The same theory also says that a formally-
rational selfish superintelligence should defect in the Prisoner's Dilemma against its own clone, which some
might regard as unreasonable.
EFTA00813961
Trying to resolve this issue led to another behemothic set of results we're still trying to write up, so I can only
point you to a relatively informal and unfinished online intro:
https://arbital.com/p/logical_dt/
But we have now shown, for example, how two dissimilar Ms with common knowledge of each other's source
code--and no other means of communication, coordination, or enforcement--could end up robustly and reliably
cooperating on the Prisoner's Dilemma:
https://arxiv.orWabs/1401.5577
This new decision theory also has economic implications about negotiations a la the Ultimatum Game, about
whether it's 'rational' to vote in elections, and a number of other informal problems in which the 'rational' choice
was previously said to be something unpalatable (see the first link above).
https://intelligence.org/files/Corrigibility.pdf
...is an overview of the "how do you make an AI that wants to keep its suspend button" problem, aka "how do
you build an agent that learns over two possible utility functions given one bit of information, without giving the
agent an incentive to modify that bit of information, or assigning it a definite value in advance of your trying to
press the button or not".
Everything above is on the more theory-laden side of our work. We only recently started spinning up work on
alignment within the paradigm of modem machine learning algorithms; half our researchers have decided to
allocate themselves to that going forward. Here's our current broad technical agenda for that research:
https://intelligence.org/files/AlignmentMachineLeaming.pdf
https://intelligence.org/2016/05/04/announcing-a-new-research-program/
An example of a bit of specific work we may try to do there soon is figure out how to translate an autoencoder's
representation into something closer to a human-readable representation--a representation that more easily
'exposes' or makes easier to compute, the answers to the sort of questions that humans ask. (An unsupervised
system learns a representation; we want to learn a transform from that representation onto another representation,
such that the second representation makes it easy to compute the answers to a class of supervised questions.)
MIRI has 8 fulltime researchers including me, plus 2 interns and a few more funded to do part-time work. 2
more researchers with signed offer letters are starting later this year/ early 2017. We have 6 ops staff, plus 1
more starting in November (handling research workshops, popular writeups, keeping the lights on, etcetera).
We're on track to spend $1.6M this year and we hope to spend $2.2M in 2017. We are currently at $257,000 out
of a needed $750,000 basic target in our annual public fundraiser, with 12 days remaining:
https://intelligence org/donate/
Hope this conveys some general idea of what MIRI does and why!
EFTA00813962
Eliezer Yudkowsky
Research Fellow, Machine Intelligence Research Institute
EFTA00813963
Entities
0 total entities mentioned
No entities found in this document
Document Metadata
- Document ID
- 3a9dbc3d-8a2c-4047-946e-97366f6fb3e8
- Storage Key
- dataset_9/EFTA00813960.pdf
- Content Hash
- cc7b5bc24a59950c78354e79883f0e10
- Created
- Feb 3, 2026