Okay, it's my great pleasure to introduce Daniel Selsam.
Daniel is a PhD student at Stanford,
he is advised by Percy Liang and David Dill.
Daniel was done here last summer.
He was off to a great tour how to link improver.
Daniel is interested in both machine learning systems and
to improving the intercheck to make sure they work as expected.
He is also interested in developing a prime machine
learning technique centered to improving.
And today he's going to present the stocket that he presented
to ICML.
>> Thanks Leo. >> Hi everybody,
thanks for coming today.
So I'll be presenting a new way to develop bug free machine
learning systems using formal mathematics.
And this is all joint work with my two advises Percy Liang
and David Dill.
So, Here's the problem.
It's extremely difficult to detect implementation errors in
machine learning systems.
If you write a program to sort a list and
the program ever returns an unsorted list,
then you know your program has an error.
Even if you don't detect it when you're writing it.
Later on once it misbehaves you can detect after the fact that
an error has occurred.
But if you write a machine learning system,
you give it a bunch of data and the lost function
doesn't go down that much, it doesn't work that well.
Maybe you just needed more data or more layers, or
a different model or maybe we're just numerically unstable,
you don't really know.
But maybe the program should work and
there's just a bug in the system.
How would you even know?
So this is an example recently.
So Theano for those of you who don't know, was the most popular
machine learning system for about a decade.
And it was notoriously riddled with bugs.
Somebody posted an issue about a year ago.
They have a model that's training well for
about 20 iterations, and then the lost function diverges.
And the poster thought that the model should work.
And he posted an issue and the developer said, well,
maybe your model's just not supposed to work.
How are we supposed to know?
Then after doing some experiments and
comparing with TensorFlow,
a different system where the model behaved similarly for
20 iterations and continued to go down.
The developers finally granted that it probably is a bug.
But they don't know where the bug is,
they don't know which models are affected.
And the scary part is I've given up on models that didn't work
well in Piano many times, and I really don't know whether it was
just because of a bug or because the model is wrong.
There's so many other possible reasons a system can fail that I
really don't know whether the model was really at fault, yeah.
>> Memory corruption bug, what kind of bug was it in the end?
>> They don't know.
>> We don't know,
cuz we've had memory corruption bugs in software for
the last 30 years.
That's not a new problem.
There are also tools to find memory corruption.
>> That is just the poster's guess for what it was.
>> Okay.
>> The developers have no idea
whether it was a memory corruption bug or not.
Sorry?
>> That's an interesting story, right,
because again, we get the memory corruption bug, you should be
able to inspect the program to see what it's doing.
But you're saying that what the program is doing is so
complicated that the developers who developed Theano,
even if they watched what it's doing they don't know whether to
interact or not.
>> Correct. >> Okay.
>> Here's another, Here's another example.
So Hastie and Tibshirani are two famous statisticians.
And in 2002 they got very excited about a new method
called coordinated descent for the lasso.
And they implement it and they find it works poorly and
they abandon it.
But four years later they go to a talk on a similar method that
works and they get excited again.
They implement it again and find it state of the art.
They look back and they realize their original version just
had a bug and they implemented it,
it didn't work well, they didn't know it was supposed to
work well they didn't have math that told them it should work.
So they just assumed that the model didn't
work well like most other models and they actually,
how many other great ideas have been abandoned because of bugs?
I mean they actually changed their entire research path
because something didn't work that should have worked.
So, The status quo methodology for software
development doesn't work perfectly for regular software,
but it really doesn't work for machine learning systems.
So the status quo is to test a program empirically, right.
You write your program and then you run the program.
And if the program ever does something that you think it
shouldn't do, you go back and debug the program and iterate.
And sometimes you can release the program, and
then users can run it and find bugs, and they can tell you,
and this process will iterate. Yeah?
>> [INAUDIBLE] you write
directions, we have specifications.
It's more than just, you write it, I just let it run.
>> By let it run, unit testing is still running.
A lot of testing is still done empirically.
>> Sure.
>> Being able to detect after the fact of an execution that it
didn't do what you want.
>> And the real numbers,
the number of programs that have specifications is zero.
>> [LAUGH] >> Performing those options is
very well a specification.
I mean, you can, >> Yeah.
>> Represent.
So we're waiting to see what kind of properties of programs
you're talking about here, and outside that realm, waiting,
we're still here.
>> I will, I will.
>> Except for go ahead.
Go ahead. Yeah. >> I will go through a very.
>> Yeah. >> Extensive detail,
gory detail, soon enough.
>> Okay, go.
>> Okay, so this doesn't work well for, damn, for
machine learning systems because you don't usually,
when it runs it doesn't work well,
you don't actually know that there's a bug.
You certainly can't localize it but
you might not even be able to detect a bug.
So in our methodology, this won't be new at all to
verification people, but we tested the program
mathematically instead of empirically.
So we first formally specify in logic what the program is
supposed to do.
We write the program,
and then we try to prove that the program does what we said it
should do.
And when the proof gets stuck, every bug in the program will
cause the proof to get stuck in some point.
And we can look at where the proof got stuck,
fix the program accordingly, and iterate.
And unlike in the status quo methodology, this loop will
eventually terminate with a bug free system up to your
specification and the machine checkable proof of correctness.
So this will be, this is probably obvious to the PL
people in the room, but I mostly presented this to machine
learning people where this was very, very wild idea.
>> [LAUGH] >> So
how do you specify proof properties of programs?
What does that even mean?
So it's not possible with traditional programming
languages.
You can't write down a specification involving
integrals and random variables and
see or know what that would even mean.
You need to somehow jointly reason about programs, theorems,
and proofs.
And the answer is to use, at least one of the main answers is
to use an interactive proof assistant.
So it's exactly what they're for
and they've served two communities historically,
former mathematics and software verification, which have had
relatively little communication between the two fields.
But there have been major
successes already in both fields.
So in math they've successfully, actually at MSR they've actually
formalized what was a 300 page informal proof all formally.
And in software there's a fully verified C compiler comp cert,
and many, many, many others here and elsewhere.
And to prove machine learning systems correct we actually need
to build on both of these fields because we want to be proving
properties about specific executable programs.
But their semantics involves advance mathematics and
real analysis and integrals and things like that.
So in our work we used the Lean Theorem Prover which
is developed right here by Leonardo DeMora.
And before explaining the methodology,
we need some background on Lean,
this might be familiar to many of you, I'm not sure.
So first, basically it's a functional programming language,
you can write functional programs.
So here's a function square, takes a number and
returns its square.
You can state mathematical theorems in a system like Lean.
So this is the statement of fermats_little_theorem,
and the universal quantifier and
the implication arrow are both part of the logic of Lean.
So this already has mathematical semantics analogous to what
we're used to in informal math.
And you can state theorems about programs also.
So the program, the functional programs and
the mathematical theorems live in the same universe,
so you can state theorems about programs.
In this case we're saying that the program sort Always returns
a sorted list.
And you can prove theorems interactively,
and I'll explain this in more detail shortly.
We walk through an example.
And whenever I say proof I mean machine-checkable proof.
So literally a data structure, some kind of
formal certificate that a computer program can check, and
make sure it actually proves the theorem that's been claimed.
So let's walk through a toy example to warm up.
So suppose you're trying to compute
the gradient of the softplus function,
which is a common function in machine learning.
I'll explain what it is in a second.
So the first step is to specify what we're trying to implement.
So this is the softplus function, log (1 + exp x).
And we wanna write a function grad_splus that takes a real
number and gives you a real number.
And we're gonna leave it as a placeholder for now.
And then here's the specification,
grad_splus is correct if for
all x, it computes the gradient of the softplus function at x.
Everybody who wants to follow this, is following?
[LAUGH] Okay.
So then we program, and I don't remember,
I don't wanna do the derivation by hand.
I don't know what the programs was to compute.
Let's just pretend leaving it as a placeholder is just
another bug and we'll try to fix it through this iterative loop.
So we're trying to prove it correct, so
in lean when you're trying to prove something correct,
it's an interactive process.
At every step, it'll display what the remaining goals
are that you need to prove.
So in this case, the remaining goal is the statement of
the original theorem.
And we're trying to prove for all x, something.
So we're gonna introduce x, which is the same thing as,
let x be an arbitrary real number.
And we can then unfold the definition of softplus to expose
the log and the 1 + exp x.
Then, we can just rewrite exhaustively with known
simplification rules for basic gradients.
And as a result,
the right-hand side will simplify to exp x / (1 + exp x).
So our theorem is not true,
because this doesn't return anything.
But it tells us exactly what it needs to return.
It needs to return this.
So we go back and debug our program.
And we fill in what is supposed to compute and
we go back to proving.
And now the same proof goes through and we're done.
Everybody okay with that example?
>> Simplify.
>> Yeah, so lean has a general purpose simplifier.
You just tag a few as rules and- >> It does gradients and
integrals.
>> It does after you have those lambdas and make them as.
It doesn't a priori.
>> Okay.
>> You need a library for it.
>> You need a library, yeah.
>> For things you need in ML learning,
you made it already like gradients.
>> Exactlly, yeah.
>> Right, all right.
>> I'll say a little bit more about that later but
not that much more.
So if you wanna know how the library is structured, ask me or
we can talk offline.
But I'll tell you a little bit more about that as we go along.
So that was a toy example, let's look at a real example.
So as a case study,
we used our methodology to build a system called Certigrad for
optimizing over stochastic computation graphs.
I'll explain what that means in a second for
people who are not in machine learning.
So here's some background.
So the computation graph is the basic programming abstraction of
machine learning, especially of deep learning.
And all of the popular systems like TensorFlow and Theano
are based on this abstraction, and here is an example.
So it's basically just a simple program,
usually a directed acyclic graph where each intermediate node
represents some kind of computational operation.
This is matrix multiplication,
this is the softplus function component wise.
And you're gonna have some inputs, like the ones in black
that are just data that are fixed that you know.
And you have some inputs which I've put in red which
are parameters that you don't know.
And then you have some loss function, which is some notion
of how unhappy you are with the parameters at any given state.
So usually if you're doing image recognition,
your loss function will be how much probability do you assign
to the correct human labels?
And the point is that you want to assign more probability to
the labels that the humans provided.
And then the goal is to find W1 and
W2 that will minimize the loss or make the loss small.
So the way you do this is by computing gradients.
So the main goal of these machine learning systems is to
compute the gradient of the loss
function with respect to the parameters.
And then the standard learning algorithm is called stochastic
gradient descent, where basically it's very simple.
At every iteration you have some values of W1 and W2.
You compute the gradient of the loss with respect to them and
you take a little step, and then you repeat.
And that's basically how deep learning works.
So this is a very nice abstraction,
but it's reaching its limits.
There are lots of new models that are a little too
expressive to be supported by computation graphs.
So many models require random variables or non-differentiable
simulators, especially in reinforcement learning.
So Go is not differentiable itself.
You have a simulator for Go, but you have a differentiable
network spitting out actions that it should take.
And you need a way to back propagate information from
the result of the Go game to the network parameters, for example.
And the answer is stochastic computation graphs,
which are just computation graphs with random variables.
So this is the analog of the example we just saw where,
in the middle we throw in a random variable.
So we sample a Gaussian with mean,
the output of this previous layer of the neural network, and
now the loss function is no longer a scalar.
It's gonna be a random variable itself.
So the quantity we're interested in minimizing is the expected
value over the random choices in the graph of the loss
function nodes.
And now the main goal of Certigrad is to sample unbiased
estimates of the gradient of the expected value of
the loss with respect to the parameters.
Cuz you can no longer do this in closed form in general.
So we're actually going to be running a randomized algorithm
and only proving properties about it probabilistically.
Okay so let's see how to build Certigrad using our methodology.
The first step is to specify what we're trying to build,
which I already stated informally.
But by analog to the pseudocode, here's the pseudomath,
we want a function stochastic back prop.
Such that the expected value of stochastic backprop is
the gradient of the expected value of the loss.
This is what it means to sample unbiased estimates of
the gradient of the expected value of the loss.
Now this might look formal to some people who have no
experience with formal reasoning.
But it's just Latex, it has no semantics.
It's just characters,
the computer has no idea what expected value means,
what a graph is, for example, what gradients are.
So in order to use this as a specification for the computer,
we need to make it formal.
So we need to import libraries for real analysis so
the gradient makes sense,
for probabilities so that expected values make sense.
We need to define what a stochastic computation graph
on a node is.
We need to say how a graph and
a parameter gives you a probability distribution.
And that we're going to leave a placeholder for the function
we're trying to implement to a stacastic backprop.
And then the theorem is pretty, some syntactic clutter, but
it's essentially isomorphic to the informal version.
We want for all n ,and for all stochastic computation
graphs with n nodes, and for all parameters.
The expected value over the distribution induced by
the graph and
the parameter by sampling of running stochastic backprop
is the gradient of the expected value of the loss function.
So it takes a little getting used to the syntax,
but there's no additional complexity here.
And unlike this representation here,
this actually is in a language with real semantics, and
the computer can understand and reason about it.
So now let's try to program it.
And as before, we're going to leave it as a placeholder for
now, because I have no idea how to write this.
There's a lot of tricky cases,
what if you've multiple incoming nodes, that are some
of which are stochastic, some of which aren't.
And multiple outgoing nodes, some of which are stochastic,
some of which aren't.
I don't really want to think about this, so
let's just leave this as a bug and
hopefully we can fix it during the proof process.
So now we start proving and we don't actually Know anything
about G here, so there's nothing we can do.
But we can just figure we have to reason by induction.
So in this case, in this example,
I'm hiding a few details, which,
the graph is essentially just a list of nodes.
Which node has references to the previous nodes in the graph,
just some topologically sorted set of notes.
And to take a few steps, and
then I'm manipulating the right hand side, and eventually I see
that the base case has to be equal to this term F.
So I go back to debug my program,
since I've learned something about what it needs to do.
And I decide, to mirror the induction from before
I'm gonna recurse on the graph.
And I'm gonna fill in the base case,
as we learned it needs to be.
Now I'm gonna start proving again,
and now I'm in the step case, blah blah blah.
But I need to, I realized I hit a proof obligation that some
functions integrable, and
I can't prove this cuz I haven't assumed anything about G.
So I realized that i need to debug my specification.
So I go back and add another precondition,
that some integrals exist on G.
And now I go back to prove one more time, take a few
more steps, and I massage the right-hand side just enough.
I case-analysis on whether my node is deterministic or random.
And I end up learning that the expected value of sbackprop on
this branch has to equal this term.
And so as you probably figured out by now, I go back and
debug my program.
I have the program do case analysis on whether the node is
random or deterministic, and I fill in the term.
Actually, I mentally figure out something that
has the right expectation, which is pretty easy to do locally.
And then I fill in that term on the right hand side, but
I still have a placeholder for the random case.
And I go back to proving.
And suddenly I need to prove something is differentiable.
I can't do it, I haven't assumed anything about it.
So I go back and debug my specification and
I add a new assumption that all the gradients I need exist, and
I try to prove again.
I learn something more about the program and debug the program.
I try to prove again, missing a precondition.
Go back to debug my specification.
And eventually I try to prove again, prove, prove, and
it works.
Now this is great, but the program is wretchedly slow.
As a general rule, the programs you synthesize this way,
by proof driven development, will always be the most naive,
most ridiculously slow and stupid programs imaginable.
For those of you who know back propagation, the key insight to
make it efficient is that you need to use dynamic programming.
So all we're gonna is memo-ize this version that we don't need
to understand but we know roughly what the algorithm we're
trying to build looks like.
So we go back and debug our program, and
we rename the backprop that we wrote to slow_sbackprop.
And then we manually,
eagerly implement as sbackprop by memo-izing the slow version.
And now we only need to prove that sbackprop
is the same as the slowest back prop.
We don't have to re-prove the original theorem again.
So now we start proving, blah blah blah.
And we get some goal where you've got sum x on
the left hand side,
which is supposed to equal sum x prime on the right hand side.
This was actually the bug I had.
I was missing some primes in something.
And so we realized that sum x was supposed to be an x prime,
so we go back and debug our program,
replacing x with x prime.
We go back to prove, blah blah blah, and eventually we're done.
And now at this point, congratulations,
you really have an efficient and correct program.
Now once we complete the proof,
we know that sbackprop is correct.
We don't need to test it,
we don't need to think about edge cases,
we don't need to understand how or why it works.
The proofing system absolves us of all of these
responsibilities.
Now, I can't speak for
every software engineer but, I find it extremely stressful
having to bear the burden of understanding my own systems.
And having to think about why they're correct and
whether all the invariants are satisfied.
I take great comfort in just having
this tell me that it's correct and I can sleep easy at night.
So we just saw a simplified version.
I'll quickly show you a little bit about
the complexity that I've hidden.
But in the real Certigrad every node can be an arbitrarily
shaped tensor, not just a real number.
And there can be multiple unknown parameters not just
a single parameter theta, but
it mostly just complicates the syntax.
So for those of you who are curious,
here's the actual specification.
It's much more confusing syntactically, but
it's pretty much the same.
The expected value of running sbackprop is the gradient of
the expected value of the loss function.
And then you have a couple extra pre-conditions I didn't
mention before.
So there's actually quite a bit of complexity hidden in this
one, it can differentiate under integrals.
So in machine learning papers you have a gradient and
you have an integral, done.
Push the gradient inside the integral.
But in mathematics that's actually not quite so
simple, you need smooth first derivatives.
Some functions need to be continuous.
Something needs to be uniformly integrable, so
there's some complexity hidden here.
These are pretty simple conditions, and
then this is the specification, which really does capture
functional correctness for stochastic backprop.
Okay, so let's talk about actually running Certigrad.
So the program is not executable as is, as was pointed out.
It's in terms of infinite-precision reals,
which are just a mathematical abstraction.
So to actually run, we compile the program to byte code, and
we replace reals with floating points.
Now you can introduce numerical and stability errors here.
But I consider them an orthogonal problem.
You can separately verify that your system is
numerically stable.
There's many different kinds of metrics you can use.
But it has nothing to do with the mathematical or
algorithmic semantics of your problem,
which only makes sense in terms of real.
So the spec doesn't even make, it's not just that the spec is
invalidated when we switch to floating points.
The spec doesn't really make sense because integrals,
gradients only makes sense in real numbers,
not on floating points.
>> In the ML I have evidence that floating points,
they can semantics.
>> No, so TensorFlow, which is Google's machine learning system
will apply algebraic simplifications without any
regard for numerical stability.
The only times people in machine learning
really care about floating point is in, a few
of the hard-coded functions require special techniques.
So basically whenever you're computing log sum x,
you need to do this trick, where you subtract the max first, so
you don't get ints.
But in general, as in machine learning user level,
you just never have floating point issues,
at least not that we know of.
Or if you do you get instant NANs, I've never,
nobody knows of being burned by silent floating point errors,
but I can't actually.
[LAUGH] I don't know if it has inverse.
>> Are you saying here that you are not going to discuss about
including reals as floating points in this work?
That's all you're saying?
>> That's all I'm saying.
>> You're not saying there's no problem in there.
>> No there's a potential problem,
I'm just saying it's a different problem.
>> But it could be, you could have a theory of floating
points and then a balance.
>> Yeah, yeah.
So actually I have abandoned it, but I have a separate
project to verify numerical stability of computation crafts.
The strategy would be to prove isentropic backwards stability.
But I consider it very distinct work.
>> But you think that the proofs are separable,
that you won't have to change your proof of correctness when
you have a separate proof of numerical [CROSSTALK].
>> Absolutely, you end up with two unrelated proofs.
That the real version- >> That compose,
that compose is what you're saying.
>> Well not, they don't perfectly compose.
You have a proof that the real version does what you want in
real space, and then you have the proof that the floating
point version is faithful to the real version.
In essence, they compose.
>> I have a quite different question, and
I wonder if I can throw the ball at?
>> Sure.
>> So when you derive the program,
you used certain patterns.
An invoice station.
So in some areas of program synthesis are based
on the algorithms following certain designs of divide and
conquer or whatever space you decide.
If you could say each of those design patterns in
a way that you're [INAUDIBLE] >> Yeah, so-
>> [INAUDIBLE]
particular
[INAUDIBLE] >> Yeah,
I think they're very compatible.
The main thing that's new here is not really the decomposition
patterns, it's the fact that for the individual branches,
you can manipulate the proof state to figure out,
deductively, what the program is supposed to do.
It' very compatible with other strategies for-
>> You might even put that in
this set of systems that there's a theory of divide and
conquer algorithms for the ego spitting, and
all of the manifestation of [INAUDIBLE] eclipse and
interpose that on the proof search and [INAUDIBLE].
>> Yeah, so let me go back a little.
Essentially, when I call that induction tactic,
at first, that is the dual of saying that no one ever curse.
So you can imagine the tactics that have to figure out
different induction principles to try that would correspond to
the program that's- >> Correct.
>> Yeah, I didn't need that for this work,
but that's definitely an interesting direction.
>> And then, in particular, when it comes to machine
learning algorithms, I wonder if the algorithms are all
the same particular pattern of algorithm design.
Are they all memorization, right,
all they all dynamic programming-based?
>> So I think I'll partially answer that as we go along.
So the core algorithms in the systems
that people write machine learning programs in,
the main algorithm is back propagation.
But the algorithms in machine learning programs,
like specific machine learning programs, are much more varied.
And I'll talk about a few examples later on.
Where was I?
24?
[LAUGH] Okay,
so in addition to replacing reals with floating points,
we're gonna replace tensors with an optimized tensor library.
So a cute fact about most machine learning systems is that
nothing has any impact on performance except matrix
multiplication.
You spend all your time multiplying huge matrices.
So using functional data structures and the interpreter,
the compiler is just completely washed out in the noise.
And we can be competitive with the main systems just by
wrapping a library for matrix multiplication and
convolution and a few other, really slow level computations.
And then we execute the result in a virtual machine,
and as a result, Certigrad is efficient, as I just said,
even though we use functional data structures and
lack some optimizations, it just really doesn't matter.
All the time is spent, so GEMM is matrix multiplication.
And we're as fast as TensorFlow on CPUs.
Now, we could do DPUs also, but
we're not proving anything about the devices anyway.
So, from the projective Certigrad we assume we have
an operation matrix multiplication that has
the gradient rules that we understand.
So we could just as easily wrap a GPU library or
a library that can schedule different devices.
So, so far we've talked about building the core machine
learning libraries that compute gradients for
all sorts of models.
Now, let's talk about proving properties of specific machine
learning models.
So as many of you probably guessed,
not all machines learning models have useful specifications.
Sometimes they're just heuristics, you're just playing
putting layers here and there and seeing what happens.
But many models actually do have useful specifications cuz
they're derived by incrementally improving naive ones using
mathematics.
So here's an example,
this is the naive variational auto encoder.
It's the same graph I showed you before and I'll quickly walk
through what this model actually is supposed to do.
It's a quite interesting model.
Let's say you're feeding in an image x and you're going to
apply a little neural network to encode it into a small number
of bits, actually, 40 points between zero and one.
And then, you're going to sample a Gaussian around that
as the mean and then you're going to decode it back to
the original high dimensional space.
And then, your loss function is going to look at the old x and
reconstructed x then penalize you for differences.
So the goal is to reconstruct x while you have to
funnel it through a small number of bits.
Now, this is a really nice model intuitively, but
it has some problems.
So the induced gradient estimator will have very high
variants because the actual parameters you're interested in
are the random variables appear
on the path between this parameter and the last one.
So that propagation cannot be applied
in a straight forward way.
So, we refine this incrementally.
So the first step is to reparameterize the Gaussian.
So we first sample a Gaussian from a unit Gaussian and
then we just add the mean to it.
And you can prove the this preserves the expected loss.
And then, the second transformation is to integrate
out the KL-divergence, integrate out part of the loss function.
I'll simplify this a little bit.
Some of this loss function, the loss is actually going to be
some integral over a normal PDF of this stuff.
And, a lot of this stuff,
you don't actually need to estimate with samples.
You can just compute in closed form analytically,
you can solve part of the integral.
So we're gonna do that here and we're gonna have
some of the encoding loss which can be computed
outside of the integral without looking at the sample and
then the decoding loss is the same as before.
And now this model is state of the art.
This is called the auto-encoding variational Bayes model.
And you can do some very cool things, which I won't go into.
So the status quo is to do pages and
pages of tricky math on paper to derive the AEVB.
You have to compute all these integrals,
you need to justify a bunch of different transformations, and
it's quite tricky.
But then, you usually implement the AVB at the end and assume
that you did your math right and hope you did your math right.
But we can do better.
We can encapsulate the derivation steps I showed you as
program transformations.
And we can prove the program transformations correct once and
for all in a library, and then apply them automatically
to the naive auto encoder, or to any other naive model to improve
the variance of the gradient estimators.
So let's see how to apply our methodology to this problem
of writing a program transformation to integrate out
the KL-divergence.
So, as always,
we start by specifying what we're trying to implement.
And here's the placeholder for the,
first of all I'm just gonna do the computer algebra part.
Here's a placeholder for the closed form of the KL-divergence
which we don't know what it is yet.
I don't wanna do the math by hand.
And here's the spec that the integral of the Gaussian PDF of
this empirical scale divergence actually equals this
thing that we were looking for in closed form.
And we program, as always, and, as always,
we leave it as a placeholder and
see what information we find from the proof process.
And we start proving, so we're gonna use this calculational
environment, which lets us state the intermediate steps of
the proof and justify them as we go along.
So first, we're gonna unfold the definition of the empirical kl
to expose these log PDFs.
And we just keep going as if we were doing a derivation by hand
but with some of the steps automated or justified.
And eventually,
we realize we need sigma greater than 0 to apply some lemma So
we realize that our theorem's actually wrong so
we debug our specification, add an extra precondition,
signal greater than zero, go back to proving.
Step, step, step.
And eventually we're gonna actually come up with something
in closed form that we can use as the result of the program
we're looking for.
So this theorem is still not true because this is
just a place folder.
But we go back and debug our program and
fill in the value we computed.
And we start proving it again,
in this time the proof goes through and we're done.
Now we still need to write the graph transformation itself.
It was just a mathematical justification.
By this routine, I mean, there's no way to make a mistake, so
we don't really have to think that hard.
Pretty much just keep turning the crank without a care in
the world.
And with that, I'll try to start wrapping up.
So it's hard to build machine learning systems,
even when they don't need to be perfect.
So an interactive proof assistant can make
the process easier.
Not just facilitate perfect code or
bug-free code in a limited developer effort.
But it can actually make the process easier in many ways.
So it can help you find bugs as we saw.
It can do algebra for you.
It can partially synthesize program fragments to let you
figure out what programs need to return in various branches.
And most importantly,
it reduces the overall cognitive demand because you just don't
need to be constantly worrying about whether all
the little parts of your system line up and match up and when
they're actually does something mathematically meaningful.
And as a general principle,
the more precisely you state your goal,
the more the computer can help.
And once we write a functional correctness specification,
in principle, it can build your entire system for you.
I mean, it's a little far-fetched, but
not that far-fetched, to automate this iterative process
I've just been describing.
We're not there yet, but in principle, there's
potentially a path to full large scale system program synthesis,
using this interactive process.
And a really important note is that our methodology can be
adopted incrementally.
So not all code fragments need specifications.
You can have large parts of your system where you don't prove
anything about it.
Even once you state your lemmas and theorems,
you can use them just as documentation,
just to get clarity.
And only prove some parts of it, and leave other things that you
think are probably true unproven.
And just like we wrapped Igen for
tensor operations, you can wrap a lot of existing code.
So if you really only cared about verifying state of the art
machine learning programs,
you could wrap all of TensorFlow, for example.
Just give it some axioms and
then prove properties about TensorFlow programs.
And actually, Overall, this methodology
is right now still substantially more work than just writing it,
especially if you don't care about it being correct or
you're willing to tolerate silent failures.
But I'd say I got the vast
majority of the benefit of these benefits, especially reduced
cognitive demand, way before I had verified the system.
I mean I finished the system with extremely high confidence
it was correct.
Just from the functional correctness spec without
the preconditions, with sorrys everywhere.
With some obvious lemmas with no proofs.
You really get the vast majority of the benefits for
just a tiny fraction of the work and then all of the extra effort
that makes it not economical doesn't really add all that much
to the overall trustworthiness of the system.
So I think if this methodology were adopted I would hope people
would not feel burdened by the expectation of full purity.
Yes?
>> So the methodology you described in the slides of
synthesizing from the process of generating proof,
is that really the way you developed, I mean because it
seems like the same paper from which you get the spec,
the same chapter and ML textbook from where you get the spec,
will also provide you the program that you need to use.
So why not just start with that?
>> Maybe in some cases I didn't find one
the pseudocode was very confusing.
Sometimes if you have really good pseudocode,
you could just write it and then just use this to fix the errors.
Sometimes the pseudocode really comes later.
You know mathematically what you want an algorithm to do,
like in this case, the spec is one line.
Unbiased estimate grade into the expected loss.
And the algorithm it depends on implementation details of
your graph. So-
>> It's interesting cuz I would
think that the ML community would be a community
where you could very easily get pseudocode and algorithms.
But it was hard to get specs.
And you're saying it's the other way around,
that it's easy to get specs, it's hard to get pseudocode.
>> I think it depends.
I mean, the example that I chose just happened to have great
simple specs.
There's always a counterexample of something that
is just some simple procedure, heuristic,
that does not have an interesting spec.
But there are a lot of important problems where the spec is
much simpler than the program, so
I'll just leave it at that for now.
And also, this thing, there are limits to it.
It didn't find the analyzation, for example.
It very stupid version.
So you need to be willing to figure out how to write
a program or else you might hit a brick wall.
But you often get a lot of useful information
from the proof process itself.
Yeah?
>> Yes, my question is related to this memorization thing.
Especially, I think a lot of bugs come in when programmers
try to do complex like performance optimizations where
you think you have this property.
And you like code up something that is ideally complex which
seems like it's working correctly for
a bunch of cases like I think [INAUDIBLE] is one of these
things like reduced precisions in a lot of these new machine
learning models is one of these things.
Is there any way to comment about those things where
you don't get perfect?
To be the same program like- >> So reduced precision, what
you'd wanna prove is something on the floating point side.
You'd wanna prove that for a specific program,
your errors are still bounded by something reasonable.
>> Right, but it is actually not the factor.
Some position.
The fact that it doesn't affect the optimization of
the generalization error that you're seeing,
is that you correct?
>> So part of the claim.
>> So it sub relates back to the composing [INAUDIBLE]
in the sense that.
Is there any way to comment on things that are known to be
approximations?
That's what I'm, absolutely no,
we don't have this about those things yet but.
>> So I mean I think
the general answer is it's a case by case thing.
So for these graphs, you can do additional transformations that
introduce bias to your estimator but will lower the variance.
And you can prove that you'll have lower variance
even though you're no longer unbiased.
So you have to switch spec to prove the property that you care
about, given the approximation tradeoffs.
I think it's case by case.
There are many cases where some approximation will have
a spec that's meaningful and useful.
And some cases where it really is just I think this will work
better and there's not mathematical justification.
Yes?
>> I have a question.
This is really cool work.
I'm thinking you're saying that TensorFlow applies from algebra
optimizations to their graph substitutions, could you hook
this up to sensible, so that you can validate the transformations
that it does for- >> Well,
this is It's not professional like TensorFlow, but
it is a TensorFlow competitor in that sense.
>> I understand that.
>> It does what TensorFlow does.
>> Yeah, but let's go with TensorFlow, and
if it makes errors along the way, then if you hook this up as
a validator, you could detect those errors.
>> There are a lot of bugs in TensorFlow.
>> You did find lots of bugs.
>> So, most of the bugs that I'm aware of are in gradients of
complicated functions, which you could use a tool like that we
need to derive mathematically, and then implement in C++.
So a lot of the errors there are on the math side,
not just the implementation, but
like they've derived some of the things wrong,
and in another source of errors, I would presume at least
this was a source of errors in an older system, was not
the alegrabic optimizations in the compiler, but
optimizations saying like okay, this update can be in place.
This update can be in place.
There are lots of things that are further separated from
the math, that are more like normal compiler optimizations.
So the short answer's maybe, but the hardest part would be just
connecting the C++ code that they're writing with
anything that has a mathematical meaning, because then you verify
the pseudo-code of TensorFlow, and then implement it.
>> So like TensorFlow will take a graph of it and
then try to apply.
So it'll keep the optimizations to simplify it and
make it more efficient.
>> Yes.
>> Okay.
So that sort of because you know conservation of the I mean,
it's got nothing to do with machine learning, but
then he finds bugs in the course of doing that, and so
I'm just thinking >> Yeah.
>> Cuz it's gonna take a lot of time to get a lot of people to
your system I would guess.
>> I'm not planning to compete with TensorFlow here,
this is a research prototype.
>> Sure, sure.
>> Yeah I think there's they probably have a ton of
bugs in the TensorFlow optimizer compiler
that have not even been discovered yet.
I think that's a, people looking for a bug hunt for
people would care about bugs.
I think that's a great target, and
a lot of them won't require the formal mathematics.
A lot of the errors are just gonna be normal compiler errors.
[LAUGH] Some of them will require the formal mathematics,
but not all of them.
Yeah?
>> On your conclusion side, at the very top,
you had a lot of systems do not need to be perfect.
You said that TensorFlow has lots of bugs,
how severe are those bugs?
There's so many known when you apply this stuff, basically.
There's no overall specification in any case that is, so
how important is verification in that space, really?
And what kind of verification is important?
>> So, I think the worse thing that can happen, so
right now, we're still in the early stages of machine learning
where most training is done offline.
So I'll train my system offline, and
I'll only deploy it if its learned something from the data,
and it works well on my test set.
So in that scenario it's not really about safety critical
systems, because if it's trained well, and it works well already
then you can probably deploy it and none of the bugs
that affected training are going to leak into deployment.
>> When you fix your bugs you may not get the right model that
you wanted in the first place, right?
>> Yeah, so you can still have the wrong model.
>> So in some sense whatever, if you get a result,
a good model of your training set, problem solved.
You don't care whether there are bugs underneath.
>> Well so there's two issues though.
The main thing, as I pointed out in the beginning with
the example of the statisticians,
is that the models that have already been scaled up and
are deployed for vision, for translation,
those models don't have bugs, or they don't have major bugs, or
they work despite the possibility of having bugs.
>> Yeah, because they are bug tolerant, it doesn't matter.
>> I don't think there's evidence of that, though.
>> How are you gonna model when you get at the end?
When you get at model at the end, is the result that matters.
>> Yeah, I- >> How you got-
>> I agree.
>> Okay.
>> The problem is when you're experimenting with a new kind of model,-
>> Yeah.
>> And it doesn't work. >> Yeah.
>> And you don't get the results.
>> Yeah, yeah.
>> Then, if that's because of a bug, and
it's higher- >> But
it could be on many different reason,
perhaps a bug in implementation.
I wonder how often it is, or
because you just got the knobs in the wrong way.
>> It's more often the- >> [INAUDIBLE],
you pick the wrong algorithm, you have the wrong training set,
and that's the huge mess that we are in right now.
>> I agree completely.
Most of the time it's not because of
an implementation bug.
It's just especially tragic and
preventable when an entire field abandons a line of research,
because of it, implementation bug.
But the other answer is that we're not far from
the world where systems are actually learning from data in
deployment.
So right now we have this two stage system where you train,
you get something that works well, and then you deploy.
But once you have models that are learning on the fly you
don't want them to fail to learn.
It could actually become a safety critical issue
if they fail to learn things that
you know empirically they should learn on similar things, but
they fail to learn from this new kind of data.
>> And don't get me wrong, I don't wanna,
this is very interesting, but I, your four birds there.
How would love, in the years to come, to get more and
more evidence that this is really true.
We have not much evidence of that today.
So this is interesting research for a grant, we're early work.
Continue, but
I'm not convinced that is moving the needle [INAUDIBLE].
>> So just to clarify, I'm not saying that this methodology-
>> That's what I meant about
this, to make sure you agree with me, or not,
or where is the evidence for that?
>> So just to clarify what I'm claiming here.
I'm not claiming that this methodology will help you find
bugs in TensorFlow.
You asked about that.
That's a separate problem as far as I'm concerned.
I'm saying if you are using this methodology it absolutely helps
you find bugs, because you can't finish your proof.
>> Yes but, Andy was suggesting that you use your methodology
and try to look at TensorFlow to find bugs in TenserFlow.
>> Can you say that more slowly?
>> It's okay?
[LAUGH] Andy was suggesting that you use your methodology and
try to look at TensorFlow to find bugs in TenserFlow.
>> Yeah if I wrote the TenserFlow algorithms in this
system, in this methodology, and there were bugs,
I would absolutely find them.
>> Yes.
>> But it's not an automatic tool that's gonna point
at TenserFlow, yeah.
>> No, no, no that's why I suggested, right?
>> Yeah. >> And
that's a good suggestion [LAUGH].
>> Yeah, that's a great suggestion, I agree.
>> You'll get much more attention and then you get more
result [INAUDIBLE] for the first bullet for instance, for
me, skeptic >> And the first class of
bugs I would get is the gradient calculations for the individual
kernels in TenserFlow, because that's something that would be
fairly easy to automate the derivations of in mean.
So that might be close to free.
>> Right, and I was thinking that's the worst thing thing
which you said is this incremental point is probably
really interesting, in the sense that you're making this
point that there are bugs at various levels.
You sometimes have bad linear algebra operators.
You sometimes have bad method specifications, and
then you sometimes have just the wrong data, the wrong model,
and so on.
And I think the way I look at it is what you're trying to say
here is we can build more better building blocks underneath these
layers, and if you're doing something at a higher level,
then you know at least the lower levels interact.
>> Yeah, here's a summary, a summary image.
You have a system, that does something stupid,
you have a big checklist, lots of possible what went wrongs,
and there's just one thing that's crossed out.
From a safety perspective that's all I'm claiming.
Your system will still go wrong forever for
any number of reasons. Yeah?
>> If you wanted to use
this methodology to find a bug in an existing piece of code,
like TensorFlow,
wouldn't you need the ability to synthesize a counterexample?
Because your methodology allows you to take a buggy program and
turn it into a correct program, but it doesn't let you
demonstrate to somebody- >> Correct.
>> That your buggy program is buggy,
unless you could generate counter-examples,
like here's an input, look and see the gradient is wrong.
How could you synthesize a counter-example?
>> Well, sometimes the way it finds bugs will point at
a counter-example.
Let me go back.
[LAUGH] A lot of slides.
So this case, I get to some goal where x is x on one side, and
x prime on the other side.
It's pretty easy to go from that to say, okay I find them in
the program, and then I enter it different from the other value,
and I can construct the counter-example.
That doesn't always work like that.
But often the information you learn about the program,
when the proof fails, indicates what the counter-example would
need to look like.
>> And another one [INAUDIBLE] should be able to differentiate
on the integral.
>> Yeah, yeah, yeah. >> Yeah, those counter-examples
are way more complicated >> Well,-
>> You have to construct some
crazy thing.
>> Right, or the sigma bigger than zero,
maybe you can make one not right.
>> Yeah.
>> That's a good example.
>> Yeah.
>> So is this your PC project?
>> No this specifically is not, a follow-up work might be.
>> So you're still working on it.
Awesome.
[APPLAUSE] >> Thanks.
>> [APPLAUSE] >> That was my last slide, okay.
[LAUGH] Great.
Không có nhận xét nào:
Đăng nhận xét