WEBVTT

NOTE Created by CaptionSync from Automatic Sync Technologies www.automaticsync.com

00:00:00.276 --> 00:00:05.216 align:middle
Good morning, bom día.

00:00:05.216 --> 00:00:10.926 align:middle
So today I want to talk to you
about a little bit of a story.

00:00:10.926 --> 00:00:15.456 align:middle
I want to do a very different talk than most
other conference talks that you have seen

00:00:15.456 --> 00:00:17.256 align:middle
or that you would have here today.

00:00:17.906 --> 00:00:22.736 align:middle
Rather than telling you what to do or
telling you what technologies to use

00:00:22.736 --> 00:00:25.836 align:middle
or how to do something well,
I'm going to tell you a story

00:00:25.836 --> 00:00:27.386 align:middle
of something that I've royally screwed up.

00:00:27.806 --> 00:00:30.196 align:middle
We all make mistakes.

00:00:30.196 --> 00:00:32.336 align:middle
We've all gone out and built systems.

00:00:32.336 --> 00:00:35.766 align:middle
And a team that I had the
opportunity to lead a number

00:00:35.766 --> 00:00:37.686 align:middle
of years ago, we wound up building a platform.

00:00:37.686 --> 00:00:39.496 align:middle
And I'm gonna tell you a
little about the background.

00:00:39.496 --> 00:00:44.786 align:middle
But we made a bunch of mistakes and rather
than let those mistakes die out and let me

00:00:45.116 --> 00:00:46.836 align:middle
and the team be the ones that learned from it.

00:00:47.236 --> 00:00:50.606 align:middle
I really want to take all of you on a little
bit of a journey and tell you about some

00:00:50.606 --> 00:00:52.766 align:middle
of the things that we did right
some of the things we did wrong.

00:00:52.976 --> 00:00:56.986 align:middle
So today is really going to be
a journey about what we built.

00:00:57.546 --> 00:00:58.576 align:middle
Why we built it.

00:00:58.576 --> 00:00:59.456 align:middle
How we built it.

00:00:59.806 --> 00:01:03.776 align:middle
Where we ran into significant trouble,
and where everything worked well.

00:01:04.296 --> 00:01:07.866 align:middle
And this is my puppy Adda who's going to help
me on some of these transitionary slides.

00:01:08.106 --> 00:01:09.596 align:middle
So let's jump into the background.

00:01:10.006 --> 00:01:14.296 align:middle
I walked in to a growing team.

00:01:14.546 --> 00:01:19.516 align:middle
They had recently received a major round of
venture investment, grew the engineering team

00:01:19.516 --> 00:01:22.846 align:middle
from five to 18 within a year, it
would later be about 30 people.

00:01:23.466 --> 00:01:28.236 align:middle
And it was a very good mix between php
developers, a couple people with Go experience,

00:01:28.656 --> 00:01:32.846 align:middle
a couple of really, really good frontend
engineers, data engineers, et cetera.

00:01:33.216 --> 00:01:35.076 align:middle
They really cared about software quality.

00:01:35.076 --> 00:01:38.536 align:middle
The people that they hired in
were very, very good engineers.

00:01:39.656 --> 00:01:44.696 align:middle
But they were dealing with a legacy system, and
we've all dealt with legacy systems in the past.

00:01:45.266 --> 00:01:47.666 align:middle
This was quite an interesting one.

00:01:47.826 --> 00:01:50.456 align:middle
The one I really want to
highlight here is this bottom one.

00:01:51.016 --> 00:01:55.576 align:middle
There were over a thousand cron jobs that
ran as frequently as every two minutes,

00:01:56.116 --> 00:01:58.626 align:middle
which corrected data in the database.

00:01:59.466 --> 00:02:02.106 align:middle
That should tell you a little bit about,
of an idea about the state of the system.

00:02:03.396 --> 00:02:08.286 align:middle
It was so hard to find and fix
bugs that rather than fix them,

00:02:08.286 --> 00:02:10.056 align:middle
they just patched over them and band-aided it.

00:02:10.086 --> 00:02:12.666 align:middle
And that's not a knock on the developers.

00:02:12.666 --> 00:02:16.296 align:middle
Like this system was around for five
years, was built very, very quickly,

00:02:16.766 --> 00:02:19.236 align:middle
was scaled with different
varying amounts of talent.

00:02:19.466 --> 00:02:21.316 align:middle
You know, a classical legacy problem.

00:02:22.566 --> 00:02:24.866 align:middle
But the business needed more.

00:02:25.226 --> 00:02:26.746 align:middle
The business needed stability.

00:02:27.296 --> 00:02:30.506 align:middle
They needed an application that worked
for them, that scaled with them.

00:02:30.916 --> 00:02:35.676 align:middle
When I walked in, they had just gotten
off of a nine month product freeze,

00:02:36.206 --> 00:02:38.816 align:middle
which means that engineering
said, for nine months we're going

00:02:38.816 --> 00:02:41.536 align:middle
to do nothing but solve technical debt.

00:02:42.256 --> 00:02:44.366 align:middle
We're going to do nothing but
try to clean up the system.

00:02:44.856 --> 00:02:49.006 align:middle
It was another three months before the first
meaningful product release happened from there.

00:02:49.446 --> 00:02:52.256 align:middle
And it was incredibly, incredibly
painful to work with.

00:02:52.906 --> 00:02:58.686 align:middle
So we were left with a decision,
do we refactor or do we rebuild?

00:02:59.846 --> 00:03:05.646 align:middle
The team had spent about nine months, like I
just mentioned, trying to refactor and kind

00:03:05.646 --> 00:03:07.896 align:middle
of not really having very much success with it.

00:03:08.336 --> 00:03:11.246 align:middle
And so what we decided to do was not to rebuild.

00:03:12.586 --> 00:03:17.366 align:middle
We went into a room - head of
product, head of UX and about five

00:03:17.366 --> 00:03:19.336 align:middle
or six other individual contributors
and myself -

00:03:19.816 --> 00:03:21.996 align:middle
and we went into a room with
four whiteboard walls.

00:03:22.586 --> 00:03:26.976 align:middle
And we started to ask ourselves,
what does this system do?

00:03:27.426 --> 00:03:31.436 align:middle
What are the core things in this
application that it assumes.

00:03:31.906 --> 00:03:33.386 align:middle
And I'll give you an example right now.

00:03:33.926 --> 00:03:38.306 align:middle
Your system - any application that you
work on - probably has a unique column

00:03:38.306 --> 00:03:40.216 align:middle
on the user table for email address.

00:03:40.876 --> 00:03:42.516 align:middle
Emails are unique within the system.

00:03:42.756 --> 00:03:43.586 align:middle
That is an assumption.

00:03:44.736 --> 00:03:49.876 align:middle
However, in most well-designed systems, that
assumption is isolated to that user system.

00:03:50.366 --> 00:03:55.686 align:middle
If you wanted to change that, if you wanted
it to allow multiple emails to register

00:03:55.686 --> 00:03:59.836 align:middle
for multiple user accounts, sorry, a single
email to register for multiple user accounts,

00:04:00.656 --> 00:04:05.206 align:middle
you just remove that unique, change your login,
change your registration and you're done.

00:04:05.966 --> 00:04:09.146 align:middle
This application, it was all over the place.

00:04:09.756 --> 00:04:15.626 align:middle
Things inside of systems that had no business
knowing about an email address relied

00:04:15.626 --> 00:04:16.896 align:middle
on the fact that emails were unique.

00:04:17.706 --> 00:04:21.016 align:middle
And so in order to change that
assumption, we would have needed

00:04:21.046 --> 00:04:24.716 align:middle
to touch about 60 percent of the code.

00:04:24.906 --> 00:04:27.676 align:middle
So what we did is we went on the
four walls of this whiteboard room

00:04:27.876 --> 00:04:32.786 align:middle
and we filled every single wall with a
core assumption of what our platform was.

00:04:33.236 --> 00:04:36.686 align:middle
And then we asked product:
which out of these do you want

00:04:36.686 --> 00:04:40.736 align:middle
to change significantly in
the next 6 to 12 months?

00:04:41.496 --> 00:04:43.816 align:middle
We wound up with, out of four entire walls,

00:04:43.816 --> 00:04:46.456 align:middle
exactly three that were not going
to be changed significantly.

00:04:47.166 --> 00:04:50.226 align:middle
And so what we realized is
that it's not a refactor.

00:04:50.226 --> 00:04:52.756 align:middle
It's not a rebuild, it's actually a V2 .

00:04:52.836 --> 00:04:56.016 align:middle
It's actually a completely separate
product that we want to build.

00:04:56.476 --> 00:04:59.886 align:middle
At the highest of high level, it
solves the same business problem,

00:04:59.886 --> 00:05:02.596 align:middle
but how it does it is drastically different.

00:05:03.736 --> 00:05:08.616 align:middle
We could have taken the time and refactored that
in, but it would have taken three or four years

00:05:08.616 --> 00:05:11.456 align:middle
to actually get to the end goal
of where the product wanted to be.

00:05:12.006 --> 00:05:16.006 align:middle
And so instead, we set us, set
ourselves a four month goal

00:05:16.006 --> 00:05:18.676 align:middle
to get an MVP of the V2 up and running.

00:05:19.736 --> 00:05:21.756 align:middle
And this is the story of that MVP process.

00:05:23.106 --> 00:05:27.846 align:middle
So stepping into the technical architecture
here, when we started building this MVP,

00:05:27.846 --> 00:05:30.386 align:middle
we had to have some kind of a guiding framework.

00:05:30.986 --> 00:05:35.966 align:middle
And what we settled on was this, and I'll
walk you through each one of the pieces.

00:05:36.046 --> 00:05:42.936 align:middle
But the basic concept here is everything
that runs on a server only does API calls.

00:05:44.136 --> 00:05:47.156 align:middle
The frontend is the only thing
- this frontend server here -

00:05:47.156 --> 00:05:51.316 align:middle
is the only thing that actually
knows about HTML,

00:05:51.316 --> 00:05:53.156 align:middle
that actually knows anything about a browser.

00:05:54.946 --> 00:06:00.066 align:middle
Everything else talks through this
gateway to a service over REST.

00:06:00.476 --> 00:06:01.766 align:middle
So API-first development.

00:06:02.356 --> 00:06:05.946 align:middle
The API gateway is the first thing
I want to focus on because it's one

00:06:05.946 --> 00:06:08.816 align:middle
of the things I think we
got really, really right.

00:06:09.216 --> 00:06:12.356 align:middle
This was in the neighborhood of 2015

00:06:12.826 --> 00:06:20.336 align:middle
and Amazon had just released their API
gateway project two days before we decided

00:06:20.336 --> 00:06:22.456 align:middle
to settle on Tyk.

00:06:22.456 --> 00:06:26.486 align:middle
Tyk is an open source project, it's
written in GO, uses MongoDB on the backend,

00:06:26.486 --> 00:06:31.046 align:middle
it integrates well with console,
with a lot of modern Dev ops tools.

00:06:32.046 --> 00:06:37.906 align:middle
But basically what Tyk allows you to do
is configure REST API endpoints with JSON.

00:06:38.426 --> 00:06:43.056 align:middle
So I can hit an API and say, create this
new endpoint, here's the backend server

00:06:43.056 --> 00:06:46.846 align:middle
that it's going to deal with, I
want you to handle OAuth for me,

00:06:46.876 --> 00:06:49.396 align:middle
so terminate OAuth and just give me a user ID.

00:06:49.396 --> 00:06:51.866 align:middle
I don't want to know any of that stuff.

00:06:51.866 --> 00:06:56.306 align:middle
Handle rate limiting, handle
quotas, do all of this stuff.

00:06:56.306 --> 00:07:01.516 align:middle
And what it actually, one of the really cool
parts is it had the support for middleware

00:07:02.206 --> 00:07:07.106 align:middle
where we could actually have a very
slowly-evolving frontend REST API,

00:07:07.796 --> 00:07:12.486 align:middle
that our frontend servers built on, that
our clients build integrations against it,

00:07:12.486 --> 00:07:21.296 align:middle
etc. While our internal servers were more RPC
based, a lot faster moving and didn't have

00:07:21.406 --> 00:07:24.326 align:middle
to worry about backwards
compatibility nearly as much.

00:07:24.976 --> 00:07:26.276 align:middle
There was still some challenges there.

00:07:26.376 --> 00:07:31.096 align:middle
But the API gateway pattern definitely is
something that I'm actually really proud

00:07:31.096 --> 00:07:34.376 align:middle
of from this that we actually built out,
worked really, really fascinatingly well.

00:07:36.346 --> 00:07:40.226 align:middle
So stepping back out, the next
really key part is RabbitMQ.

00:07:40.796 --> 00:07:48.346 align:middle
Again, this is 2015 timeframe, the stack
that we're building on is mostly php.

00:07:48.346 --> 00:07:55.336 align:middle
We had some GO services, et cetera, and at the
time Kafka did not really support PHP well.

00:07:55.676 --> 00:07:59.266 align:middle
Go did not really support PHP well,
sorry, Go did not support Kafka well.

00:07:59.836 --> 00:08:02.166 align:middle
And so we decided to go with RabbitMQ.

00:08:02.536 --> 00:08:05.866 align:middle
If I was doing this decision today,
I would 100 percent pick Kafka

00:08:06.156 --> 00:08:08.866 align:middle
for a reason that's going to be apparent.

00:08:09.836 --> 00:08:15.926 align:middle
We were using this as a pseudo event sourcing
database, meaning every time we made a change

00:08:15.926 --> 00:08:19.426 align:middle
in the application, we would emit
an event describing that change.

00:08:20.356 --> 00:08:25.256 align:middle
Theoretically you could replay those events in
order and get the system back into the state.

00:08:26.126 --> 00:08:31.006 align:middle
I said theoretically because in
practice it didn't really work that well.

00:08:31.006 --> 00:08:35.166 align:middle
The, one of the big mistakes
that we made was none

00:08:35.166 --> 00:08:39.086 align:middle
of the services relied on
Rabbit to set the state.

00:08:39.736 --> 00:08:42.896 align:middle
So whenever they emitted the state
changes it was kind of just advisory.

00:08:43.346 --> 00:08:47.976 align:middle
So we would run into problems where
the JSON wasn't fully populated

00:08:47.976 --> 00:08:55.036 align:middle
or messages were just completely blank by
accident and they weren't caught for a while.

00:08:55.156 --> 00:09:02.106 align:middle
if I were to do this again, I would
absolutely use a similar-style system,

00:09:02.506 --> 00:09:04.916 align:middle
but I would make it event sourcing first.

00:09:04.916 --> 00:09:10.136 align:middle
Meaning that event list is the single source
of truth and the services mainly use that,

00:09:10.136 --> 00:09:14.016 align:middle
have a query against it, have a API
where they can look at those events.

00:09:14.636 --> 00:09:18.676 align:middle
But, so this was something that
really was difficult to get running,

00:09:19.046 --> 00:09:23.426 align:middle
caused us a lot of frustration, but in the
end actually gave us a lot of insights.

00:09:23.426 --> 00:09:28.946 align:middle
So it was kind of a mixed bag because having
everything talking to a single archive meant

00:09:28.946 --> 00:09:31.776 align:middle
that we had one single picture
over everything that was happening

00:09:31.776 --> 00:09:34.156 align:middle
in the application from the event stream.

00:09:34.156 --> 00:09:36.926 align:middle
So while it was a pain in the
neck, it also did help us a lot.

00:09:37.966 --> 00:09:42.216 align:middle
The other thing I want to talk about
at this layer, is this service layer.

00:09:43.226 --> 00:09:51.326 align:middle
So we divided our services into roughly three
categories, domain services which are meant

00:09:51.326 --> 00:09:55.966 align:middle
to sound like domain objects because that's
kind of how we were thinking about them.

00:09:55.966 --> 00:10:01.556 align:middle
So our business entities, all of our business
logic, all lived in these types of services.

00:10:01.556 --> 00:10:05.476 align:middle
They communicated over HTTP and
they all had their own persistence.

00:10:05.916 --> 00:10:08.356 align:middle
So they all have their own
databases and caching systems.

00:10:09.306 --> 00:10:15.166 align:middle
We also had asynchronous services that
purely listened over RabbitMQ to do things

00:10:15.166 --> 00:10:20.606 align:middle
like long running jobs, batch processing, we
did a lot of video transcoding, et cetera.

00:10:20.606 --> 00:10:23.676 align:middle
So all of that stuff was handled
via asynchronous services.

00:10:23.726 --> 00:10:27.546 align:middle
And then we finally had these
things that we called meta services.

00:10:27.646 --> 00:10:31.866 align:middle
It's kinda hard to explain
what a meta service is.

00:10:31.866 --> 00:10:37.016 align:middle
So I'll give you an example in just
a second,.but one of the challenges,

00:10:37.236 --> 00:10:44.506 align:middle
we approached this design as normal object
oriented design: your domain entities,

00:10:44.506 --> 00:10:48.006 align:middle
you split them apart, you find your
boundaries, you create your services just

00:10:48.006 --> 00:10:51.056 align:middle
like you would do it in PHP, whether
it's Symfony or Laravel or Zend

00:10:51.056 --> 00:10:53.186 align:middle
or whatever framework you want to do.

00:10:53.256 --> 00:10:55.766 align:middle
And those services talk to each other.

00:10:57.156 --> 00:11:01.576 align:middle
We modeled our microservice architecture
off of very, very similar principles.

00:11:02.296 --> 00:11:03.766 align:middle
There is something that we
didn't consider though.

00:11:05.696 --> 00:11:09.846 align:middle
How unreliable a service call
is in relation to a method call.

00:11:10.986 --> 00:11:12.616 align:middle
So ask a question, ask yourself a question.

00:11:13.356 --> 00:11:16.156 align:middle
How often do you expect a
method to fail randomly?

00:11:17.106 --> 00:11:20.246 align:middle
And I don't mean the method to not return

00:11:20.246 --> 00:11:22.276 align:middle
because that would be somewhere
around one in a billion.

00:11:22.816 --> 00:11:28.526 align:middle
If you look at Symfony in a normal default
configuration running in production,

00:11:29.026 --> 00:11:33.666 align:middle
your front page may take
10,000 method calls to render.

00:11:34.376 --> 00:11:36.856 align:middle
And how often does one of them fail randomly?

00:11:37.526 --> 00:11:39.486 align:middle
Maybe one every 100,000 requests?

00:11:40.206 --> 00:11:43.996 align:middle
But we're actually not talking about
what happens inside that method

00:11:43.996 --> 00:11:46.166 align:middle
because that stays the same
when you go to services.

00:11:46.766 --> 00:11:49.456 align:middle
We're actually talking about
the method call itself.

00:11:50.666 --> 00:11:54.186 align:middle
It's so infrequent that you've
probably never even thought about it.

00:11:54.816 --> 00:11:58.566 align:middle
You've never thought that: hey, I have
this object, I know it's a valid object,

00:11:58.656 --> 00:12:02.496 align:middle
I'm going to call this method on
it, maybe that's not going to work.

00:12:03.046 --> 00:12:08.146 align:middle
Whereas in a service architecture,
you absolutely, positively have to.

00:12:08.146 --> 00:12:13.596 align:middle
HTTP calls, if you're very, very good at
operating an HTTP service, you're maybe going

00:12:13.596 --> 00:12:19.426 align:middle
to get five nines, five nines of uptime,
99 point nine, nine, nine percent uptime,

00:12:19.826 --> 00:12:23.686 align:middle
translates to one every 100,000
requests will fail randomly.

00:12:23.936 --> 00:12:25.396 align:middle
And there's nothing that you can do about it.

00:12:26.286 --> 00:12:29.936 align:middle
So compare those two numbers, one
in infinity versus one in 100,000.

00:12:31.066 --> 00:12:34.806 align:middle
This is what got us into a lot of trouble.

00:12:34.916 --> 00:12:37.766 align:middle
So the question is, how small
should you build your services?

00:12:39.006 --> 00:12:41.716 align:middle
We started with the idea of one week.

00:12:42.886 --> 00:12:47.646 align:middle
One week should be about the amount of time that
if you have a solid specification for a service,

00:12:47.646 --> 00:12:50.556 align:middle
you should be able to go from
zero to a working service.

00:12:50.956 --> 00:12:53.456 align:middle
That way, if we made a mistake,
if we had a significant issue,

00:12:53.456 --> 00:12:55.206 align:middle
we could literally throw a service away.

00:12:55.826 --> 00:12:58.816 align:middle
I've heard some people talk about
that number as a rough benchmark.

00:12:59.136 --> 00:13:03.446 align:middle
And at this point in my career
and after this experience,

00:13:03.446 --> 00:13:05.186 align:middle
I will tell you that is an absolute mistake.

00:13:06.876 --> 00:13:11.456 align:middle
Here's why, this was a rough
model of our domain.

00:13:11.456 --> 00:13:16.856 align:middle
We did an e-learning platform that would
deliver lessons via a web platform.

00:13:17.376 --> 00:13:21.436 align:middle
And so we had users, we had assignments
which assigned lessons to users,

00:13:21.786 --> 00:13:25.836 align:middle
we had a history which showed what
lessons a user interacted with.

00:13:26.356 --> 00:13:30.196 align:middle
We had content associated with lessons
and we had assets associated with content.

00:13:31.066 --> 00:13:36.256 align:middle
And the way we modeled this as a system
architecture was roughly every one

00:13:36.256 --> 00:13:38.246 align:middle
of those entities became its own service.

00:13:38.716 --> 00:13:43.456 align:middle
Now it's actually a bit more
complicated than this.

00:13:43.456 --> 00:13:46.156 align:middle
This is a little bit simplified
to make the point:

00:13:46.156 --> 00:13:49.406 align:middle
each one of these services did
have more than one database table.

00:13:49.676 --> 00:13:53.186 align:middle
It did have more logic built into
it, but this is the rough concept.

00:13:53.186 --> 00:13:55.376 align:middle
And so this raises a question.

00:13:56.146 --> 00:14:00.506 align:middle
If you are building for the frontend, how
would you get everything that you need

00:14:00.786 --> 00:14:06.056 align:middle
to service a request or to render
a page that shows an assignment?

00:14:07.696 --> 00:14:12.266 align:middle
Today and in fact, back then, one
answer could be GraphQL, right?

00:14:12.316 --> 00:14:14.886 align:middle
GraphQL is phenomenal at stitching
it things like this together.

00:14:15.546 --> 00:14:19.706 align:middle
The problem was this need
isn't just had on the frontend.

00:14:20.396 --> 00:14:23.686 align:middle
Backend services needed to look
at things and aggregate as well.

00:14:24.026 --> 00:14:26.426 align:middle
So that's where these meta services came in.

00:14:27.556 --> 00:14:29.526 align:middle
They had domain knowledge.

00:14:30.286 --> 00:14:32.156 align:middle
So it knew what an assignment was.

00:14:32.156 --> 00:14:34.436 align:middle
It wasn't just stitching
things randomly together.

00:14:34.746 --> 00:14:38.126 align:middle
It knew what it was creating so we
could actually add some business logic

00:14:38.126 --> 00:14:39.216 align:middle
in there around that.

00:14:39.706 --> 00:14:43.996 align:middle
But most importantly, it acted as
a pain function for developers.

00:14:44.866 --> 00:14:50.176 align:middle
If we got our backend model wrong, we would
feel it when we built that meta service.

00:14:50.696 --> 00:14:57.546 align:middle
And so by forcing us to maintain a service
to fill in the gap, it forced developers

00:14:57.546 --> 00:15:00.056 align:middle
to realize that: hey, this other model was bad.

00:15:00.716 --> 00:15:02.366 align:middle
Or this other model had problems.

00:15:04.636 --> 00:15:05.816 align:middle
So let's ask another question.

00:15:07.456 --> 00:15:11.936 align:middle
How would you get a list of
lessons ordered by the author name

00:15:12.016 --> 00:15:13.746 align:middle
of the content within those lessons?

00:15:14.446 --> 00:15:15.776 align:middle
Think about where that data lives.

00:15:15.776 --> 00:15:21.146 align:middle
You want to order by this user service
over here, but you want to join

00:15:21.146 --> 00:15:24.136 align:middle
against content and then finally return lessons.

00:15:24.856 --> 00:15:27.836 align:middle
That would be an absolute
nightmare to do over REST.

00:15:28.496 --> 00:15:33.406 align:middle
I mean that's, you basically would have to
query every single row from every single service

00:15:33.936 --> 00:15:35.406 align:middle
and try to stitch that back together.

00:15:36.756 --> 00:15:40.906 align:middle
It took about six months to solve that
problem once we sat down and tried to do it,

00:15:41.256 --> 00:15:45.906 align:middle
which is where I think the real failing of this
architecture and going this small on services.

00:15:46.256 --> 00:15:49.076 align:middle
We ignored what the business domain was.

00:15:49.536 --> 00:15:53.546 align:middle
We ignored the bounded context and
went way smaller than was necessary.

00:15:53.546 --> 00:15:57.696 align:middle
And to be fair, we didn't know that this type of
requirement was going to exist when we built it.

00:15:58.396 --> 00:16:04.246 align:middle
So keep in mind, keep things as wide as
you can and only really cut those services

00:16:04.666 --> 00:16:06.996 align:middle
when those boundaries are clear and easy.

00:16:08.046 --> 00:16:12.896 align:middle
By the way, the way we solved this was
with a service that used ElasticSearch

00:16:13.376 --> 00:16:16.616 align:middle
and basically kept its own
model of everything in here

00:16:16.616 --> 00:16:21.056 align:middle
and became a generic search
service, which yeah, was challenging.

00:16:21.536 --> 00:16:24.036 align:middle
Let's talk about infrastructure.

00:16:24.416 --> 00:16:26.676 align:middle
I just told you something
that we got horribly wrong.

00:16:26.676 --> 00:16:29.736 align:middle
Now I'm going to tell you about
something that we got ridiculously right.

00:16:30.346 --> 00:16:31.086 align:middle
I think at least.

00:16:31.996 --> 00:16:38.326 align:middle
We had been running Mesos, Apache Mesos for a
while at that point because we were using Spark.

00:16:38.966 --> 00:16:44.866 align:middle
And Apache Mesos is basically very similar
to Kubernetes but for arbitrary jobs.

00:16:45.496 --> 00:16:47.196 align:middle
You can run a farm of servers.

00:16:47.706 --> 00:16:53.236 align:middle
You give Mesos jobs and it figures out how to
run them and it runs them across the cluster.

00:16:54.056 --> 00:16:59.666 align:middle
So we had a lot of experience running Mesos and
at the time Kubernetes was really not a thing.

00:16:59.666 --> 00:17:01.246 align:middle
I think they had just announced it.

00:17:01.546 --> 00:17:04.756 align:middle
It may have even been alpha, but none
of the cloud services supported it.

00:17:05.176 --> 00:17:09.446 align:middle
And so we decided to go this
direction, running Marathon,

00:17:09.446 --> 00:17:12.216 align:middle
which was the Docker scheduler on top of Mesos.

00:17:13.636 --> 00:17:14.876 align:middle
Took a little bit to get running.

00:17:14.876 --> 00:17:17.866 align:middle
But once it was running, it
worked phenomenally well.

00:17:18.916 --> 00:17:20.426 align:middle
And so I want to take you through the life

00:17:20.426 --> 00:17:23.536 align:middle
of an actual request just to
show how powerful this was.

00:17:24.746 --> 00:17:26.136 align:middle
The very first thing that happened

00:17:26.426 --> 00:17:32.846 align:middle
when you called an API was it would hit an
external elastic load balancer, an Amazon ELB,

00:17:33.676 --> 00:17:37.986 align:middle
which are very, very, very reliable,
but kind of slow to reconfigure.

00:17:38.656 --> 00:17:43.786 align:middle
Whereas Marathon would want to reconfigure
things every couple of milliseconds at times.

00:17:44.096 --> 00:17:50.936 align:middle
And so the ELB would talk to a ha proxy
instance, which, was a little bit less reliable

00:17:51.046 --> 00:17:54.096 align:middle
but was very, very, very fast to update.

00:17:55.396 --> 00:17:57.756 align:middle
Those requests would then
go into Tyk into our gateway

00:17:58.636 --> 00:18:03.056 align:middle
and then Tyk would call an internal service
inside the firewall to an internal ELB,

00:18:04.096 --> 00:18:06.816 align:middle
which would then go through the
same process and hit our service.

00:18:07.626 --> 00:18:14.076 align:middle
This looks heavy, but including Tyk, including
OAuth termination, including rate limiting,

00:18:14.426 --> 00:18:18.356 align:middle
including the REST deserialization in
any middleware that we had in here,

00:18:18.696 --> 00:18:21.506 align:middle
this entire process took
about 10 to 15 milliseconds.

00:18:22.106 --> 00:18:25.276 align:middle
So really, really, really fast
and gave us near infinite,

00:18:25.276 --> 00:18:29.506 align:middle
well not near infinite vertical scalability
let's not go that far, but gave us a good bit

00:18:29.506 --> 00:18:33.096 align:middle
of vertical scalability with this,
or horizontal scalability actually.

00:18:33.856 --> 00:18:40.096 align:middle
And then if that internal service wanted to
talk to another API, it basically just bypassed

00:18:40.096 --> 00:18:44.126 align:middle
that external step and just talk straight
to that other service through that ELB.

00:18:44.866 --> 00:18:51.696 align:middle
So what we wound up having was a system where if
we wanted to add nodes, we could drag a slider

00:18:52.206 --> 00:18:58.346 align:middle
and within 30 to 50 milliseconds have
every single machine running a new service

00:18:58.706 --> 00:19:00.226 align:middle
and the system would reconfigure itself.

00:19:00.536 --> 00:19:05.816 align:middle
We were deploying on average, I think it was
about 500 machines per day where we would spin

00:19:05.816 --> 00:19:08.006 align:middle
down old machines and spin up new machines

00:19:08.006 --> 00:19:13.926 align:middle
and we had almost 100 percent uptime during the
time we actually ran this, that I was there.

00:19:13.926 --> 00:19:19.176 align:middle
It turned out to be very, very
reliable once we got it up and running.

00:19:19.176 --> 00:19:20.556 align:middle
Touching on logging really quickly.

00:19:21.366 --> 00:19:24.686 align:middle
Logging is insanely important when
you're building a distributed system.

00:19:24.686 --> 00:19:30.486 align:middle
This was one of the core things that we
did initially using LogSpout to collect,

00:19:30.576 --> 00:19:32.076 align:middle
to stick things out into DataDog.

00:19:32.916 --> 00:19:36.046 align:middle
So StatsD as well for application metrics.

00:19:36.336 --> 00:19:38.716 align:middle
And we also used something called Zipkin.

00:19:39.586 --> 00:19:45.466 align:middle
Zipkin was simultaneously one of the
biggest pain in the rear ends that we worked

00:19:45.466 --> 00:19:49.666 align:middle
with as well as one of the most
powerful tools that we had.

00:19:50.436 --> 00:19:54.266 align:middle
Getting it running at least at
the time was an utter nightmare.

00:19:55.696 --> 00:19:58.476 align:middle
Once it was running, the data
that we got was incredible.

00:19:59.266 --> 00:20:04.346 align:middle
Basically, when you have a
service that calls other services,

00:20:05.006 --> 00:20:08.306 align:middle
you pass along a request id on that other call.

00:20:08.796 --> 00:20:12.266 align:middle
And Zipkin, looks at that data
and is able to correlate requests.

00:20:12.636 --> 00:20:17.676 align:middle
So you can see for one service, it may have
taken 100 milliseconds to serve that request.

00:20:18.126 --> 00:20:22.976 align:middle
You can look at every single sub request,
every single part, no matter how far it fans

00:20:22.976 --> 00:20:25.056 align:middle
out into your system, all from one graph.

00:20:25.646 --> 00:20:31.866 align:middle
You can see the network effects,
the configuration that happens

00:20:32.196 --> 00:20:36.226 align:middle
when you have one service fail,
how that affects other services.

00:20:36.716 --> 00:20:42.056 align:middle
Zipkin took us a very long time to get up
and running and we paid a significant price.

00:20:42.056 --> 00:20:44.546 align:middle
A lot of things would have been a lot easier

00:20:44.546 --> 00:20:46.816 align:middle
to debug had we have gotten
that up and running sooner.

00:20:48.516 --> 00:20:52.676 align:middle
The final piece on the infrastructure
side, we implemented something called,

00:20:52.676 --> 00:20:54.626 align:middle
that we called the service.json file.

00:20:55.676 --> 00:20:59.226 align:middle
This basically lived inside
of every single services repo.

00:21:00.126 --> 00:21:03.776 align:middle
It described the name of the service,
which would turn into a DNS name.

00:21:04.406 --> 00:21:10.996 align:middle
It would describe what other services this
one required to run, what database it needed,

00:21:11.246 --> 00:21:16.156 align:middle
and so we could actually spin up
databases, run migrations against them,

00:21:16.156 --> 00:21:19.986 align:middle
configure the credentials
100 percent automatically.

00:21:21.016 --> 00:21:27.516 align:middle
Same thing with health checks and the APIs that
that service exposed, as well as all the things

00:21:27.516 --> 00:21:30.076 align:middle
that Mesos Marathon needed to run it properly.

00:21:30.726 --> 00:21:35.936 align:middle
And so with one file to get a new service
into production, all we needed to do is go

00:21:35.936 --> 00:21:38.626 align:middle
into CircleCI and say, build this project.

00:21:38.626 --> 00:21:42.706 align:middle
And it would automatically deploy everything
into production when you merge into master.

00:21:44.156 --> 00:21:45.726 align:middle
Worked actually really quite well.

00:21:46.926 --> 00:21:50.336 align:middle
One thing I'd point out does anything
that I just talked about look familiar?

00:21:50.906 --> 00:21:53.856 align:middle
This is basically Kubernetes.

00:21:54.376 --> 00:21:57.396 align:middle
That's what we wound up reinventing
and looking back on it,

00:21:57.396 --> 00:22:01.166 align:middle
it feels a lot like we reinvented
a lot of the wheel.

00:22:01.546 --> 00:22:06.936 align:middle
But the reality was we were just a little
bit too far ahead of where things were coming

00:22:06.936 --> 00:22:08.296 align:middle
and I don't mean that in a good way.

00:22:08.826 --> 00:22:11.716 align:middle
But we wound up reinventing a
lot that was ultimately coming

00:22:11.716 --> 00:22:13.866 align:middle
down the pike for major open source projects.

00:22:14.276 --> 00:22:16.286 align:middle
And today just use Kubernetes.

00:22:16.286 --> 00:22:19.456 align:middle
You don't need to build the rest of it
and you can get the same exact benefits.

00:22:19.986 --> 00:22:25.096 align:middle
So moving along, that's how it worked
in production, at least in theory.

00:22:25.326 --> 00:22:28.566 align:middle
Local dev was a little bit of a different story.

00:22:29.786 --> 00:22:36.106 align:middle
So the initial intention and what we built
at first glance was a command line tool.

00:22:36.516 --> 00:22:41.476 align:middle
So you would check out a repository for service
and you would run this command line tool in it.

00:22:41.646 --> 00:22:48.046 align:middle
And it would read that service.json and figure
out what you needed to run your service,

00:22:48.926 --> 00:22:52.776 align:middle
configure a docker compose file to
spin up all of the other services

00:22:53.126 --> 00:22:58.286 align:middle
to run all the migration files to get your
databases all set up and everything like that,

00:22:58.286 --> 00:23:01.836 align:middle
and get everything up and running
so that you had a local dev

00:23:02.316 --> 00:23:07.446 align:middle
that basically exactly mirrored
production, in theory.

00:23:08.756 --> 00:23:12.096 align:middle
The problem was nobody actually used it.

00:23:13.026 --> 00:23:16.266 align:middle
Every engineer ran their own
service natively on their machine.

00:23:17.656 --> 00:23:22.446 align:middle
And when they needed to run a dependency run
another service that they had to talk to,

00:23:23.026 --> 00:23:27.566 align:middle
they would either mock it themselves by
creating a little, you know, Node.js script

00:23:27.566 --> 00:23:31.526 align:middle
or a little PHP script to simulate
that endpoint, or they would talk

00:23:31.526 --> 00:23:34.876 align:middle
to another developer to get the other
service running on their machine.

00:23:36.306 --> 00:23:40.656 align:middle
And this wound up having a big problems because
the amount of times that they would update

00:23:40.656 --> 00:23:44.026 align:middle
that destination service was
not really that frequent.

00:23:44.316 --> 00:23:49.276 align:middle
And when they did, they weren't really in the
habit of running migrations,.and so when we went

00:23:49.276 --> 00:23:53.036 align:middle
to integrate this in production services
weren't used to talking to each other.

00:23:54.346 --> 00:23:55.746 align:middle
It became an absolute nightmare.

00:23:55.976 --> 00:23:57.306 align:middle
And so we stopped and we asked why.

00:23:58.066 --> 00:24:01.136 align:middle
Why weren't the developers using
this tool that built and worked?

00:24:01.136 --> 00:24:02.486 align:middle
That was built and actually worked?

00:24:02.486 --> 00:24:05.076 align:middle
And the answer was actually pretty simple.

00:24:05.686 --> 00:24:06.816 align:middle
There's two layers to it.

00:24:07.306 --> 00:24:09.496 align:middle
The first is that tool was slow.

00:24:09.526 --> 00:24:12.506 align:middle
It took on average, because of the number
of dependencies that had to spin up,

00:24:12.976 --> 00:24:19.576 align:middle
and it basically started from a docker, from
a fresh slate every single time you booted,

00:24:19.886 --> 00:24:21.486 align:middle
took about 20 minutes to get up and running.

00:24:21.996 --> 00:24:24.396 align:middle
And you figure you do that once
a day, once every couple of days.

00:24:24.916 --> 00:24:27.376 align:middle
But that also means anytime
you want to reset the state

00:24:27.376 --> 00:24:29.036 align:middle
of the system, you have to wait 20 minutes.

00:24:29.696 --> 00:24:31.966 align:middle
That's kind of a ridiculous
amount of time to wait.

00:24:32.056 --> 00:24:35.866 align:middle
It was also really unreliable about
half the time it wouldn't start,

00:24:36.276 --> 00:24:39.466 align:middle
But there is another key point.

00:24:40.816 --> 00:24:45.586 align:middle
The way we had built the team in
the way we had built this tool was

00:24:45.586 --> 00:24:48.596 align:middle
such that it was somebody else's problem.

00:24:49.236 --> 00:24:53.276 align:middle
There was somebody who is designated on
the team to build and maintain this tool.

00:24:53.936 --> 00:24:59.936 align:middle
Developers had agency over their service.json
file, but they had no agency on whether

00:24:59.936 --> 00:25:04.596 align:middle
or not their system ran in this tool or not
because ultimately the tool didn't matter.

00:25:04.596 --> 00:25:08.346 align:middle
What mattered was prod and as long
as prod worked, nobody really cared.

00:25:08.346 --> 00:25:13.156 align:middle
And so that was a really, really
huge fault and failure on my side.

00:25:13.156 --> 00:25:17.276 align:middle
And one of the big things I learned
as a takeaway is you've got to,

00:25:17.276 --> 00:25:21.506 align:middle
especially in a services architecture, you
have to get the local environment right

00:25:21.506 --> 00:25:25.686 align:middle
and consistent because that will
be the difference between a system

00:25:25.686 --> 00:25:27.766 align:middle
that works in a system that doesn't work.

00:25:28.756 --> 00:25:32.596 align:middle
And speaking of that, I want to talk about
actually getting this thing off the ground.

00:25:33.166 --> 00:25:38.736 align:middle
So we spent about three weeks building the
first couple of services, authentication,

00:25:38.736 --> 00:25:41.366 align:middle
user service, and a basic content service.

00:25:42.196 --> 00:25:46.186 align:middle
And we had all three of them running in local
dev and we went to put it into production.

00:25:46.186 --> 00:25:51.156 align:middle
And it took a month from that point before the
first successful login on the frontend happened.

00:25:51.626 --> 00:25:57.116 align:middle
A month, from working code to serving traffic.

00:25:57.876 --> 00:25:59.936 align:middle
That's just absolutely ludicrous.

00:26:00.646 --> 00:26:03.976 align:middle
And so we started to ask ourselves, we
had retrospectives after retrospectives

00:26:03.976 --> 00:26:06.216 align:middle
and there is a whole bunch of reasons for it.

00:26:06.616 --> 00:26:10.826 align:middle
One of the key ones being the local
development environment was not stable.

00:26:10.826 --> 00:26:15.536 align:middle
So therefore getting into prod was actually
the first time these services ever talked

00:26:15.536 --> 00:26:16.076 align:middle
to each other.

00:26:16.076 --> 00:26:20.476 align:middle
And so the services were built in isolation.

00:26:21.056 --> 00:26:23.766 align:middle
It really caused a lot of pain.

00:26:23.766 --> 00:26:30.286 align:middle
Some of this was just normal growing pains,
you know, you're building your infrastructure

00:26:30.286 --> 00:26:34.696 align:middle
to kind of an ideal, you expect health
checks to behave exactly a certain way

00:26:35.096 --> 00:26:37.696 align:middle
and in practice there's a little
bit of fudge room for that.

00:26:37.696 --> 00:26:42.436 align:middle
And so there's always little kinks to iron
out, but it was really, really challenging.

00:26:43.006 --> 00:26:45.816 align:middle
A little more detail on a couple of these.

00:26:45.816 --> 00:26:48.226 align:middle
I'm not going to read these each independently.

00:26:48.306 --> 00:26:51.486 align:middle
I'll get to the coordination in a second.

00:26:51.546 --> 00:26:54.546 align:middle
But getting the local state
and staging environments

00:26:54.546 --> 00:26:57.516 align:middle
into a known state was insanely challenging.

00:26:58.396 --> 00:27:02.036 align:middle
So what I mean by known state is,
let's say you have a bug in production.

00:27:02.496 --> 00:27:04.556 align:middle
How would you replicate that
in a local environment?

00:27:05.656 --> 00:27:06.856 align:middle
If you're dealing with a monolith,

00:27:07.566 --> 00:27:11.026 align:middle
maybe depending upon your security
requirements, you clone the database?

00:27:11.386 --> 00:27:15.096 align:middle
Or if you're actually doing
things well and GDPR compliant,

00:27:15.096 --> 00:27:18.406 align:middle
you actually have a system in
place to anonymize that data?

00:27:18.806 --> 00:27:22.526 align:middle
Or even better yet, you're
able to recreate it directly

00:27:22.526 --> 00:27:24.886 align:middle
on your local without having to copy any data.

00:27:26.056 --> 00:27:27.816 align:middle
That was literally impossible here.

00:27:28.236 --> 00:27:30.086 align:middle
Every service had its own database.

00:27:30.506 --> 00:27:35.296 align:middle
There was data in all sorts of places and
Rabbit MQ queues, in S3 that the system used.

00:27:35.296 --> 00:27:42.236 align:middle
And so to get the system into a
known state was literally impossible.

00:27:42.606 --> 00:27:47.006 align:middle
The only way people could actually do it was
by going into the interface and clicking things

00:27:47.006 --> 00:27:50.666 align:middle
in the interface to create content and
to create assignments and all this stuff.

00:27:50.876 --> 00:27:55.226 align:middle
So both from a debugging standpoint,
but also from a build standpoint,

00:27:55.426 --> 00:27:57.516 align:middle
it was insanely, insanely challenging.

00:27:58.656 --> 00:28:05.156 align:middle
We had an idea for a tool to solve this
problem, which was basically to detail the state

00:28:05.156 --> 00:28:10.836 align:middle
in a yaml file and give it to a tool which
would then coordinate with all the services

00:28:11.156 --> 00:28:14.756 align:middle
to get everything into a known state.

00:28:14.756 --> 00:28:19.656 align:middle
It seemed like it would have worked to solve
the problem, but again, just a matter of time,

00:28:19.656 --> 00:28:23.666 align:middle
we didn't actually get a chance to finish it.

00:28:23.666 --> 00:28:26.866 align:middle
The high-level coordination is
a really interesting challenge.

00:28:28.296 --> 00:28:29.496 align:middle
How do you deal with change?

00:28:30.426 --> 00:28:34.246 align:middle
So change is always a natural part
of software development cycles

00:28:34.246 --> 00:28:35.686 align:middle
and of getting things into production.

00:28:36.696 --> 00:28:40.696 align:middle
You know, we're never going to get it
right the first time, whether it's feedback

00:28:40.696 --> 00:28:45.926 align:middle
from clients tell you that you built the wrong
solution, whether it's an executive stakeholder

00:28:45.926 --> 00:28:48.566 align:middle
that walks in and goes, I
demand that you add this feature

00:28:48.566 --> 00:28:50.966 align:middle
because I'm just executive and I demand things.

00:28:51.946 --> 00:28:55.486 align:middle
Or maybe it's because we actually literally
screwed up the first time that we did it

00:28:55.486 --> 00:28:58.746 align:middle
and we misunderstood the problem
from an engineering standpoint.

00:28:58.746 --> 00:28:59.666 align:middle
We built the wrong thing.

00:29:00.256 --> 00:29:03.906 align:middle
So change is a natural part and,
you know it's going to happen.

00:29:03.906 --> 00:29:08.106 align:middle
Here's an example of a real
change that we faced.

00:29:08.546 --> 00:29:13.446 align:middle
This is a rough abstraction of one
of the hierarchies in our system

00:29:13.966 --> 00:29:18.466 align:middle
where a program has many
topics, a topic has many lessons,

00:29:18.466 --> 00:29:21.616 align:middle
a lesson has many cards,
and a card has many assets.

00:29:21.956 --> 00:29:23.556 align:middle
Don't worry too much about
what these things are.

00:29:23.556 --> 00:29:25.446 align:middle
Just think of them as entities.

00:29:26.116 --> 00:29:29.756 align:middle
What happens when the business comes
in and says, I need a new layer?

00:29:30.356 --> 00:29:34.716 align:middle
In a normal monolithic application,
this would be trivial.

00:29:35.526 --> 00:29:37.426 align:middle
You would add a new database table, right?

00:29:37.426 --> 00:29:41.936 align:middle
Maybe a little bit of a migration and 90
percent of the time things will just work

00:29:42.386 --> 00:29:44.676 align:middle
and maybe you spend a couple
more days cleaning stuff up

00:29:44.676 --> 00:29:46.646 align:middle
and adding the UI elements and stuff like that.

00:29:47.376 --> 00:29:52.026 align:middle
It took a month to do something
like this because, remember,

00:29:52.276 --> 00:29:54.156 align:middle
everything is a separate service.

00:29:54.726 --> 00:29:56.576 align:middle
How would you make a change to this hierarchy?

00:29:56.576 --> 00:30:02.926 align:middle
Well, it turns out, you first
have to make the change

00:30:02.926 --> 00:30:05.656 align:middle
in anything that depends upon that service.

00:30:06.226 --> 00:30:10.356 align:middle
And you have to make the change such that
it can accept either the old or the new way.

00:30:11.506 --> 00:30:13.396 align:middle
And then you need to add that new service.

00:30:13.816 --> 00:30:17.996 align:middle
You need to run all the migrations to get all
the data into that new service and then you have

00:30:17.996 --> 00:30:22.376 align:middle
to change all of the other services
again to get rid of that duplication,

00:30:23.286 --> 00:30:27.656 align:middle
to get rid of the acceptance of the old way of
doing things or the new way of doing things.

00:30:28.116 --> 00:30:34.506 align:middle
And so basically what would have otherwise
been a very simple refactor turned

00:30:34.506 --> 00:30:36.996 align:middle
into massive coordinated surgery.

00:30:37.796 --> 00:30:41.196 align:middle
Typically these type of refactorings
required half to three quarters

00:30:41.196 --> 00:30:44.256 align:middle
of the team working for weeks at a time.

00:30:44.406 --> 00:30:49.016 align:middle
And these are things that any application goes
through, like these are not complex things.

00:30:49.796 --> 00:30:56.896 align:middle
And so what we should have done, in retrospect,
what I would have rather done is this:

00:30:58.186 --> 00:31:05.706 align:middle
take our domain model and group it by bounded
context, group it by the things that are similar

00:31:05.706 --> 00:31:10.466 align:middle
that need to know about each other and put
these other services out in other areas.

00:31:11.596 --> 00:31:14.966 align:middle
Asset service is its own thing
here, it could be part of lessons,

00:31:15.516 --> 00:31:18.306 align:middle
but considering video transcoding
and stuff like that,

00:31:18.306 --> 00:31:20.546 align:middle
there's a lot of things that
are unique just to assets.

00:31:20.546 --> 00:31:23.166 align:middle
And this is ultimately what we wound up doing.

00:31:23.166 --> 00:31:26.296 align:middle
We did wind up throwing away three
quarters of those other services

00:31:26.626 --> 00:31:29.026 align:middle
and building what some people
would call microliths.

00:31:29.116 --> 00:31:33.826 align:middle
So I want to wrap up a little
bit here with some lessons

00:31:33.826 --> 00:31:35.706 align:middle
that we learned and some key takeaways here.

00:31:39.126 --> 00:31:41.456 align:middle
First thing that I would recommend
is don't do microservices.

00:31:42.426 --> 00:31:47.236 align:middle
Now this is a little bit of a joke because
you can very clearly see what we did

00:31:47.236 --> 00:31:49.886 align:middle
at least initially was not really microservices.

00:31:49.886 --> 00:31:51.066 align:middle
It was a distributed model.

00:31:51.566 --> 00:31:58.686 align:middle
But I would strongly, strongly say, do not
build a non monolith unless you can invest

00:31:59.206 --> 00:32:02.876 align:middle
in operations, unless you can
invest in automation and tooling

00:32:03.266 --> 00:32:05.996 align:middle
and have every single developer
own and be responsible

00:32:05.996 --> 00:32:09.276 align:middle
for that tooling, it's not gonna work out well.

00:32:09.406 --> 00:32:11.886 align:middle
Or at least, it didn't work out well for us.

00:32:12.926 --> 00:32:17.446 align:middle
Start with big services, especially
when you don't understand the problem.

00:32:18.176 --> 00:32:21.936 align:middle
You may think that you understand the problem,
but unless you have solved that problem

00:32:21.936 --> 00:32:26.056 align:middle
in production, start with a big
service because it's far, far,

00:32:26.176 --> 00:32:31.886 align:middle
far easier to take something that's
big and complete and split it into,

00:32:31.886 --> 00:32:34.816 align:middle
break it apart because you
know where your challenges are.

00:32:35.126 --> 00:32:37.676 align:middle
You understand your domain,
you understand the problems.

00:32:38.286 --> 00:32:43.166 align:middle
But to try to stitch two services together,
especially when you have dependencies

00:32:43.166 --> 00:32:46.536 align:middle
in other parts of the application,
becomes insanely challenging.

00:32:48.536 --> 00:32:50.786 align:middle
Automate absolutely everything.

00:32:51.226 --> 00:32:52.836 align:middle
And I don't just mean deploys.

00:32:53.376 --> 00:32:55.626 align:middle
I'm talking about your infrastructure changes.

00:32:55.626 --> 00:32:59.036 align:middle
If you're not using Terraform, use Terraform,

00:32:59.546 --> 00:33:02.896 align:middle
etc. Like make sure the backups
are 100 percent automated.

00:33:02.896 --> 00:33:07.966 align:middle
Make sure that how you get the application
into a known state is automated.

00:33:08.586 --> 00:33:16.096 align:middle
Automation and autonomous systems are absolutely
critical, especially as complexity increases.

00:33:16.576 --> 00:33:20.876 align:middle
This is, I think, the biggest
lesson that I learned.

00:33:21.106 --> 00:33:25.746 align:middle
Anytime that we're really
dealing with distributed systems,

00:33:26.266 --> 00:33:29.536 align:middle
but a lot more so applications in general.

00:33:30.286 --> 00:33:34.326 align:middle
Normally the way we write code
is we write the happy path first

00:33:34.326 --> 00:33:37.076 align:middle
and then we test the sad paths
and we write the sad paths.

00:33:37.466 --> 00:33:42.876 align:middle
And even TDD with the red green
cycle is designed to kind of do this.

00:33:43.166 --> 00:33:47.236 align:middle
You don't write the failing test for your
failure case first, you write your failing test

00:33:47.386 --> 00:33:49.106 align:middle
for what the business problem is.

00:33:49.656 --> 00:33:51.976 align:middle
Then that fails and then you
write the code to solve that.

00:33:52.876 --> 00:33:54.746 align:middle
Think failure first.

00:33:55.306 --> 00:33:58.616 align:middle
Start off with: what happens
if this thing breaks.

00:33:58.996 --> 00:34:03.946 align:middle
And write the code to solve that
first, because things will break.

00:34:03.946 --> 00:34:08.746 align:middle
Your database will go down, you will have
corruption, you will have a network failure.

00:34:09.126 --> 00:34:13.276 align:middle
No matter whether you're building a monolith
or a distributed system, things will go wrong.

00:34:13.276 --> 00:34:18.076 align:middle
And changing your thinking to start
thinking about what's going wrong,

00:34:18.076 --> 00:34:22.326 align:middle
what's going to happen, what's failing
and how will I gracefully handle that,

00:34:22.896 --> 00:34:28.326 align:middle
becomes absolutely critical at maintaining
a scalable and highly available system.

00:34:28.856 --> 00:34:32.556 align:middle
And then finally, this is
I think one of the things

00:34:32.556 --> 00:34:37.936 align:middle
that we thought we understood
in the beginning, but we didn't.

00:34:37.936 --> 00:34:40.976 align:middle
SLO's are a concept that came
out of Google's SRE book,

00:34:41.166 --> 00:34:44.196 align:middle
which is their service-level objectives.

00:34:44.706 --> 00:34:48.596 align:middle
Basically, what does the business
care about for this service?

00:34:49.396 --> 00:34:52.136 align:middle
We measured everything in our services.

00:34:52.136 --> 00:34:53.436 align:middle
We measured requests per second.

00:34:53.436 --> 00:34:54.786 align:middle
we measured memory usage.

00:34:55.126 --> 00:34:59.406 align:middle
We looked at number of database
queries per call.

00:34:59.406 --> 00:35:02.906 align:middle
Like any technical metric you can
think of we probably were tracking it.

00:35:03.406 --> 00:35:07.596 align:middle
But we weren't really tracking with
the business actually cared about.

00:35:08.166 --> 00:35:09.016 align:middle
Yeah, we thought we were.

00:35:09.016 --> 00:35:12.226 align:middle
We thought were, you know, number of
pieces of content accessed per second,

00:35:12.666 --> 00:35:14.886 align:middle
but that's really not what
a business cares about.

00:35:15.046 --> 00:35:17.656 align:middle
The business cares: was a lesson consumed?

00:35:18.706 --> 00:35:19.866 align:middle
Did the learning happen?

00:35:19.866 --> 00:35:21.056 align:middle
It was an e-learning platform.

00:35:21.596 --> 00:35:28.136 align:middle
Did a service respond in a timely
fashion and not a timely fashion

00:35:28.136 --> 00:35:32.676 align:middle
from a technical definition, a timely fashion
from what the actual end user cares about.

00:35:33.346 --> 00:35:37.326 align:middle
And so by defining these SLO's when
you, before you build your service,

00:35:37.626 --> 00:35:40.236 align:middle
you actually have a metric that
you can test your service against.

00:35:43.706 --> 00:35:47.666 align:middle
The biggest takeaway that I have
from this entire experience,

00:35:47.666 --> 00:35:51.766 align:middle
and I've talked about this a little bit on
twitter and I know there's different opinions

00:35:51.766 --> 00:35:58.956 align:middle
on this, but I've learned that complexity is
the number one enemy in software development.

00:35:59.336 --> 00:36:02.426 align:middle
Everything that we do: when
you look at refactoring,

00:36:02.426 --> 00:36:08.166 align:middle
when you look at object oriented
design, are ways of managing complexity.

00:36:08.626 --> 00:36:13.626 align:middle
And quite often what we do is we take
complexity from one part of the system

00:36:13.896 --> 00:36:15.596 align:middle
and we spread it out into the rest.

00:36:16.186 --> 00:36:18.066 align:middle
This file is easy for me to read.

00:36:18.066 --> 00:36:20.986 align:middle
This code in this method is easy for me to read.

00:36:20.986 --> 00:36:22.496 align:middle
Therefore, it's simple.

00:36:22.496 --> 00:36:23.966 align:middle
Therefore my application is simple.

00:36:24.586 --> 00:36:26.846 align:middle
When in reality you didn't simplify anything.

00:36:27.156 --> 00:36:29.626 align:middle
You just moved the complexity
from one place to another.

00:36:30.566 --> 00:36:36.026 align:middle
And so managing complexity becomes
the absolutely most important thing

00:36:36.426 --> 00:36:37.986 align:middle
to building a reliable system.

00:36:38.386 --> 00:36:43.936 align:middle
The way I would phrase it is this, it
is insanely simple and insanely easy

00:36:43.936 --> 00:36:47.646 align:middle
to create a system that is so
complicated that you cannot understand it.

00:36:48.306 --> 00:36:51.256 align:middle
And if you can't understand
it, how can you run it?

00:36:51.566 --> 00:36:56.916 align:middle
This is basically the story
of the system that we built

00:36:56.916 --> 00:36:59.236 align:middle
and where it failed, where it broke down.

00:36:59.236 --> 00:37:02.826 align:middle
And so I want to say thank you
and I think I have a minute

00:37:02.826 --> 00:37:16.606 align:middle
or two for questions if anyone has any.

00:37:17.256 --> 00:37:19.016 align:middle
Come up to the podium.

00:37:19.866 --> 00:37:28.296 align:middle
Nope. I'll throw the cubes.

00:37:28.296 --> 00:37:30.286 align:middle
Ok, to anybody?

00:37:30.736 --> 00:37:31.426 align:middle
Or to questions?

00:37:32.196 --> 00:37:38.176 align:middle
Okay. Here you go.

00:37:38.676 --> 00:37:42.936 align:middle
Thank you for awesome presentation.

00:37:44.346 --> 00:37:50.886 align:middle
Oh, okay. Who wants the cube?

00:37:50.886 --> 00:37:52.076 align:middle
Who wants a mic?

00:37:52.776 --> 00:37:55.956 align:middle
Up here. Can you hear me?

00:37:57.636 --> 00:38:06.296 align:middle
So, uh, I wanted to ask you a question like:
normally when you're trying to split the system

00:38:06.296 --> 00:38:11.266 align:middle
into smaller services, you often end up
splitting the complex logic into complex....

00:38:11.266 --> 00:38:11.896 align:middle
You can't hear me?

00:38:11.986 --> 00:38:12.516 align:middle
Can you hear me now?

00:38:12.856 --> 00:38:15.406 align:middle
Yeah, it's, I can't really hear
you over the background noise.

00:38:16.106 --> 00:38:17.756 align:middle
Uh, so I'm going to try my best.

00:38:17.866 --> 00:38:18.666 align:middle
Yeah, that's better.

00:38:18.826 --> 00:38:23.096 align:middle
Okay. So when you're trying to split the large
system into smaller components or microservices,

00:38:23.166 --> 00:38:25.916 align:middle
you often end up making the service simpler

00:38:25.916 --> 00:38:28.426 align:middle
but the communication harder
and more complex as you said.

00:38:28.466 --> 00:38:32.126 align:middle
So how would tackling such a problem be?

00:38:32.196 --> 00:38:36.936 align:middle
Be like, identify, be like making a service more
responsible communication adaptation for data

00:38:36.936 --> 00:38:39.346 align:middle
and then forwarding data and formatting it?

00:38:39.346 --> 00:38:42.456 align:middle
Or how, how would you advise
people to tackle that?

00:38:42.456 --> 00:38:46.976 align:middle
So I think that's where I get back earlier
where I said failure is the key and thinking

00:38:46.976 --> 00:38:49.996 align:middle
about those failure modes when
you're splitting that apart.

00:38:50.906 --> 00:38:54.136 align:middle
Think of the happy paths when,
if that service is offline

00:38:54.456 --> 00:38:59.516 align:middle
or if those data communication
channels fail, how would that behave?

00:38:59.516 --> 00:39:03.396 align:middle
And try to work out the system such
that they actually do behave correctly,

00:39:03.576 --> 00:39:05.136 align:middle
when those things happen.

00:39:05.136 --> 00:39:06.116 align:middle
Or they can gracefully.

00:39:06.376 --> 00:39:08.816 align:middle
Unless I misunderstood the
question, it's kind of hard.

00:39:08.816 --> 00:39:12.806 align:middle
Come up after, we'll chat afterwards about it.

00:39:13.776 --> 00:39:19.716 align:middle
Alright, thank you.