Thanks for the introduction.
Like I said my name is Karun Japhet, I'm a developer at ThoughtWorks.
That's the short version of it.
Okay. Oops.
Maybe I should turn it on first. Okay.
Alternate between...
Hello? Is it okay?
It's okay, I can start. Oh, I think we're back, cool.
So three years ago we started working with this financial client
who wanted to scale their business up.
They had a fair share of their market with the couple of products
that they had launched a few years ago
and, you know, they wanted to essentially grow their market share.
They wanted to significantly increase the number of customers they had, and they had,
you know, projected the growth target that they wanted to meet in a couple of years.
They eventually sort of had an evaluation,
they realized that the space that they were in did not provide enough growth opportunity
so they decided to add new lines of product.
Now this is not something they had done in a really, really long time.
So what they ended up doing was they called an organization in
to try to evaluate their business processes and their technology infrastructure.
When they went through this activity, they realized that...
I mean, their business would scalify,
like they had the ability to launch the new products and support it from a business aspect.
However, their technology infrastructure, their hardware as well as their software
could not scale to take the new, you know, inflow of customers
and essentially their systems would start breaking.
As a result, they decided that,
you know, they had to launch a new platform, one with,
you know, a more modern platform which allowed scalability,
which over the next decade could help them launch all the products that they wanted to
and achieve the growth target that they had set for themselves.
Now this kind of sounds like a lot of the projects that we might have worked on, right?
Quick show of hands, how many of you have worked on projects
where the key reason you were trying to do something was really because,
you know, your software could not scale or had performance issues
or was just like the legacy coverage took a while to be factored?
Right, I mean, decent population here, and you guys share this experience as well.
So it does not sound like a new experience, however,
like the reason why this particular project really stood out in my mind for me
was purely the number of integrations that we had to do, right?
The ecosystem that we had to work with, not just inside this financial institution,
but the organizations we had to communicate with to make sure that the products went live.
So to make sure the first product actually was released to market,
we had to integrate with over 27 third party systems.
To bring on the existing products, the integrations would double.
So you were really looking at,
you know, over 75 integrations
within the first three years of this new platform going live.
Which meant integration had to be planned
to be at the heart of this development effort, otherwise this plan for,
you know, building this new platform would not work.
Now this along with the 370 services in production
meant that you were looking at an absolutely massive system in production.
One which worked at scale was available all the time and took a lot of traffic.
Now the communication interfaces when you were speaking to these,
you know, 74, 75 third party systems
included some standard interfaces like old HTTP, like REST and so,
but they had JMS, it has file transfer, and even a few mainframes.
Now most of us don't deal with mainframes all the time but the reason why...
I mean, this is a financial system,
so they have to continue to sort of integrate with these systems.
During this experience of, you know, debating with all these sound systems,
we realized that not all of these systems play nice with us, right?
They react in weird ways,
and we have to protect ourselves to make sure that our service quality does not go down
because that's the main reason why we are building this platform,
and we can't have, you know, outages or downtime
because a third party system is not available.
So, you know, here are some of the cha...
Today we're going to be speaking about some of the challenges that our team saw
and the way we actually tackled them to make sure that
we could achieve the target that was set in front of us.
Before we do that, what we're gonna do is
look at a high level view of what their existing software patterns look like.
We'll do this by taking a simplified example of their SAP.
So let's take an example of an auto management service,
where you have a UI, where you place an order.
You'd have this UI speak to say an order service...
Which accepts the order and tries to confirm if inventory is actually available.
So the order service goes to the inventory service to check for available inventory
which has to go to its database to make sure that the information is available.
Let's say that the information...
Let's say you have enough items to fulfill this order.
So the inventory service says, "Yup, you can go ahead, everything looks great."
The call comes back and the order service records the fact
that there is an order and then you return, you know, the call to the UI.
Now at some point if you keep getting orders for the same thing again and again,
your inventory will run low,
so your inventory service probably has to speak to,
you know, a vendor to make sure that you replenish your inventory,
that you restock yourself, right?
So far this looks like a typical three-tier stack, you have,
you know, the database has the current state
and services communicate with different vendors.
Now if I were to draw a parallel and kind of use a metaphor for this system, you would have,
you know, the inventory service could be represented by Spock
and Kirk can represent the auto management service on the right.
Now if you do not know Star Trek, don't worry,
this is not a Star Trek talk,
as much as I'd like to do that, this isn't one of those.
I'm just trying to draw a metaphor,
the colors behind the services kind of give you an idea as to what it is.
It's just friendly faces instead of going through just system names, right?
So if you stick with this metaphor where these two characters are actually,
you know, out of the Federation.
The communication that you see between them, the line between them
can be actually represented by one of the Federation-approved languages.
Let's take Vulcan as the Federation-approved language here.
So the conversation between Kirk and Spock is now in Vulcan.
Now if that's the case, when Spock has to speak to a third party,
that conversation will be done in a different language, a slightly different dialect.
If our vendor is actually Commander Kruge who is a Klingon,
then Spock is responsible to speak to his cadre in Klingon,
right, another language.
If you are a pretty large organization,
you'd probably have more than one vendor that you have to sort of replenish stock with.
So let's make the problem worse, let's have two vendors here,
the other one is now Commander Tomalak who is Romulan.
Now, I mean, as you can see,
Spock would have to know a third language, Romulan, to have this conversation.
This is where we see our first challenge in the way we design our services,
which is each of your services has to be multilingual.
Now if I go to map the services on the right hand side
with the languages that they need to speak,
Spock needs, when Spock is speaking with Kirk, the language used is Vulcan.
This language is sort of the languages that your services speak to each other in...
It's the language that your business speaks to you in,
the same kind of lingo that you use to actually represent your code,
it's your domain language.
Now if Spock has to speak with these other third parties...
Those languages can be termed as being external languages.
Each of these languages actually is represented in code,
so it's not just one of the concepts that we are talking about.
So we'll be going to a few more details about this
but what we'll see soon is every time you talk about this language,
it is essentially the protocol that you use to communicate with this third party.
And right about now you have, you know, one of the languages
which is the domain language
which if you look at Eric Evans's Domain-Driven Design book,
it's called the ubiquitous language.
That's the language that you want to be writing your code in
and that's the language you want to model your domains in.
However, your system also has to learn these other languages
because it has to interact with vendors or vendor systems.
And it's almost as if you wish you had a universal translator from Star Trek
so that the conversation became easy.
You could continue to speak in Vulcan
and the machine would convert that into,
you know, translate that into the vendor-specific language.
Let's look at another challenge which is related,
shared responsibility in common integration models.
So, you know, when you set up communication between yourself and a vendor,
one of the first things you do is that you have a contract set up, right?
So that document essentially tells you what is the protocol for communication.
In this case, let's take an example of an item ID
which has to be passed whenever you send a request out from your system.
So whenever Spock has to send any kind of message which has an item ID,
let's say the rule states that the item ID, even though it's a number, has to be put in quotes.
So the request would look like this.
Now whenever you get the response,
let's assume the rule is that if the response has an item ID,
the vendor will make sure it's always five digits in length and in quotes.
So if the number one, two, three is...
Since the number one, two, three is smaller in length than five,
you would append zeros to it, right?
Now this doesn't look like a real problem but it's kind of a nuisance
because every time Spock has to speak with any other system in the Federation essentially,
one of your services,
it has to just remove the quotes and use just the number one, two, three.
Whenever it gets a response, it has to remember that
the quotes have to be dropped and the padding zeros have to be dropped as well.
It looks like an arbitrary example
but this is not far from many of the integrations that we have to do with,
especially with legacy systems which are fixed in length, right?
So...
To make matters worse,
when you have to do the same thing from multiple services
because one vendor kind of satisfies
more than one business requirement,
you're looking at the same logic now spreading to multiple systems.
And the concept of converting this into your domain language
now is spread into multiple domains.
So, I mean, the thing that you would really, really wish you had
was something to represent integration as a first class domain.
So why don't we have a system here which actually does that?
So this is where we introduce the protagonist of today's show,
Comms Officer Uhura.
She is responsible for making sure that the translation
between your domain language and your external languages
actually happens successfully.
So any language which is sort of, you know...
So if you draw a clear line in the sand, this is kind of like your network boundary,
anything above that is inside your network, company's network,
and anything outside is, you know, outside the company's network.
So the languages, the way you design your code
when you are speaking with systems inside your organization
is based on the UL or ubiquitous language and anything which is outside,
you can choose to implement a vendor specific language to speak.
So now that we've talked about, you know, some of the challenges that we faced,
let's quickly go through a quick crash course of what the platform looks like,
the platform that we set up.
So let's look at what does an event-driven domain service
kind of look like on the inside.
So let's look at what event-driven systems actually work like.
So if you have three, let's say three systems,
you have Kirk and Spock who you've already met.
The new guy is Bones I believe.
Yup, he is the medical officer.
So each of these represent a domain or a system or a piece of software in your system.
Now in a traditional SOA-based architecture,
you'd have them call each other
whenever they wanted to exchange information.
But when you are building an event-driven system,
this changes just a little bit.
So each of them will announce an event when something happens.
An event is a...
It announces that something has happened in the past and is a matter of fact.
For example, the customers get REST changing, is an event.
So customer Rest-changed event is how we normally name it.
And the reason this is important is that the domain
who owns the customer entity announces the fact that this change has happened.
And no one else really contest it, if you care about it, you will kind of listen to it.
Now what happens is that this event actually is announced on the EventBus.
Now EventBus is kind of like a room much like this one
where each of the systems announce what has just happened
and there are other systems who will listen to it.
Now let's actually go through an example of what this looks like.
So if your first system announced that,
you know, an order was placed, this event is now spread,
you know, sent to the other domains.
Let's say it comes to Spock.
Spock cares about, you know, an order being placed and therefore listens to that event.
And when the same event goes to Kirk,
Kirk does not care about orders much like the character on the show.
And therefore he actually ends up ignoring the event, right?
So what you'll notice is that the EventBus is a channel for announcing change.
Each of the services will fire an event and forget about it,
there is no coupling between the services themselves.
None of the services know that someone else is going to react to a change.
I have done my job by announcing the fact that something happened,
if somebody reacts to it, that's fine, and if no one does, that's okay.
You also only pick up events that you subscribe with.
If you've not asked for an order-placed event, you will not get it, right?
So you have to subscribe to the events that you really, really, care about.
The other concept that we want to concentrate on today is
one of the styles of how we've modeled our events.
If you attended Martin Fowler's talk earlier,
he talked about some styles in which we design our events,
this is one of them, event carrying state transfer.
What you do here is that...
I mean the event name,
the customer changing their address kind of explains what happened.
There is a customer ID to explain which customer actually changed their address.
But if you notice in blue, there's a section
which actually has the address, the new address of the customer.
So what you've done is that this event
completely explains what the change is and where the change was applied.
The event is self-contained all in itself
and this style of event design is kind of important if you are doing event sourcing.
This is in contrast with something like event notification
where you would announce the fact that an event has happened
but not really tell the user what the actual change is.
You'd just provide a sort of link, an HTTP source, for example,
where the user can go and actually fetch the later state of the object.
There is no right or wrong design.
The question is, you know, what suits your system better?
In our case, since we were building an event-sourced, event-driven system,
the event carrying state transfer model worked better for us.
If you want more details about, you know, this topic...
You can read through Martin Fowler's blog post "What is event design."
There's also like in the past two weeks,
I think Martin's talk at GOTO Conference was put up on YouTube.
It's a pretty good talk where he actually explains the entire thing
as a 50-minute talk much like he did here.
So if you've not actually attended the talk on the series,
you can go back to YouTube, look up that video, it's a good watch.
Okay, coming back to, you know, our system design,
so now that we know what event-driven systems look like,
let's look at what our third party communication stack looks like.
So you have services on the top
which sort of announce changes as events on the EventBus.
You have vendors at the bottom.
And what you have in between is the gateway,
the service gateway who listens to the events, figures out
which ones actually need to be sent to a third party, do the appropriate translation,
and then send it off to the third party, right?
So if we go into sort of the details of how the service gateway looks on the inside,
and we have the EventBus on the left and the vendor on the right,
the gateway will be in the middle.
And the gateway has three major steps.
It has a reader, it has a transformer, and it has a writer.
These are the three basic steps of,
you know, making sure the message can be correctly converted and sent over.
This is sort of the model that we'll start off with
and over the next 10 or 15 minutes,
we will evolve this model to add more details into it.
But if you notice, you know, a reader is pretty generic,
the way you read messages off of a queue is pretty standard.
The transformer keeps changing every single time because,
you know, you have a different request
or a different incoming event and the outgoing message might be different.
So the way you transform is really custom logic every single time.
The writer really depends on how you are sending the message out.
If you are sending the message over HTTP, there will one writer,
if you're sending it over JMS, there will be another one.
Since we are doing, you know, over 75 integrations,
we don't want to keep writing this collection every single time,
we want to, you know, give returns back to our entire development team and build a framework
which accelerated the pace at which we could do integration.
So what we ended up doing was we built a set of standard readers and writers
based on where you want to read the message from and where you want to write it.
And for the transformer itself...
We added this layer on top called the adapter.
Now the adapter contains two sets of configurations.
One is the transformation rules.
The transformation rules will tell you
how to transform the message from say,
a customer address changed event to whatever it needs to be
when they send that message out to a third party.
The root logic actually explains
where you read the message from and where it is supposed to go.
So details about how that third party vendor
actually accepts messages over JMS or HTTP,
and a bunch of other configurations
that they need to be able to correctly send that message out.
The service gateway here,
now instead of having business logic about how,
you know, how to transform messages
and actually translate and send the messages
it's now more responsible to be a container.
A container, kind of like Tomcat,
where you can put in different assets
and it would just know what to do, and actually how to translate messages and send them.
So let's move on to the next challenge
and see how this affects our system design.
The next one is vendor abstraction...
How we achieve vendor abstraction in our system.
So what we're going to do is that this is the stack that we've been working with so for.
And what if one of their services announces that the inventory is actually low?
If it realizes that, you know, the gateway realizes that
since the inventory is low, it has to do something
because it has been told based on the configuration that we put in as an adapter,
it knows that it has to act on this information.
So the gateway decides to react.
One of the things the gateway needs to know is what the message looks like.
So let's take a sample event body that has an event ID.
And there's a couple of other things that we don't really need to focus on right now.
Now based on the configuration that we have inside the transformation rules,
you could say that an item ID of 403 needs to go to this specific vendor.
It's kind of like saying,
"If I want cheesecakes, then I go to a bakery."
And there's a preferred bakery which you configure.
But if you want let's say pizzas, you have a favorite pizza place that you go to,
to make sure that your stocks are maintained.
So this is kind of like that.
You're seeing an item ID 403 comes from, you know, this specific vendor.
So you'll make sure that you transform that message and actually send it out to that vendor.
Let's actually look at what this means in code.
Like, how do you configure your code to be able to do that.
So at a high level, you have...
What we would do to make sure that...
Since there are two vendors
and we have to make a decision between who gets which message,
we can have a service gateway where we put two different adapters.
Now on the first adapter,
we configure the fact that anything which is an item ID
between say 200 and 300 will be picked up by this...
Will go to vendor 1.
So if adapter 1 picks up the message and actually transforms it,
it has been configured to send the message to vendor 1.
If the item ID is between 300 and 400,
the system's been configured to have adapter to process the message,
and as a result, the message will be sent to vendor 2.
This allows you configurability to choose who gets which kinds of messages.
The reason this is important for us is that now
your domains have been designed to announce business events
that the inventory was low.
Your domain does not care where the inventory actually comes from.
The gateway and the configuration that you have in there
decides where you send the messages.
So this means that your domain code can be truly abstracted.
It can be an actual business domain now.
And there's no leakage of non-domain specific logic in there.
The second interesting thing is that
you now enabled your business to abstract out...
No, sorry, allow them to easily replace vendors.
So you are not vendor locked in as much anymore
because if during contract negotiations
you realize that a specific bakery is charging you higher rate than the other,
and the business wants you to move away from it,
the only part of your system which has to change is the adapter.
And since the adapter is declarative in style and is basically configuration,
it means you have a quicker turnaround time to build another one, and say,
"Yup, if I want to move to another bakery tomorrow
even though their protocol for communication is different,
my turnaround time is now in days and not weeks, months
or even possibly longer than that."
So our turnaround when...
Our client is essentially not locked in
because we can change vendors at a very, very quick pace
which allows our business to go into contract negotiations
and be a lot more comfortable with,
you know, potentially changing vendors.
So far, we've been looking at one-way communication.
Like, we've just focused on request going on
which is not really a realistic representation of most of your systems, right?
Because you have... You usually want to also work with responses.
So let's look at what synchronous as well as asynchronous communication here would look like.
So the simple stack that we had so far,
the reader, transformer, and writer together make up the request path.
And if it's synchronous communication, kind of like HTTP,
what you would is, you know, send the message out over HTTP
and you'd get an immediate response.
When you get a response, what you do is that
you have another reader, transformer, and writer
to sort of take the message, immediately process the entire thing,
and this together we call as the response path.
If you notice something, we have the same stack mirrored in a certain sense.
So the concept of a request and a response don't really mean anything for us anymore.
It's essentially a pipe of the way you process things.
And this conceptually remains true
but there's a few things you'll notice later in the architecture
where we had to evolve, where the request and response aren't diverging.
But conceptually, you're doing the same things irrespective of whether it's a request or a response,
especially when it's a synchronous system.
So now if you move to an asynchronous system,
the main difference is that you will send the request,
and your thread is essentially done.
You don't want the blocking system.
So what we do is we send a message out and at some point the system responds.
Let's take JMS as an example.
The classical example of what you'd do when you have an asynchronous communication.
So when the request goes out, you have...
We added a component called the correlation ID generator
which basically generates a random ID.
In our case, we use a word, doesn't matter if it's a number.
The vendor has been told that as part of the message at a very specific place,
there will be a random number... There will be a specific number.
They are responsible to take that message and when they send us the response,
bill the packet which has the exact same number.
And that's how we notice...
That's why we can correlate the requests with the responses.
The correlation ID generator not only generates this random number, this correlation ID,
but also associates the kind of message we were sending out.
As a result, when we get the response and you look up that table for that ID,
we know what the request was and therefore, we know how to process the response.
Okay, so whenever you start off a new project,
you define what the success criteria for a project is.
You know, usually this means a bunch of functional
as well as cross-functional requirements.
You go through the inception, you define the scope, everything looks great, you're really happy.
As with most projects,
as with a lot of projects you would deal with, at some point something would go wrong.
You know, either it is the fact that you are behind schedule
or the release date has to be moved up.
Or you are, you know, running short of budget so you have to speed things up.
A tough decision has to be made as to how you can deliver functional value
or like as much value as you can in the new timings that you have.
A lot of times, when reprioritization is to be done,
you know, the cross-functional requirements are the first ones that go out of the window.
Because, you know, as a technologist,
you're not completely happy with this decision but you understand.
You understand until, you know, you actually go to production
and everything's working fine for a couple of months, it's holiday weekend,
and you are the one who has to pick up that pager...
I think you know where this story is going to go soon, right?
You're sitting at the dinner table with your family and the pager goes, and your phone rings,
you barely have to look at your phone, you know exactly what's about to happen, right?
You know, that vendor that you integrated with six months ago just went down.
Holiday weekend, high loads, it just crashed.
And you are the guy who is spending the remainder of your holiday weekend on call.
This is not a situation I wanted to be in.
And no one else on that team wanted to do this as well.
So, you know, because as Murphy's Law states,
"Anything that can go wrong will probably go wrong."
So why not prepare for it?
So here's a few cross-functional requirements that,
you know, we planned for, things that we tried our best
to make sure that the system has so that it's resilient.
The first one's kind of obvious.
What happens when you just can't speak with this critical vendor
which is very, very important to run your business, right?
So let's look at an example that, you know, this is the stack that we've been looking at.
Let's look at what happens when the vendor is not available, right?
Now if you're doing HTTP based communication
and, you know, the vendors are unavailable, you get an immediate response.
The writer would try to send the message and the HTTP call fails, right?
So it's pretty obvious it's instant feedback.
If it's say, JMS, and the JMS channel itself is down,
as soon as you try to send the message,
the queue is not available, so it fails.
And you get immediate feedback.
If the vendor itself is unavailable, but the queue is available,
what ends up happening is the queue will start filling up.
You can put monitoring on top of it
and, you know, alert on the fact that your queue size is going up.
If it goes over a certain range
which you deem as being outside operational parameters, you'll probably get a page,
you can call up the vendor and tell, you know, check with them what happened.
Right, let's focus on the first two types
where you get immediate feedback and messages,
errors are just being thrown at an extremely rapid pace.
So if you're doing a couple of thousand messages a second,
you're looking at that many errors a second.
So when the vendor's down, the message goes all the way to the writer.
There's an error and an exception is thrown.
Since everything is on...
So, you know, it's the same thread pool,
exception goes all the way back to the EventBus
and since the bus is a queue, it will just keep retrying.
All you have to do is set up a retry policy on the queue
so that the message keeps being sent.
So congratulations, you've fixed the problem.
If you have an outage on the vendor side
or on the communication medium, your system...
It just keeps retrying and everything works.
So let's actually try to visualize what this looks like.
Let's do that with an example.
You have six messages,
marked A to F being sent to three different systems, one, two, or three.
Just for laughs, let's make sure that the first one, first vendor is unavailable.
I marked all those messages red, vendors 2 and 3 are available.
So they are marked as blue.
The expectation is anything which is... Sorry, any message going to vendor 1
will keep retrying.
Anything which goes to 2 or 3 should work.
Let's say we have two threads
to read the message of the queue and process everything, right?
So that we'll pick up the first two messages, try to process them.
Since the first one is not available, it'll throw an error, keep retrying,
just like we talked about.
Since the second one is available...
Oops, sorry.
Since the second one is available, that message will go through.
Everything looks great, since the second message went through,
we'll remove it from the queue.
Let's move everything up one level.
Both messages are unavailable at this point.
So what we'll do is we put an error on both of them
and keep retrying until the vendor comes up.
Now while this looks like, you know, the state that you want to be in,
if you look carefully, there is one problem in this system.
Messages 3 and 4 are marked in blue.
Those messages should have been delivered.
There is nothing wrong with that. That communication channel is completely open.
However, since there's two messages above that
and you have only two readers, you block your entire system.
Congratulations. You have cascading failure.
Drucker's Law which is interestingly a corollary of Murphy's Law
that I didn't know about states,
"If one thing goes wrong, usually everything else will, and at the same time."
So just by trying to fix one problem,
we have actually caused a worse problem
which is not only have we blocked the first vendor,
we've actually stopped sending messages to everything else
and your system's backing up like crazy now.
By the way, you could have replayed that problem with any number of threads and any number of messages,
the system will eventually always come to that state.
So it's mathematically always going to block.
The way we fixed this is by adding,
you know, splitting the route into smaller sections.
What we've done is that
when you're trying to send a message to vendor 1 there's a sub queue 1 up top,
and there is a sub queue 2 in the bottom which is just sending messages to vendor 2.
The reading and the processing of messages is common.
When you process the message,
you realize where it's supposed to go and that's why there is a fork.
The buffers that you see at the front of the sub queues are essentially,
you know, any queue based technology, you can use ActiveMQ or whatever else.
And what you use it for is buffering.
So let's look at an example of what happens
when you have a message trying to go to an unavailable vendor,
comes to the transformer, you realize that something is wrong...
It needs to go to vendor 1, gets put on to the buffer queue.
You then go ahead and process the message all the way to the writer.
Since the vendor is unavailable, it throws an exception.
But this time, it will go back to the buffer.
Because that's the last queue you worked on.
You have a thread pool for the section up top.
There's a thread pool for the section at the bottom.
And there's a thread pool for, you know, just the reader writer.
So what you've done is whenever there's an exception here on the top,
it'll just retry in that smaller group.
Now if you're trying to send a message to an available vendor...
You will see that it just goes through the system, goes to buffer 2,
no problems, it will be delivered.
Now when you have extended downtime,
you know, the first message is already being retried.
But as more messages keep coming in,
they will keep getting queued behind the first message.
The first message continues to be retried,
but the second, third, fourth, the nth message all stay there.
If there's a message to an available vendor, you notice that it's going through now.
There's nothing blocking it.
And this is how you actually alleviated the problem
that you accidentally caused earlier
which was the cascading failure we talked about.
Now when the vendor becomes available again,
all the messages flow through without any system performance degradation.
Let's look at the next challenge.
Rate limiting of outbound messages.
Now...
Rate limiting seems like a pretty standard problem.
You know, you need to do this mainly for two reasons.
One, you've built this shiny new architecture where everything is cloud based.
You probably, you know, if you're on AWS or something,
you probably have that big scale.
So every time your traffic load goes up, you could probably,
you know, just keep scaling and you could take that traffic.
Just because you are scalable does not mean that your partners are, right?
So if you scale up, you will handle the traffic fine,
but when you start sending those increased messages to your partners,
if they can't scale at the same rate you can,
you will essentially, you know, have a denial of service attack on them.
This is probably not the best thing for your relationship,
you know, you'll probably piss them off.
So the nice thing to do really is to put a small component in there
which makes sure that you don't send them more than a certain number of requests.
This is something that you'd usually want to talk about during contract negotiations.
You want to make sure that, you know, either they agree that,
"Yup, we can scale up, we don't have a problem," in which case you go and focus in...
But if you're talking to a system
where they don't have scalability, dynamic scalability,
put one of these in.
Usually, if you have a denial of service,
and you end up bringing their systems down,
you're actually losing more time.
You're better sending lesser number of messages per second
over a longer period
than having a spike of large messages
and an outage for multiple hours after that.
Also, in some vendors,
we've actually seen if you send them more than a certain number of messages a second,
they will end up blocking your account.
And good luck trying to be on call during a holiday weekend
and explain to them
the fact that you want your account unblocked and it's affecting your business.
Because they're probably not, like, barely available and it's going to take forever.
So make sure you put this software in place.
By the way, that story about account blocking, it's real, it does happen.
Thankfully, we never had to face that
but it's definitely part of the contract.
Okay, moving on to the next problem.
You know, what happens when a vendor asks you to retry.
In HTTP sense it kind of looks like, say, a HTTP 503,
with a Retry-After header which specifies the time.
So it could say, you know, "I'm doing my deployment right now.
Why don't you come back after 10 minutes?"
So this is something we've seen as a pattern.
Sometimes you have a custom message typed that a vendor returns.
It's as whenever you see this header, know that we are unavailable.
Please just come back after a certain amount of time.
Or sometimes, there's no time in which case you just keep retrying.
Just keep doing it.
And in the system that we built,
the way we do that is that
when you get a response and you identify that it is one of those retry kind of messages,
you send the message from the response side back to the buffer.
The advantage of doing this
is that you don't have to process the message from the beginning.
If you transformed it the first time,
it will look the same way as you're doing it the second time,
the third time, or the nth time.
The data doesn't really change, right?
So if you convert an apple into a half eaten apple, it will always look the same.
I guess that's a bad example, but you know what I mean.
So the transformation always looks the same.
So what we are doing is that when we transform the message
in the correlation ID generator phase,
we actually put a copy of the original message
as well as the transformed message in that database.
What we're also doing is that on the response side...
You know, when we get the response, if the response is a successful one,
we kind of save it in the database as well.
If it is one of the retry type of messages,
we will kind of maintain a count just mainly for debugging to,
just statistics to know how many times we had this problem.
So if you had to retry a message eight times,
that message will have account there and we'd know.
If it's around zero message,
we had a problem in terms of delivery
and even though the message came in at say 11am,
it actually got delivered say at 11:30am, you got a 30 minute delay.
So we can calculate statistics like that which are important.
Especially if the customer calls up and says,
"I clicked the button 30 minutes ago, why don't I see, why didn't anything happen?"
So those kind of queries, it's easier for us to find out what happened.
There's two more things which, you know,
as a result of this small change to handle retries we receive,
one is that we have a record of every message
which has come into our system,
the way we transformed it, the way we sent it out,
and the response we got from them.
So in case, you know,
you ever have to be audited to make sure that you did the right thing,
you have a full log of everything that's happened.
And if you worked at a financial institution,
you probably know how important,
you know, the ability to audit your messages are.
And, you know, as Vineet probably mentioned in his talk,
you know, having an event source system is a huge asset,
like, auditability is pretty easy on an event source system.
On a gateway, this is kind of the equivalent of that
because we have every message that came through,
we have a record and we can prove why we sent a specific message,
and nothing was tampered with.
Also as a result of this
we kind of got the ability to do something on real environments
that we normally wouldn't have.
In case we made a mistake on processing the response...
Let's say there's a defect on the response side of the transformer, right,
and while processing the message you just blew up, you threw an exception.
Well, since you have the response that the vendor sent to you,
all you have to do is fix the defect,
push it to production and take the same message and put it back into the queue
and sort of just push it through the system again.
Your system will think that it's a new message
and just process it all the way and the system will correct for itself,
you don't have to build in something elaborate,
the bare framework that we require is already there.
So that kind of system actually helps you recover in case you had errors as well.
Okay, let's go to the next challenge
which is what happens when vendors throw an error that you don't know of?
For this we'll take a quick example.
So let's say you are running out of cheesecakes,
you know, one can't really have enough cheesecakes in their life.
So what you'll do is, you know,
I basically visited the office, let's say I ate all of them,
so your system then has to go out and order new cheesecake.
So when the system announced the fact that,
you know, you didn't have enough cheesecakes,
the gateway notices that and places an order.
Your vendor normally should respond with the success message
but instead, just gives you a message
that absolutely does not make sense for the request you sent.
So you are confused especially
because the joke that the vendor just made is at least four months behind schedule...
Actually before 1/8, look it up.
So what you end up doing is,
you are confused about what actually happened
and the only thing you can do at this time
when you don't know how to process the message
is to find a human being to deal with this message.
So this is where we introduce one of the last systems...
Into our ecosystem, a BPM.
Steve Carell from The Office.
So the reason why we have a BPM here is that,
you know, there's inherent strength in some things that computers do.
Computers are great at processing large volumes of data
and doing them really, really quickly.
But when you have an actual decision to be made,
when you don't know what's going on,
you know, that's something which is best left to a human being to do.
So a business process manager is a system
that we use to allow us to sort of manage cases.
So what we do at such a point
when you have this kind of an error is you create a case
in this case management group.
Now there's multiple systems out there that we can deal with,
our client actually had one of these systems available, so we just used those.
If you've not used a case management system,
you can try envisioning maybe a ticket management system
which is kind of similar.
You send a request and it creates a case
and it notes that something had happened.
Now you are expecting that an agent looks at that case,
tries to understand what happened
and figure out what needs to be done to fix the problem.
Like the problem might not always be technical.
One of the cases we had was that a vendor deleted their zip file,
like, their standard table where they have all US zip codes.
As a result, whenever we sent them an order,
they didn't recognize any of the addresses
because they'd go to do a check and just say, "This is not a valid US address,"
when it clearly was.
When these kind of errors just start flooding your system,
you know, a human being can look at it and be like, "Well, the address looks fine, I verified it."
So what this means is that the vendor is at fault.
So somebody picks up the phone and actually calls them up and gets them to fix it.
Now if you have the same problem
and you have say a couple of thousand requests per second
coming through your system,
by the time the agent logs in they might see half a million different cases.
I don't know about you, but I don't want to be the guy
who logs in first thing in the morning and sees half a million new cases
that I have to solve by the end of the day, right?
It does not make me happy, I'd totally end up quitting on something, I don't know.
So not wanting to have our people in that position,
we had to figure out a more intelligent way to handle these cases.
So what we did was,
we tried to group cases together based on their similarity.
Usually even though the errors are unknown,
you can look at a few key characteristics and know how to group them together.
You know where the error occurred, you know what kind of error it was,
you can use that and along with the system that you got the message from,
take all of that and hash it together,
and most of the time you'd get the same hash.
So when the zip code example that I mentioned happened,
even though there were half a million cases,
the number of hashes generated was one.
What we did was, you can tell the BPM to look for the specific code
and every time it sees a code...
I mean, if the code is the same, do not make a new case, just append it to the same one.
So instead of the agent looking at half a million cases first thing in the morning
what they actually saw was one case with half a million data packs.
Since this case management tool also allowed automated fixes to be applied,
so when we built a script to actually fix this problem,
all they had to do is select all the half million cases and apply it,
probably go grab a coffee or something,
and just watch the system fix itself, right?
So that's kind of the power of this system.
The last challenge we're gonna be looking at today
is what happens when you have a blue-green deployment being done on a system
when you are speaking to a third party system over JMS.
Now you built this brand new architecture which is really resilient,
which always, you know, always is available, takes messages,
processes them, sends them through,
you know, it never loses anything, it's perfect, it's beautiful,
it's always available except when you are about to deploy something
because the moment you are about to deploy something, you probably will bring it down, right?
There's a brief moment that your system will be unavailable.
There are multiple ways to tackle this in terms of deployments,
one of them is blue-green deployments
which is kind of a strategy that we as a whole selected for our platform.
So we'll go through a quick sort of crash course
I guess of what blue-green deployments look like.
Now if you have your user kind of trying to use your website and make calls,
what you do is you send the request that comes in from the user's browser into a,
you know, a system like HAProxy
or something else which balances the load.
Now when you deploy that standard stack that you see on your left
which is basically a group of domains with the EventBus and your gateway,
let's call this stack as the blue stack,
the load balancer that you have actually sends every request
which comes in from the user to the domain.
And the gateway has also been configured to talk to the vendor at the bottom.
So, so far this kind of looks similar to what we've been talking about, right?
When you want to do a new deployment,
what we do is you take the entire stack that you see on the left
and basically deploy a newer version of every one of those services.
So let's call this stack the green stack.
So now you have the same set of services and this side is inactive.
It's called inactive because it's not actually taking live traffic,
it's just there, you want to make sure that it started up correctly, everything looks great,
you've done your automated system checks and you're like, "Yup, this is ready to go prime time."
This is the newest and the latest and the greatest version of our code
and we want to make sure it goes live.
So in blue-green deployments, there is this step called as the swap
where as soon as you hit that button, the inactive side will start taking traffic.
So if you just noticed the green side
which was recently deployed is now the active side and the blue side is inactive.
Once you notice that all the traffic has drained from the inactive side, everything's great,
you will kill that entire section and you have just one side on production.
So this is blue-green deployments done correctly.
But this is not what actually happens
when we deploy a code, when you do a blue-green deployment.
What actually happens is that
when you have the active side fully connected, everything's working,
as soon as you deploy your newer version of code,
since the gateway knows the address of the vendor,
you know, the vendor queues
and it's been configured to always be connected.
So guess what's the first thing is does when it starts up?
It goes out and connects to the vendor.
Now it has no messages to send to the vendor, so that's fine,
however, the active side is sending messages.
Now you have two different versions of the software
who are both trying to read the responses
which means there's a 50% chance that your inactive side,
that you've not confirmed that it's correctly working it
is actually taking traffic, this is an anti-backing.
Now unless you're ready you don't want it to take traffic.
So the side that you wanted to be inactive is actually active.
This is blue-green swaps done incorrectly.
Well, actually we've not even done swap yet,
so this is just blue-green done incorrectly at this point.
So the way you fix this problem
is the same way you fix a lot of problems in software development.
We fix this problem by adding another level of abstraction.
So what we did was we added this layer called as the messaging proxy.
Now this is kind of similar to the thing that you see on the top,
we use HAProxy which is the same component we used here as our load balancer.
Except this time when we configured it here, we configured it in a TCP network.
So as soon as you try to connect to a queue,
the gateway has been given the address of the message proxy
and not that of the third party system.
The message proxy essentially, well, as the name suggests,
proxy is the connection to the vendor.
So in this diagram, the gateway on the blue side
thinks that it is talking to the third party system
but it's actually talking to the proxy which is just relaying messages through, right?
It's really, really quick, it's a transparent proxy, it's not a performance hit
but there is an advantage.
The advantage is that when the green side comes up...
It tries to connect both,
you know, to read messages and to write messages.
What we did was that the blue and the green side have been given different port numbers.
Let's say this is on port one, two and this one's on three and four.
So what we've done on the inactive side is the port
which is actually used to read the messages has been disabled.
So when the inactive side tries to connect to it to read messages, it just can't.
As a result what you'll see is that it can't pick up messages.
So that's why the inactive side stays inactive.
The moment you actually do a swap,
what you do is not only do you tell the HAProxy on top
taking the HTTP traffic to switch over,
you also tell the message proxy at the bottom to switch over.
So it will now make both the ports which have been configured
for the green side to talk to to go live and the other side,
it can just have, you know, one connection.
The reason why we still have the outgoing connection on the new inactive side is just prior to swap,
if there's a message which came in and you're still processing it,
you don't want to lose this message,
you want to make sure that it goes through, right?
So let's say a half a second or whatever is the standard processing time,
you know, couple of hundred milliseconds later when you're done processing the message,
it will actually write it out.
If the vendor responds to that message,
it will never go to the inactive side, it will go to the active side.
And all the traffic now starts flowing over to the new active side
and within a few seconds or minutes,
you'll notice that the inactive side is,
there's no activity there at all,
you have monitoring on each of the services and queues
and that's when you can switch it off.
Congratulations, we've made it through the entire thing.
Time for the summary.
So scale your integrations,
use declarative integration definitions wherever possible,
they are quicker to do and easier to manage.
And they also act as documentation in themselves on how,
what's the communication protocol between your systems...
Try to be vendor agnostic
and have vendor replaceability on your systems, right?
Agnostification allows you to build better domains, well attracted domains
which focus purely on business concepts and not on,
you know, system specific concepts.
And if you do this correctly,
you will have the ability to replace vendors with a lot more ease.
This is great because your business will appreciate the stack a lot.
Making your business happy is probably the best thing a dev
can ever do apart from building project sometimes.
Next thing is plan for failure from day one.
Always, you know, build resilience from day one,
always plan that,
you know, something or the other will go wrong...
Just because you're scaling your system dynamically
does not mean that your vendors will scale them dynamically, be nice to them.
Have the conversation upfront,
make sure they can either scale as quickly as you
or that you've put in safeguards that you don't cause them problems.
You don't want them to be angry at you.
Have recovery plans in case of failures
and always assume that every single component in your system can fail.
Always have a backup.
Automate error handling wherever possible.
When error handling is not possible, use a BPM.
When building a resilient platform it is important
to build a system which is not severely effected by issues outside your system.
Building resilient systems definitely makes Spock happy.
So I could just leave that over there, just want to enjoy and dance, I love that kid.
And so thank you once again, my name is Karun Japhet.
And my contact details are up on the screen.
I'd love to hear your feedback on the talk.
So if you're tweeting, please,
keep me retweeted by handle or send me direct messages,
slides are up there as well in case you want to see it.
Certain people who've come have already done that.
And I'm available for questions now.
Thank you.
Can you just maybe give us a hint, we got that the technology stack
you guys use, of course I understand there can be
several platforms used but which is the one that you guys like
or for EventBus and for the Data that works?
Okay, I'll repeat the question.
Now the question is what kind of technology stack did we use for the EventBus,
mainly around the EventBus and the adapters
and the other software I talked about.
For the EventBus, we ended up choosing ActiveMQ.
For the adapters we built something custom made.
The reason is that we wanted a declarative style to do transformations,
generally most people have pure code just
written custom one time to be able to transform stuff
but what we wanted was, you know, a lot of our
semi-business-ish users wanted the ability to do transformations themselves
so they're like you want to write something
which is very, very easy to configure,
it should essentially be configuration
that is easy to understand to make sure that it's right.
So we built a framework which helps them do that
and if you want I can, offline I can show you what some of that looks like,
I have a couple of hidden slides I skipped over.
So, yeah, custom framework for the transformation bit...
Just so we could do it across sometimes as well.
So in case you want to, like this is a real time environment,
if you want to do data coding or Hadoop or something like that,
like we had requirements where we wanted to use the same set of transformations
because transformations in themselves are treated as assets in the organization.
So when you are talking to a specific vendor the language is kind of defined, right?
So if you have the language defined,
you want to reuse that same language across multiple systems
to make sure you don't rewrite the same code.
So we had to roll out something custom to make sure
that all the run times the business works,
that the organization works, and we could apply the same thing.
So we had to roll something out custom.
I think you are next and then...
Did you have any cases where your transformed work wasn't purely functional
So either that you made any mistakes or in the platform you developed.
Oh, yes. Yeah, you are right. Sorry, I'll repeat the question.
Did you ever have a case
where your transformations were not really functional at required state to be maintained.
Yeah, we actually grappled with that question quite a bit, right?
So the idealistic view that we built in our heads
we started the project was, hey, everything should be purely functional
purely stateless,
you know, having states or having mutations is not great,
let's build this idealistic garden when everything looks amazing
and that works very well and it does most of the job that we wanted it to.
Every once in a while it does not,
so the question I think is that, yes, we had that, we had to handle it.
The way we ended up handling it was, that we...
Sorry, the question we had to answer
is whether we wanted the framework that we were rolling out for transformations
to support mutation out of the box
or did we want to build a more generic framework where we allowed the user
or, you know, the end developer to actually do that.
So what we did was we built transformation functions
where you could have hooks to say there's a data provider here.
So when you did map field A to B
using function X, you'd say in X I have to take, A is given to you,
so you just do a query or whatever it is and that could maintain your state.
So we allowed users to do that from inside that sort of sandbox that we have.
So that's how we solved the problem,
it kind of is tricky because every once in a while,
if you have an external integration that could fail
and that means that your transformation overall could fail.
So you had to handle more and more error cases.
But yeah, I mean, we ended up doing that
and making sure that the sandbox we built for them is safe for,
you know, is resilient and...
I put retry policies in there to say
if a database is unavailable, we can't translate a message.
So we throw an exception, we will handle that in a separate talk,
that's probably a topic in itself for a session.
Can you give examples of declarative
integration and configuration as opposed to the theory...
Guess I'm pulling the slide out, give me a sec.
I'll try going full screen, I don't know if keynote will let me do this.
Oh, sorry, you guys can see my desktop, the random wallpaper.
Oops, no, I have to unpack those slides.
Let me try this again.
Okay, cool.
That's what, I mean, to your question earlier,
what does the transformation kind of look like.
I mean, our declarative transforms kind of look like that.
It's actually saying, I need to transform...
I'm gonna skip the concept of a selector and target just for now
but the map bit is the actual transformation,
we are saying that map an item ID to something which is an ITEM ID in upper case.
So your source is on the left and your target's on the right
and you can also pass in a transformation function
inside the two right after the target.
So that's what that looks like.
That's the transformation role...
I don't have full examples to say...
A non declarative one would be,
you can imagine as just a Java function
where people are taking an object of type X
and converted it into an object of type Y.
Like we didn't want to build object to object,
physical mapping where you are doing A,
object.get and set it to other object.sec
or constructor or whatever, that's kind of the code we didn't want to sit and write
because it's hard to test and maintain.
Couldn't you Spring bean?
We could do that as well.
We actually couldn't use Spring in this particular project,
one of the requirements I mentioned was the fact that...
We had to have the ability to run this over multiple runtime environments.
One of the environments that they wanted us to run in, they couldn't have Spring,
the entire Spring container there,
the initializing container would not let you start out Spring.
So we had to go no Spring on all of the products,
like at least the adapter bit,
the other stuff you can have, like the gateway has Spring on it
because it's the container which runs the work.
Yeah, let's take one more question and then you can take it offline.
I think there was someone here who raised their hand first.
Sorry, I'll get right back.
I was just wondering if the diagrams,
like the calls between reader, transformer, and writer,
like, what protocols were that or was there a queue in between
or was it just like call to call?
It was just call to call at a time, this is why I said,
one of the things I mentioned was when you have an error
it just, you know, the stack just goes back all the way.
So we just use the same thread across,
I guess in a more SCDF or spring cloud data flow manner
if you've seen the framework,
each of those steps would be a processor in itself
and could be separated by a queue, is separated by a queue, right?
So that's a similar sort of concept but slightly different implementation.
One of our key requirements was since transformation and asset sending is an overhead
and traditionally the organization had,
you know, direct calls,
they didn't want this framework to take up too much overhead.
So the requirement we were given was
you can take no more than 10% of the time that the pure call takes.
So if the call is 100ms, like, all the persistents,
everything else can take no more than 10ms, right?
And as much as we tune our databases, like, the database clause we made,
each of them is five milliseconds for us.
So which means you don't have much room to actually do the transformation.
So if we did one of the queues, the queues would have to be backed up
and one of the requirements is that each of them have to be persistent
in case of failure to make sure that nothing happens,
that adds this kind of time which makes everything longer,
so putting queues would have just gone above our budget.
So in a frozen box.
Sorry, we're running late, so sorry.
So you can take it...
Karun will still be here if you guys can...
I'm still here.
Yeah, I just have couple of points, right?
So one, thanks everyone for coming, I hope you guys enjoyed it.
There is a feedback board outside right near the entrance,
there are stickies, you know, you can leave feedback about the talk,
about the meet-up,
what do you want to hear about,
you know, like other topics that you want to hear about,
we take that feedback seriously and kind of work on it and kind of improve as we go.
Another thing is, yeah, there's still some beer left.
So, yeah, help yourself, you know, mingle, network.
And another thing, if you have any questions after today,
you can either tweet or email Karun.
I guess you should go back to that slide so people can note down that.
Or you can also go to the Discussions tab on Meetup
and actually post questions there
because we'll be monitoring that and onscreen questions, okay?
Thank you, everyone.
Thanks, everyone.
Không có nhận xét nào:
Đăng nhận xét