# Facebook System Design | Instagram System Design | System Design Interview Question

## Метаданные

- **Канал:** codeKarle
- **YouTube:** https://www.youtube.com/watch?v=9-hjBGxuiEs
- **Источник:** https://ekstraktznaniy.ru/video/39009

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Hello everyone! Welcome to CodeKarle. In this video we will be talking about how to design a social network like Facebook. The important thing to get take care of here is, how do we design a system that can scale up to the requirements that Facebook has. Now I have already made a video on how to be device twitter, so I would recommend to watch that video first because this video takes an inspiration of that video and tries to take it to the next level by making some changes which can help that system scale to a much larger volume at a lower cost and in a much scalable way. Now the same system can be used for any other social network as well, like Instagram, LinkedIn or even Twitter for that matter. So yeah let's get started! Let's look at some of the functional and non-functional requirements that we want this platform to support. So from the functional side the very first requirement is that we should be able to post, make a Post. Post is anything that we want to update on the platform. right. It could have an image, it would have a video, it could be just text and eventually you can extend it to a lot more varieties of course. The next thing is you should be able to like a post, comment on a post and you should also be able to share a post, right. For now let's restrict it to commenting just on posts and not on comments. The next thing is adding a friend. So a person should be able to go on somebody else's profile and add them as friend. Now this is a non-directional relationship, so if I am your friend you are also my friend. So the friendship is non-directional but there is a concept of weightage of friendship that will come to in a while, that would be a directional concept, okay. The next thing is user should be able to see a timeline. Now that timeline is something where in let's say I have hundred friends, so all the hundred friends, that would have posted whatever they wanted to I would see all of that data in my timeline. So that's the timeline functionality. The next thing is user should be able to see other post and others profiles. So by other's profile, we mean it's just a profile page through with you add them as friend. Other's post is basically a timeline of a particular user. It would be self profile or self timeline or some particular users timeline. That can be accessed again through a UX by going on to their profile and looking at their posts. The next thing is something called activity log. What this means is we should be able to log everything that a user does on this platform. So for example if a user posts something, they like something, they comment, they share, they search for something, all of that should be tracked and should be visible to a user somewhere, okay. So these are the functional side of things. From a non-functional standpoint we know that any social network is a very read heavy system. So let's say if I post something all of my friends would be seeing it, right. So for one write that is happening in the system, there are probably hundreds of reads that are happening. The same thing was the case of Twitter as well. Now it should have a very fast rendering and posting time. So if I want to post something, I should get a success message that the post is saved very quickly and whenever I am looking at somebody's profile, my timeline, somebody's posts view, all of that should also render very quickly. But some lag is okay. Now lag and latency are two very different things. Latency should be low but lag is okay. What lag means is if let's say a user post something, if it takes 10-20 seconds for their friends to see that post, it's okay to have those 20 seconds in between but whenever they see is the page should render very quickly. The next thing is access patterns, so any social networks has a very unique access pattern. So whenever a post is created, it will suddenly start to get a lot of views, lot of like, lots of comments, and all of that There will be a time when it reaches its peak which will be the highest point of interaction that the post has, then slowly over time that post will start to decay and maybe after a couple of days or a couple of weeks there would be absolutely no interaction on to that post. We will use this specifically for handling images and videos because that is the heavy content and we might be able to optimize on cost, on our hardware by using this property of the access pattern. One other thing is that this

### Segment 2 (05:00 - 10:00) [5:00]

platform needs to be built in a way that can be global. Now by global there are two three aspects that come in. First of all, it means that there are a huge variety of devices through which people will access it. Second thing is the platform should be available in multiple languages and should be able to handle content in multiple languages. The other thing is people coming from this platform are from a wide variety of internet bandwidths, so it should not happen that it is very slow in some geographies and if it is then we need to make sure that we have enough servers in their locality, in that geography, so that people in that area are still able to access this platform in a fast enough way. So global basically means that we need to have a global presence from our hardware as well. Now coming to the scale, these are some really big numbers Facebook has 1. 7 Billion Unique Daily Active Users(DAU). What this means is 1. 7 billion users, access facebook on to the read path at least, most of them some of them would be writing as well but most reading, right. There are 2. 6 Billion Monthly Active Users(MAU). Now honestly I'm not sure of authenticity of this data. It could be a bit here and there, but from what I read about these are the numbers of just Facebook and not including WhatsApp and Instagram. Those are separate. Overall including them the number crosses 3 Billion I believe. Now these people access Facebook, ninety-five percent of these people access through mobile phones. So we need to kind do some optimizations that are very specific to mobile devices also. Because 95 percent the user base is on mobile, so let's do some optimization for that. Now, again coming to some volumes of some functional requirements. Every minute, these are the number of actions that people take. They upload one hundred and fifty thousand images, which needs to be uploaded onto our server spread out in various CDNs and made available for the users they update three hundred thousand statuses. These are the new statuses that come into the system which needs to be distributed across all the user base and they post five hundred thousand comments every minute. if you look at the numbers of Twitter, they were a bit smaller than this. In fact significantly smaller than this, so we were able to cache all of the data, right. In this case because of the sheer volume of these numbers we might not be able to cache everything so there is where some changes we'll do from the Twitter design. If you have not looked at that video I would recommend to go and watch that first before moving further from this. Now coming to one more thing very similar to the previous video have categorized users into five segments and we will be handling these segments of users separately in certain flows. Now based on your requirement with your interview, you could make more categories or less category but on and off for a Facebook like system, this should be good enough user segregation. Famous users are the users who have huge number of friends, okay. Now in our case, we have not included followers as a functional requirement, so this would not be that bit of a problem because I believe Facebook has some restrictions on number of friends that you can have so that will still be a finite(small) number. But in case we start having that we would have a problem, so let's add something called a Famous User as well. Now there is something called an active user. Active user is a person who has access Facebook in the last few days. Let's say for all practical purposes during this conversation, we'll assume that people who have access Facebook in last 3 days, are considered active. And we'll do certain optimizations to make sure their page loads are fast enough. Live these are the users who are actually browsing through the platform right now, so instead of waiting for them to request for new posts or new content, we will probably send them notifications through a communication between their device and our servers if a new content comes in. Passive users are the users who are not regularly active, so we will not cache any content for them. Whenever they come into the system will try to generate whatever we need to do for them, okay. Inactive are basically the deactivated accounts, it could be fake profiles, it could be people who are requested to get their account deleted and these are people who do not access Facebook. We just have their accounts because it's good to have. But these are soft deleted accounts. Now let's go what the overall architecture. As usual I have divided the overall architecture into multiple components and we will look at them one by one. A bit of convention to start with. Things in green are the user interaction points, It could be web browsers, but for most of the

### Segment 3 (10:00 - 15:00) [10:00]

users, it would be Mobile Applications. Things in black, these two are the Load Balancers plus reverse proxy, plus authentication and authorization layer which will validate the API calls coming from the outside world into our system. Things in blue are the web services that we have written. It could be web services Kafka consumers, Spark jobs, basically the code written by us. Things in red are the Databases, Data Stores, any clusters like Kafka cluster, Hadoop Cluster, or any open source tool that we'll be using or any third party component like a CDN for example. With that, let's start with the user flows on how does the user come into the system. Now this is very similar to the Twitter design, so we try to go over it quickly. User onboarding and login and user interactions like looking at their profile, creating their profile, updating their profile are basically this green box and all of that is taken care by User Service. User Service is the primary source of data for any user information. It provides all the APIs that the UI needs or the APIs that other systems in the overall infrastructure would need. These would generally be APIs to get a user by user ID, update get details of a lot of users in bulk, so all of those APIs will be powered by this User Service. Now this user service sits on top of a clustered MySQL Database because of the volume we might need to shard the data across multiple hardwares, but it will we have MySQL Database because again user information is a very standard and a relational set of data points. We don't really need a NoSQL Database for it. Plus one more thing about user information is it is not updated very frequently. So most of the information about the User like email Id, name, location, stuff like that is a constant information. So MySQL is a fairly good database for that. It also sits on top of a Redis cluster. What does that Redis does is - it basically caches a lot of information. So like Twitter video, we'll also be using a lot of Redis and a lot of caching here... And we'll go over what all is cached at what point in time, but to start with, the very first thing that stays in Redis is basically all the user information. So let's say if an API call comes in to fetch details of a user to User Service, it first tries to look up in Redis, if it finds the information then it returns from there. If it doesn't find the information, then it queries MySQL Database, gets the information, updates Redis and then responds back to the user, okay. And whenever a user logs in or user account gets created or anything happens, an event is pushed into Kafka, for a lot of reasons. So for new accounts we will be doing a fraud check, just to validate if they are not fake accounts. For any updates to the account we might want to push that information to other systems. So let's say if the person changes their email id or phone number, we might want to inform the notification service about their notification preferences. So all of that would be taken care of by this Kafka. The next thing is how does the user add somebody as a friend. So a user can go onto somebody else's profile and add them as a friend, okay. All of those interactions are this green box and it is taken care of by the Graph Service. Graph Service again because it maintains the whole graph of how the network is maintained it keeps their relationships and also the weightages of the relationships. What weightages are we'll come to in a while, but this is basically source of truth if you want to infer how close are two people or which users connected to what all people, Graph Service will expose API for all those kind of functionalities. Now again as User Service, Graph Service also sits on top of a clustered MySQL Database. Again it'll need to be sharded because it will have like multiple times data when compared to User Service. But still it will again be very structured. So think of it as a table with a user ID, and a friend ID kind of a thing it's very core and there will be other tables which kind of build in other functionalities but at the core it will just be one table with that kind of a mapping, okay. Now again all the user relationships would be stored in a Redis. So if I want to quickly find out that who all are the Friends of a particular user, I would not want to query MySQL for all those kind of requirements because all those APIs would be used very frequently by other systems. So all that information again gets stored in a Redis cluster, in which the key user ID, and value is basically "list of friends", okay. Now I have made one cluster ideally because of the volume I'd want to keep two different clusters for these

### Segment 4 (15:00 - 20:00) [15:00]

two use cases but we could keep one also if we can have a bigger cluster so that depends. Now other than this, there are a lot of other attributes that also get stored in this Redis. Key in all the places is user ID, values are these things. User Details and Friends, we have looked at. It also stores what is the user type against a user ID. Remember the user type that we talked about: active user, live user, passive user, famous user, so based on some logic it would... User Surface would be able to identify what kind of a user it is and it would be able to update that. It will get notified through somebody. So somebody would tell you the service that user is now live, make it live or user is now become famous, make them famous, something of that sort. There is something called Relevance Tag, which we'll look in the last section, but the idea of Relevance Tags is: so there's a lot of information that can be shown on Facebook. Let's say if I update a status about sports, instead of sending that to all of the friends, if we restrict that to only those friends who were actually interested in sports, then that will be a very engaging... then that could lead to an engaging conversation between me and them. But if there are people who are totally not interested in sports and they also get to see that, they will just scroll through it and that would not be a good experience for them because they are seeing junk content because they don't like that, plus it is opportunity loss for Facebook because it could have shown something relevant which could have initiated a conversation through which some more interactions would have happened, user would have probably spend more time on Facebook, right. So Relevance Tags is something that will be powered through analytics will get to know of Relevance Tags, and it will be used at the time of figuring out, that who or needs to be shown a particular status or anything. All right! The next thing is Last Access Time. It is same as the Last Accessed Time that used to keep in Twitter. It would use for a lot of functionalities mainly showing how much time back a user was online in chat and stuff like that. So we are not building chat as part of this video, but it could be used for that and some other functionalities as well. Now let's look at how does a user post some content and how do they retrieve their timeline and other user posts, okay. First let's look at some of the generic components that we are using here. So there is a component called Short URL. What this is so let's say somebody wants to post an external URL on to Facebook. It's generally a good idea to keep track of what kind of URLs are being posted and how are they being hit. So what we will do is we will have a Short URL Service which converts at any URL into a short URL and short URL would be posted on the platform. Okay and then it would be redirected to the long URL when a person clicks on it. So I've made it a separate video on two short URLs you can have a look at that to know in details about how it is implemented. Now there's something called Asset Service. So remember a post can contain an image and a video. Now the thing with images and videos is so people are coming onto Facebook from various kind of devices. They have a phone, they can use a laptop and aspect ratios on both these platforms are very different, right. So if a user uploads a photo which is kind of a square-ish photo, that would look very bad on a mobile phone so we might want to do something and play with the resolutions to make the pictures look good, right. Plus if it is a very high definition photo or a video, it wouldn't give any added advantage when you are viewing it on mobile phone. It will just add a lot more data, right and bandwidth consumption will increase. So it does make sense to reduce the aspect ratio and the number of pixels in the content there. So all of those pieces are being taken care of by Asset Service. So I have made a different video on Video Streaming/ Processing Platform something like YouTube or a Netflix. This is basically your whole platform which is a Video Serving Platform. It takes care of file conversions to multiple formats, to multiple bandwidths, to multiple aspect ratios. It takes care of streaming the content and all of that. It also takes care of one very important thing so remember in the NFRs we talked about the access pattern. So when a photo is added in the first few weeks it will be accessed a lot and then it will slowly die down, right. So once the access to that photo dies down, then we don't really need to keep it on CDN. Then we can remove it from CDN, so that the capacity can be used for some other photos, right. So that is also taken care by Asset Service, saying that with which particular content would reside on CDN and which would reside on S3. So we'll be using Amazon S3 as an Image Store and a Video Store here. It also takes care of one thing like, let's say there's a celebrity, who has a very old photo

### Segment 5 (20:00 - 25:00) [20:00]

Now let's say another celebrity comments on that photo, so suddenly that photo will become popular now right, which was a very old photo which was removed from CDN. But now that it has become popular again this service would keep a track of how my CDNs are being accessed, what is the hit rate on each photo and if a photo is access frequently into from S3, it will basically pull it from S3 and put it into all the CDNs wherever it is getting accessed from. So this is basically a Video/Image Streaming Optimization to take care of the global geographical data okay. There's also User Service and Group Service, which we talked about in the previous section, we'll be using it in a bunch of places. Cool! So now let's start with the user flow when they try to post a content. So that piece is being taken care by this add a post box. It talks to something called as Post Injection Service. Now if it has audio/video content, Post Injestion Service will talk to Asset Service. If it has URLs, it'll talk to Short URL Service. And then it will have a final content that it will actually need to persist. Now for all post kind of data, I'm using a Cassandra here. Again being a very same reason that the volume of posts are very high, right of the order of lots of thousands per second. So a Cassandra is a good choice here because it can handle that kind of traffic on both write path and the read path. Now other than Cassandra you could have used a HBase also here. Both are kind of giving similar-ish performances. I just chose to use Cassandra because set up for that is much easier, okay. Now once a particular post has come in and it got saved into Cassandra, there is a service called Post Service. Now this Post Service is the source of truth service and the owner of this post information. And this is a Service that will provide APIs to everybody else to access a particular post. Basically if you wanted to Get post by ID, or Get a lot of post by a lot of IDs, all of those APIs are provided by this Post Service. Now whenever a post comes in, Post Injection Service reads the post. Puts it into Cassandra and then posts an event into a Kafka. And from there on it will be spread out to further other systems which will do their own jobs. So at that point in time, Post Injection Services's work is done and then we will get to, how a post kind of comes back to the user. So there is something called Analytics that you can see here, it is not a Service actually. It is a lot of components, so on this Kafka there will be a Streaming Consumer which will keep listening to all the events that are coming in, and it will try to tag it using a machine learning model saying that this post is of what kind of a category, okay. So basically it has a classification model that is ready and it applies the post on that model and come up with some tags for that particular post, which will be used to score to post on various relevant parameters. Once this analytics cluster tags a particular post, it'll put it back into Kafka from where it will be processed further. Now... doing this kind of a thing, we can do it because remember we had a NFR of the fact that some lag is okay so if somebody posts(adds a post) if I see it after a few seconds that is okay and in during that few seconds I can do all of this fancy logic to figure out who are the right set of users who should see this post, okay. So once this Analytics puts it back into Kafka there is something called Post Processor which runs. Now this Post Processor basically looks at the content of... so it gets the event from Kafka, which has the content of the post and some tags along with that, and obviously the user ID who posted the content. It then queries User Service and Group Server, so these are two separate services. I have just clubbed into one box that it's easy to kind of draw it out. So this Post Service will now fetch all the friends of this particular user, who potentially can see this post, okay. Along with the... so that is the group information. Along with that it will also find out the Relevance Tags stored against the user. So we also do some user profiling that we look into the next section that basically is a classification of users into some tags. So a user could be tagged to let's say politics and sports, okay. And let's say if the post is also about sports, then this user would see it. If the post is about technology let's say and the user is tagged to politics and sports, then the user would not see that particular post. So now it fetches all of these information about a post, about all the users that this person is

### Segment 6 (25:00 - 30:00) [25:00]

friends with, and tries to come up with a subset of users who should see this post. It's then puts all the relevant posts into Redis. Now what Redis here is It basically is a timeline of all the users. So remember the twitter video, it is something very similar to that. It basically has a timeline So let's say "user1" posted this post okay and it has friends with "user2" and "user3", okay. Lets say coming after relevance, this Post Processor decides that only U2 should be able to see that post. So what it will do is, it'll in the timeline of U2, it already has some posts. It will put that post lets say Pi, okay. So that now when user U2 accesses their timeline, they can look up directly from Redis and they don't need to do anything else, okay. So once we have posts into Redis now Post Processor's job is done. Now coming to the other side of things. What happens when a user accesses a particular... their timeline? - So there are two kinds of timelines that a person can see. One is if they go to somebody else's profile and they want to see all of their posts okay so that is a very straightforward implementation. What that does is it basically calls Timeline Service. I want to see all the posts of this particular user. Timeline Service would then query Post Service, saying give me all the posts of this particular user and it displays it back. So that's a very straightforward implementation. The other side of things is, when they want to see their own timeline. Which basically contains posts of all of their friends, right. Now the flow that happens is - this is the timeline box. It basically calls Timeline Service which then does a lot of things. It gets all the... basically the timeline from Redis for that particular user. Now again remember the Twitter video it would not contain all of the users, it'll contain only normal user's data. Why? Because normal users can be easily managed with this kind of a model. For famous users we need to do something different. The reason is that if a famous... so right now we don't have a problem with famous user because we just have friends, but let's say we introduce a concept of followers and famous users start having multiple millions of followers, then the problem is this Post Processor will have to update Redis for millions of users. And that is not scalable, so that's the reason we'll do exactly the same thing as we did for Twitter implementation. We will basically give that responsibility to Timeline Service to merge the data for normal users and famous users. So what it does is whenever a particular user's timeline needs to be shown it queries User and Group Service, figures out the list of users that this person is friends with and it queries the user timeline from here, which is the post of all the regular users (normal users), and then it basically also know the famous users that this person has friends with, right. It queries Post Service and ask the Post Service to give all the posts of the famous users. Then it aggregates the information and sent it back, okay. Now before sending it back, it could also persist this into Redis okay. Along with a very similar implementation again with that particular time stamp let's say Ti it says that at Ti, I have inserted this data which contains the regular users data and famous users data. Now the next time when a request comes in what we will do is we will basically look at this timestamp if this time stamp is in the last couple of seconds or maybe a minute we'll return the data as but if it is more than a few minutes, so we know that this was last refreshed a lot of minutes, back then what we say we again queries the Posts of famous users just in case they would have posted something different okay. So this is how we handles the regular active users and the famous users. Now we can do a lot of optimization for the live users. What live user thing is, so when an event is posted and Post Processor figures out that what all friends this guy has and who all needs to get the relevant posts it also knows because it has queried User Service, it knows that the user is now actually live. So along with sending in to Redis, this what it will do is it will put another event in a different topic into the Kafka, saying this user is live and this is the timeline that he sees... needs to see now with this new additional content, somebody go tell given this information, okay. Then comes the job of Live User Service. What this is, it basically is the service which has open connections based on WebSockets

### Segment 7 (30:00 - 35:00) [30:00]

with all the clients(Users), right. So these are your regular users who are using mobile apps or web browsers, who are on the platform right now and this Service has an open connection with all the users. Now whenever this service gets a particular event saying this user's view needs to be updated, it will send the event to the app via the WebSocket protocol and the app would then... show up the new post in whatever way it needs to. So this is the separate thing that we do for live users. Now there is something called Archival Service. What it does is, so this Redis cannot contain data for a lot of period of time. What we are saying is - we'll put only today's data into this Redis All the posts that have come in today. For everything in the past, we'll use something called as Archival Service. Remember a post can just be created at a certain timestamp, right. It can have comments, likes and all of that, but it is made and once it is made it doesn't really... it's created time doesn't really change, right. So what we can do is... and the timeline for a particular day once it is created, it just... new things get added to it but old things don't change from there, they remain the way they are. we can cache this timeline. But instead of using an in-memory solution like Redis, we could use a Cassandra for storing that. So what archival service does is at some point in time during a day it basically fetches all the users timeline whatever have been created, it puts it into something called an aggregated timeline Cassandra and clears this cache, okay. Now this makes sure that we have a finite amount of data within Redis, which is small. It's not a lot of data. Now for the users where this doesn't exist, it will try to create that kind of a timeline by simulating the same behavior. It can in fact call Timeline Service and get the time line and save it for all of those users. Now the format in which it'll save is roughly similar to this. What it will do is, it'll say that user Ui, at a date DTi, has a timeline of a particular thing, right. Now coming to Timeline Service, an actual time line needs to be created. It'll not only query some Redis and Post Service it will also query from Archival Service if the user scrolls down a lot, and needs to see yesterday data or day before yesterday's data. All of the older data need not be computed at that point in time. It can just be looked up this itself and this can store data not only for the active users, but also for famous users, passive users and all of them. And that's the reason I wanted to use a Cassandra here because now we can scale this also infinity. And wherever we are saying we are caching something, we are storing something, we are just storing the post IDs. We are not showing the actual content, and to get the actual content this Post Service can be used to construct the whole payload. Now there is one more thing with this. at places where we've used Cassandra, there is a potential problem that could be created. We could create hot spots. What a hot spot is - is basically, so Cassandra is basically a distributed thing, which is across a lot of machines, right. Now if you don't choose a partition key properly, it could potentially happen that only one machine is serving most of the live traffic, all the updates are happening on one machine and all the reads are happening from the same machine and all the other machines are just lying idle with no work. This would happen if you have a partition key something like a date range, so all the updates of today would happen on the partition which is handing today's data, and all the reads will also happen from that, right. So we need to be a bit careful about choosing the partition keys there, a user ID based key would be a fairly good thing, but a date based thing (Partition Key) would definitely create hot spots. So wherever you are using Cassandra we need to be very doubly sure of basically to make sure that we don't create a hot spot. Coming to the last section of this video, let's talk about how are likes and comments handled. So assuming there's a UI through which people can do a comment or a like. This could come through various devices like phones, browsers, whatever. So whenever a user presses a like button, the request goes to this Like Service. Now this Like Service is our owner of all the like's information. So even when you want to make a like(put a like on something) or you want to get the number of likes in a particular post, all of those are powered by this Like Service, okay. Whenever a person puts or adds a like, that information gets stored into Cassandra. Think of it as a table saying userID and postID okay. So postID contains the

### Segment 8 (35:00 - 40:00) [35:00]

post which is being liked and userID is basically the user who's liking it. Let's say if you want to add a functionality of like being put on a comment, right. This comment would also have an ID, so we can add a parent type kind of a thing or post thing, saying postID so-and-so, and post type could be either a post or a comment or anything right, and then the user who'se like it. Alternatively if you want to add more types of likes, like upvote, downvote, or something of that sort, you could add a like type also in that. So that becomes the permanent store but there's a Redis is also that is used here. What is that for? - so generally what we see is whenever a post is displayed, along with the post content the number of likes are also shown. Details are not shown but number of likes are shown, right. So the number of likes is being powered by Redis. Redis stoers information of the count of likes on a particular post, only for the recent posts, okay. Once a post becomes outdated, this particular entry will have some TTL(Time To Live) of a few days and that will get expired. But for all the recent posts, it should have an information of the count. So whenever a like comes into the system, it gets stored into Cassandra and Redis update gets called. So basically there's a thing called update and get in Redis(INCR operation) which the atomic update into Redis, so that takes care of thread safety and all of those issues. That can be used to power the updates. Now once a like has been posted that information would also be put into Kafka saying this user has liked this particular post. How that will be used we'll come to in a while. Now let's get to comments. So again there's something called Comment Service, which is basically the repository of all the comments into the system. Whether you want to make a comment or you want to get all the comments from a post, that will be powered by this Service. And this uses a Cassandra as a Datastore... a very simple schema it will have. It will have a postID and a comment text and whenever you want to get all the comments on a particular post you just run a WHERE clause on the postID. It'll obviously have userID and all of those attributes as well, timestamp, and all those things, but those are minor things. Now whenever somebody wants to get all the comments that is also powered by this Cassandra We don't really need a caching here. So this Redis was used because it was storing aggregate information. So instead of Redis call, the alternate was to do a SELECT COUNT(*) query from Cassandra. But over here in case of comments, it is anyway look up by ID (PostID) whether it is on Redis, or Cassandra - it's not too slow. So we can use Cassandra without any cache here. Now we did not talk about sharing of posts earlier, but sharing of post would exactly be implemented in a way a post is implemented. So a shared post would be just another post with a parent ID of the original post so that it can be referenced that from where is this post coming from. Now whenever a comment is added it also gets posted into Kafka for further analytics. Let's come to something called Activity Tracker. So you remember we talked about... when we talk about Functional Requirements, that we want to capture all the activity than by a person on Facebook right. So let's say a person makes a post, or a person likes a post, all of that should be tracked. So anyway all that information is coming into this Kafka. Whether it is a person liking something, comment, whether it's a post, everything is actually coming here, right. All we need to do is to store that information against a userID, with the timestamp that this was the action taken, and some attributes of that action. So this Cassandra of Activity Tracker would primarily have a userID, a timestamp, an action, like action could be "liked a comment ID" or "made a post with the postID" or "searched with a search string" all of that right So all of that would be powered by this Activity Tracker. This would have both insert a APIs into Cassandra and get APIs to fetch the data from Cassandra and show it on to an Activity UI. This Activity Tracker would listen to the same topic that everybody is writing to. Now coming to Search. Search is implemented in exactly the same way as it was in Twitter so I will not go over it in much detail. But assume that if all the events are anyway coming to Kafka there will be a search consumer that reads all those events, stores it into an Elastic Search which is very efficient for text-based searching and there is a Search Service that sits on top of that Elastic Search, run the query and returns the data, and in some cases we might want to cache the results of the search response, so before sending it to the user we might

### Segment 9 (40:00 - 45:00) [40:00]

cache it in a Redis as well. All of that is abstracted as part of this search box for which you can look at the Twitter video in which I've explained it in much detail. And while a person is searching that information also gets pushed into Kafka so that can be logged by Activity Tracker. So we had put a lot of information into Kafka saying that we will do some analytics on it let's try to do... let's try to take some examples of what kind of analytics can be done. So let's say a person posts a content. That content can be classified using some very standard classification ML Algorithms into certain Tags. So for example a post could be tagged as a sports related post right now if a person is having a lot of sports related posts, that means the person is giving us a signal that person is interested in to sports, right. So what we will actually do is - you will have this Spark Streaming Sonsumer running which will put all the information about all the activities and by user into this Hadoop Cluster. That would include the post, the comments, the likes, what not. And then we'll have some algorithm which basically do this user profiling saying if a user is liking a lot of sports related posts, that means the person is interested in that. If the person is commenting on similar posts, that also gives the same signal, right. So all of those information could be used to generate a user profile which classifies users interests. Now in real Facebook we have the option to explicitly like a particular thing, right. Let's say I like a sports page that means I'm explicitely giving a signal. But in case of this scenario, we did not include it as part of functional requirements, so we can use this data to infer all of those things. So that basically creates a user profile. Once the user profile is created that is put back into Kafka, and remember the very first section where we had a User Service which was storing a User Profile Tag kind of a thing, User Service basically reads that event and stores those tags against the user which is then used for various kind of things in the system that we talked about in that section too. Now there is something called Graph Weight. So what is that? - so let's say myself and Elon must have friends on Facebook! hypothetically:) Now I might like a lot of his posts, but he might not really like mine, right. What that tells is I like to content that he posts but he does not like the content that I post. So even though we are friends, even though will be shown some of the content that both of us might post, but there is some affinity of mine towards seeing his post, rather than somebody else's, right. That information can be captured by the Graph Weight Job, which basically say that out of all my friends whose post have I liked most or whose posts have a commented the most on. And bases that logic, it will create a profile of mine saying I am more inclined towards these two three people and less inclined towards the rest of them, so I'll see more posts of these people. That becomes your Graphs Weight Job. Now this runs on top of a Spark Cluster can query the data within Hadoop, run a couple of spark jobs, do all the maths and publish data, right. Now we have left something called Trends. Trench is basically something very similar to the trending thing on Twitter we talked about that while during that Twitter video. It basically tells us what people on Facebook are talking about right now. Basically what this Streaming Job will do, is basically it'll take all the posts, all the comments, tokenize them and split them by space characters, remove some very common stop words, like a', 'and', 'of', 'the', 'for', 'is', and all of that, and then the remaining words will be put into the Trends Service stackedranked by their count. So at any point in time we will know what word is the most used word right now. And that will give us some inferences about what kind of things people are talking about and it cannot just be words it can be phrases also and we can a lot of intelligence to figure out what people are talking about and what people are interested in right now, and all of that information could be stored in a Redis cache, because anyway it's a temporary information for right now and it will get updated in a while and there can be a UI kind of a thing that can we build on top of it if there's a need. So these were the Functional Requirements. Now coming a bit one the non-functional requirements. All of this is that we have talked about. One thing that I did not explicitly mention but all the services that you see here and in the previous sections, were scalable horizontally. So whenever the

### Segment 10 (45:00 - 46:00) [45:00]

traffic increases you basically add nodes into that particular service or that particular database, or that particular cache and it would scale automatically. But that being said we still need to... there are a lot of different kind of technologies used here, and we need to do a lot of alerting monitoring on all of these things, right. So for example is there are... there's a Web Service that we have built. We need to measure the latency that it is having, the throughput CPU and memory usage within that, the disk usage and then if anything kind of spikes up then we need to send out an alert. Similarly for all the databases we need to keep a track on how many... what are the throughput and latency are, how many queries are getting executed, what is the disk space, we don't want a database disk to be full at any cost. So all of... again CPU, memories anyway are there, so all of those things on the database, on the caches, on the web servers, on Kafka, all of these those things need to be monitored on and as soon as they go beyond a particular threshold, there needs to be some good enough alerting, so that the people get to know that something is going beyond limits and can look into it proactively so as to make sure that we are able to maintain the uptime and latency and everything, there by maintaining the non-functional requirements. So I think that should be it for a Facebook kind of a system. Now just to re-iterate, the same design can be used for any other social networks like Instagram, Twitter, LinkedIn or anything of that sort.