Twitter system design mock interview (with Senior Software Engineer)

Twitter system design mock interview (with Senior Software Engineer)

IGotAnOffer: Engineering

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (11 сегментов)

Segment 1 (00:00 - 05:00)

hello welcome to this system design mock interview very pleased to say we have a special guest with us today it's Eugene from one of our favorite channels Crushing Tech Education, Eugene thanks for coming on the platform thanks for having me Tom it's great to have you on now you're going to be doing a classic system design mock interview question with us but before you do just quickly give people a quick overview of your background of course I'm a senior engineer and manager so I started out my career back in 2010 as a software engineer and over the years had an opportunity to work as a devops and network engineer engineering manager and VP of technology I also have my YouTube channel called Crushing Tech Education where I post a lot of system design content the idea for the channel was to create a Content that would cover all those deep dive concepts because that was my frustration back in the day when I tried to learn more about it I just wouldn't find any deep dive content and majority of the videos would be very shallow and high level so I guess that's my attempt to fix that all right let's stop chattering and get stuck in shall we Eugene today I want to ask you to Design Twitter sounds good all right Twitter or now it's x. com right let's do it so first I think I'm just going to ask you a few more questions just to make sure we are on the same page on what components and workflow you want me to cover so maybe you can provide me some components that you want me to focus on so I can start it from there yeah sure let's look at posting a tweet the timeline and user following okay so we want to cover three main features which is pretty much the main workflow right so I post a tweet and then as a result of it that tweet would probably go to all the followers that I have and their timeline would be updated can tweet contain any media like video or photo yeah contain links photos and video all right so we can have any sort of content on that tweet and when you say a timeline there are two types of timelines right the first one is a user timeline that consists of the tweets that I post myself and the second one is the home timeline that would be a mixture of the tweets from the users I follow but at the same time my tweets perhaps and some ads right yeah correct all right and do we want to also support that yeah we probably want to support that user can users can follow each other to subscribe to their updates right yeah cool and do you want me to cover any other Concepts like search retweet maybe favorite tweets likes yeah maybe Twitter search but let's see if we have enough time for that all right sounds good let me just talk a little bit about the assumptions that I have and make sure you can stop me if I'm going some wrong direction or we have some discrepancy in the thought process so it's Twitter right so all the workloads are going to be read heavy because when people go to Twitter they mostly read and they read way more messages and tweets than they write and I would say the ratio probably going to be like one to 10 or one to 20 which means for every single tweet I post I probably read like 20 tweets right and when I post a message I would need to also likely index it and making sure that it makes it to First the search across the Twitter at the same time to all the all my followers timelines so that likely not going to be instant and is that okay if we have eventual consistency here because I would assume that making sure that tweet makes it to all the timelines and search would take I don't know maybe 10 to 30 seconds would that be okay to have eventual consistency yeah I think 30 to 60 seconds is fine yeah okay and so I would also assume that majority of the searches and tweets read would be the most recent data right because that's what Twitter is used for reading the most recent news and everything and we would also need to store historical data right if we

Segment 2 (05:00 - 10:00)

do the search we would still want to search for some data that probably was posted like a year ago sure yes all right so then if this is anything else you want me to cover or any other comments on the requirements or high level assumptions I think that's all good for now yeah Eugene did a great job taking the question and reducing it into something more manageable you should clarify the main workflows and use cases that you need to cover talk through the functional and non-functional assumptions and confirm them with the interviewer then walk through the adjacent features and confirm whether they need to be covered please continue all right sounds good then I'm G to go and just try to briefly do some capacity estimation just so we know how much data we would be storing how big of a storage and network we would need for the system all right you should be able to see my screen over here can you see it yeah so I mean the Twitter site the Tweet itself would be limited because we can't post a really big a lot of characters there it's 160 100 60 characters as far as I remember or 180 so I think that would probably be like what 10 kilobytes Max size so let me just put some high level assumptions it's 280 characters they moved it up but yeah it's not going to make a big difference there all right yeah so let's just seem like it can be 10 kilobytes per tweet maximum and from the user base how many users how many active users we would want to have like 100 million or one billion say a billion users so let's assume we have one billion active users and maybe what maybe 200 million of active users of users that post messages Daily Post like daily users call it right and on average every user has maybe 200 followers all right so does it look good to you is does that does make sense yeah that makes sense so far all right so let's try to estimate some storage and network capacity that we would need so if we have 200 million daily users we can assume let's say 100 million users post message every day so we would have 100 million tweets per day right and every tweet would be 10 kilobyte for text but then you also said that we might have some media right in there so I would assume we would have let's say what 200 kilobyte let's say per tweet on average with media and pictures and if that's the case then we would have what 10 terabytes of data per day so as we have u let's just think how many requests per second we would have so with this if we have 100 million posts per day and we're just going to divide it by the number of seconds make us what 1,200 request per second and this would be an on average so during the peak time likely we would have more brights because let's say we would have five or 10x more rights which would make it to six to 12,000 requests per second right and on average let's say we would need to send we would need

Segment 3 (10:00 - 15:00)

to propagate those tweets to 20 users then for every single right and we have fix th000 request six 6,000 rights per second we would need to push that user to message to 20 followers that would make it what close to about let's say 100,000 fan out messages which means that pretty much during per second we would need not only to write 6 to 12,000 tweets but also push them out to all the followers and it would be about 100,000 F out messages per second okay do you want me to go to the API now or we can skip that yeah let's have a look at the API for the timeline all right so for the API I think our API would be relatively straightforward so yeah we can cover the Tweets in Timeline here so when we post a tweet then we would just use the post method I would guess with the format we can have a version for the path and then let say tweets this would be our post request and it would contain sorry we would have a user ID content and the media we would have a separate endpoint for the media because we would want to store media separately and then just propagate the link to the media so there would likely be a different point for the media but we would just post it and we would have so this is these are tweet end points then we'll probably have a few end points for the user connections so it would be let's say post and delete for users to follow each other so when the user follows each other we would use post and then when we break the connection and user unfollows someone we would use the delete verb and the API would probably look like this can have like users and then perhaps user ID and then follow and for the timeline as you mentioned we would need get end points right so we would follow a similar pattern here and call it timelines that would probably take as a parameter we would need a page size we would also need user ID obviously but user ID can be passed in the header and that will be the home timeline and then there would be a user timeline right so it probably look like this user IG and we would pass the page and Page size here as well so I think this is like high level how the API would look like does it look okay good enough do you want me to cover something else here yeah I think that looks good okay so now as we've covered like high level functionality and methods that we want to implement let's try to Define high level components and as I think about it we would need a few components right so our main components would be a tweet service or tweet processor that would be responsible for posting a tweet so when we post a tweet we would need to store it and then post it elsewhere Ware for some further processing

Segment 4 (15:00 - 20:00)

right and indexing and search and timeline generation then the second one would be some sort of user graph that would be responsible for making sure we have all the user connections and that service would be tracking the users that follow each other and serve as a source of Truth for that and the third one would be a timeline service that would be responsible for generating the timeline storing them and then serve back to users So based on the requirements does it make sense what I just said the components yeah I'm on board with that yeah all right sounds good and now I think we can briefly talk about the database design if you want to just how the tables would look like yeah I think it would be good to cover the main entities of that yes all right so I mean high level our database design would be relatively simple right we have the main enes would be a tweet that would need to contain a user ID tweet ID we would need to have likely some sort of a Tim stamp on when the Tweet was created and the content right we might also need to have links for the media but at the same time I think those links can be inside the content because when we would return the Tweet back to the user we would use those IDs and links to pull the media so I think high level that would be like that would be it like for the Tweet the basic functionality for the user relation we have some sort of user relation table that would have many to many mapping for users that follow each other so I would assume we would have like if we were to use single table for many to many connection between users then we would just have user ID and then or we can call it follower ID follower user ID and then the other one would be a followe user ID so every single time when someone follows an account we would need to insert two records in this t table one would be for one user to follow another one and the other one would be for that user to get a new follower so it's a bidirectional connection that would require us to write two records for a single follow functionality on the on the Twitter but again I think it can be optimized a little bit by separation of the follower and followe data but we're going to get to it later I guess once we go to the main design sure do think that's kind of good enough for now and'll be obviously the user Table Right but it's pretty yeah just like high level user information so I think we can skip it for now yeah I'm happy with that as Eugene did before getting into the design you should draft some capacity estimation for the system then you can walk through the API design you also want to define a database design for the main entities that the system would operate Eugene did all this very convinced and he was nicely set up to start drawing his architecture diagram please carry on all right sounds good so let me just get then to the high level design and the components so as you mentioned we want to cover three main features right posting a tweet user follow each other and timeline generation so I'm just going to go from there and going to create a few components here for the user action so let's say this one is going to be post at tweet like that then we would have the second feature user follow and then we would have user timeline gener eration so user timeline and home

Segment 5 (20:00 - 25:00)

timeline and I'm just going to split it like this so This these are would be user requests from the internet so call internet and U on the right side it would be our system just to give high level framework all right so do you want me to cover the like API Gateway load balancers like authentification authorization does things no I think we can Circle back to those at the end if we have time I think Skip for now okay so then I'm going to assume that we'll have load balancers we'll have some sort of API Gateway that sits in front of the system load balancers as well I'm just going to skip that and focus mostly on the service architecture that's okay yeah great all right so starting from the posting a tweet as I mentioned we would need some sort of tweet processor service just like that would that would receive the tweet and then make sure that the Tweet gets propagated down the system and makes it to the search and other users timelines so with something like that and this service would also need to be backed likely by a database so let's have a database here so it will have to be no SQL database and since we want to focus on storing the most recent data we might as well choose Cassandra because it allows wide rows and they can be sorted by time in our case so let's just assume we have a persistent database here like Cassandra will likely also need some sort of cach just so we can retrieve records more efficiently if we need to we can have a reg here and you also mentioned that we can post media right so the Twitter proder would also need to take care of the media that exists in the Tweet so we would need some sort of asset Service that Twitter proder can call and make sure that the media is handled so we can call it asset service and you also mentioned long links so perhaps there could be a like URL shortener Service as well here URL shortener okay just going to put it here and so as we receive a tweet we would and if it has media we would call the asset service and the service would store that media in the object storage so there's likely would be some sort of object storage that would also be exposed as a CDN for quicker access yeah do you want me to go deeper into the asset and CDN no I don't think that's necessary now let's on okay so I'm just GNA skip it for now all right so as we post a tweet it would go to the Twitter processor if it has a media like a picture or a video that we would call the asset service stor it in the object storage and receive back a uid for that media

Segment 6 (25:00 - 30:00)

that would later be used to retrieve that media and return it back to the user as we store this data we also need to push it somewhere for future processing and indexing Etc so I would assume we would have a some sort of Q here so let's make it a different color here let's make it that for now we can say it would be something like Kafka but we're going to circle back to this so as we get so we store the message we return success back to user and we throw it into Kafka for future processing and back from Kafka we would likely need to have some sort of processor that would retrieve those messages and send them either for indexing for the surage or for the timeline generation so I'm just going to call it timeline processor so that would be a server service that consume messages from Kafka and then update the user timelines and speaking about the user timelines let's just circle back there we would need some we would need a timeline service probably like API service are responsible for serving the timeline and this timeline service would need to return the timeline to the user pretty much instantly so we would need those timelines to be cached so I'm just going to put a r cach here and our timeline processor would probably be responsible for consuming tweets and then making sure that they make it to the appropriate user timeline that would exist in the redis cache but again we're gonna drill down to this later I'm just trying to build this high level workflow that we can follow does it look good so far yeah looks good please carry on all right sounds good so as I'm thinking about the timeline service right it would need to pull data from reddis but we might also need tweets from other from users that whose tweets didn't make it to R perhaps so I would assume we would also need some sort of service with direct access to the tweets because if user is not active for instance one might not have the timeline cached in reddis so we would probably need to just pull it from the database get the tweets from the database and then return it to the user so I would just like say that we would need to have some tweet service perhaps just a simple API that sits on top of the Tweet data that can serve tweets based on the user ID so something like that this way our timeline service would have access to First pre-generated timelines as well as if the timeline is not generated it can call Tweet service and pull the data from reddis and then create a timeline at the real time it would be not desirable but still probably need that okay and user follow that's the last piece that I didn't cover yet so I mentioned it would have a user graphs service or we can call it that would be an API in front of the database it would likely need to be key value database I'm not sure about Cassandra let's call it just key value DB for now that would be storing the user mapping and know about who follows who pretty much and timeline generation service timeline service would probably need to have access to

Segment 7 (30:00 - 35:00)

this database as well to understand to retrieve the user followers because over here if I a user and I try to render and I open a home timeline if I passive user or like not active user then my timeline might not be pregenerated and I would need to and the timeline service would need to get a list of followers my followers and then retrieve them from the reddis or Cassandra so that kind of covers that workflow if the timeline is not generated okay what else is here okay I think this is high level how it would look like let me just work through the whole workflow one more time just so it's clear right so when we post a tweet first thing we do is send a tweet to the Twitter processor and if there is Media we obviously handle media and we store the Twitter processor to the database and cache so here ideally we want to store in both and use right through cash but I mean we want to return 200k as soon as possible to the user so we probably want to return to 100 once the data is stored into database and then if cash is not if for some reason the right to Cache fails we would want to handle it internally and not block users on the cach so once that's done asynchronously we also push the Tweet to the Q in our case let's say it's Kafka and on the Kafka we would need to likely partition it so we will partition our Kafka on the Twitter ID because the whole I mean we can partition multiple way right the first one we can partition based on the user ID but if we do that then we have a chance that some users can become hot and post either too often or in proportionally and we don't want that ideally the system we want to have is to have as lad as even distribution as possible so in our case we want to partition on the Twitter ID so we would push that tweet to Kafka and then we would have a timeline processor that would consume it and update the timeline so how would that look like let me just try to drill down more into that um so let me just so let's say we post a tweet right okay we it gets to the ca or our q and then from Q we likely have that's going to be our essential place from where multiple functionalities would be plugged in so we can have a service responsible for the indexing and search uh we can have different sort of analytics also consumed and process further and the main piece of functionality would be would be the timeline generation and for that I think what makes sense to do here is that we would have a cluster of consumers that would consume the tweets and then they would get a list of followers of the user

Segment 8 (35:00 - 40:00)

that posted a tweet and create a list of tuples like we can do it in the loop like we receive a tweet then we pull all the users that we need to push this tweet to and we create a tuples with the user ID that posted the Tweet user ID where this tweet has to go and a tweet ID and then we post all the tuples back to the queue it can be either the kka Q or I think in this scenario it makes sense to use reddis channels that act similar to a queue and from all those channels we can have those channels per user from all those channels we would have a more consumers that would consume it from the corresponding Channel and push it into the reddis database where the timeline would be cached so let me just try to draw it so we have a cluster of consumers where the Tweet is going and then we would for every single follower we would create a tuple and push it back to the queue and in this case we can have it can be let's say Redd is channels and now for every single follower we would have a separate reg channel that would be receiving tweets from our main queue and from here we would have a service that would just push it to the reddis it would be listening to the redis channels and push it to the redice so the redice here would have a format so what kind of data would it be storing it would be storing a mapping key value mapping between the user and a timeline right and this mapping can also be so the way we're GNA so in this case the user would be the key and the timeline would be the value and the format of the timeline would be a link list this way we can very fast insert data into the list and remove it from there and both of these operations would take all one operation all1 time and would be pretty fast so does this make sense the design I wrote here for the Tweet processing yeah makes sense to me so I think I've covered here the main workflow we've covered the posting a tweet method in pretty much pretty granal details here and the important part is the redit as well I think this is a really good optimization by storing timeline as a double link list that would provide us o1 insert and delete time so it pretty much means we can have a link list of let's say 500 tweets for a timeline and as new tweets arrive we can just delete all tweets and it's all done in all1 time in the main memory I think it's also important so we spoke about being about the system be real

Segment 9 (40:00 - 45:00)

time is that all these tweets and the whole the main workflow of the Tweet is done in the main memory because we don't want to be blocked on writing to the file system or any sort of slow data sources so ideally what we would like to optimize here as we posting a tweet is to make making sure that all the communication on the and operations on the Tweet happen in the RAM memory like Fast Access Memory and I think we achieve it by sending tweet to Kafka and then perform multiple operations and Transformations on the Tweet then we would store it into redish Channel which is also mostly stored in memory and then the final output which is the timeline would also be in the memory this way majority of users or all of them can just have the timeline pregenerated and updated into Real Time Eugene has covered a lot of ground here so let's take a second to review the steps he's taken he started with a high level workflow and made some good decisions such as using a pub subsystem like Kafka to segregate publishing and processing of the tweets he did a good job of talking through his assumptions as he worked on the diagram constant communication with the interview is really important in system design interview once you have a diagram take your time to talk through it explain how components work together Eugene went back and went over the workflow a second time which helped him catch a few things he hadn't considered or dived into when you're ready make sure you do a deep dive on some of the components so the bottom L here I think would be the way we partition data right because the main entity of our system would be tweets and we would have like billions of them every single day so the bottom link here can be how we store those tweets and how we query them so what options do we have to mitigate against that then so there are multiple ways here right the first one is that it's all depends on how we want to partition the data we can partition it per user ID so when we store tweet if we partitioned per user ID we can partition it by tweet ID and let's think here so if we do a user ID right it means that all the tweets for the same user would exist on the same server and that is very convenient because we can easily query and get all the tweets for a user but at the same time likely it would cause a bottleneck and uneven tweet distribution in our system which we want to avoid it would be hard to manage the system because like multiple popular users can end up on the same server and then the server would be just overloaded so and as the main entity of our system is a tweet so I think we should partition based on the Tweet ID but in this case if we want to retrieve a time a user timeline we would need to query literally every single server every single partition and then get a subset of users from every single partition so that wouldn't be very convenient I mean the third option here is that we can do a like double partitioning which means we can try to Partition partioned by user ID first and then partitioned by the Tweet ID secondly which means we would have a like two hash ranks of data um so it means that if a popular user even end up on the same server with other popular users they would still be dist attributed because every single hash range would point to another hash ring over here I mean in theory this sounds like a doable option but I think the problem here is that if we want to move some partitions then it would be just such a complex task for instance if you want to move users from like this segment to the other segment not only we would need to move all this data but we would also need

Segment 10 (45:00 - 50:00)

to move all the data from the hash ring to another hash ring which is very complex so I think what we can do here is just a mixture of partitioning by tweet ID and perhaps pilot like this double hashing functionality which is very questionable because it's just very complex to manage but it's still an option am I going in the right direction with this yeah I think so yeah I think we're coming to the end just before we wrap the interview up I just wanted to give you a chance to you know think is there anything that you'd kind of refine in your design architecture just before we come to the end of the interview sure so few more bottlenecks or improvements that we can perform here is I mentioned the user graph and how we would be storing the U Connections in a single table so ideally what we do here is that instead of using a single table we split it into two tables and making sure that one table is dedicated specifically for the user followers and the other one for the user followes this way we have this B directional connection but at least we can split the responsibility and manage these databases and scale them separately so I think this would be a good Improvement at the same time when there is a user that follows the another user we would need to do two wres to two different databases which might cause some discrepancy but from the scaling perspective that would be a really good improvements also the popular users that I didn't cover so with this design where we have multiple cues and this sort of processing pipeline this wouldn't really work for the popular users because if I have 80 million subscribers and I create 80 million tuples and then push them to the queue it would just slow down the whole system so this workflow would likely be for the generic users that have like let's say under 10,000 subscribers and we would also have a list of popular users whatever the threshold is maybe more than 10 or 50,000 subscribers and we would treat them differently so as let's say if I a user over here and I want to generate a timeline from and I have and I follow some popular users then I would get all the tweets from other users through the main pipeline over here and when it goes to the popular users I would call the Tweet service and retrieve tweets from the popular Ser user from the r database and then just plug them into my timeline this way we can have on demand requests to the tweets that popular users post so that's probably a workaround for the popular okay so I mean additional Improvement that we can make is the way we store tweets because as we mentioned we'll have a lot of them and ideally we want those tweet IDs to be unique and also follow and also support like cality which means that we can sort them by the timestamp because this would be a crucial for our system as majority of our data that we need to keep in memory and store is the most recent data so whatever Twitter ID we generate we want to make sure that it's done in a way that is globally unique across the system and also support cality and yeah so the way I think Twitter does it is that it uses a 64bit user ID a I'm sorry 64bit tweet ID where 64bit tweet

Segment 11 (50:00 - 54:00)

ID where one bit is a sign bit just to make sure that it's compatible to all the languages because if we have a signed bit it means that the ID is always going to be positive then we can have then I think they have like 41 bit of the Tim stamp which are like Epoch seconds and then to be able to scale this ID so Epoch seconds gives us cality inside the ID and then the rest we need to make sure that we have enough unique IDs per every single second so the way we can do here is for instance we can have 10 bits for the worker ID because we'll probably have multiple workers generating these IDs and 12 bits are increment this way we would support first we would support the cality with work radies we would support distribution because we would have multiple workers and we'll provide also ha because if one of them goes down we are we won't be impacted and at the same time if we have more than one tweet per second we have what is it uh 4,000 IGS per second so this way I think we can support what like tens of thousands of unique tweet IDs per second and never have a duplicated ID at the same time support the causality because we have seconds this way when for instance we store messages in Cassandra if we store them in the wide row sorted by the epoch type then we can only retrieve a last 20 or 100 or five 500 of messages without looking at all the older data in the database and this is I guess additional key point to optimize for the most efficient storage for the recent most recent data not for the historical data okay yeah that makes a lot of sense Eugene I think we come we've run out of time I think we're going to have to wrap the interview up there as you come to the end of the interview you need to think about the corner cases so in the case of Twitter would your system be able to handle uneven traffic distribution with popular users the most common bottleneck is a database so talk about how you could improve database performance talk about the multiple approaches and solutions that are possible and their trade-offs Eugene did this really well to round off a very impressive interview great job Eugene well done how was it for you did you enjoy it oh it was really good yeah a lot of Concepts to cover so I tried my best to fit it in the time but yeah had a lot of fun thanks Tom yeah cool it would have been nice to just keep going keep going for another half an hour on that but yeah thanks so much for doing it and I think people are going to find that really interesting and hopefully really useful so thanks a lot again and yeah hopefully we'll see you back on here sounds good thanks Tom bye hello really hope you found that useful if you did you can like And subscribe and why not come visit us at IGotAnOffer. com there you can find more videos useful Frameworks and question guides all completely free and you can also book expert feedback oneto one with our coaches from Google Meta Amazon Etc thank you and good luck with your interview

Другие видео автора — IGotAnOffer: Engineering

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник