This Meta SQL Question Tricks Almost Everyone

This Meta SQL Question Tricks Almost Everyone

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (1 сегментов)

Segment 1 (00:00 - 04:00)

Most candidates fail this meta-QL question not because they don't know SQL, but because they use the wrong SQL technique. I'm going to show you the trap and exactly how to avoid it. The trap is the most intuitive solution, but it's slow, inefficient, and it's not the answer they're looking for. The solution that gets you past the first round is a clean, powerful window function. Let me show you exactly how to write it and avoid the trap. Here's the challenge they give you. The problem statement looks simple. Calculate the average duration of a user session. A session is defined as the time between a page load event and the very next page exit event for that same user. Sounds easy, right? But the logs are messy. You have multiple loads, scrolling, and other noise. The interviewer is watching to see if you can cleanly isolate only the direct load exit pairs. This is where people reach for the self joint, trying to connect every load to every exit. But we're going to bypass that mess entirely. If you're using the self join to find the start and the end of each session, you'd have to join the table to itself on user ID. Then again to keep only the latest page load per user and day. Then once again to keep only the earliest page exit per user and day. Bonus a mess of wear clauses. This is a very fragile approach. Change greater than to greater than or equal to. Bug. Forget the date alignment. Cross day bug. Put the action filter in the wrong place. Wrong rows. Misty is null checked. Duplicates. The pro move is to never join at all. I'm going to treat the data like a sample sequence. The plan is two steps. The prostep one, isolate the sequence. We only care about page load and page exit events for each user in order. The proep look one step back. For every page exit, we'll use a window function a lag to peak at the previous row. If the previous row was a page load, we have a valid session. That's it. No messy joins, no complex logic. This approach is more efficient, cleaner, and less bugprone. This is the data we're working with. Facebook web log. It's a simple raw event stream. We have three columns. User ID, the ID of the user, timestamp, exactly when an action happened. Action, what the user did, page load, scroll down, and most importantly, page exit. Now, here's the catch. Look at the data for user ID zero, for example. Notice they have multiple page load events, sometimes seconds apart, before a single page exit. A session isn't just the first load and the last exit. A true session is a page load that is followed directly by a page exit. A clunky self join would struggle with this, but our window function will handle it perfectly. Let's prove it. We'll use two common table expressions, CTE, to keep this clean. The first CTE is session events. It selects user ID, timestamp, and action. We then use the wear clause and the in operator to keep only the page load and page exit actions. Done. We'll now create our session pairs. We select the same columns as in the previous CTE. Then we use lag as our secret weapon. Partition by user ID puts each user's activity in its own separate lane. Then order by time stamp orders their events chronologically. lag just looks over its shoulder at the previous action and timestamp in that same lane. It's clean and impossible to mess up. Now that we've done the hard work, the final step is easy. We select user ID from session pairs. Next, we subtract the previous time stamp from the current timestamp to get the session duration. Then calculate the average duration using the average aggregate function. Also, there is a filter in where that calculates the duration for valid session pairs. Finally, we group by to get the average by the user ID. The code runs and it's correct. Next time you see a problem that requires comparing sequential rows, your brain should immediately think window function, not self join. Avoiding that one trap completely changes the interview. It shows you value, efficiency, and readability. And it proves you can handle complex event-based data, which is what every top tech company, including Meta, is built on. If you're prepping for a data role at Meta and other big companies, I'll cover the exact SQL problems they use in real interviews in future videos. Subscribe so you don't miss the next one.

Другие видео автора — StrataScratch

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник