# The Data Science Industry Has a Problem (And You're Part of It)

## Метаданные

- **Канал:** StrataScratch
- **YouTube:** https://www.youtube.com/watch?v=htIMa-BrWNw

## Содержание

### [0:00](https://www.youtube.com/watch?v=htIMa-BrWNw) Segment 1 (00:00 - 05:00)

I've been in this industry for too long and I'm tired of pretending that what most people call data cleaning isn't complete garbage. If you're spending more than 20% of your time on data cleaning, you're doing data science completely wrong. Always fill nulls. — Wrong. — Remove outliers. — Wrong. — Standardize everything. — Wrong. — These are the mantras of people who learn data science from YouTube tutorials and $49 UDMI courses. Here's what your boot camp didn't teach you and what your university professors who've never worked outside academia definitely don't know. Most messy data problems are actually business logic problems that data scientists are too scared to challenge. You see duplicate order IDs and think data quality issue. But you're wrong. It's a business process failure and you're just putting lipstick on a pig. And here's where it gets really ugly. The investigation phase. This is where you realize that everyone has been lying to you. The product manager says the data should work one way. The engineer says they built it differently because of technical constraints they never documented. The business analyst gives you a third version that matches neither. So you spend days tracing through code repositories, Slack histories, and meeting notes from 18 months ago trying to figure out why customer ID can be null when the schema says it's required. The answer is always the same. Someone cut corners, someone made assumptions, and nobody bothered to document So why does this keep happening? Because data science hiring is completely broken and nobody wants to admit it. Companies hire data scientists who can recite machine learning algorithms but can't read a database schema. They hire people who think correlation equals causation and that P hacking is a legitimate research methodology. These people learned from Kaggle competitions where the data is already clean and the problem is already defined. They've never had to explain to a VP of sales why their conversion rate calculations are wrong or fight with engineering to get access to the actual production database instead of some sanitized month old export. The result, organizations full of people calling themselves data scientists who are fundamentally unqualified to handle real business problems with real messy business data. If you want to stop being part of the problem, here's what you actually need to do. And fair warning, this is going to require you to grow a backbone and start having uncomfortable conversations with people who outrank you. Rule number one, demand context or don't touch the data. Stop being a pushover and demand real documentation. When someone hands you a data set and says just clean it up, you say no. You demand a data dictionary. You demand to know the business logic. You demand to talk to the people who created the data. And when they say we don't have time for that, you respond with, "Then you don't have time for accurate analysis either. " Let them talk about you being difficult. They'll thank you later. Rule number two, speak money, not math. Nobody cares that your p value is statistically significant. They care that your analysis affects revenue, costs, or customer satisfaction. When you find a data quality issue, calculate what it's costing the business. Instead of saying there's an error in the pricing logic, say this bug is costing us approximately $2. 3 million annually in lost revenue. Guess what will happen? The bug will get fixed in 2 days. The same bug had been in Jira for 8 months, labeled as low priority because the previous data scientist reported it as a technical issue. Rule number three, stop destroying evidence. Stop destroying evidence like an amateur. Every time you drop rows or overwrite values, you're destroying information that might be crucial later. Keep the raw data untouched. Create transformation pipelines that are reversible and auditable. Use version control for your data, not just your code. I can't tell you how many times I've seen data scientists panic because they cleaned away something important and have no way to get it back. Don't be that person. Rule number four, document like you're building a legal case. Document like your career depends on it because it does. Every assumption, every decision, every conversation needs to be recorded, not in some fancy tool just in comments in your code or a simple text file. When the VP asks why the customer churn model suddenly changed, you better have receipts. The business will blame you for incorrect analysis when the real problem was that you had no documentation to defend their methodology. Your future self will thank you. And more importantly, you'll have evidence when people try to throw you under the bus for problems that weren't your fault in the first place. Most of you shouldn't be called data scientists. You should be called business intelligence analysts or reporting specialists or data analysts. And that's not an insult. Those are valuable roles. But stop pretending you're doing science when you're just making dashboards and running AB tests someone else designed.

### [5:00](https://www.youtube.com/watch?v=htIMa-BrWNw&t=300s) Segment 2 (05:00 - 05:00)

Real data science requires the ability to question fundamental business assumptions, challenge stakeholders when they're wrong, and design experiments that actually isolate causal relationships. The industry will be better when we stop inflating titles and start being honest about what we actually

---
*Источник: https://ekstraktznaniy.ru/video/38865*