David Hirko - AI observability and data as a cybersecurity weakness

49:03

David Hirko - AI observability and data as a cybersecurity weakness

Towards Data Science 28.09.2022 275 просмотров 5 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

David Hirko, co-founder of data observability company Zectonal, joined Jeremie Harris in discussing data observability, data as a new vector for cyberattacks, and the future of enterprise data management. Intro music: ➞ Artist: Ron Gelinas ➞ Track Title: Daybreak Chill Blend (original mix) ➞ Link to Track: https://youtu.be/d8Y2sKIgFWc 0:00 Intro 3:00 What is data observability? 10:45 “Funny business” with data providers 12:50 Data supply chains 16:50 Various cybersecurity implications 20:30 Deep data inspection 27:20 Observed direction of change 34:00 Steps the average person can take 41:15 Challenges with GDPR transitions 48:45 Wrap-up

Оглавление (10 сегментов)

Intro

hello and welcome back to the tourist data science podcast now we'll be talking a lot about the data science life cycle today and if you're towards data science reader you'll know that we've had some really interesting articles on that topic as well so this actually seems like a really good place to remind everyone that if you have an idea for an article on that or any other interesting data science topic don't let it sit in your grasp folder fired away because I know a bunch of editors at towards data science who would love to check that out so on to today's episode which is going to be really interesting because it's rare that we get to spend an entire episode talking about just data so let's say that you're a big hedge fund and you want to go out and buy yourself some data right data is really valuable to you it's literally going to shape your investment decisions and determine your outcomes so you go out you buy your data and then a cold chill runs down your spine how do you know that your data supplier gave you the data they said they would right from your perspective you're just staring down a hundred thousand rows in a spreadsheet with no way to tell if half of them are made up or maybe even more for that matter and this might seem like an obvious problem in hindsight but it's one most of us haven't even thought of we tend to assume that data is data and a hundred thousand rows in a spreadsheet is a hundred thousand legitimate samples now the challenge of making sure that you're dealing with high quality data or at least that you have the data you think you do is called data observability and it's surprisingly difficult to solve for at scale in fact there are now entire companies that specialize in exactly that one of those companies is called zectonal and their founder David hurco will be joining us for today's episode of the podcast Now Dave spent his entire career studying data observability the challenge of evaluating and understanding data at massive scale and he did that first at AWS in the early days of cloud computing and now through zigtonal where he's working on strategies to allow companies to detect issues with their data whether they're caused by intentional data poisoning or unintentional data quality problems Dave joined me to talk about data observability data as a new Vector for cyber attacks and the future of Enterprise data management on this episode of The Tortoise data science podcast the topic for the day is a little bit offbeat because we normally talk about algorithms scale the eye alignment safety those sorts of things I think the story that we'll be exploring today is not entirely Untold but it is less explored and it's no less important I mean we're really talking about data here the kind of foundation on which all our models are built and you've had a really fascinating journey in this whole space in the space of what's now become known as data observability I'd love to understand what so first off like what brought you here what brought you to this world and then maybe give listeners

What is data observability?

a bit of a sense of like what is data observability as well I think the journey to where we are today really starts with my time uh early days at Amazon web services at AWS and kind of in the early days of cloud computing where we just saw the massive amounts of data that were being stored and collected as those storage costs started to go down and ubiquitous kind of ability to have access to massive amounts of compute for really low costs um you know everything started with a data Centric World in my mind uh when joining AWS uh having been there for several years I always like to say AWS is the second best company I've worked for um you know just a phenomenal experience uh we decided to start our own company in about 2014 that was really focused on distributed analytics so if you think about those early days of Hadoop and Spark it was really folks didn't really think that you could run those kind of distributed systems in the cloud outside of AWS a lot of the companies Cloudera hortonworks were really kind of pushing more of an on-premise View and so we started a company to really focus on that and then we were involved in a lot of the alternative data fintech space where we're starting to see these really esoteric data sets that were being used by financial institutions hedge funds and the like and that was really kind of the first inkling of how do we start monitoring data like how does the data that's feeding these algorithms how should we be treating that and so I had an opportunity after about eight years of starting my last company to kind of start this new Venture that we call zectonal and that was in February of 2021 and we were really focused on data observability we weren't using the term at the time I think it's a fairly new term that's kind of Taken roof over the past couple months but the concepts are kind of similar how do we look at Trends and data that feed algorithms I wrote tensorflow algorithms for several years and you start to really see that a lot of the errors that are introduced into specifically training for machine learning algorithms come from abnormalities inside the data itself it's not always the algorithm you start to measure the performance of an say well where are the discrepancies coming from on inference and almost always it was coming from the data so you know like all folks like we started writing code and in February 2021 and here we are today with uh you know our product that does data observability and the final note on that is having come from a little bit of a cyber security background I was fortunate enough to have started Apache Metron which then kind of incubate all the way through to a top level Apache project um we felt like that security just needed to be embedded in data observability and so looking at kind of not only the characteristics of data quality but are there threats inside of the data and I know we'll talk a little bit more about that later but that's kind of how we got here right and actually I mean that's such a great teaser for the massive Iceberg we're about to unpack here with data security being connected with data observability it's something that took me by surprise actually even though I was tangentially aware of some of the security issues you know that you can run into here and there when I read some of the posts that you put together on the topic like wow holy crap like there's a lot here um so before we dive into that though I do want to unpack this term data observability and maybe a little bit more looking at um in particular like what are some of the metrics some of the ways that you judge uh data quality when you're looking at obviously data observability right it's somehow tied to the assessment of data quality um is actually that's a good question like is data observability almost like the verb the action and then data quality is like the noun the target of that action or how are those related I that's how we would Define I mean our customers Define it really historically as like data characterization and you think about the macro and micro Trends and kind of how you acquire and ingest data um and so for a long time we just called it data characterization I think the industry has taken hold of the term data observability certainly it helps when you're defining markets and things like they go into Gartner magic quadrants but I think it's really kind of looking at macro and micro Trends and how you acquire data so you know some of the key metrics and we started to look at this early on kind of from my background uh service level agreements with data providers are you getting your data on time you have an SLA for data arrival if you're you know a Wall Street Trader building you know super fast algorithmic kind of trading platforms like you want to make sure your data is getting there on time and oftentimes you don't want to be trading when your data doesn't come in so sometimes something is just as easy as it you know do you have a green light a red light is your data coming in and so we kind of started with that basic concept and then you get into some kind of more higher level kind of data quality characteristics like is your data stale not only did your data not uh come in but is your data late so we'd see oftentimes where especially for like machine learning training if your data came in a week late because your Upstream data provider uh somehow didn't get an ETL job done when you thought it was going to get done and all of a sudden you've got this model that's doing inference what you believe is on real world kind of scenarios and then you get this whole bundle of data a week late you know what do you do like at least you want to know that the algorithm should be retrained or maybe it shouldn't but you want to have an indication that maybe it's not necessarily complete and then from there you know our customers make decisions whether to retrain or not sometimes it's file size file volume A lot of times we'll see just basic things like are expecting you know 10 000 files a day did you get 10 less you look at aggregate size per day aggregate size per month especially if you're buying data from the commercial Market when you look at like ad Tech data sets right location data sets those are very expensive data sets for a lot of consumers and so you want to know like am I getting everything that I bought and you know it's beyond the ability for humans to kind of look at that and understand it and so you really need tools and capabilities to kind of look at that and so those are some of kind of like our macro kind of Quality Concepts and then we think about uh some of the more micro Concepts and so starting to appeal back and look inside the data and starting to see like you know am I getting all the right columns as part of this schema you'd be probably not surprised but it's surprising to me at least if you think about like tab separated files CSV files the amount of times that you may have like an unencapsulated comma that then creates an extra column at the end that throws out a whack some kind of Downstream analytic system and so again at certain scale you want to be able to detect those things we see patterns where null values start to get introduced into very large data sets and you want to start to quantify to say you know if my Baseline is that 25 of the data in a parquet or a CSV file is null but that starts to I'm up to 50 percent and all you know that's an indication that something's amiss and of course those all have significant Downstream impacts on algorithms that use the data so we think of like data observabilities both kind of the macro and then the micro kind of trends that are going on and I imagine you know if listeners who are thinking about this through the lens some of the episodes we've done on safety like this also starts to become relevant too you start to talk about in one case you mentioned data that comes in late well we have a word for that in AI safety it's called out of distribution potentially or you could be looking at a problem like that so making sure your data is on time is actually a safety problem in many cases it's really fascinating what's some I'm curious what some of the sort of um let's say funny business that you've seen happen with data providers because I did read about this a little

“Funny business” with data providers

bit I found it almost slightly amusing but it is also concerning what are some of the things that you've your customers your clients have discovered when they've turned these sorts of observability tools on uh especially with respect to third party data suppliers sure I mean probably the funniest story I've ever heard and you know it was a fairly significant financial services institution and they were running their data pipeline using a bunch of Python scripts that was running on somebody's laptop and so we were looking at kind of the trend of when that data stopped flowing as part of our own kind of product and time series and then we started to correlate to well the data stopped flowing every time that one data scientist closed their laptop lid right and so then the pipeline kind of shut down so you know you think about non-resilient data pipelines that are feeding kind of these you know very you know expensive you know Revenue generating algorithms and there's something simple and silly like that so you know it was funny in hindsight I think you know at the time we were just like you know let's think about how to build a more reliable data pipeline here but there's all sorts of things people I mean it really just turns a light bulb on an area of focus that just a lot of organizations just aren't looking at I mean I think we're in an era where everyone understands it that their data can be monetized and so you know compared to five years ago where you'd have to convince people to save their data and store it in low cost now we're really kind of evangelizing like are you aware of the quality of the data um right if you asked a lot of cios and big Enterprise how much data are you storing I think you may not always get the accurate answers in fact we're finding it a lot more sizzles uh have a better appreciation for their data assets than maybe even the cios but um you know the world's changing where people are wanting to quantify Now what is my data worth not only am I storing it and I think we play a part in it yeah I mean and what a big problem too to take on as well at this point um and to do that actually one metaphor one analogy that you've started to lean on I've noticed is this idea of the data supply chain would you mind unpacking

Data supply chains

that a little bit like what is a data supply chain in your eyes so I'll use the example of weather data because I think weather data surprisingly is used by so many organizations and so if you think about fusing weather data into other type analytics uh you know if you look at earning statements from like CEOs a lot of times they'll blame bad earnings on like bad weather patterns unforeseen weather patterns and so how do you bake that into kind of your algorithms so weather data is typically bought by a data aggregator but it's Federated through many layers of a supply chain down to probably an individual piece of Hardware that you could put outside in your backyard or put into a park so there's literally tens or hundreds of thousands of these devices spread out all over the world and they all kind of generate this data that gets aggregated usually through multiple different companies and what is really shocking to us is we'll go into a client and we'll say well who are you buying your weather data from and there's a couple big name weather data providers and then they'll say well do you know where they're getting their data from because they're then buying their data from different aggregators and buyers and so it goes all the way down and so trying to track where is the actual source of your data coming from down to that individual sensor and if you think about traditional Supply chains and one that you know you study because of history is like you think about automotive supply chains uh some of the big car manufacturers I mean they know where every nut and bolt and screw comes from in their providers and what's so surprising is our algorithms and many of our training algorithms and analytic insights come from data that we really have no idea One Step removed where it comes from you know we may be buying from a provider but you know we never really ask where that provider is buying from or where that Downstream and so for us that's really the idea of like the data supply chain and it's a global yeah oh sorry I was just curious like do you think of the data supply chain as including not just like where the data comes from but then like the other steps in kind of the data processing the data life cycle essentially if that's all to you that's part of that supply chain it is it's I mean data lineage how do you know that people aren't changing Concepts slightly you know whether it's the name of a header inside of a schema or a column or introducing some kind of Errors I mean the really the way that we start thinking about data security was because there were just errors being introduced unintentionally right as part of that supply chain I think the CEO of Nvidia always talks about data factories and you know each one of these aggregators is a data Factory they're taking in inputs and they're creating outputs and there's going to be anomalies errors they're going to be introduced in that processing and you know you carry those errors forward you carry those quality issues forward all the way Upstream to where those the final product is being made and that final product is in you know it's a model it's an inferencing algorithm it's a bi analytic and the ability to try to detect those early is really important and I guess just like a regular supply chain bottlenecks show up in really weird and unexpected place I mean you just gave the story of this one engineer who's opening and closing his laptop causing an entire Downstream effect to unfold you know not entirely unlike any number like tsmc in the semiconductor supply chain or ever given that big cruise ship that bottles up the Suez and then all of a sudden our goods are 10 times more expensive or whatever like it's it seems very apt like this connection um as well as from a national security standpoint you know increasingly we're starting to look at our supply chains as a source of National Security risk where are people inserting you know maybe where is China inserting stuff into our Hardware or whatever else I'm curious about that aspect like what are some of the uh those security implications then that you see the moment you start

Various cybersecurity implications

looking at data through a supply chain lens yeah we don't know I mean we're not that involved in the National Security component of it oh sorry cyber security aspect yeah but sure I mean I'm sure it exists I would say too like um I think what's really interesting too is just a compliance perspective I mean you think about all these kind of data sovereignty you mentioned uh you know China and I mean the data sovereignty laws there seem to be changing fairly quickly um you look at Europe in some of the days I mean so I think if you're an Enterprise consuming data um you know again you carry forward some of those uh data supply chain kind of liabilities if you will if they do exist or even wanting to know if they exist so I think you know visibility into your data supply chain is key and um I think more and more organizations are going to have to start looking at that and um and then putting a price on high quality versus low quality I mean it's very difficult I mean we think of data observability is a way to kind of put a price on Commercial data and um you know is it good data is it bad data sometimes you just want cheap bad data to kind of feed a learning algorithm where you just need a lot of scale and other times you may want to buy more expensive uh lower volume data just because you'll get better outcome on that so and on the cyber security side I mean I know there's a really big story here and you guys have done a fascinating study on this like log four shell vulnerability that we'll get into in just a minute but I'm curious about the role that data poisoning plays in this picture can you unpack that a little bit and what are some of maybe the stories you've encountered when it comes to data poisoning yeah and you know I'm I have been in the cyber security field but I'd say like there are a lot more folks who are more intelligent about this than I am so I'm going to use a term though that is often used kind of in the cyber security world called fuzzing and you know fuzzing is when you unintentionally throw a lot of data at a particular software application and see kind of what unintended side effects result from it and it's a way then that when you fuzz A system that you're able to kind of figure out like how does it behave abnormally and for us our kind of fuzzing experience was really just watching these data pipelines over years collectively from different Vantage points in different Industries and you'd see things break you know you'd say well why did that Kafka node go down in the cluster why did that spark like lighted elasticsearch not do something that it was supposed to do um and then when you do kind of a deep dive post-mortem you start to see that it was probably an unintentional malformed piece of text inside a park or CSV file and you say well geez like if that caused that what would happen if somebody deliberately did something like that and I think you know that kind of unintentional fuzzing of these systems using these classic kind of data transport uh file codecs like parquet Avro CSV tsv I mean that was kind of how we learned a lot about how these kind of vulnerabilities may exist and it's a big part of our data security story it's not necessarily that we saw intentional um you know packaging of malicious payloads but sometimes they're just unintentional we talked about the data Factory it just could be a mouth you know manufacturing glitch on an ETL job that created like a partially completed file or texturing in a CSV that just caused the system to go down and how do you detect that when it happens I mean that seems like such a daunting challenge just there you know there's an issue in the data somewhere um it's anomaly detection I guess or something like that like what's so we have this term that we call Deep data

Deep data inspection

inspection and it's kind of borrowed as an analogy from like uh Network packet inspection so if you think about 20 years ago when folks were thinking about securing networks there's a lot of encryption uh encrypting data that goes across the internet and so people start to look at the actual contents of network packets and they start to look inside of them and it was kind of this industry was born for forensic network analysis so can I ask a really stupid question um what is a network packet like what exactly should I be picturing it's you know if you think about routers and switches that get connected to you know wireless networks it's the most primitive piece of data that gets uh pushed on to the network okay so it will contain some information about uh you know data right so it could be email it could be uh web traffic it's just a fundamental kind of data transport kind of building block so we use that analogy with deep data inspections so if there's deep packet inspection then we started to say well let's start looking inside the data itself so you know a CSV file a parquet file we can query the contents of that file and so we started to develop software to do that oftentimes these files will contain you know tens of thousands of individual data points I mean for some of our listeners who may not be familiar with the parquet or a CSV file we always say think of it as like an Excel spreadsheet that's like you know 10 000 rows by ten thousand roads I mean and all you need is a single cell inside of that spreadsheet to contain some kind of malicious payload to trigger some kind of vulnerability and so um so we look inside that and we look inside and say you know is there something that looks anomalous in here against some patterns that maybe we've seen in the past or some uh some of what we think may be occurring kind of research-wise and so that's what we kind of refer to as deep data inspection okay and I think that brings us quite smoothly to this like log four shell vulnerability that you've explored recently can you tell that story a little bit because I think it does a great job illustrating the stakes potentially of this sort of inspection so you know uh the lock 4J shell came out last December it was a really big deal um I think every organization very quickly I think uh the government said it was like a 10 out of 10 in terms of just the uh the ability for it to be exploiting the Simplicity of the exploit so the first thing that we did obviously just like every other kind of business is we looked at a lot of the systems that we were using at the time and of course A lot of these kind of distributed processing systems are heavily reliant on Java and a lot of the Java libraries and so we happen to see that you know I won't name it here but you know a fairly large open source uh distributed processing application was vulnerable and everyone patched it fairly quickly I mean I think there was a great industry-wide effort to patch log 4J but is kind of you and probably our listeners know uh oftentimes these systems just go unpatched for long periods of time and so you know we saw log4j and we saw kind of how it was being documented to exploit systems and we just took a different perspective on it and we said well you know what would happen if the same kind of vulnerability were packaged in some of these kind of file formats that we see you know every day that every Enterprise is working on and knowing that you know one of the systems was vulnerable uh we happened to run it through that kind of ETL process and next thing we know we were taking you know remote access control of that node in the cluster from deep inside you know a private um virtual private cloud and so we were just like shocked and like I said like you know it was one cell amongst probably 10 000 different data points so it was like really tricky to find and we were just like wow like this is data poisoning and um how does that contrast too sorry with like how you would how the standard approach would be to like exploiting the log 4J vulnerability like what would be a more standard Vector of attack besides the data I'm not an expert in this but my understanding was that you'd have these kind of internet-facing web servers that had the vulnerable logging library and if you kind of sent um kind of a malformed packet string with the user agent was you know a particular string formatted in a very specific way that would allow you then to take over kind of that internet facing web server um which in a lot of ways it was you know the web server was kind of exposed to the internet anyway um right if you had kind of the proper defense in-depth architecture I'd assume that even if your front-end web server was compromised hopefully that still wasn't like you know access to kind of the crown jewels of the Enterprise and so you know while that was kind of well documented um that particular attack factor I mean the one that we found was you know exploiting a system that was well inside of you know probably most defensive perimeters from a cyber security perspective yeah right and took a path that you know sometimes I don't know how frequent but didn't always scan these kind of files there weren't traditional kind of application firewalls that would be scanning this kind of traffic so you know we thought it was um a little bit unique and again it was something that we published and made everyone aware of it was already a patched system but in our world it was like here's unfortunately the future right so you know we always tell people if you're a data scientist with a Jupiter notebook and you use pandas like be careful if you're downloading that file from the internet to do some exploratory data science because you know as far as we know like how do you know right pandas just wouldn't get exploited as soon as you open that file and parse it or a matplotlib library I mean you have to be a little bit paranoid to live in this world to think like that but uh you know unfortunately I think that's kind of where things are heading yeah it's incredible how the attack surface area just keeps growing and growing it seems like you know what you've just described as the initial attempt to explode vulnerability it sounds more like a full frontal assault on an API and it's just like okay let's go after it but when you start to play this kind of Trojan Horse game uh and get in through the data to the also I mean a lot of data gets stored you know not client-side like deep like you say deep server side like really behind all those security barriers and it's like what that could do uh is it's scary to think of um I'm curious in that sense like how have you seen things evolve over the last say I don't know the last 10 years maybe as you've been involved in this space like have you seen the complexity of sophistication of these sorts of vulnerabilities increase like what's the direction of change that you've seen

Observed direction of change

well I think just in the cyber security you know um Market in general I mean people are becoming more aware of it and it's evolving faster because the repercussions of you know you think about all these data breaches that occurred and the visibility they've had on those organizations and the revenue impact and um so I think people are a lot more aware of it and probably like you know similar to a cat and mouse game I think the people who are developing these really um nefarious kind of on you know vulnerabilities and exploits are having to evolve and get more sophisticated um you know we use uh exclusively a programming language called rust uh which you know is inherently secure is very fast and you know we're starting to see you know more and more malware just this year that's being written in a in a performant kind of you know secure programming language which I think makes it harder for folks who are trying to defense against those kind of things so I think it's just upping the stakes and again when you think about data Lakes data warehouses um like you know Lake data lake houses uh Delta Lakes I mean all those like there really hasn't been an emphasis around security with those and so you think about people that are using that data to kind of inform the analytics and the systems that do it are traditionally not ones that you secure you know um right thinking about securing that Outer Perimeter but like you said a lot of these ETL analytic uh machine learning kind of uh internal kind of systems these distributed systems aren't traditionally ones that people think a lot about securing and I think you wrote about this too this idea that you usually just discover these issues at the very end of the process once your model fails there's like an obvious external manifestation it's usually at the very end of all that process um is that uh are there steps people can take to kind of catch things earlier like what would you recommend to let's start maybe with the Enterprise level because that's you know what you're focused on most right now but yeah like I mentioned before I like that term called fuzzing um which is where you just pay attention when things crash and you know I think you know I'm a product of AWS where you know you do very deep postmortem mod outages and uh any kind of incidents and I think for any Enterprise like if you see an ETL job fail analytic that you're somehow able to kind of understand is not being accurate I mean do a deep dive I mean over time you'll start to see these patterns you'll start to put together like ah you know this is what caused it to fail and you know if it turns out a certain way or is formed in a certain way under certain circumstances and just document and retain that knowledge I mean that's how we got started we just started to build that knowledge base from seeing how things fail um and then trying to document and not just fix it but really try to truly understand like what was a real cause of that failure so that would be you know the best advice I'd give to folks just pay attention to it and I guess again the vast majority at least for now the vast majority of these things are accidental or not kind of these intentional failures so hopefully there's I don't know a little bit more statistical regularity in terms of what those failure modes look like is that fair to say it is yeah to be very candid like we have not seen at this point in time like a malicious data person attack in the wild I think um there's a lot of conjecture about it and we documented and published our research about one particular way to do it uh but we certainly have not seen it now that being said a lot of people who spend a lot more time in the cyber security industry than I do that we've talked to and briefed I mean they said like this is probably the future it's not necessarily what's keeping them up at night today but five years from now I think this is going to be a series of threats that I think everyone's going to have to deal with and I guess it's also like you're always climbing down that ladder of sophistication in terms of these attacks where the average person even the average organized crime syndicate isn't going to be using The Cutting Edge stuff but you know nation state actors things like that maybe at first you're using it and then gradually as tools become more and more available the attack surface area increases and there are more and more people exploiting these sorts of vulnerabilities too I suspect that would be the case as well yeah I you know I'm every Enterprise deals and has these type of systems I mean it's happening all over and all the time not that people are being compromised but these outages are happening and these malformed kind of files are being processed and um so people are learning from it it's not unique to anything that we're doing I mean it's so uh you know ubiquitous that there's going to be others who are going to figure this out as well but then I guess there's also the challenge too of like how do you even you know verify that this is or is not being used today in the wild because I mean if the attack is successful it could be introducing vulnerabilities that don't surface until much later on but like by Design um yeah and it it's something because like we talk a fair bit about malicious uses of AI and like large language models and there's always this question in that context about you know like is for example is Russia is China are our kind of adversaries using these techniques to interfere in Democratic processes and it's like well if they are if you're using a genuinely human-like system to interfere then and there's gonna be very little evidence for it because it's human-like that's almost the point it's yeah that's part of the challenge it totally is I mean you think about just subtleties and training data you know um if it were intentional that you could subtly change data sets that you knew would have an impact on an inferencing algorithm right I mean it'd be so difficult to detect I mean the one that we found in it and published it was real obvious somebody took control of you know a server deep inside of an Enterprise um but I would suspect a more subtle longer term would be to just you know poison the data in ways that were almost you know impossible to detect knowing that the algorithm that was trained off that data would be subtly you know changing kind of how it saw the world over time and that could be really difficult to detect I would start and if the more kind of individual level because obviously a lot of our listeners build personal projects you mentioned the pandas thing like watch out what data used for that I'm sure that will have sent a cold chill through the spines of a few people but um obviously we're not talking about this being a present day massive scale concern for the average person but are there steps that you think the average person ought to take when they think about data provenance like what data am I going to use for my personal project or things like that are there things at this stage they should already be thinking about

Steps the average person can take

you know not to sit oversimplify but I think it's similar to you know what we all experience in our email inbox probably on a daily basis where we get spam mail we get links you know we get text messages with links I mean knowing what files you're going to be using I mean knowing that they're from a trusted Source if you're you know downloading you know New York City traffic data to do kind of like a example data science project and you know that data is probably safe and it's hosted on you know a well-known site I think you know there's some good uh intuition there but if you're looking at a site that doesn't have a domain and maybe it's just an IP address or something sketchy like that I mean you know I'd say be careful because I suspect there's a lot of bad data out there so I think it's that same kind of hygiene that we've all gotten used to about what we open in our in their email boxes or what text messages we open I think applying that same hygiene to kind of the files that we consume from the internet for data science I think is really important but surprisingly like you think it'd be easy because we're all so trained in that and yet you know we just don't do it I mean you know yeah well but to your point earlier it does feel like data for some reason we treat psychologically as like a safe category of thing it's like our soft underbelly we just we assume that data just doesn't get used for adversarial purposes and shockingly like you said it applies to Enterprise as much as individuals but um maybe that's something that's going to change you know you start imagining like people thinking about data the same way they think about spam email that sort of thing that's a pretty big shift psychologically for the whole Space it really is yeah I mean you said it you know spot on it's just we treat it differently I mean we treat it how it comes into the Enterprise differently the ingest paths the data pipelines are just they're architected differently than how regular I'd say more common type of data comes into the Enterprise and how it comes through firewalls and how we treat it and so um you know you hope that it doesn't take a series of kind of you know bad incidents to bring awareness to it um but you know it is treated differently absolutely right and do you imagine a future then where there are and actually maybe this speaks to the design of the sectoral product but like do you imagine a future where at every stage in the data life cycle there's a dedicated series of checks uh on data for delivery poisoning or just like kind of crappy data for whatever reason or is it how do you imagine that architecture playing out in the long term I think that's a great question I think I don't know how other folks exactly I mean we spend a little bit of time looking at other data observability products and there's some great companies out there but you know we have uh we've architected our products so that we try to find data quality data security issues before they get ingested into the data lake so we think of like that and you know there's many terms of data warehouse data Lake um I'm probably using them interchangeably and you know shame on me for doing it but if we think about that Enterprise data Repository for us once bad quality data or a vulnerability gets in there like it's game over right it's too late right like and so we think about architecting not necessarily like at the edge but we think about kind of a product that will look inside Data before it gets into the data Lake because it's not enough just to be able to determine that there's bad data quality issues but we want to prevent it from kind of Upstream so I think there are some architecture paradigms that we'll need to change with that like to your point like where do we monitor that in the data pipeline like in the life cycle of data um and I think it will probably if you'll have to be various different points um you know what does that data look like pre-etl what does it look like post ETL once it is in the data repository um I think that's just a healthy way to look at it interesting and from a almost a business standpoint I'm curious because every time I see a company that's selling to Enterprise at scale with all the Legacy uh tooling that comes with that kind of the big challenge of customization do you find yourselves building Uh custom solutions for Enterprises like is it basically a case-by-case basis you kind of look at the whole architecture of their data Supply and you go okay you need checks here and here or is it like a one-size-fits-all thing or are you seeing some at least some consistent threats so I grew up in the era of you know SAS business models um you know working at AWS and seeing kind of the Boom in software as a service and I think the pendulum has swung a little bit the other way um you know SAS business models are great and there's many efficiencies you could drive out of it but I think especially when it comes to Enterprises data there's still an aspect that I think needs to be kind of on premise and I think we've built our product to be really lightweight it's almost like a utility that you would install on a Linux system and so having you know these big spark clusters that are looking inside the data warehouse they just felt like that was too heavy of a lift for a lot of Enterprises and so we realized based on our experience that the best thing we could do was kind of hand our software in a really easy to way use to our clients without us having visibility or access into their data and I think Enterprises get very you know um finicky if they're handing kind of the keys over to third parties right and you think about a lot of the threats that are coming into Enterprises they are coming through third parties um they're so we designed a product that was very lightweight that you know we could get up and running in a couple minutes couple seconds and kind of interrogates the data from there as opposed to like a big heavy kind of software as a service and I think too like we're you know a post gdpr company and so you think about those compliance regimes and now everyone not only countries but now even individual states Within the US are you know starting to impose and so I think that's also going to benefit us how we've designed and built the product so that we can kind of just snap right in and there's not like this monolithic you know SAS product that exists in some Amazon region that um you know needs to support various constituents so we've tried to incorporate a lot of lessons learned around architecture around data and data governance data provenance data liability into our product to make it as likely as we can it's really interesting to hear you kind of reference the policy landscape the gdpr story as almost integral to the foundation of the company or at least the architecture of the product and I think that's something that a lot of people might not realize especially if they're working a smaller scale startup that sort of thing like this is a thing that does change the way companies operate and it's really interesting to hear like the pre and post gdpr period and how much adjustment like how much how hard would you say it would be for it's impossible to say really but like what are some of the big challenges that are involved in moving from like a pre-gp gdpr era to post like what are some of the things you've had to do

Challenges with GDPR transitions

you know we've seen other companies struggle with it um in a past life like we helped companies migrate to that I mean when gdpr was on the horizon and these organizations that were you know serving International customers and um they had to quickly come up with a way to kind of create these various storage Footprints and so we learned from that I mean there was just a lot of engineering time and cost and resources and um people building different you know cloud-based data repositories and you've seen kind of the Pick 3 Cloud providers now have regions and points of presence and in so many different countries and grown to kind of accommodate that so we kind of looked at that and said you know like let's just design lightweight drop in really quick and um it was a big influence but I think Enterprises at least in the United States I mean that's where most of our experience has been um it's not as much even like compliance it drove them as much as it was just kind of safeguarding right their data I mean they don't want third-party technology companies to know what what's in their data and we feel the same way like we don't want to know what's in their data like we want to give them a capability to look at quality metrics and to feed it back to them provide alerts when things go bad but we really don't want to have access into our tool once it's deployed in our yeah it's another liability for you actually yeah he is it totally is and it makes it you know I've been in sales like Enterprise software sales for 20 years I mean it makes the sales process a little bit easier when we kind of lead with that message so um so we designed a product that would help kind of make that more possible and easier less friction for our customers okay well that's great last question I just want to pick your brain about the future a little bit here what are I mean you've talked about some of the trends moving kind of more and more towards well potentially data as an attack Vector among other things and then the interest in data observability what are some of the things that you expect to have happen in the future like do you just see this continuing and eventually data just becomes like every kind of like it's almost like an API attack against the system like it becomes just one of these Walls that needs to be built or I don't know could you speak to that a little bit yeah I think you know we're obviously I mean this is really a no-brainer statement and so sorry but you know it's we are so rely on data to feed our algorithms that you know we are becoming a byproduct of the data that we use um and everything that we do our algorithms are based on it and so of course it's going to be a Target uh for folks to kind of manipulate um they're probably thinking of ways of manipulating that we couldn't even comprehend at this point in time um data poisoning as a topic has been out for at least two or three years and people have been thinking about how to poison machine learning training algorithms so I think there's um unfortunately like it's just another series of threats just like you know the network was you know something that needed to be secured and built over you know a decade post time so I think we're just going to be thinking about kind of how do we protect our data repositories um I think another really interesting fascinating topic for us is the emergence of synthetic data and you know I know you've talked about this on some of your prior podcasts but the idea that you create intentionally fake data to supplement you know uh the training of algorithms and we've used it ourselves I mean if you think about fraud detection use cases the classic kind of imbalanced uh training data set hopefully it's in Balance you have fewer fraud cases and not fraud cases but you know you see there's real value to supplementing um training data for something like that but at the same time as that kind of emerges you know Enterprises are going to start to really think about like do I want a store that much synthetic data do I want to not only pay for the cost of storing it should I secure synthetic data in the same way as secure real data if I'm buying commercial data you know uh and paying a very Hefty price for it how do I know that my data Brokers and you know data Brokers there's not always some great data Brokers out there aren't supplementing their data right and just not telling us right and they're charging a lot extra and you think you're getting you know 10 000 monthly active users and you know in reality it's five thousand but they've supplemented with synthetic data so I think we're also going to be entering into a world where I think data observability is going to play a really big part in this but being able to differentiate like real data from synthetic data and that'll I think Enterprises are gonna have to figure out how they treat it it's not really for us to figure out how to treat it right but I think it is up to us in the data observability space to be able to kind of tell our customers or give them an indication when we might be seeing synthetic data versus actual data and is that uh I'm thinking back now to uh yeah to a couple podcasts that we did one more recent one in particular about synthetic data and one of the themes that it really was this idea that you could actually I mean you could enrich you could add value to your data by using synthetic data essentially yet just for listeners who haven't heard the episode something like imagine taking gpd3 or like a large language model with all the World Knowledge that it's collected right these models know what a clown is what the sky is they know all kinds of Concepts and essentially you take your data as an input you train the model to create new data as an output but leveraging all that implicit knowledge that it contains essentially creating an output that accounts for a whole bunch of facts about the world that maybe the original data didn't even consider or include um so in that context I don't know like I'm imagining this is going to be a massive challenge can you think of any like are there strategies are people even thinking about how to overcome that we think about it I mean we ask ourselves a lot of questions because we are starting to see it emerge just like you said and it you know to your example I mean it's fascinating because the whole purpose of you know creating in our mind creating a lot of these more sophisticated algorithms is to model the real world right and so sometimes you do need to supplement but what happens when the real world that you're trying to model changes at a pace that's faster than the synthetic data could kind of incorporate right because it's kind of building on each other so we ask ourselves a lot like you know how can we detect and differentiate and again we're not saying synthetic data is bad we've been consumers of it we see the value but we think it's really going to be important to differentiate what's real and what's not and you know it's going to be cost it's going to be how fast can you pivot your algorithms and your models just to your example um it's just another level of awareness about people's data it's you know how can I observe it you know and understand that a little bit more well I've got to say reading a lot of the material you put together on this topic it was like um it was like the future coming at you in Fast Forward because I wasn't even aware at the time of the idea that like data providers would just be fudging the numbers like packing the envelope with like a whole bunch of like you know even just replicated data and then we've got synthetic data on the horizon just like in a way kind of value added but if you don't know what synthetic data if you're operating under the assumption it's authentic original data um it's interesting like we're heading almost to a world where a snake seating its own tail we have like AI systems that are going to be trained on top of data from other AIS like synthetic data

Wrap-up

generated by other AI systems and we may be doing a lot of it without even realizing it so just a fascinating exploration thanks so much uh Dave for the great conversation well thank you for having me and uh I really enjoyed uh these set of topics and so thank you again for having me today

Другие видео автора — Towards Data Science

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник