# CSE 519 --- Lecture 18 part 1: Nearest Neighbor Methods (Fall 2024)

## Метаданные

- **Канал:** Steven Skiena
- **YouTube:** https://www.youtube.com/watch?v=ub6cK32Aq6o

## Содержание

### [0:00](https://www.youtube.com/watch?v=ub6cK32Aq6o) Segment 1 (00:00 - 05:00)

okay let me get started here um first of all uh any questions about homework 3 how many people have gotten started on homework 3 a few people actually more than I thought so I'm I'll say I'm impressed with that um any questions about it any how many people are seeing that they have gotten as far as to see to get let's say the NLP em the vision embeddings out okay a bunch of people and are they behaving for you guys what you would kind of expect that that it does a decent job of doing the classification and the picture looks good or is it that this is a uh complete mess whoer said that that who had done it who looked at these things says they're looking at things and they're seeing clusters and they're seeing something happening no one who here is saying they're seeing complete mess okay this is so this is a surprise because when we did it we did not see a complete mess and um I'm going to do a little test a little later in the class of whether you're seeing a complete mess or not um but uh any questions about the homework other than that okay it's good to get started and clear that but this is a surprise okay and uh we will try to clarify that okay anything else on homework 3 okay what I'd like to do to start off is to talk about the course projects so um this is uh you know what what has to be done this semester before you can kind of put this course to bed we have a homework three that is due in um I want to say a week to two weeks correct um we then have a week after that a uh you have to turn in a three-page project proposal okay and part of you may say three pages that's nothing it's important that you think through what your project proposal is I'm going through this today so although you'll be graded on it and although it's only three pages and you may say chat GP and write that it's important to kind of get this do this carefully to make sure you've got a project you're able to do okay that uh that you've worked out kind of the roadblocks that will prevent you from doing a decent project and then these will be graded fairly quickly and then the you know as of the last day of class you will have to turn in a uh you know the actual final write up which I think I said is eight pages long okay any questions about that so the projects are supposed to be done in groups okay I want I like groups of two or three I do not like groups of one because I do think that it is good to interact with somebody on a project like that and I also really don't like groups of four or more okay because uh I think that you know these projects are you know of sufficiently bounded size that they that I don't think it uh it helps to have larger groups um I will do an analysis of what the scores are on projects independent of group size and if I find again theoretically a project with three people should be better than a project with uh with two people if I find that there is some kind of um skewing of scores where I will correct for the group size on some way at the end okay so if we find that three people projects all score wildly better than two people projects that's how it should be in real world you know in the ideal World okay and if so we would then scale down the three people projects you know to give more weight to the two people projects usually there is not that much of a difference to be honest with you okay any questions about that okay so um what is my experience again I encourage you to read this okay the handout I'm going to go through the project topic soon but I'm going to tell you what I find that I always find amazing you guys are going to go spontaneously go off and form groups you're going to go look at my list of projects you're all going to I'm going to tell you which ones I think are the ones which are the most exciting and have the most opportunity to do something interesting you guys will systematically can evaluate them

### [5:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=300s) Segment 2 (05:00 - 10:00)

and you will all heard on the boring uh project where there is nothing to do this is my sense is from the past classes we'll see whether that happens this year some of these projects are you know more open-ended and have um bigger possibilities of doing something those I personally think are more interesting some of them will be very tightly conscripted and they doesn't seem to be anything too brilliant to do okay and you know you can pick this if you want okay and my guess is that many of you will pick that but I think that then what you will often find is that uh that you know when you do the project you're not that excited about it you know what you've done is not look really that great okay and you will decide any questions about that we'll decide if I give you any hints how I think again I encourage you to think follow what I think I'll also tell you most classes don't okay um what I will tell you is during this evaluation time why am I giving you so much time to write a three-page paper as a group the reason is I want you to kind of you know think about multiple project options During the period I want you to show that you have some kind of preliminary results just basically because um when you add preliminary results you'll have some confidence you'll be able to do something it could very well be you pick a project you don't you get any preliminary results you then go and try to do it the week before the end of the semester and suddenly discover you can't get the data set there's no data available for my project I don't know what to do that's what we need to avoid by you having kind of worked that through in the um thing um I'd like to have um groups you know the projects roughly balanced by the number of people doing it um if I see one project everybody herds on and you know it is a chance I may take people with the weakest proposals and tell them to go do something else in the second phase that sometimes happens but uh we'll see I don't expect anywhere near Perfect Balance but uh we'll see um to preserve anonymity for when at least the the grer Tas are grading it make sure you don't include your names or ID numbers on the uh what you call on the papers we are going to have send out a form okay for you to fill out every group's going to have to submit a um group registration form on Google classroom and that's how we're going to know what names map with what projects any questions okay I will warn you that on the um at the end of the project I'm going to have group members rate the have to describe how they partition the credit among the members of the group so if you're working with a partner you'd like that partner you and that partner 50/50 okay or you'd like to be in agreement that this guy didn't really do so much this was the one that carried it when I look at how you guys split the grades if there's inconsistencies I will track it down and make you guys come to an agreement okay so we'll talk about that later I don't think we're going to use peer grading here so uh my plan is this semester I think we have enough Tas and there's usually enough uh reaction against beer grading um I am going to I think we're going to be able to get by with the uh T and I doing the grading um any questions about the rules first of all so what I'd like to talk about are some of my project ideas deas and um so when you look at each entry you will see there is a captain what is the captain is a ta's name who you should probably direct questions to as a first uh as a first procedure so this is the one who will be Fielding the uh what you call it the uh qu any questions about that uh project and uh likely but not certainly be the one that's grading it um the uh you know for each project I give a recommendation of what kind of people I think this is good for okay it's not force of law if I said that this was good for Ms students and PhD student was dying the do it that's fine Vice same is true vice versa

### [10:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=600s) Segment 3 (10:00 - 15:00)

but this is my sense and I give a subtle red uh thing here if I really like this project okay that may mean you don't want to do it but that's what I think okay any questions about that okay so the first one is about and again some of these are going to use you there's different ones that different things that you may or may not know about you factor your skills in okay I assume people know what vanity license plates are correct some people have a uh you know a license plate on their car that spells something out okay and they usually do it in very cryptic ways I don't know if you've seen it trying to figure out what a uh a vanity license plate is saying is an a an interesting task and so that's kind of what I want you to do take a look at find a database of vanity license plates obviously the bigger the better there is something available from New York State I was impressed that it came out because of a Freedom of Information Act law you may remember we talked about where data comes from early on government records if they don't cause any uh you know um you know if they don't violate privacy you can appeal that the government release them and uh so this is like a list of all the requested vanity license plates that went to New York state now some of them might be um were people may make requests to get refused suppose you have a dirty word you want on your license plate okay they will not let you do that okay um but basically the question is do something with license plate data and llms and what do I want to see I might be interested in is there a categorization of what kinds of license plates people request okay that makes sense that feels like something to know about I by the way have always wanted the license PL I never bothered to pay the money for it but I always wanted the license plate NP hard so anyone know what now what would that be interpreted as you know again I'm an algorithms person I'm a uh you know I find happen to enjoy proving things and be complete um but uh you know would someone recognize it how does it deal with okay given a license number can you tell whether it's personalized or randomly allocated meaning allocated by the state okay there are some rules by how license plates are constructed um given a vanity license plate can you tell what it really means it would be interesting to spell it out if you saw NP hard on you would like the system to maybe tell you this guy must be a theoretical computer scientist or something like that okay um you know given a license plate tell me whether it's obscene or illegal legal okay so these are the kinds of questions that I might want you to answer and what I need you to discuss in your proposal if you're going to do this project I need to know what are your data sources okay I give one link okay that doesn't mean that that's the ultimate data source that is out there okay that doesn't have mappings to what these things mean for example okay but you should be able to do it you should do a literature search I am sure there are people who've written papers on this it maybe there's a perfect system out there you can find in which case this is probably not as great a project as I think it is right um you know you will need to express how you're going to evaluate how you're doing okay this may not be so easy for this project okay and so you got to show that you've thought that through okay ideally you'll have done some kind of a preliminary experiment or data analysis just to see how bad a dumb thing works okay what would be a dumb way to inter dumb meaning simple not dumb meaning anyone who tries this is an idiot what would be a simple way to in try to interpret what a license plate means can anyone come up with a simple way a baseline thing yeah search for words and a dick that's exactly what I was thinking okay now obviously you'd like to do better than that okay but that starts to be a start right

### [15:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=900s) Segment 4 (15:00 - 20:00)

um and again you should then have some kind of a vision of how you would do better maybe using llms probably using llms in some kind of a way okay any questions about the project on the license plates okay good a second project that I like as you notice this thing here has to do with analyzing business logos you guys are familiar every business has some kind of a logo for it that's appears on their stationary it may appear on their website what is the logo for McDonald's does anyone have a concept of McDonald's I'm seeing some waving thing right I am seeing a yellow waving thing the golden arches right okay does Apple have a symbol yes does every sleazy startup company in the world have a symbol and the answer is almost everyone does okay and the question now is can you look at a database of logo images we had uh a guy who I work with assembled a database of 1 million company logo images from crunch base crunch Bas is something that seems to analyze startup companies or something like that and um you know we were trying to look at could you tell whether a uh more successful companies had better logos okay um you know it's not crazy to me so and what do we mean by more successful most of these million companies are startups what happens to Startup compan IES does anyone know companies they go out of business that's generally what happens to them okay um some of them get money and get successful they raise money some of them go out of business some of that data is available in crunch base a successful company is one that would get funding or one that succeeded in not going out of business do everybody kind of get that idea you can have higher standards of success but the question now is for example can you predict which countries will be companies will be successful from their logo alone now this doesn't sound sounds crazy but it's not as crazy as it sounds certainly I imagine there is a difference between a logo made by the founders kid and a professional designer okay and do I think that the companies with the logos made by the professional designer do better maybe I think so okay so what would you do with this can you first of all given a logo can you tell what business it it's in when you see the McDonald's golden arch can you say that obviously denotes a restaurant okay or is that a high-tech company okay um so can you do Tech logos look different than other companies okay will which company will be successful from its logo can you tell the difference between an American company and an international company from looking at its logo I think you can to a certain extent if it has Japanese writing in it probably it's probably less likely to be American but not certain okay can you cluster the logos in a natural way okay what logos are similar to what other logos okay and is there a taxonomy that naturally can be formed of what logos look like and uh which are the best and which are the worst in some defensible way any questions about logos what do I want in a project report like that I would like to see um what you call it uh I'd like to see that you have the data and that you've played with it and you've underlined it and you've done something with it the data I linked to is kind of old maybe you want bet show that you can get a better data set maybe you can find of what happens to companies things like this you need to do some kind of exploratory analysis that you describe in your thing so that you have some idea of what that million logo data set looks like if you have a million logos in there a certain number of 100,000 must be

### [20:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=1200s) Segment 5 (20:00 - 25:00)

complete junk okay can you clean that out okay um can you extract simple features and understand it okay which you should decide which of these kind of questions you really want to answer a good project is a one answer one problem dealt with in depth a bad project is a handwave at all of these things does everybody get that idea so we want to see something that's more deep okay and again this is an image processing project on some level you should get some kind of familiarity with dealing with image models maybe off-the-shelf ones will be fine logos are different than normal photographs though they're typically drawings and it may or may not be that one of your models does well these are the kind of things you want to think about any questions about the logo project okay fair enough the third class that I am going to want is something on um Kel challenges okay it turns out that there are three Kel challenges that are running now that are interesting they all came in after this home homor three started or you would homework two or homework three started or else you would have been um maybe gotten one of them for an homework assignment one of them is on data from American uh the American Football okay they give you some data set about where the players are positioned on the field what they are doing before the play starts and based on that can you figure out what the what's going to happen next okay who here follows American football at all okay a lot of you I can tell okay um well um anyway based on where different plays have players in different spots and you know in principle can you look at the location and figure out what you want to do there now if you don't like American football okay there's two other kle challenges that are listed there okay one there is one about trying to figure out um how com they have data about how Physically Active somebody is and how much bad stuff they do on the internet okay and the question is can you maybe predict what kids are going to be getting you know showing problematic internet Behavior that's one of them another one was on stock market prediction given uh you have a mark real time Market can you predict how the stocks are going to do so for all of these this involves kind of doing a uh what do I want you to do you need to figure out which challenge you want to work on you should do some preliminary experiments here you should um have an make sure you know you can how you can measure what you're doing okay and so on any questions about the Kel ones what did I color the Kel challenges at the start hold on if you okay sorry about that let's keep going am I doing bad I'm going the wrong way sorry about this for the Kel challenges if you notice I did not make it red okay that doesn't mean you can't do it might be interest you decide but this one it's no more open-ended than a kle challenge and uh you know so but if you're interested in one of these by all means go ahead question well in this case what would I want to know I'd like to know that you guys have uh are prepared to do something serious on this submission okay that would mean what idea of what the methods are what your preliminary work is what you're going to be doing how attacking it okay so you know again there's if you're just going to do the kle challenge there's limited opportunities for what to do next outside the challenge but again maybe there's visualizations you can do maybe there's some other interesting problem you can find related now that you got data on it the NFL one is relatively open actually

### [25:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=1500s) Segment 6 (25:00 - 30:00)

because they don't really specify what the statistic is or anything like that you have to generate there but uh but you decide any questions okay next one how many people here have ever seen a horse race okay so you're not as enthusiastic about horse racing as you are about NFL or almost a little more enthusiastic about NFL football horse racing is actually a an interesting sport um where people bet on it that's why it's interesting um what I want you to do here is to try to write a develop something that can tell me how well a horse is doing in the race any given time okay so a video of a horse race right here um you'd like to be able to know your horse is three lengths behind halfway through the race what are the chances it's going to win is the kind of thing that I'd like to be able to know and in order to do this and it's kind of interesting because in horse racing good horses sometimes start very far behind Okay bad horses run out to the front get tired and then collapse okay the question is taking a look at try to build a data set for analyzing how uh a horse races are going on okay I'm kind of interested in knowing after every 16th of a mile for each of the horses in the race what is their chance of winning at this point okay when the race started probably the betting odds was the right predictor at the end the picture showing which horse crossed the finish line is a great predictor what about in the middle okay and so for this one I want you to um in your if you're going to do this one I want you to read something about horse racing to understand it I want you to figure out where you're going to get your data about um positions of horses in races there's lots of videos and if you can do video analysis that would be an exciting way to get data about where horses are at any moment in a race some other more modern RAC tracks apparently they put like GPS devices on the horses or something like that so you can see where it is okay so one part of the project is build a database of horse races where the horses are as a function of time okay and if you this by itself might be a tremendous project okay if there's a lot of video analysis and a lot of hard work doing data collection this by itself is great if you can otherwise get data about where each horse is in the race I want you to kind of show that you can do some tabulation and Analysis of this okay and then describe a plan of what you're doing okay any questions about horse racing okay now this Pro next project is um Reddit CLA there claims to be a Kel data set on an analysis of 1 million jokes that were posted to Reddit okay now um the question is can you Analyze This Corpus of jokes and do something interesting with it can you tell me what actually is a joke okay I'm sure that there's cleaning problems just because something was posted the Reddit joke page probably does isn't enough to say that it's a joke okay um which jokes are the same joke okay you can imagine that there are uh you know two jokes that really are the same modulo maybe the name of the person or the ethnic group uh you know um some aspect of the joke is really is structurally same the question of how many different types of jokes are there in the world would be an interesting question to know about can you tell if two jokes are the same can you tell the difference between clean and dirty jokes or kids

### [30:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=1800s) Segment 7 (30:00 - 35:00)

jokes from other jokes okay uh can you cluster the jokes and categorize them and finally maybe you could predict how funny a joke is okay what is important to know if you're going to do this project one is you'd better start by doing a careful look at this joke Oprah and see what's really there I suspect you the more you look at it the more disappointed you will be but that's okay you've got a million things to start with and uh there's stuff there um you should be able to do some kind of a literate search on Rec on recognizing humor with NLP okay there's people who have done this kind of thing you should think through evaluation if you're going to tell how funny a joke is you need some kind of a gold standard or something like that okay and I'd like to see something like that again you should not do the uh most of those tasks you should do one task well rather than a lot of tasks bad okay any questions about the joke assignment okay what other one of our Tas jigesh contributed this one he's interested in studying how machine learning models are trained okay in the bowels of these things how do you train them we will talk about you know stochastic gradient descent and things in the next you know week or two probably about a week or two okay I want to know um what you call it uh in this case there are different strategies that are used to optimize these things and this project is basically about trying to evaluate how these different um optimization criteria work okay and maybe even design a better optimization criteria for training these things um and uh what would you need to do some kind of a literature search on these optimization algorithms you're going to have to figure out which one of your tasks and training sets you're really going to play with okay you're going to want to deal with Baseline models and Analysis using the way current optimizers work and then think about how you know uh you know some kind of a again you could just do a very careful evaluation if it's well thought out otherwise maybe you've got a different approach tweak to one of these optimization methods and can see how well it does any questions about that one this has the advantage that you have a TA that's excited about this okay and so if you're interested in these kind of things that's good yes wait okay wait did I not mark it with a star the the reason I didn't is that it's a my ta's thing so this may reflect Prejudice more than anything else okay and um it require it if this is the and it's going deeper into kind of machine learning things than we have are going to do in this class if you're interested in doing this is a great project okay so don't let my lack of Red Scare you here if that's what you think you want to do okay any questions good that's a good question by the way okay the next one is a project that I like see it's red I gave it last year it was red one group tried it and they made it too complicated and did a terrible job and otherwise nobody did it okay so you can decide if that makes it exciting to you or not but what is my question here suppose I were to give you a set of points here and now I asked you what order were they generated in would you look at the set of points like this and say this was the first one this was the second third one okay probably not you might say it was 1 2 3 to n or does everybody see that now the question really is if I give you a set of points can you make a prediction of what order they were inserted in general just giving you a

### [35:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=2100s) Segment 8 (35:00 - 40:00)

picture okay think back when we were looking at the painting by Jackson Pollock right Jackson Pollock painted he had these drips remember he had these drip paintings which swirled around some drips were before other drips can you figure out what the order kind of was is the same thing so in this case you're given a set of points they were generated by some process that inserted them in an order and your goal is to now say can you make a guess as to what order they were inserted in knowing nothing else other than the points that's kind of what my question here is okay and um a lot of data sets get inserted incrementally if you look at Wikipedia some pages got inserted into Wikipedia before other Pages were inserted in Wikipedia okay when you look at IMDb okay certain pages entries into the movie database were made before other ones probably that reflected the age at which the movie was made okay although not completely actually because presumably IMDb didn't start at the very beginning of movies right probably it started with hot modern movies work backwards for famous old movies and then gradually filled in less famous ones question is given data points figure out the order that they were inserted that's what I'm asking you to do and if I go and do this hold on I'm sorry if I keep messing this up okay what should you do you should build some kind of a do a literature search are there a pro people who've done problems like this before you should look at building a data set a a large enough set of interesting things that you can um where there is an incremental aspect to the points where you know what the order of the points were in inserted in and that gives you a way of reconstructing it probably the simplest way to do that would be to do random walks suppose you build a data set by starting at a particular place and then do a random walk from that point in arbitrary directions to what extent could you reconstruct looking at the points what order they were generated in okay I don't expect you to be able to do perfectly but I better than random okay and I give you various possible questions here that you can answer on this and as I said I gave this last year people didn't choose it I still think that doing the random walk reconstruction is an interesting problem okay and you guys can see what happens okay and if you pick this one demonstrate that you can make okay enough data for a set of experiments or get enough data to make it interesting okay um do some kind of Baseline model to generate an order okay come up with an evaluation showing how measuring how good or bad your order is what might be a statistic to measure if I give you an order if I have a bunch of points and I give you figure out an order of them and I give you a reconstructed or the right order in which the points were inserted how would you measure how good or bad yours you did can anybody come up with a statistic for that okay what Hamming distance was for measuring how many positions you got exactly right that's probably not quite as good because again if I had you know kind of moved one over and I had them all next to each other and they were one was a slight shift the other Hamming distance would say that they were terribly there was no association but that would be pretty good if actually all I did was get the last Point first okay the Hamming distance would be you know infinite but in fact I knew the relative order of all of the points any other statistic that might work here yeah something like Spearman correlation sounds good to me oh okay but we can just figure this out later any questions

### [40:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=2400s) Segment 9 (40:00 - 45:00)

about this one okay I'm getting down towards the end of the list I know we're getting restless last year I had a project where I had people try to take a text in that case the book The Great Gatsby and translate it so it didn't have any ease in it a lip grab you want to have the same book means the same thing but it doesn't have any ease now I want you to do something even harder take a sentence or two sentences and translate it into a palindrome that means what it mean roughly what those sentences mean okay what is a palindrome is a sentence that reads the same forward as backwards A Man A Plan A Canal Panama that is a palindrome the letters used are the same forwards and backwards okay so my question is a palindrome translation of a text is a rendition P such that P is a palindrome and it somehow is close enough ideally to the original text to have the same meaning okay any questions about that so my question is use an llm to try to do this palindrome translation okay I warn you I don't expect the results to be great okay but I do think that there is something interesting that can be done here okay so doing this involves getting into the bowels of an llm okay and what would I want you to do okay I would say start out using chatbots and see what they could do with this problem I expect if you do nothing then I you know just ask a prompt I expect it will do a terrible job but if you could train an llm in the right way or a adjust the uh lost function in one of these in the right way okay you can kind of do something on here okay any questions about what I mean by palindrome translation okay fair enough there is a sort of a an ongoing project I've had uh to try to gather interesting data about colleges and universities if you go to the website sk. guu you will see my guide to colleges and universities okay and um this was something Based on data that we had obtained from all you know about all interesting universities okay and in this project you are going to try to find and build some kind of an interesting new statistic about universities or some other interesting new aspect of dat of universities that you're going to try to find a way to measure from data sources okay the important thing is that they have to be based on something that you can scrape or get on the web they have to cover the full range of what I call universities you can see what I call a university on my website maybe you there's more schools than that um they kind of measure your statistics should measure what it's measuring okay um and that you can somehow validate that on some level and that you can map them properly to University names and codes so again I give you a list of my data that I'm using um what kind of things might I want to know about can you measure the sports program quality of different schools okay now um knowing about smaller teams at colleges like let's say the tennis team okay not everybody know you know there's a lot of Statistics covering the pro you know the uh football team and the basketball team in rankings my question if you look at the broader set of sports is there something interesting you can do okay because again there are a lot of schools like Stony Brook which really don't have top basketball and football teams the full depth of the program goes deeper than that can you measure it can you measure which schools have better teachers okay and again Rate My Professor is one

### [45:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=2700s) Segment 10 (45:00 - 50:00)

website that does this but can you know analyzing that data in an intelligent way is not trivial okay can you tell me how good a program is how many of you came to Stony Brook because you heard it was a good computer science department few of you okay um why did who said it was a good computer science Department don't know why did they say okay question is if I wanted to know how good the programs are at different schools is there a databased way that you can do this that make sense which school schools have better quality uh rankings and is there a way to establ generalize the data that I have be to International Schools if I wanted to rank Indian schools the way I do in the United States is that a sens thing to do is there a good enough data set where this is a meaningful comparison okay so what do I want you to sort of figure out what questions you're interested in answering what data sources you're going to use prove you can get access to it prove that you're not just SC doing something dumb like scraping one site and calling it a day okay uh is there a way that you can um update it and uh you know make sure that there's enough work to do an impressive job building a complete data set here and analyzing it if you're just scraping from One Source this is not very interesting okay any questions about okay and I think that's all the projects I am lisening to any questions about the course projects or what I'm wanting or how you should proceed okay yes the Tas are there basically to answer questions so you know the ta's I generally with the exception of jig n's project I came up with these projects and sentenced the TA to be the leader here so it's not that they have a you know necessarily A you know some will some have know something about the domain but uh but your job is to figure out the project and questions and you know ideas I encourage you to chat with the G the TA about maybe they'll have some ideas maybe they'll have ideas what other people are doing things like that any questions yes do some projects have an edge overs over the others what I will say is you saw that I gave some red and some black I believe that the red ones have more of a range to do something potential to do something interesting and distinctive now is that an edge it's not necessarily an edge you know but uh but if you've got an object project where you know what you're doing doesn't sound like it's going to be that exciting quite likely it won't be that exciting okay and that's probably not going to get rated as well as one that is on the other hand if you have one that is exciting but you have absolutely nothing to no idea how to do anything on it that's a bad project for you does everybody agree so yeah you a boring compl if you do a great job on a boring project you will do great okay uh and you know and there's domains within every boring project that are probably can be less boring okay but it's probably harder you have to make sure you can find that okay and that may or may not you know that takes some judgment and some experience and some thinking and some creativity okay yes can you use D yes you can do use whatever data sets you can find okay now if you're going to tell me that you're going to do the educational project and you're going to say look I found this data set on the web and here it is that's not a very interesting project but on the other hand if you have a better data set for one of these tasks by all means that's a great thing to do okay any questions about it so think about it and uh we can talk about it some more later on any questions okay um fair enough so what I'd like to do from now on for today I'm going

### [50:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=3000s) Segment 11 (50:00 - 55:00)

to do a little bit of a shift from this I'm going to permute the order of some of the lectures in here just because the homework three seems to be dealing with things related to nearest neighbor methods and so I want people I think maybe it'll be more helpful if I talk about these kind of things now and um one of the things that uh that uh I'm again we're faced with you're faced with a on homework three what you is the purpose of homework 3 you will use some NLP model to turn documents into a point in space you will use a vision model to turn images into a point in space now what do you do with points in space okay one popular set of methods for working with these things are nearest neighbor methods what is the idea typically it's a classification thing suppose let's say we have a bunch of training examples let's say if we wanted to do real nearest neighbor classification let's say you have a bunch of examples of what is an image that is a car of an airplane a what were the other cat you have some other you have 10 classes in your data right you could imagine a world where the images you know of are points in space and they have labels they can be used to carve up the whole Space of possible points into regions such that you can def assign every point in a region okay uh the the label of its region here we've got red and blue points if I have a point here that I want to know what color it is how can I figure that out well I look at which point is its closest neighbor if that point is blue I am blue in this case if its nearest neighbor is red then I am red okay and when you do nearest neighbor classification ideally what you've done is you carve up this space into regions the regions are not linear boundaries so we're not dealing with like a you know if we were going to have to separate this by a line it would be a shush you can have a more complicated boundary defined by this nearest neighbor thing it's a high dimensional thing okay and if you have a distance function that lets you find the distance between points nearest neighbor classification is a simple method it is relatively interpretable if you take a look at a neural network and it comes back and it says this image is a car okay what do you do say I don't know the network said it was a car that's a car okay now if you have a nearest neighbor thing and you take a picture of this guy in my glass and you say he's a car you're going to say why is he a car well you could take a look you're going to get what is the nearest neighbor to him and if you look at that you'll say yeah that's a car but it does look a lot like him doesn't it okay now you've got some explanation some interpretability okay and that I like and the fact that the decision boundary can be kind of arbitrarily Wiggly this is a great thing it means that you can fit your data in a nonlinear way this is why I like nearest neighbor methods any questions okay now one thing that nearest neighbors are good for okay is um kind of enabling you to identify um what items are like it or should be like it okay so one thing that you could do for example and I think I may sort of ask you to do this in your uh thing is to look for outliers right I might ask you to say which are the outliers which are the things that don't look like the center or I might say for this particular car image which are the ones that do look like it if I can take a look at what the near neighbors are I can usually understand what my representation is doing right or doing wrong okay I think it's a great way to visualize representations now here is some data on websites

### [55:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=3300s) Segment 12 (55:00 - 60:00)

classification that was done using two methods that that happen to be from my in my lab that doesn't really matter but what matters is I want you to tell me which one is better okay is deep walk better than Rand or not okay and what is this doing this is the uh elements in here are websites and what this is showing you for a particular website what were its nearest neighbors okay so some of you may be familiar with stonybrook. edu right which these are the 10 nearest neighbors according to one algorithm another algorithm which one does a better job does anyone have a feeling here yeah why is deep walk doing a better job well if we look over here let's look at the colleges one do you think the colleges are doing better which one's doing better for measuring capturing similarity these are all New York colleges right these are other colleges so somehow there's clearly a preference that it's capturing some locality is it better or not I don't know but it's doing something different right which is more like BYO does anyone know what BYO is okay CH Chinese company right which one is a is it doing a better job capturing what a BYU is it deep walk or is it Randi does anyone have an opinion say deep walk why is deep walk doing better because you don't think wolf from alpha is like B do okay does anybody else have a feeling I mean I see Chinese companies here right I see Chinese internet companies if I am correct or marketing companies someone knows better correct me by the way and here I'm seeing things that are reference sources but are uh not clearly I think I think of it more as an internet company as a Chinese big Chinese company than I think of it as a reference source which is better deep walk or Rand compared to chase. com does anyone know what Chase is it's a bank okay so which one is doing a better job okay deep walk okay and what about Wikipedia which one looks more like it's matching we what we would think Wikipedia sites are okay I think we keep these are reference sources that I know this is that internet comic I told you about that's great okay so you can form an opinion of which method is better without actually a quantitative evaluation does everybody see this if you understand your domain using this even in the absence of a quantitative evaluation you can have a intelligent sense of which one is better that's one of the things I like about nearest neighbor kinds of things okay any questions now these are T s& plots which is what I ask you to do on homework 3 how many people have built TSN plots of what they are seeing from homework 3 homework three okay this is where I want to get a sense this is what deep walks clusters looked like again these were internet domains I colored them by what the last extension are they edus uk. casa and I see clustering here does everybody see this thing I took my you know several hundred dimensional points these TSN plots are drawing them in two Dimensions okay and I see clusters on the Rend any E1 okay do you see clusters look at this I see clusters do I see as clean

### [1:00:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=3600s) Segment 13 (60:00 - 65:00)

clusters and the answer is no okay but it's clearly nonrandom something is going on here okay some of the Clusters okay it's doing a very good job with others it seems to be kind of fragmented but even there you see like the pink one is fragmented and it's got some continuity so my question for people who've been doing the homework do you see a picture that looks like the one on the left or is it much worse what tell someone people who've been doing the homework I just want to get a sense who here has a TSN plot of something from the homework raise your hand okay now of these is your T TSN plot much worse than the thing on the left you should be holding your hand up if it's bad relative to this okay so people are seeing bad things work this out with the over the um I'm not quite sure I know why that is happening the picture I got wasn't so bad okay so I want you guys to work this out over Piaza and stuff like that and make sure we figure this out okay any questions about it actually one thing to be impressed with is what is TSN doing in order to make a picture like this okay it's taking your high-dimensional data and reducing each point to how many dimensions two now think a little bit about how hard that is okay can you reduce something from 100 Dimensions to two dimensions without adding some form of distortion no okay assuming there's something in that those Dimensions the fact that the two-dimensional projection here is preserving so much visual similarity is actually pretty amazing part of your homework is involving Dimension reduction you're going to be using SVD to kind of reduce the dimensionality here okay it's actually kind of amazing that there is still a visual clustering even after you take this High dimensional thing and reduce it to two Dimensions does everybody agree with that so you should respect it okay not expect it to be perfect but it does sound like there's some kind of a problem here okay any questions okay good so when you're using a nearest neighbor method you need to have a sense of distance between points again on some of the homeworks you had to Define some kind of a distance measure okay between things what are the properties that a distance measure should have okay we say is called a metric if it obeys the following properties which you will all be familiar with from ukian distance the distance between any two things is always greater than or equal to zero the distance from me to you is positive okay when is the distance zero the distance from me to me is zero right um and that's what identity means that the distance we should know the distance is always positive the distance between is only zero when the points are exactly the same the distance function is symmetric I'm as far from you as you are from me and the more complicated one which is the triangle inequality the distance from me to you okay is less than the distance from me to somebody else plus the distance from somebody else to Y to you what does this mean this is the as the crow flies idea the straight line distance from me to you the shortest one is as the crow flies if I have to go out of the way for something okay the sum of the distances gets larger the only way there is equality is if this point Z lies on the shortest the the shortest line segment between X and Y and any questions about that so if you have a distance function that observes all of these properties that is

### [1:05:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=3900s) Segment 14 (65:00 - 70:00)

like ukan distance and that is uh what we grow up thinking about distance functions as being okay so our intuition about distances comes from things that are metrics now many measures of similarity in higher dimensions and in regular are not metrics is the correlation coefficient a metric two points if you could especially in high Dimensions if there was a perfect correlation between the two values that would mean that the points are very similar but the correlation coefficient goes from minus1 to 1 so it's not a normal distance metric in this case actually bigger it's also not even a um what you call it really a distance function should get smaller as things get closer okay there is a way to map a um number that's in the range of minus1 to one so that it becomes a number between like 0 and two okay we're further away you know more extreme values would be far and high correlation would be low but understand the difference between a similarity function and a you know a correlation measure cosine similarity which is the dotproduct of two vectors also has the property that it goes from minus1 to one and often in high Dimensions this is a very good measure to measure the similarity between points on your homework I believe you were given the task of comparing your distance metrics your your points with ukian metric distance and with the cosine distance and we'll see how they perform or not any questions there are other measures of seeming distance that are not met some of you may know about Mutual information measures of probability distributions this the distance but the the mutual information from A to B is not the same as B to a for many of these things so it's not symmetric and if you think about the cheapest airfare if you say that the distance between two places is the price of the distance the cheapest airfare you can get between them does that observe the the triangle inequality and the answer is often not often there's sometimes a way that if you're tricky if you fly to an intermediate City you can reduce the cost of the airf fa okay so some things are metrics some things aren't okay and you know our ability to reason about these distances depends upon whether it's a metric um the most familiar okay uh metric that we are used to is ukian distance what is ukian distance the sum over all dimensions of the absolute value of the difference of the E coordinate of one point minus the other Square it so it's the sum of squares of the dimensional differences and you take the square root of it okay this is what we use as a dimensional thing okay most everybody's familiar with using ukian distance now notice that there are other things we could do to this distance function if we wanted suppose we knew that some Dimensions were more important in other dimensions let's say we're rating people and we have one Dimension is the length of their toenail and the other was how smart they were okay intellectually we would say that maybe we care more about the similarity of their IQ than their toenail length Okay if so you could conceivably generalize this by multiplying each Dimension I by a CO efficient so you weigh you could weigh certain diff Dimensions differently than others that would still leave a metric okay so that's still

### [1:10:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=4200s) Segment 15 (70:00 - 75:00)

fine now the big problem on using distance is if you've got these things is if the diff dimensions are not comparable okay if we have you as a thing where one of your Dimensions was IQ and that's a number between one and 100 and we had another dimension that was um let's say your gender it was zero or one depending upon whether you were male or female those would both be important Dimensions but by definition the one which had the bigger range is going to make a bigger contribution to the distance this is why we talk about zc scores and normalization okay to make sure that if you're going to compute the distance between one you have to make sure that all Dimensions have comparable magnitudes or else they're going to it's going to get a weird you're going to have a weird result any questions yes okay so there's two different issues that we're talking about here you here I think you're talking about correlation between features what happens if we have two things features that are highly correlated with each other let's say one dimension was the length of your left toenail another dimension was the correlation length of your right toenail okay these two are probably going to be highly correlated what happens when you compute the distance of it well the they're being treated equally so toen nailess is going to be getting twice as much vote let's say as IQ under that case if you have one column for IQ you have two very highly correlated um you know variables when we're Computing the distance here we're not taking the correlation of the columns into account okay so if you have a large number of highly correlated features they will come to Domin the distance metric that could be a problem okay that might be an argument that before you comp compute the distance you do some kind of Dimension reduction to eliminate we'll talk about Dimension reduction later but that might be the concern you would have there any questions about that now there is a generalization of ukian distance that is natural and important to know which is that uh that our ukian distance was the differences of each Dimension raised to the second power sum up that and then take the second root by raising it to the 1/2 that's what the square root is right note that this is well defined if we use a different value of K here ukian distance is what we would call the L2 distance because K is equal to two Okay but there are other values of K that are interesting what happens if we have k equals one in a distance metric like this that's going to be the sum of the absolute differences raised to the 1th power which means it under changed basically this is the sum of the dimensional distances okay this is called The Manhattan distance because this is how far you walk on a city grid right does everybody see that in uh if you're in Manhattan you want to go from here to there the shortest distance according to the ukian distance is to fly through buildings can you fly through buildings unless your Superman cannot fly through buildings right you can only move down the X you know north to south or east to west and so what is the distance from here to there it's the number of blocks to the West plus north that you have to go to okay that is exactly what the L1 distance corresponds to what does the L Infinity distance correspond to what happens if I instead say take the sum of the distances of

### [1:15:00](https://www.youtube.com/watch?v=ub6cK32Aq6o&t=4500s) Segment 16 (75:00 - 79:00)

things raise every difference to the power if you don't like Infinity say extremely big right and tend take the Infinity the an extreme the infinitythe what is going to be left from that does anybody have an intuition let's think about it if you have um you know a bunch of differences here let's say that difference this one is bigger than all the others if you raise each of these to the infinitethejackal squared it this would be something like that right so when you raise them to a high power the biggest one is going to dominate all the other terms right and if the other terms are buus next to it okay if you then take the infinitythe you're only going to have basically the largest distance dimensional distance so K1 gives you the ma K Infinity gives you the maximum difference between things any questions about that why do we use the k equal 2 in our world we could have measured lived in a world of Manhattan where everybody measured everything in terms of L1 distance why is it that we use uh the ukian distance in our life does anyone have an idea basically it's because we like the Pythagorean theorem that's the way I would think about it that we're kind of used to this idea that there is a diagonal kind of cuts things off okay but there's you know it's not immediately obvious that the world had to take two as an assumption and depending upon what dimension distance measure you use that will change what you find as a nearest neighbor remember we're interested in nearest neighbors is the20 or 1. 5 further from zero okay if we had k equal 1 well the sum of the dimensional differences here is 2 versus uh three so this point is closer okay if we had one where we were measuring the maximum difference I think I got this backwards did I have this backwards um okay for for ukian distance this is two this is going to be 1 2 * 1. 5 squared okay this point is all right further from it so so as opposed to closest so the further point from one in in a uh Manhattan distance this is further from the origin than that in ukian distance this is further than that for the K Infinity Norm okay we look which has a bigger dimensional difference the X Dimension is diff is bigger so P1 is further from origin so basically if you want to do nearest neighbor you got to specify your distance measure and you got to decide which Norm is better for you we're used to using k equal 2 that's probably a good first try but there is going to be a difference as to whether you're worried about random noise on each Dimension or are you worried about artifacts of big artifacts of outliers between things if there's certain Dimensions being out of whack means everything they're really far away maybe you want a higher K than others any questions which of these are circles I'm showing you what a circle is what is a circle is the set of points that are equal distance from um an or a center right this is what we are used to

---
*Источник: https://ekstraktznaniy.ru/video/39736*