Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment (Paper Explained)

25:59

Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment (Paper Explained)

Yannic Kilcher 27.09.2021 16 738 просмотров 403 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

#neurips #peerreview #nips The peer-review system at Machine Learning conferences has come under much criticism over the last years. One major driver was the infamous 2014 NeurIPS experiment, where a subset of papers were given to two different sets of reviewers. This experiment showed that only about half of all accepted papers were consistently accepted by both committees and demonstrated significant influence of subjectivity. This paper revisits the data from the 2014 experiment and traces the fate of accepted and rejected papers during the 7 years since, and analyzes how well reviewers can assess future impact, among other things. OUTLINE: 0:00 - Intro & Overview 1:20 - Recap: The 2014 NeurIPS Experiment 5:40 - How much of reviewing is subjective? 11:00 - Validation via simulation 15:45 - Can reviewers predict future impact? 23:10 - Discussion & Comments Paper: https://arxiv.org/abs/2109.09774 Code: https://github.com/lawrennd/neurips2014/ Abstract: In this paper we revisit the 2014 NeurIPS experiment that examined inconsistency in conference peer review. We determine that 50% of the variation in reviewer quality scores was subjective in origin. Further, with seven years passing since the experiment we find that for accepted papers, there is no correlation between quality scores and impact of the paper as measured as a function of citation count. We trace the fate of rejected papers, recovering where these papers were eventually published. For these papers we find a correlation between quality scores and impact. We conclude that the reviewing process for the 2014 conference was good for identifying poor papers, but poor for identifying good papers. We give some suggestions for improving the reviewing process but also warn against removing the subjective element. Finally, we suggest that the real conclusion of the experiment is that the community should place less onus on the notion of top-tier conference publications when assessing the quality of individual researchers. For NeurIPS 2021, the PCs are repeating the experiment, as well as conducting new ones. Authors: Corinna Cortes, Neil D. Lawrence Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (6 сегментов)

Intro & Overview

hi there today we'll look at inconsistency in conference peer review revisiting the 2014 eurips experiment by corina cortes and neil d lawrence which were actually the chairs of the 2014 eurips conference so they are going to have access to some data that the rest of us sadly don't have access to but also it allows them to make pretty cool research on how conference reviewing works and whether or not it actually can determine the quality of a paper or how much of it is just random subjective reviewer decisions now this paper particularly here takes up the papers that were subject to the 2014 eurips experiment and tracks them over time so it's gonna it looks at the papers that were submitted how they perform in the subsequent years meaning how many citations that they accumulate both for the accepted and for the rejected papers and they find some pretty interesting results right here so we'll dive into this the paper is not too long and the conclusions are fairly straightforward i still think it's really cool that people actually follow up on this work

Recap: The 2014 NeurIPS Experiment

so for those of you who don't know the 2014 new rips experiment that is the wrong color the 2014 eurips experiment was an experiment in assessing how much of review of conference review is random essentially so what you did was and i think they have a little section about this here yeah so they selected about 10 of the submissions these were 170 papers and these would undergo review by two separate committees so usu whereas usually you have a paper that goes into a review let's call that a committee which is a bunch of reviewers and an area chair and they make the decisions of whether to accept or to reject and yeah at the end you have a decision so in this experiment you would take a paper you would actually give it to two different committees committee one and committee two committee one would only be selected from kind of one half of the reviewer pool and committee two would only be selected from the other half these were random assignments and to the two pools and also the papers who participated were randomly selected so each of these committees would reach their own decision accept or reject and of course the interesting part is how many of those agree or disagree with each other and by the way the paper would be accepted finally if the max so if either of the committees would accept the paper and if i recall correctly this year's nurips conference actually repeats that experiment from 2014 so we're going to have another data point in hopefully assessing how conference reviewing has developed over the years whether it's gotten better or actually worse all right so that was the experiment in 2014 but by the way the authors here have decided that the name changes is retroactive i never know when talking about old neurips conferences whether i'm supposed to say it was nips 2014 or new rips in any case in this paper we're doing eurips so what was the outcome of that experiment and that's pretty interesting namely here you can see this these are still 2014 numbers committee one um and committee two split up so it's not the same committee one of course but committee one would always be reviewers selected from kind of the first half of the population committed to from the second half they did agree on most of the papers as you can see here for 101 papers they agreed to reject from 22 they agreed to accept however for 43 of the papers one committee would accept and the other one would actually reject so for about 25 of the papers the two committees would disagree 25 percent it's you know it sounds it's a lot but it doesn't sound like that much but if you look at it in a different way where they say right here if the conference reviewing had been run with a different committee only half of the papers presented at the conference would have been the same so this is looking at if you for example always go with committee one you would have these papers but if you would always go with committee two you would have these papers therefore but the simple selection of the committee uh determines about half the papers at the conference so if you're at the conference you walk through the big halls of posters or you look at the proceedings uh you have to keep in mind that half of the papers are there only purely because of the random choice of or not purely but they wouldn't be there had the reviewing committee been a different one half the papers that's kind of crazy and of course this sparked a lot of discussion right here so this is the outset this was the results from that time and now we're going into new analysis

How much of reviewing is subjective?

so they do three different distinct points of analysis the first one is they do um the title is called reviewer calibration so they try to figure out what portion of a reviewer's assessment of a paper is let's say objective and what portion is subjective so what portion of a score is simply due to the reviewer's subjective feelings about the paper that doesn't match with any other reviewers scores so here you can see this for example what you can do is you can build a model you can say y i j that's the score that the jth reviewer gives to the eye of paper and you know being the conference chairs these authors here would have prime access to that data so what you observe is y now you can say we assume this is a combination of three things first of all we assume that there is some sort of a objective paper quality which is f i this is the objective quality of the paper this is actually what the reviewers are trying to predict so when the reviewer posts the number y into the system they're trying their best to actually assess fi however there is also this bj right here and this is the bias that the jaith reviewer has in calibration so not everyone sees the one through ten or one through nine scale that we have in the same fashion and therefore what's like a three to me might be a five to u so we have to correct somehow for this and the inclusion of this b j factor is how we account for that and then lastly you have this e i j factor right here and this is the subjective portion of the score so this is independent of the objective quality of the paper this is sort of the subjective bonus or penalty that reviewer jay gives to paper i and our goal is going to be to figure out how do these two numbers compare to each other how much of the score is objective versus subjective after we have calibrated for reviewer for general reviewer bias for calibration bias let's say keep in mind this is a model this is a how we imagine the world all we observe is this y thing right here what we can do is of course we can put up a linear system of all the scores right and because every reviewer does give more than one score in this conference and every paper gets more than one reviewers scores so we can put up a linear system but it turns out this is over parameterized um because you only have as many numbers as you have these parameters right here so the rest both parameters they don't you don't have enough data points to assess that now as much fun as over parameterized models are in deep learning they're actually not that good if you want to estimate a linear system so what people do they come up with regularizers and bayesian approaches and yada yada i'll skip all of this to just give you the numbers so the model that these authors come up with determines that the factors of the linear systems are as follows this here is the factor that goes with the fi this one is the one that goes with the bj and eij and you see you you pull out this one and then you simply compare the number on the left to right you'll see they're almost exactly the same and that means and they formulate this here in other words 50 percent of a typical reviewer's score is coming from opinion that is particular to that reviewer and not shared with the other reviewers this figure may seem large sorry about that they say but in retrospect it's perhaps not surprising so this is pretty i guess this is pretty surprising to me uh but it is not that i didn't expect it and i think anyone who's participated in conference peer review would expect a number that is in approximately this range because we know that the review process is pretty noisy and very often uh individual reviewers just kind of give weird scores that you don't understand and here's the reason you don't understand because it's the source of them are subjective and largely not shared by other reviewers so having figured that out having

Validation via simulation

figured out that about 50 of the variation uh is due to just subjective feeling of a reviewer about a paper now they sort of try to validate their findings and for that they run a simulation so the simulation is a it's a simulated conference so we assume that each paper was scored according to the model we've given above and we estimated the accept consistency through averaging across a hundred thousand samples so now they're simulating the conference with this experiment done and they ask themselves if this is really the correct model then we should get back a consistency of the 50 we found above right so because above the results of the experiments were that there was about a 50 consistency in acceptance in the experiment and now they go and they look at all the papers and all the scores and they determine that there is about a 50 subjectivity in scoring and now they ask themselves do these two numbers match and they run a simulation where every reviewer has a 50 subjectivity and they ask themselves if we do if we simulate this splitting up into two committees and then um every committee agrees by themselves do we see the numbers that we found in the experiment and the answer is yes actually so you can see these are conferences for a bunch of of different scenarios namely for different number of reviewers as you can see here these are reviewers per committee so random means there is no reviewer per committee decisions are just random and you can see that as the accept rate of the conference goes up the accept precision of the committees go up because they simply they would more papers are accepted and therefore more papers would be the same if you were to change the committee what we're interested in is of course the one with three reviewers which is the most common reviewer scenario uh in these conferences and that's this curve right here so the way to read this is that for example if the conference had an accept rate of 50 percent right here then we would expect a reviewer consistency or an accept precision of 0. 75 of 75 which means that uh if we were to switch the reviewers for a particular or for all the papers 75 of the paper would still be the same remember that in our experiment only 50 of the papers were still the same if we switched committee but the conference also didn't have a 50 accept rate so for that we actually need to go to the accept rate of the conference which was something like 23 right here and then if we look that up we are at about a 60 percent accept precision now this might still be away from the 50 percent we found in the experiment however the experiment had so little data that the um if you calculate the bounds on the on what the true accept position precision was from that experiment you can determine that it was between 38 and 64 and the exact number we got is 61 so this is still within the bounce of what we found in the experiment so pretty interesting uh this actually means that the model they put up is a close enough approximation to reality such that it predicts the experiment's outcome and this gives us a little bit of a validation that we're on a good track right here so we can sort of confidently say that about half of a reviewer's decision on a particular paper essentially comes down to subjectivity is consistent with what we found in the experiment and it'd be interesting to see how this develops this year when we repeat the experiment so lastly what they were trying to figure

Can reviewers predict future impact?

out is well are these reviews even worth it so to say do they actually predict how good a paper is and you know how do you measure how good a paper is of course by the number of citations so here they define the citation impact as the log of the number of citations and yes there is a debate about whether citations really mean a paper is good or influential or a blah blah but we don't for better or worse we don't have a different measure right now than number of citations and it's been seven years which is like three generations in machine learning so there is a long enough time that these papers had to accumulate citations so do let's just look at the accepted papers do the scores that the reviewers give to the papers predict in any way uh whether or not the paper is going to be cited more or less so do higher scores indicate more citations and the answer is no not at all so here is a plot the correlation is 0. 05 this is ever so slightly statistically significant but not really uh so um you can like at least for this particular conference right here there's no correlation between reviewer scores and impact of the paper in the future it becomes a little bit interesting uh when you ask specifically so because here the question is you know is the paper novel is it correct uh is it well written and so on um these are not necessarily indicators of significance right if you accept the paper to a conference only a small part of it is it significant if you actually ask reviewers do you think this paper will have a potentially major impact or not you get a slightly higher correlation but also not really which means that reviewers are kind of bad at estimating uh whether any given paper will have a big impact or not though to be fair for most papers the answers is probably no by default however the interesting part is when you ask them um about their confidence in their rating and it is if i understand correctly it doesn't even matter which rating um but for up for the rating that you give at these conferences you have to provide a confidence score like you say okay i think this paper is really good but i'm not very confident and if you simply correlate the confidence scores as you can see here the average confidence over all your sort of confidences of the paper with the impact then you do get a slight correlation which is interesting right so the authors here argue that it might be that there might be something like clarity in the paper so if a paper is written very clearly then you will also be able to understand it better as a reviewer which makes your confidence higher but also since the paper is more clear it means that the rest of the world will have an easier time understanding the paper and therefore cited more often so this is a good hypothesis but it's quite interesting that um that the confidence in papers uh it seems to predict the impact better than the actual assessment of the impact that's astounding it's not super astounding that confidence by itself would um predict it but that it does so more than uh if you directly ask people i wonder what else we can ask um like i wonder what weird questions we can ask that will then up correlating with the do it with the future impact like do you like the colors of the paper pictures um so these were for accepted papers they also interestingly trace the fate of the rejected papers so they say only 414 were presented at the final conference so they want to trace the rejected papers and they go through a lot of work to try to figure out where these papers ended up so they search for papers with similar titles and authors or same titles and authors and of course this is not a perfect process but it seems like they've been able to trace a lot of these papers to their final destination you can see a lot of papers are discarded or some are simply posted on archive or somewhere else of course the discarded papers you don't know if they somehow morphed into other papers or something like this but it's still pretty interesting to see though they say there are various uh error resources in these plots lastly yeah here is the fate of the rejected papers now they don't say exactly what blue and green means in this particular thing in other plots in the same papers they differentiate for example between papers that have been accepted somewhere else ultimately and papers that have not been or that they have not been able to trace so this might be blue and green i'm not sure i haven't been able to maybe i'm just stupid at reading but as you can see if you look at the rejected papers so this is the calibrated quality score for the rejected papers and here you can see that there is in fact a correlation which means that for the rejected papers the assessment of the reviewers really does correlate with how the papers will end up doing ultimately though i'm gonna guess well if if the citation count is in here i'm gonna guess the discarded paper must not be in here yeah sorry but the conclusion is that for the rejected papers reviewers can tell whether they're better or worse for the accepted papers not so much and that's what they said at the beginning the review process is probably good at identifying bad papers but bad at identifying good papers and this is it's not too surprising because bad papers you know you can find uh it's really easy to recognize a very poor paper but it's harder to recognize really how good a paper is you know compared to other good papers so that was the paper they give some

Discussion & Comments

recommendations for example they say well maybe we should assess papers uh on on some um on different criteria than we do now but they do guard they do warn against saying we should do away with subjectivity all together because you know as annoying as the subjectivity is they argue is it also guards against sort of the the collective dominance so it guards against uh sort of making consistent mistakes so if all the like if the entire conference for exa makes consistent mistakes in some direction then the subjectivity might counter that a little bit i'm not sure if that's a super good argument i am generally for noisy processes over super duper rigid ones um it seems though that the conference review right now is a bit too noisy uh i'd rather do away with uh just having three reviewers and not having this accept barrier this is my personal opinion i would just do away with the accept barrier altogether you know you submit to a conference you get a bunch of scores and then you have the scores like why do we need to divide papers up into accepted and rejected or you know like it seems better to just put papers out there and let the future researchers assess them in retrospect rather than having three random people with highly subjective opinions assess them but yes probably a bit of noise is good in a process like this if you do they also say well maybe we should not push put that much value at publishing at top tier conferences now i don't know how that's gonna work you know like whenever whenever and yeah i wish as well that we could like change the collective um the collective thinking about our field uh i don't see that as a super easy task though in any case uh this was the paper let me know your ideas uh let me know how you think this year's experiment is going to turn out like are we going to find more uh subjectivity less how much disagreement do you think we're going to find this is going to be interesting so yeah thanks for listening and i'll see you next time

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник