AutoML can be a tool for good, but there are pitfalls along the way. Rahul Sharma and David Selby tell us about how AutoML systems can be used to give us false impressions about explainability metrics of ML systems - maliciously, but also on accident. While this episode isn't talking about a new exciting AutoML method, it can tell us a lot about what can go wrong in applying AutoML and what we should think about when we build tools for ML novices to use.
[00:00:00] Hi and thanks for tuning into the AutoML Podcast.
[00:00:03] Today is going to be sort of a meta-episode, you could say.
[00:00:07] We're not necessarily looking at a specific AutoML method.
[00:00:11] We're instead talking to two people that are thinking about
[00:00:14] what implications the use of AutoML can have.
[00:00:17] Especially if we think about disciplines that maybe aren't as close to machine learning
[00:00:22] and the scientific practices that we know and care about.
[00:00:27] So specifically, I'm talking to Rahul Sharma, who is an up-and-coming researcher.
[00:00:32] He's still doing his Masters in Kaiserslautern.
[00:00:35] And David Selby, who's really a statistician.
[00:00:39] He's worked in epidemiology, he's done a PhD in statistics.
[00:00:42] So he's the perfect person to tell us what maybe unintended consequences using AutoML can have.
[00:00:50] And also how we should take that into account and how we think about building AutoML methods
[00:00:57] and how we think about talking to the people that we expect to use our tools.
[00:01:02] So I think that's really an interesting perspective to have for anyone
[00:01:06] that is working towards automating machine learning and making ML easier to apply.
[00:01:12] I really enjoyed talking about the different aspects this topic has
[00:01:16] because obviously the first thing we think about if we think about how statistics interact with AutoML
[00:01:22] or machine learning in papers and how that might not reflect the actual real results.
[00:01:29] We think about malicious actors and cherry picking
[00:01:33] and all of these buzzwords that have been going around different sciences,
[00:01:38] different disciplines of sciences in the last years.
[00:01:40] But if you think about it, and as we talk about it,
[00:01:43] it really becomes clear that it's not just about malicious actors trying to manipulate results
[00:01:50] and a few bad apples.
[00:01:52] It's really that because we work with a lot of data and optimization on top of optimization,
[00:01:58] it's very easy to produce results that inadvertently support your hypothesis,
[00:02:05] even if that's not what you plan on doing.
[00:02:08] And AutoML, unfortunately, can be a tool to do that in the domain of explainable AI.
[00:02:15] And that's what the paper titled X Hacking is about.
[00:02:19] So I hope you enjoy. This is the first podcast episode I'm doing ever.
[00:02:25] So be patient with me and the audio quality on this one.
[00:02:29] But for now, enjoy the conversation and let me know what your thoughts were
[00:02:34] about both the paper and the conversation we had on explainability hacking using AutoML.
[00:02:43] Hello and welcome back to the AutoML podcast.
[00:02:46] I'm Theresa, and today I am hosting two researchers that are working on how we could potentially
[00:02:52] or not we but other people could maliciously use AutoML to deceive you in research papers.
[00:02:58] Those are Vipra Rahul Sharma and David Selby.
[00:03:01] Do you maybe want to quickly introduce yourself before we get into the topic?
[00:03:05] Yeah. Hi, I'm Rahul.
[00:03:07] I am a Masters in Computer Science student at Rheinland-Ferner Technische Universität.
[00:03:13] And I'm working as a research assistant also at German Research Center for Artificial Intelligence
[00:03:20] in the research group, Data Science and its Applications.
[00:03:23] And I'm David Selby.
[00:03:25] Hi, I'm David Selby.
[00:03:26] I'm Senior Researcher in the Data Science and its Applications group at the German
[00:03:30] Research Center for Artificial Intelligence.
[00:03:33] I moved to Germany from the UK a couple of years ago.
[00:03:37] So you, I think, picked a very interesting topic that's not represented that much
[00:03:42] in the AutoML community, namely what actually happens if someone doesn't use AutoML tools
[00:03:48] for what would consider good purposes like making machine learning more efficient,
[00:03:52] solving hardware machine learning problems, but actually use them maliciously.
[00:03:56] In this case, when it comes to interpretability of models,
[00:04:00] what prompted this line of thinking in the first place?
[00:04:03] Did you encounter something like this in a paper?
[00:04:06] It basically comes from the idea of key hacking from statistics where basically
[00:04:12] if you have a hypothesis and you want to prove that hypothesis,
[00:04:17] you could basically do different types of tests and just report those tests
[00:04:22] which gave you a significant result in the form of a p-value.
[00:04:25] So taking this idea from key hacking, we thought can it be transferred to X hacking
[00:04:32] in general, which is called explainability hacking in our paper?
[00:04:35] So the idea is that we have explainability metrics like sharp values
[00:04:40] and different models and we want to have these black box models explained
[00:04:45] with the help of sharp values, but can we give different explanations
[00:04:49] without actually changing what the influence should be in terms of accuracy
[00:04:55] like any other metric for performance?
[00:04:58] So that is basically the idea.
[00:05:00] It comes from key hacking and key hacking has been talked about a lot in statistics,
[00:05:05] but I think the idea of X hacking now, it's not that prevalent
[00:05:10] and we need to talk about it.
[00:05:12] So we basically show a way that it could be done automatically
[00:05:17] and without much of influencing the models or the metric itself
[00:05:23] and how it can be done basically in an AutoML model
[00:05:28] or like an AutoML library which is off the shelf that you can just take.
[00:05:33] And can you maybe quickly walk us through what such a malicious user would do
[00:05:37] to get a better result?
[00:05:39] How would that result look like in the first place for people
[00:05:41] that are maybe not too familiar with them?
[00:05:44] So yeah, the idea would be that you have data sets
[00:05:48] and you want to test your hypothesis for data set.
[00:05:52] For example, that you take tabular data set, image data set, anything
[00:05:56] and then you have certain features and you want to say that
[00:06:00] okay these features have influence on the results, right?
[00:06:04] And there are two ways that you could do, like you could design your own model
[00:06:08] which takes more domain expertise about the data set that you have
[00:06:13] and the models that you want to use from like literature research or anything
[00:06:17] or you could just use an over-the-top, off-the-shelf model
[00:06:21] by running AutoML on it and it finds out all of the possible data processing steps
[00:06:27] and the models and the hyperparameters associated.
[00:06:30] And in the end what the malicious part would be that the AutoML gave
[00:06:35] actually many results to you with like good accuracies and all.
[00:06:39] But within those models you actually, what we say in the paper is that
[00:06:44] you cherry pick the model which gives you the explanations that you want to show
[00:06:48] to the research community and not necessarily show all the other stuff
[00:06:53] that has been put out by AutoML but had like very different explanations.
[00:06:58] So you have hypothesis and you pick the model that fits your hypothesis
[00:07:03] but you don't show the other models which didn't actually conform
[00:07:08] the hypothesis that you had but had like equal performance.
[00:07:11] It's ultimately about how you report your results.
[00:07:15] So rather than necessarily the internal mechanism of your model
[00:07:20] if you're interested in an applied area like for example medicine or social science
[00:07:24] and you have some hypothesis that this particular intervention or exposure
[00:07:29] or characteristic is either helps to explain a result
[00:07:34] or doesn't help to explain the result, then at the end of the paper
[00:07:38] where you say are we looked at the importance of this feature
[00:07:41] or the direction of this feature depending on how you explain the model
[00:07:44] or in traditional statistics that the p-value or the effect size associated
[00:07:49] with this feature, with this variable, this predictor,
[00:07:52] you want for whatever reason because people in your field are expecting
[00:07:57] this result or because you want to come up with something surprising
[00:08:00] to say well it turns out that X appears to be associated with Y
[00:08:06] or if you're being really ambitious X causes Y, although maybe you shouldn't do that.
[00:08:11] And coming from statistics normally the way that statisticians work
[00:08:16] is that they are interested in how the world works
[00:08:20] and what is the data generating process
[00:08:24] and how would this drug interact with this particular disease or outcome
[00:08:31] whereas in the machine learning world traditionally people are interested
[00:08:34] in prediction performance basically and then everything else is a black box
[00:08:38] or at least even if it isn't a black box you're not that interested in how it works.
[00:08:43] You just want to come up with better predictions according to whatever metric you want to use.
[00:08:47] And so this is kind of the perfect storm where in machine learning
[00:08:52] people are starting to take an interest in explainability
[00:08:55] and at the same time in statistics of people starting to take an interest in machine learning methods.
[00:08:59] And so what better opportunity for machine learners or machine learning engineers
[00:09:04] or ML users or whatever you want to call them to make all the same mistakes
[00:09:07] that people in statistics have been making for years.
[00:09:11] Yeah, and that's exactly how I read the paper.
[00:09:15] Obviously you write a lot about how you could maliciously use these approaches
[00:09:20] but it also sounds to me like you could also just not be very familiar with
[00:09:24] how AutoML and the underlying machine learning model would interact in such a scenario.
[00:09:30] So where this explanation hacking at least gets harder is the concept of a defensible model.
[00:09:37] So maybe let's talk about this concept first and then go through the ways
[00:09:41] how you describe how X hacking can happen in practice.
[00:09:44] So what makes a model defensible? What does that mean?
[00:09:47] Yeah, well in our paper we describe a model defensible is that
[00:09:52] it actually conforms to the task that you want to do
[00:09:57] which is basically like you have a machine learning model
[00:10:00] and then you want to have a better accuracy on that.
[00:10:03] So a model would be considered defensible if it has good performance
[00:10:08] which basically is that you have good performance of the model
[00:10:13] and then that conforms to what beliefs you had with the data set and your hypothesis.
[00:10:20] And then when you look into it from the explanation point of view
[00:10:27] then you could say that yes, because my model was giving me good accuracy
[00:10:33] and I would want to kind of believe the explanations also that it's going to have.
[00:10:39] And I would add to that, it depends who you're defending it from.
[00:10:43] So my stereotype for machine learning community is
[00:10:46] oh well if it's got good performance then that's good and no further questions.
[00:10:50] But at the same time in different fields everywhere is different
[00:10:53] and so there might be certain, maybe there's a list of best practices
[00:10:57] or there's certain models which are popular or your PhD supervisor won't accept anything
[00:11:02] unless it has a particular way of filling in missing values
[00:11:07] or selecting the features or modeling framework.
[00:11:10] And so if when you're reporting the paper
[00:11:13] which are the things that reviewers would ask you difficult questions about?
[00:11:17] And so if they say well your performance doesn't sound that great
[00:11:20] then that's one thing but it could be
[00:11:22] oh well why didn't you use random forests for this?
[00:11:24] Everybody uses random forests in this field.
[00:11:28] And so it's not necessarily thinking about expert machine learning researchers
[00:11:35] but also in applied fields where everybody has their own practices and specialisms.
[00:11:41] And so what defensible means is different for different fields
[00:11:47] and potentially you could quantify it according to going through the literature
[00:11:51] and then saying well I can defend this because I can find a recent reference
[00:11:55] where this was used and then the buck doesn't stop with you
[00:12:00] because then you can say well you know take it up with this fine reference
[00:12:04] that I found that says that random forests with this particular hyperparameter are great
[00:12:09] therefore it wasn't even my decision that's just what the literature says.
[00:12:12] Or conversely if your predicted performance is good then you can say
[00:12:16] well look I've got better printed performance than either some paper found in the literature
[00:12:21] or some baseline that we've also trained
[00:12:23] and then of course the baseline needs to be something that is also defensible
[00:12:27] so it can't necessarily be oh I'll always guess the answer is one
[00:12:31] and it might have to be something a little bit more sophisticated
[00:12:34] whether it's logistic regression or a random forest with default parameters
[00:12:38] or something like that.
[00:12:40] I think what I can take away from that
[00:12:42] if I want to make a model defensible against me
[00:12:44] I definitely have to look out for more things than just
[00:12:48] don't cherry pick explanations on purpose
[00:12:50] and we're going to talk about some best practices you suggest
[00:12:53] of how one could defend their own models against oneself later
[00:12:59] Maybe let's first go into the cherry picking
[00:13:01] because that's something you could talk about as well
[00:13:03] and I think that's the maybe easiest way
[00:13:07] and most obvious way for most people
[00:13:09] to think about how would you actively x-hack something right?
[00:13:13] As you describe it as you run an autoML framework
[00:13:16] you run it maybe even multiple times with different settings
[00:13:19] you get a range of models out of it
[00:13:21] and you just report whatever fits your needs.
[00:13:25] What do you think empirically?
[00:13:27] To what magnitude can that affect your results?
[00:13:30] I mean you showed some of that as well, right?
[00:13:33] Yeah exactly.
[00:13:34] So the idea about the cherry picking that you got is correct
[00:13:37] that's what we have done right now
[00:13:39] but when it comes to the magnitude
[00:13:42] to which we can influence our results
[00:13:44] that basically goes in the direction of actually pushing autoML
[00:13:49] to generate bad results
[00:13:51] and not just running autoML over and over
[00:13:54] and see which results that you got.
[00:13:56] So we also are yes, working in the direction
[00:13:59] where you can actually just push autoML
[00:14:01] to generate bad results
[00:14:03] in terms of explanation hacking
[00:14:06] and it also depends on the type of data set that you have.
[00:14:11] So right now we have done it on the tabular data sets
[00:14:15] and the magnitude to which we can influence the results
[00:14:20] or the explanations
[00:14:22] is basically that we have shown certain results
[00:14:25] on what kind of features could be hacked
[00:14:28] based on the dependency and independency
[00:14:31] of the features within each other.
[00:14:33] The features that are dependent on each other
[00:14:36] are like a little bit hard to hack
[00:14:38] in the context of X hacking
[00:14:40] but that is only done in the sense
[00:14:44] that you know about the data generating process
[00:14:47] of the data set that you have
[00:14:49] and you know about the measured and unmeasured confounders also.
[00:14:53] So it becomes possible only when you know everything about the data
[00:14:58] to not allow X hacking to be done.
[00:15:01] It becomes a little bit tough for the features to X hack
[00:15:04] if you know everything about the data
[00:15:06] but that's not generally the case in real life.
[00:15:09] And also we have shown from certain data sets
[00:15:12] from OpenML CC 18 benchmark,
[00:15:15] we have 23 data sets for binary classification
[00:15:18] and running AutoML and not touching anything in it
[00:15:21] and not touching the explanations themselves,
[00:15:24] we could just look for the models which are defensible
[00:15:28] over certain data sets and we'll see that
[00:15:31] most of the data sets that we have tested this thing on
[00:15:34] does work and the features basically are getting flipped.
[00:15:39] The importance of the features is getting affected by it.
[00:15:43] So the top features in a particular model
[00:15:46] compared to baseline is getting its rank
[00:15:50] to basically third or the fifth most important feature.
[00:15:54] And we saw that in a range of benchmark data sets
[00:15:58] from OpenML classification benchmark.
[00:16:01] And just using this cherry picking strategy,
[00:16:03] we're able to show that for certain data sets
[00:16:06] if your goal, now we don't necessarily have any particular interest
[00:16:10] in any one of these data sets,
[00:16:12] so you pick the most important feature in a baseline model
[00:16:15] and then you say how easy is it to flip this
[00:16:18] just by cherry picking other models from the AutoML pipeline.
[00:16:21] And for some of the data sets,
[00:16:23] you can do this on almost all of the models you pick.
[00:16:26] So most of the models have a different most important feature
[00:16:29] and then for some of them you can't do it for any,
[00:16:31] so that feature was always the most important.
[00:16:34] Depending on what your objective is,
[00:16:36] your interest might not just be in saying,
[00:16:38] oh I want to knock it off the top spot
[00:16:40] of the most highly ranked important feature.
[00:16:42] You might say, oh well I want to flip the direction
[00:16:44] or if the scale has a particular interpretation in this problem,
[00:16:49] then you might want to get the number below some amount.
[00:16:52] And if you're p-hacking,
[00:16:53] then that would be not necessarily saying
[00:16:55] that this is most important feature,
[00:16:57] but you're trying to say that the p-value is above or below 0.05,
[00:17:01] which is the slightly arbitrary thresholds
[00:17:04] which is used in literature
[00:17:06] that's not necessarily data dependent.
[00:17:08] Yeah, but even though you say that there are some data sets
[00:17:11] where it's harder to manipulate the features,
[00:17:13] I found that actually quite impressive
[00:17:15] how big the discrepancies you can introduce actually are
[00:17:19] and how massive this x-hacking influence actually is.
[00:17:22] Because it seems like on some data sets
[00:17:24] that means you can totally do exactly what you described,
[00:17:27] you can set up a hypothesis that you want to say
[00:17:29] and you can relatively easily,
[00:17:31] probably even without having a large amount of domain knowledge
[00:17:34] in AutoML itself,
[00:17:36] just use AutoML as a tool to produce fraudulent data in a way.
[00:17:41] On the other hand though,
[00:17:42] there's also aspects of it
[00:17:44] that are found more interesting from a different direction.
[00:17:46] I mean, you already said there's standard data sets in some domains
[00:17:49] where you just try things, right?
[00:17:51] And oftentimes you have a choice of what data sets you want to evaluate on,
[00:17:54] what methods you want to evaluate on.
[00:17:56] Sometimes you also just want to leave out outliers and reporting
[00:17:59] for reasons one or another,
[00:18:01] although obviously you should report that in your paper.
[00:18:03] Is that also already x-hacking?
[00:18:05] Is it already x-hacking if I say,
[00:18:07] probably this won't actually be true for data sets.
[00:18:10] Five out of five I can use potentially,
[00:18:14] so I'm just using the first four.
[00:18:17] Arguably yes.
[00:18:18] I mean, it depends on your definition of x-hacking.
[00:18:20] So you could have within a data set,
[00:18:23] x-hacking where you say,
[00:18:25] well, okay, my data set is my input
[00:18:27] and then I just want to look at all the different explanations
[00:18:29] I can get from the single data set.
[00:18:31] But in principle, I mean,
[00:18:32] it can even go down to if you're collecting the data yourself
[00:18:35] and you're sampling it in a biased way,
[00:18:37] then if the ultimate output is this explanation
[00:18:41] in the same way that your ultimate outputs of,
[00:18:44] you know, classical frequentist statistical analysis would be,
[00:18:48] it shouldn't be a p-value,
[00:18:50] but often it's considered that it is,
[00:18:52] then it could be the data collection.
[00:18:54] It could be how you deal with outliers.
[00:18:56] It could be how you deal with missing values.
[00:18:58] It could be which features you choose to,
[00:19:00] you know, you just exclude some completely
[00:19:02] or you measure them in a different way
[00:19:04] or you transform them.
[00:19:06] Although ultimately then that comes down to
[00:19:08] just phishing generally.
[00:19:10] And so here we're focusing a little bit more on,
[00:19:13] okay, once you've got the data in front of you,
[00:19:16] how hard can you torture the data to get it to confess?
[00:19:21] So what you're saying is that using autoML for x-hacking
[00:19:24] in that way is actually,
[00:19:25] or can be a lot more deliberate
[00:19:27] than simply trying to preventively, you know,
[00:19:29] transform some features and hope that it works
[00:19:31] because you're not really sure what the result will be,
[00:19:34] but you can optimize for it if you use autoML.
[00:19:36] Well, so this is an interesting point.
[00:19:38] I'm not aware of any autoML frameworks
[00:19:40] that actually select your data sets as well as the models,
[00:19:43] which would be an interesting choice
[00:19:45] where maybe there's a module that searches the web
[00:19:48] and finds a different data set and says,
[00:19:50] this one doesn't give the result you want,
[00:19:52] but you know, I found another data set in TCGA
[00:19:55] or the UCI repository or some government portal
[00:19:58] or something like that.
[00:20:00] But one of the original motivations for this project
[00:20:03] is that we're starting to see autoML frameworks
[00:20:05] where they don't just optimize,
[00:20:08] let's say, the hyperparameters of the number of trees
[00:20:11] in your random forest or the depth
[00:20:13] or how many layers or what kind of activation functions
[00:20:16] you have in your neural network,
[00:20:18] but also these, depending on where you consider
[00:20:21] the data analysis to start, the pre-processing steps.
[00:20:24] So dealing with the missing values,
[00:20:26] selecting your features before they even get into a model,
[00:20:28] so not necessarily through regularization,
[00:20:30] but through some other criterion,
[00:20:32] and dealing with outliers.
[00:20:34] And often these are done as,
[00:20:36] for want of a better word, manually,
[00:20:39] and then you say, okay, now my data set is ready
[00:20:41] and now I'll put it into my pipeline.
[00:20:44] Now, of course, if the way in which you fill in
[00:20:47] your missing values or you select the features
[00:20:50] or you move the outliers or maybe even transform the variables
[00:20:53] or even some sort of encodings perhaps,
[00:20:56] if those are tied to your objective,
[00:20:59] then obviously the way in which you measure something
[00:21:02] can very easily become, you know,
[00:21:04] a measure very quickly becomes a target
[00:21:06] if you're aware of it.
[00:21:08] And this is the sort of thing that I worked on in my PhD,
[00:21:11] so this idea of, well, if you measure
[00:21:13] the scientific literature in a certain way
[00:21:15] but other people are aware of
[00:21:17] this is how they're being measured,
[00:21:19] then they'll change their behavior to accommodate that.
[00:21:22] And so if a certain way of filling in the missing values
[00:21:26] causes your predictive performance to go up,
[00:21:29] then that might be great,
[00:21:31] but of course there might be that, you know,
[00:21:33] oh, well, we're deleting all the difficult cases
[00:21:35] and then suddenly our predictive performance is amazing
[00:21:37] and, you know, that comes down to what
[00:21:39] your external validation debt is
[00:21:41] and that sort of thing.
[00:21:43] But then if you can introduce the idea
[00:21:46] of an explanation into your metric,
[00:21:48] then it very quickly comes into this idea of,
[00:21:51] well, okay, if I fill in the missing values in this way,
[00:21:54] if I remove these troublesome values in this way,
[00:21:56] if I get rid of this awkward, redundant feature,
[00:22:01] then I end up with the explanation that I wanted
[00:22:03] and I can always claim, oh, well,
[00:22:05] these are all reasonable steps because
[00:22:07] these are all things that's a vanilla autoML pipeline
[00:22:11] or at least a more recent one that has data processing in.
[00:22:13] It could have picked those steps normally.
[00:22:16] So, you know, I've not done anything wrong
[00:22:19] and ultimately the idea is not necessarily in saying,
[00:22:22] oh, well, I've done autoML, but rather,
[00:22:25] oh, no, I came up with this model all on my own
[00:22:27] and then concealing the fact that you used autoML
[00:22:29] or at a slightly more challenging level,
[00:22:32] oh, I did use autoML,
[00:22:34] but then you maybe are a bit vague about what the search space was.
[00:22:37] So maybe you searched for more models
[00:22:40] than you actually revealed that you searched for
[00:22:42] and you say, well, you know,
[00:22:44] I thought that, you know, it searched all different kinds
[00:22:47] of mean and median imputation, but what you don't reveal
[00:22:50] is that also it tried multiple imputation, for example,
[00:22:53] and that had the results you didn't want.
[00:22:55] So you just don't mention that you also searched that space.
[00:22:58] And so this is this idea of selective reporting
[00:23:01] or it's also been called the file draw problem.
[00:23:04] So, you know, oh, I got this result that I didn't like.
[00:23:07] I'll put that in the file drawer and forget about it
[00:23:09] because there's no journal of null results
[00:23:12] or conference of null results,
[00:23:14] assuming that the thing that gets your, you know,
[00:23:17] gets you your research grant or your promotion
[00:23:20] or your job or your prestige is a publication
[00:23:23] that has a particular explanation in it.
[00:23:25] I appreciate in machine learning that's not often the case
[00:23:27] and it might be I've come up with an entirely new framework
[00:23:30] or, you know, this is faster than before
[00:23:32] or, you know, has better accuracy
[00:23:34] or is able to use better information.
[00:23:36] But if you're able to say, oh, well,
[00:23:39] we showed a surprising result about the explanations
[00:23:41] of the fairness of the model, then that's, you know,
[00:23:44] an incentive that has emerged recently or a long time ago,
[00:23:48] depending on what your domain is and what methods you use.
[00:23:51] Yeah, absolutely. And also what we can't forget in this
[00:23:54] is that machine learning applications become more and more common
[00:23:57] in different application domains.
[00:23:59] And obviously there that's a use case
[00:24:01] where we also build AutoML tools for, right?
[00:24:04] Those are the communities where we say, yeah,
[00:24:06] ML should be easy to use, use an AutoML tool for use case.
[00:24:10] And it seems to me that there, as you said,
[00:24:13] it's communities colliding, right?
[00:24:15] Where there's little knowledge of statistics,
[00:24:17] maybe a little knowledge of machine learning
[00:24:19] and then either mistakes can happen
[00:24:21] or we can get into the more malicious way
[00:24:23] that you also talk about directly searching
[00:24:26] for good explanations in a multi-objective way.
[00:24:29] So this is then the part where it becomes really,
[00:24:32] really hard to detect.
[00:24:33] Obviously if you just do, you know,
[00:24:35] some questionable reporting,
[00:24:37] maybe people can read between the lines,
[00:24:39] but in multi-objective search,
[00:24:42] you probably don't want to report that at all.
[00:24:44] And so here you also did some experiments.
[00:24:47] You also showed some results.
[00:24:49] Does it get even worse than cherry-packing
[00:24:52] if we explicitly look for explanations
[00:24:55] as a metric in AutoML systems?
[00:24:57] Well, yeah, it depends basically.
[00:24:59] So when you know about the data,
[00:25:02] as I said like a few minutes ago,
[00:25:04] we are doing this on the tabular datasets right now, right?
[00:25:07] The target is like a binary class classification.
[00:25:10] And the experiment that we had done
[00:25:13] is basically run a simulation of datasets
[00:25:16] where the two features are like linearly correlated.
[00:25:21] So if there's a dependency between features,
[00:25:23] then it becomes easier to ex-hack
[00:25:27] the values of the explained metric
[00:25:30] that we use basically, which is sharp right now.
[00:25:33] And then we saw that if we do not have the dependent features,
[00:25:36] the feature is independent,
[00:25:38] then it is possible to ex-hack
[00:25:40] but at the cost of your accuracy going down.
[00:25:44] So that is in the experiments that we have done.
[00:25:48] We see that, okay, if the features are independent,
[00:25:52] then we can not ex-hack,
[00:25:56] but those are simulation experiments.
[00:25:59] And that means that you need to know
[00:26:01] the generating process in a real world example,
[00:26:04] and also the confounders or the confounded or unconfounded features
[00:26:11] which basically affect all these multicollinearity in your data
[00:26:15] to actually specifically know that, okay, we cannot ex-hack this feature.
[00:26:21] So is that basically something that reviewers could use
[00:26:25] to look out for actual directed malicious direct applications of AutoML?
[00:26:30] As you said, in the defensible model definition,
[00:26:32] if the accuracy goes down and if we have features
[00:26:35] that are more correlated,
[00:26:37] then the likelihood is just very low that we can explicitly ex-hack that.
[00:26:41] Well, it depends on what you report in your paper.
[00:26:44] So if everybody has access to your original data set
[00:26:48] and all of your exploratory data analysis
[00:26:51] where maybe in a small scale analysis,
[00:26:54] you make your scatter plot matrix
[00:26:56] of all the different predictors against one another,
[00:26:59] and then you see that some are correlated and some are not,
[00:27:02] and then you might select out features on that.
[00:27:04] But normally you don't include all of that information in a paper, right?
[00:27:08] You do your exploratory data analysis,
[00:27:10] you get a basic idea of how you're going to proceed,
[00:27:12] and then your actual paper doesn't include all of that information
[00:27:15] because of either pragmatic or old-fashioned limits
[00:27:19] about how long your paper can be,
[00:27:21] and people don't necessarily...
[00:27:23] Once you've found your path,
[00:27:25] you tend to get rid of your script,
[00:27:27] so you don't include it in the final repository,
[00:27:30] and people don't necessarily report all the way from the source data
[00:27:34] through to the conclusion,
[00:27:36] okay, well, we looked at the features,
[00:27:38] we've done this exploratory data analysis
[00:27:40] and here's the relationship between the features,
[00:27:42] and then this is the final model we went for.
[00:27:45] But just thinking about directed search,
[00:27:48] I think the easy answer would be,
[00:27:50] if you're looking for something directly,
[00:27:52] then it's going to be at least faster to find it,
[00:27:56] even if you don't necessarily find a better result
[00:27:59] in a given amount of time.
[00:28:01] The challenge would be,
[00:28:03] of course, if you're doing a directed search
[00:28:05] with these multi-objective optimization problems
[00:28:07] where your explanation is one metric
[00:28:10] and your accuracy or your performance is another,
[00:28:13] you can't then reveal that because then it gives the game away.
[00:28:18] But what you can get away with, of course,
[00:28:21] is revealing what you would consider to be your baseline
[00:28:24] or what candidate models you also include,
[00:28:26] and then this all comes down to how defensible you think it is.
[00:28:30] So how much accuracy are you prepared to lose?
[00:28:32] Can you get away with reporting a couple of models
[00:28:34] that have higher accuracy or lower accuracy,
[00:28:37] and you say, well, the accuracy is acceptable,
[00:28:39] but this model is more explainable, for example?
[00:28:42] Or do you have some quantifiable metric for defensibility,
[00:28:48] so maybe penalize models that use some particular method
[00:28:53] that is hard to explain
[00:28:55] or is not very popular in your field or something like that?
[00:28:58] Or do you penalize yourself on the basis that
[00:29:00] we can use this method, but we can't show the code
[00:29:03] because that will give things away,
[00:29:04] versus we can use this method and we can't show the code
[00:29:06] because all we did was cherry pick.
[00:29:08] And then in your field,
[00:29:09] it depends on what the best practices are.
[00:29:11] So if you're in a field where they insist on producing code,
[00:29:13] then you're going to have to penalize that option.
[00:29:17] But in a lot of fields, you can post your paper,
[00:29:20] you can even post it to papers with code,
[00:29:22] and there's a surprising number of papers on papers with code
[00:29:25] that don't have code or don't have the code
[00:29:27] to reproduce the results.
[00:29:29] Yes, that is indeed surprising.
[00:29:31] It always surprises me why that function exists.
[00:29:34] But yeah, reporting is a big topic anyway, right?
[00:29:37] I feel like standards have been shifting the last years
[00:29:40] and there's now at least some confidence.
[00:29:41] I mean, I know of the Alt-ML conference
[00:29:43] in the organizer committee there
[00:29:45] where we actually require code submissions,
[00:29:48] including some sort of reproducibility reviews.
[00:29:51] But in machine learning,
[00:29:53] at least that's not necessarily the case.
[00:29:55] So what would you say, except for code,
[00:29:59] which I can totally see would be very helpful
[00:30:01] in detecting things like this,
[00:30:03] what should be ideally reported?
[00:30:05] Because you also said that reproducibility checklists
[00:30:08] as they exist now can actually,
[00:30:11] maybe because they're not complete in this way,
[00:30:14] kind of give a false sense of security
[00:30:17] if there's things missing.
[00:30:18] So what do you think is missing right now from reporting?
[00:30:21] So if you have a measure, then it will become a target.
[00:30:24] So if you have a checklist
[00:30:25] and everyone is using the checklist,
[00:30:26] then people can look for ways to get around the checklist.
[00:30:28] And if they insist that you publish code,
[00:30:30] assuming that people are malicious actors,
[00:30:33] and I think we should be clear
[00:30:35] that most people are not right.
[00:30:36] So this is more about mistakes
[00:30:38] or I forgot to do this ablation study
[00:30:41] or I forgot to try this other model
[00:30:43] or I just by chance happened to get this result
[00:30:46] and I run it with a different seed
[00:30:48] and I get a different result.
[00:30:49] Yeah, maybe let me just interrupt.
[00:30:51] Yeah, because that's also something important.
[00:30:53] I think also loads of people outside of AutoML
[00:30:55] might not realize how much of a buyer's AutoML
[00:30:59] can introduce into your data.
[00:31:00] Also distinctions about tuning setting and test setting,
[00:31:04] I think those are nuances that are sometimes lost
[00:31:06] on people outside of the field.
[00:31:07] So I think that's a very good point
[00:31:09] that the malicious case is obviously an edge case,
[00:31:11] hopefully here and most people will just make mistakes.
[00:31:15] But I think as far as reporting goes,
[00:31:18] if you can have code that is,
[00:31:22] well, let's say a minimal working example
[00:31:25] that you can run it from start to finish
[00:31:27] and get your result, then that would be great.
[00:31:29] But in practice, that's not always possible
[00:31:31] because you can't always share your original data set
[00:31:34] or maybe you can physically,
[00:31:36] or rather should I say you can from a legal point of view
[00:31:40] but it's not practical because you're working
[00:31:42] on some gigantic data set that no reviewer wants
[00:31:46] to download a 50 gigabyte data set onto their laptop
[00:31:48] just because they had a hunch
[00:31:50] and they want to check something.
[00:31:51] Especially if the experiment takes three days to run
[00:31:54] and that sort of thing.
[00:31:56] But it would be nice if at the very least
[00:31:58] you could say, oh, well, here is code.
[00:32:00] Here is an anonymized, like pseudonymized data set
[00:32:04] with a similar structure
[00:32:06] and you can at least see the steps that we are following.
[00:32:09] But ultimately if someone wants to lie,
[00:32:11] then they will find ways of getting around these things
[00:32:16] because you could have a whole paper that says,
[00:32:19] oh yeah, we use this method
[00:32:20] and they just didn't use that method at all.
[00:32:22] And I mean, how would you tackle that?
[00:32:26] Ultimately, I think it's, yeah, if you can show willing
[00:32:31] and you can say, well, we followed these steps
[00:32:33] and here's why we followed these steps
[00:32:35] and maybe we didn't include in the paper
[00:32:37] but we tried these other methods
[00:32:39] and the results are available on GitHub or somewhere else.
[00:32:42] Then that's at least a step in the right direction.
[00:32:44] I don't think it needs to be really stringent guidelines
[00:32:48] on everything needs to be runnable in a notebook
[00:32:51] and you click a button and all your entire paper
[00:32:53] will come out because it's not practical.
[00:32:56] But at the very least, most people,
[00:32:58] depending on the field are not publishing code anyway
[00:33:01] or they're not explaining the data pre-processing steps.
[00:33:05] And so if they, at the very least,
[00:33:07] even if they don't reveal in the paper
[00:33:09] but it's in supplementary materials,
[00:33:10] they say, well, we followed this step, this step
[00:33:12] and this step and here's why,
[00:33:14] then that would be a step in the right direction.
[00:33:16] And I think if the moment you start introducing
[00:33:18] mandatory checklists,
[00:33:20] then you start getting into the issue of,
[00:33:22] am I doing this to tick the box
[00:33:23] and not because I think it's best practice.
[00:33:26] In the statistical literature,
[00:33:28] they have these statements on p-values
[00:33:30] and they have these papers that are about,
[00:33:33] here are all the strategies you could use to p-hack
[00:33:35] and the basic idea is,
[00:33:37] and so this is a cautionary tale,
[00:33:39] so here is all the things that you shouldn't do
[00:33:41] and are ill-advised.
[00:33:44] But I wonder if maybe the guidance
[00:33:46] is not necessarily to be directed at authors of papers
[00:33:51] or companies producing reports,
[00:33:53] but rather at reviewers and the general public
[00:33:56] and the consumers.
[00:33:57] So even if a paper is published,
[00:33:59] you can say, well, I looked at the paper,
[00:34:02] but I couldn't understand
[00:34:04] how they got the data into this format.
[00:34:07] And then there's another paper
[00:34:08] that had a different result
[00:34:09] and I could understand what they did.
[00:34:11] And so ultimately I'm more likely to trust this one,
[00:34:14] to cite this one,
[00:34:15] to use their method again
[00:34:17] and then it gets into a post-publication peer review
[00:34:20] or systematic reviews or sort of meta-analyses.
[00:34:24] So I don't know if I necessarily have
[00:34:27] specific recommendations
[00:34:30] for exactly what you should report
[00:34:32] because every field reports slightly different things
[00:34:35] and some things are more practical than others to report,
[00:34:38] depending on the structure of your data
[00:34:40] and what its best practice is.
[00:34:42] And you might have a paper that says,
[00:34:44] oh well, we used the statistical environment R
[00:34:46] for our code and that's considered enough, right?
[00:34:49] And then others might actually reveal their code,
[00:34:52] but then you don't know what stretch of the data
[00:34:55] had going into it because they reveal their code
[00:34:57] without their data.
[00:34:58] So I think you can always do more.
[00:35:02] Yeah, I guess you're right at that.
[00:35:04] I think the thing that's nice
[00:35:06] about having reproducibility checklists
[00:35:08] is obviously the lighter load on reviewers.
[00:35:11] But I agree that it's very easy to check a lot of boxes
[00:35:15] on reproducibility checklists
[00:35:16] without actually having a very clean scientific process
[00:35:20] in the paper.
[00:35:21] And unfortunately, especially reading a paper like this,
[00:35:26] it really seems to me that even though reviewers are busy
[00:35:29] and we all review a lot of papers,
[00:35:31] maybe we have to read the papers more closely
[00:35:34] and be aware of the many possible pitfalls.
[00:35:38] I, for example, probably would never have reviewed a paper
[00:35:41] thinking about ex-hacking before.
[00:35:43] Maybe it comes down to reviewer education in a way.
[00:35:47] It also depends on the attitude of the reviewers.
[00:35:50] So if you have a checklist and you say,
[00:35:52] did they report the method that they used
[00:35:55] for dealing with missing data, for example?
[00:35:58] Now, depending on the attitude of the reviewer,
[00:36:03] you might say, well, oh, they did.
[00:36:05] Oh, no, they didn't because they didn't do it
[00:36:07] in enough detail for my liking,
[00:36:09] or they didn't include the code for how they did it,
[00:36:11] or they didn't explain why they did it,
[00:36:13] or they didn't explain how they handled missing data.
[00:36:16] So do I just think, oh, well,
[00:36:18] they just forgot to mention that in the paper?
[00:36:20] Or do you think, well, if the only output's a paper,
[00:36:24] then what's the point in doing something
[00:36:25] in a scientific analysis if you don't report it?
[00:36:27] So I'm going to assume the worst
[00:36:29] and they did the worst possible thing,
[00:36:30] that they deliberately imputed every missing value with 999
[00:36:35] because that gave them the result they wanted
[00:36:37] because they didn't say that they didn't do that
[00:36:39] in the paper.
[00:36:40] And so even if you have a checklist,
[00:36:43] in the same way that people have guidelines
[00:36:45] on how well written was the paper,
[00:36:47] did they reference well?
[00:36:49] It depends where you're coming from.
[00:36:51] If you already don't like a paper,
[00:36:53] then you can rate it poorly or well,
[00:36:56] or you might think, oh, it's just another box to check.
[00:36:59] I'll just do a Control-F
[00:37:01] and look for the word missing data or reputation
[00:37:04] and then see if it's mentioned
[00:37:05] and then just tick the box.
[00:37:06] So even with a checklist, I think,
[00:37:09] it could make things easier for reviewers.
[00:37:11] But on the other hand,
[00:37:12] if they weren't interested in looking at missing data,
[00:37:15] then they think, oh, now that's extra thing
[00:37:16] for me to check rather than,
[00:37:18] or it makes it easier to structure how I rate the paper.
[00:37:22] Yeah, that's true as well.
[00:37:25] Though maybe we should mention
[00:37:26] that you do have some suggestions
[00:37:28] of best practices for AutoML
[00:37:31] when also dealing with explanations.
[00:37:32] Do you maybe want to tell listeners
[00:37:35] what they could think about including in their reporting?
[00:37:40] We have not like an exhaustive list,
[00:37:43] but best practices for like AutoML
[00:37:45] to find out if ex-hacking has been done
[00:37:47] is probably versus explanation histograms
[00:37:50] and second is pipeline analysis.
[00:37:52] And when it comes to,
[00:37:54] so you could say that, okay,
[00:37:55] AutoML generates pipelines
[00:37:57] and then we did a pre-processing model
[00:38:00] and hyperparameter selection
[00:38:01] and then finally the result.
[00:38:02] So could people analyze the entire pipelines
[00:38:06] to see where a researcher might have had done something
[00:38:11] to produce the ex-hacked results.
[00:38:13] That is kind of like a little bit tough today
[00:38:16] because I think unless they really give the entire pipeline
[00:38:22] to check, to make the other researchers check that,
[00:38:27] okay, this could be done and it's reproducible.
[00:38:29] In a standard format
[00:38:31] that you can compare with other papers.
[00:38:33] Here's my pipeline, here's your pipeline.
[00:38:35] How are they different?
[00:38:36] How are they the same?
[00:38:37] Exactly.
[00:38:38] So if that is available,
[00:38:39] then I think it becomes easier to reproduce results
[00:38:42] to some extent provided that the data set is there
[00:38:45] and all the stuff is being given.
[00:38:47] But if the pipeline is not there
[00:38:49] and they report the result as a model in the end,
[00:38:52] then it becomes imperative, I think,
[00:38:54] to say how can we do the analysis
[00:38:57] assuming that the result came from an AutoML solution.
[00:39:01] And that is like an open area of research right now.
[00:39:04] How can we take a paper
[00:39:06] and then assume that there's a pipeline that is over there
[00:39:09] and get the results?
[00:39:10] And probably in that sense,
[00:39:12] if we do have the access to raw data
[00:39:16] and we set up our own AutoML solution
[00:39:20] and try to run pipelines to find the pipelines
[00:39:23] which could fit to what the researchers have done,
[00:39:26] and then do have an explanation histogram of the metrics.
[00:39:32] And then we draw a distribution over the explanation metrics
[00:39:36] and then see that, okay, it's what they have reported
[00:39:40] is coming at the tail of the distribution
[00:39:42] and it probably is an outlier
[00:39:44] and we could have a certain red flag on that.
[00:39:47] Okay, this could be X hacked.
[00:39:50] But we cannot explicitly say that it is X hacked,
[00:39:53] but we can have a view on it like it could be.
[00:39:58] So it could raise like in the review process
[00:40:01] that okay, maybe we need more working
[00:40:05] of what people have done
[00:40:06] and then more clarity from the researchers.
[00:40:10] So as an author,
[00:40:11] you could produce this explanation histogram
[00:40:14] where you take the metric that you wanted to report
[00:40:19] and you consider all of the pipelines
[00:40:22] that were evaluated in your AutoML pipeline.
[00:40:27] So all of the models that it spat out
[00:40:30] or all of the ones that had accuracy above a certain level.
[00:40:34] And then you plot the distribution
[00:40:35] of your explanation metric across all of those models
[00:40:38] and then you compare it to the one that you reported.
[00:40:42] So either you say, here is a distribution of explanations.
[00:40:46] And so my answer is a distribution.
[00:40:49] And so that's a very probabilistic sort of Bayesian kind of way
[00:40:53] almost you might say to reporting results.
[00:40:56] Or if you still think,
[00:40:57] well I need to report one model and one value,
[00:40:59] then you can say,
[00:41:00] well I'm reporting this one value,
[00:41:02] but here's the distribution.
[00:41:04] And you can see that my explanation is more
[00:41:08] or less in the middle of this distribution.
[00:41:10] So it's not an outlier.
[00:41:11] It's not, oh, well,
[00:41:12] there's plenty of other models that said that,
[00:41:15] you know, X is not associated with Y.
[00:41:17] And we're here at the far right end tail
[00:41:20] where it does cause Y or the far left end tail
[00:41:23] where it causes Y not to happen.
[00:41:26] Yeah.
[00:41:27] I think that's a pretty sensible thing to include
[00:41:30] since from my understanding,
[00:41:32] it's really also something that's free
[00:41:34] if you use AutoML anyway,
[00:41:36] that's maybe important to say you've run this before.
[00:41:38] So you can just plot it nicely
[00:41:40] and that gives some more security.
[00:41:42] Although I can also see discussions coming up,
[00:41:44] what happens if it happens to be an outlier,
[00:41:47] but it's not actually cherry picked.
[00:41:49] Well, so here's the thing.
[00:41:50] If you pick a model for whatever reason,
[00:41:53] and then if your model is not sensitive
[00:41:55] to your analysis decisions,
[00:41:56] it doesn't matter if you cherry picked it deliberately,
[00:41:59] it's still not a robust result.
[00:42:02] And so we're not saying that people necessarily
[00:42:06] are taking this strategy of deliberately going in,
[00:42:09] malicious, I want to have this result,
[00:42:11] I for whatever reason don't think that I can just get this result
[00:42:14] by doing everything properly,
[00:42:17] but rather, oh well,
[00:42:19] I picked this model according to some metric
[00:42:21] and then his explanation of the model.
[00:42:23] And then it just turns out that the explanation actually
[00:42:26] is not robust to the modeling decisions.
[00:42:29] And actually there are plenty of other models
[00:42:31] that gave a very different explanation.
[00:42:34] Yeah.
[00:42:35] I meant like if you actually take something
[00:42:37] like the incumbent of your AutoML system
[00:42:39] and that is an outlier,
[00:42:41] I think in that case,
[00:42:42] that might actually be an issue in tooling
[00:42:45] what we have right now,
[00:42:46] because I think most AutoML tools will just give you
[00:42:49] this function of get incumbent, get best performance, whatever.
[00:42:52] And oftentimes you don't necessarily look at things
[00:42:56] like the Pareto curve or distribution of explanations beforehand,
[00:43:00] which kind of leads me into the implication
[00:43:03] this has for AutoML researchers,
[00:43:05] because obviously yeah,
[00:43:06] that's an important point to consider for reviewers
[00:43:08] and practitioners using AutoML.
[00:43:11] But AutoML researchers build a lot of the algorithms
[00:43:14] and also tooling to support all of that, right?
[00:43:17] And if we just think about accuracy performance,
[00:43:20] not necessarily about things like explanations,
[00:43:23] that's maybe not the most useful for people there.
[00:43:25] Is there anything that you would think would be useful takeaways
[00:43:30] in the realms of tooling for AutoML
[00:43:32] that could help support making better model decisions?
[00:43:36] I have a suggestion that if we are able to show
[00:43:41] the AutoML by using AutoML,
[00:43:44] X hacking could be done,
[00:43:45] then probably just including the explanation itself
[00:43:49] as a decision for getting the models trained
[00:43:54] inside the AutoML thing.
[00:43:55] I'm not coming from the field of AutoML,
[00:43:57] but I think if the fairness and interpretability
[00:44:01] in AI is like picking up,
[00:44:03] and the faster it picks up, the faster it becomes imperative
[00:44:06] to include explainability also for the results
[00:44:10] inside the AutoML framework itself.
[00:44:13] And I think there are very few AutoML solutions
[00:44:16] that include some kind of explainability
[00:44:18] as a metric also in them.
[00:44:21] Because some of these explanation metrics
[00:44:23] are more or less computationally costly to compute,
[00:44:26] especially if you want to compute it for every feature.
[00:44:28] But if you said,
[00:44:29] well, there's a particular feature I'm interested in,
[00:44:31] and this is something that I want to report later,
[00:44:34] then maybe your AutoML framework
[00:44:36] can compute that along the way
[00:44:38] so that you're not having to save literally every model,
[00:44:41] but you can at least save a Shapley value
[00:44:44] or a p-value.
[00:44:47] I think people necessarily use AutoML
[00:44:49] for models that accept p-values.
[00:44:52] The other thing that I would possibly suggest,
[00:44:55] although this sort of goes against the grain of AutoML,
[00:44:58] is just having a facility for doing,
[00:45:02] for some definition of uniform,
[00:45:04] a uniform random search of,
[00:45:06] well, here is a general distribution of models
[00:45:10] that I could have considered
[00:45:11] independently of the thing I'm optimizing for.
[00:45:14] And so here's a representative sample
[00:45:16] of all possible model decision combinations.
[00:45:21] And then you can report this histogram,
[00:45:24] because it could be that the metric you're optimizing for
[00:45:27] is also correlated with the thing
[00:45:29] that you want to report for whatever reason.
[00:45:31] So you could say, well, okay, my quest for accuracy
[00:45:36] or some particular measure of performance
[00:45:38] is not biasing my search,
[00:45:40] or you might want to show conversely
[00:45:42] that poor accuracy models give the same explanations
[00:45:46] or different explanations.
[00:45:48] And so just being able to run as a sense check
[00:45:51] a different optimization
[00:45:53] or even a non-optimized random search,
[00:45:56] even though you're looking in a nested hyperparameter space,
[00:45:59] so how can you get a reasonable sample
[00:46:01] and how long would you have to run to do that?
[00:46:04] Then that might just be a nice thing to have.
[00:46:07] But it's not obvious what a random sample
[00:46:12] of the entire decision space would be.
[00:46:15] And yeah, how random it is
[00:46:17] and how representative it is,
[00:46:19] I guess depends on your metric.
[00:46:22] Yeah, probably also on your landscape,
[00:46:24] because there's discussions of how much you need to sample
[00:46:27] in a space to actually get enough information.
[00:46:30] So I guess that would also be relatively domain specific,
[00:46:33] and then we come back to domain experts
[00:46:35] needing to be able to use AutoML tools well
[00:46:38] and the expertise that might not be there in the overlap.
[00:46:42] But explanations are really only one area
[00:46:46] where I could see this becoming a problem.
[00:46:49] Optimizing for one metric and accidentally
[00:46:52] not looking into the AutoML process
[00:46:55] enough to then pick a good model.
[00:46:58] So for me, I would say, yeah,
[00:47:00] definitely usability of these tools is a big topic.
[00:47:03] But is there also something,
[00:47:06] if we think about external motivations beyond research,
[00:47:10] I mean, you mentioned this as well,
[00:47:12] there's some demands on things like interpretability
[00:47:14] and also fairness and things that are coming from outside,
[00:47:17] like policy, where we should be careful
[00:47:21] with the practices we recommend, right?
[00:47:23] Because as you said, industry now has to start showing baselines.
[00:47:27] Will this lead to more either accidental
[00:47:31] or obvious manipulations in AutoML?
[00:47:33] Is there something the community can do against that?
[00:47:37] If you were to treat fairness as just another metric
[00:47:40] to optimize for, then it's very easy to say,
[00:47:43] well, I've ticked the box and I've optimized for this metric,
[00:47:46] but there's already research that looks into,
[00:47:49] well, you can say that you've got this fairness metric,
[00:47:52] but actually the underlying model is perhaps not so fair
[00:47:56] compared to alternative explanations.
[00:47:59] I think it all comes down to you're optimizing for something
[00:48:03] other than raw predictive performance.
[00:48:08] But do you actually incorporate that
[00:48:12] into your loss function somehow?
[00:48:14] Or do you limit the search space somehow
[00:48:17] because there's certain classes and models
[00:48:19] that you would know in advance are going to be problematic?
[00:48:22] Or do you say, oh, well, I'm not going to include deep learning
[00:48:25] because deep learning is harder to explain
[00:48:27] than a tree-based model or something like that.
[00:48:32] But that's not really...
[00:48:34] I mean, that's the decision that you would make at the start anyway
[00:48:37] is designing your search space.
[00:48:40] I think in terms of reporting,
[00:48:44] you could think about models that are easier to explain than others.
[00:48:50] But anything else I think throughout the AutoML pipeline
[00:48:54] eventually comes down to something that's quantitative
[00:48:56] that you can code into the model somehow.
[00:48:59] And fairness, you could argue, is a special case of explainability
[00:49:03] depending on how you measure it.
[00:49:05] So it sort of comes back around to the original problem
[00:49:08] that we were trying to highlight.
[00:49:11] Yeah, it also comes back around to the fact that
[00:49:14] we can't only get around reproducibility checklists if we want.
[00:49:17] We can also get around legislation probably in the same way.
[00:49:21] So what would you say then is a good way forward
[00:49:25] for dealing with the danger of X hacking
[00:49:27] if it's really hard to give guidelines?
[00:49:30] Because you also say, and we've talked about why,
[00:49:33] that fraud detection systems are likely not the way to deal with it.
[00:49:37] Do you see just more work into the topic showing
[00:49:40] the danger a bit more?
[00:49:42] Do you see maybe more education on the side of AutoML researchers
[00:49:46] how to properly use the tools?
[00:49:48] Do you see maybe even some sort of standard expectations
[00:49:52] of what we see in papers going forward?
[00:49:54] What would be your hopes?
[00:49:57] Well, yeah.
[00:49:59] So there could be like very intensive work
[00:50:02] that could be done to do actual fraud detection
[00:50:06] inside all of these pipelines.
[00:50:08] I don't know how realistic that is,
[00:50:10] but I think it comes down to asking more questions
[00:50:14] when doing the review process or with your own research
[00:50:19] and the peers that you're working with,
[00:50:21] that how many questions at every stage I can ask.
[00:50:24] I think you can on your own do that.
[00:50:28] But also we believe that nobody is going to produce hacked results.
[00:50:35] Do we trust them and do we need to verify them?
[00:50:40] So these are like that.
[00:50:42] From a bird's point of view, such questions need to be asked.
[00:50:45] And then if you think that after reading a paper
[00:50:48] that enough has been produced to verify the results
[00:50:52] or to reproduce all the results,
[00:50:55] but not necessarily people do that,
[00:50:58] but if you feel that, okay, I trust all of these results,
[00:51:01] then I think that becomes from the reviewer
[00:51:04] or like the people point of view
[00:51:06] and from the automation point of view.
[00:51:08] So people can work with like the latest available technology
[00:51:12] to do some kind of fraud detection technically,
[00:51:16] but that is like another way to find out
[00:51:23] if some ex-hacking has been done
[00:51:25] or any kind of malicious thing has been done.
[00:51:28] But I think in the current scenario,
[00:51:31] when in the auto ML community,
[00:51:34] I think it's important to talk about
[00:51:36] that this could be done at every stage of the pipeline.
[00:51:39] Not only with just taking the data pre-processing step,
[00:51:45] but also the model selection, the hyperparameter selection.
[00:51:49] And yeah, I think it's more about talking about it
[00:51:52] and like being aware of that ex-hacking
[00:51:55] could be like a major issue if the target that we are working for
[00:52:01] or the problem that we have at the end,
[00:52:04] be it like education, policy making,
[00:52:07] could have like an adverse effect,
[00:52:09] auto ML is being used
[00:52:11] or generally the practice of ex-hacking has been used.
[00:52:14] And I think in terms of education,
[00:52:17] it would be a case of whenever you're teaching about explainable AI
[00:52:22] that you also mentioned the possible downsides.
[00:52:26] If you're teaching about auto ML,
[00:52:28] then you could mention the issue of fairness and explainability.
[00:52:32] People might have a more traditional statistical education
[00:52:35] in the biosciences about machine learning,
[00:52:38] then obviously you'd want to teach about these parallels
[00:52:43] between P-hacking
[00:52:45] and the possibility of fair washing and explainability hacking.
[00:52:50] But ultimately, I think it also comes down to,
[00:52:54] it's all well and good for us in a computer science department
[00:52:58] to talk about these things,
[00:53:00] but the perhaps greatest risk is in an applied domain
[00:53:05] where people haven't done 10 courses in computer science,
[00:53:09] but they've done a couple of machine learning courses
[00:53:13] and then they've got this auto ML tool and say,
[00:53:15] oh, it makes everything easier.
[00:53:17] How can you reach these people who might,
[00:53:20] not through a malice,
[00:53:22] but perhaps through ignorance or poor training,
[00:53:24] use auto ML and then put it out in a paper
[00:53:27] where maybe there's less scrutiny?
[00:53:29] So do we need a machine learning expert on review panels
[00:53:33] for a medical journal, for example,
[00:53:36] or conversely, do you need someone who's an expert
[00:53:39] in fairness, explainability, causality
[00:53:43] on a machine learning panel that wouldn't otherwise deal with it?
[00:53:47] So everybody should talk to each other more.
[00:53:51] I think that's a great note to add this on.
[00:53:54] I very much agree with that.
[00:53:56] Thank you two for giving us a look into this really interesting paper on X-hacking.
[00:54:00] And I do hope that the auto ML community will talk about
[00:54:03] these dangers of X-hacking, of fair washing of everything a bit more
[00:54:07] and be able to also reach out to other communities
[00:54:10] and just all produce better science in the end
[00:54:13] because that's what we want, right?
[00:54:15] So if people want to ask you more questions about this,
[00:54:18] where can they find you best?
[00:54:20] They can find us through our website,
[00:54:23] which is data-sci-ax.de.
[00:54:26] We're also on Twitter.
[00:54:28] And you can find out more about the DFKI at dfki.de.
[00:54:32] We're in the Data Science and its Applications group.
[00:54:35] And we are hiring, so please get in touch.
[00:54:39] I'd like to know for some of our listeners, I hope.
[00:54:42] Well, thank you very much.
[00:54:44] And for everyone listening at home,
[00:54:46] I hope you tune in to the next episode as well.
[00:54:48] See you then.