TY - GEN
T1 - Data Poisoning Attacks against Outcome Interpretations of Predictive Models
AU - Zhang, Hengtong
AU - Gao, Jing
AU - Su, Lu
N1 - Publisher Copyright: © 2021 ACM.
PY - 2021/8/14
Y1 - 2021/8/14
N2 - The past decades have witnessed significant progress towards improving the accuracy of predictions powered by complex machine learning models. Despite much success, the lack of model interpretability prevents the usage of these techniques in life-critical systems such as medical diagnosis and self-driving systems. Recently, the interpretability issue has received much attention, and one critical task is to explain why a predictive model makes a specific decision. We refer to this task as outcome interpretation. Many outcome interpretation methods have been developed to produce human-understandable interpretations by utilizing intermediate results of the machine learning models, such as gradients and model parameters. Although the effectiveness of outcome interpretation approaches has been shown in a benign environment, their robustness against data poisoning attacks (i.e., attacks at the training phase) has not been studied. As the first work towards this direction, we aim to answer an important question: Can training-phase adversarial samples manipulate the outcome interpretation of target samples? To answer this question, we propose a data poisoning attack framework named IMF (Interpretation Manipulation Framework), which can manipulate the interpretations of target samples produced by representative outcome interpretation methods. Extensive evaluations verify the effectiveness and efficiency of the proposed attack strategies on two real-world datasets.
AB - The past decades have witnessed significant progress towards improving the accuracy of predictions powered by complex machine learning models. Despite much success, the lack of model interpretability prevents the usage of these techniques in life-critical systems such as medical diagnosis and self-driving systems. Recently, the interpretability issue has received much attention, and one critical task is to explain why a predictive model makes a specific decision. We refer to this task as outcome interpretation. Many outcome interpretation methods have been developed to produce human-understandable interpretations by utilizing intermediate results of the machine learning models, such as gradients and model parameters. Although the effectiveness of outcome interpretation approaches has been shown in a benign environment, their robustness against data poisoning attacks (i.e., attacks at the training phase) has not been studied. As the first work towards this direction, we aim to answer an important question: Can training-phase adversarial samples manipulate the outcome interpretation of target samples? To answer this question, we propose a data poisoning attack framework named IMF (Interpretation Manipulation Framework), which can manipulate the interpretations of target samples produced by representative outcome interpretation methods. Extensive evaluations verify the effectiveness and efficiency of the proposed attack strategies on two real-world datasets.
KW - adversarial learning
KW - model interpretation
UR - https://www.scopus.com/pages/publications/85114910094
U2 - 10.1145/3447548.3467405
DO - 10.1145/3447548.3467405
M3 - Conference contribution
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 2165
EP - 2173
BT - KDD 2021 - Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2021
Y2 - 14 August 2021 through 18 August 2021
ER -