Crowd Embeddings: Difference between revisions
From Algolit
(6 intermediate revisions by 3 users not shown) | |||
Line 11: | Line 11: | ||
==Case Study: Perspective API== | ==Case Study: Perspective API== | ||
+ | [[File:Screenshot-of-the-perspective-API-website.png|Screenshot of the Perspective API website (October, 2017)]] | ||
+ | <br><small>Screenshot of the Perspective API website (October, 2017)</small> | ||
+ | |||
[https://www.perspectiveapi.com/ Perspective API] is a machine learning tool developed by the Google-owned company Jigsaw, which aims to identify toxic messages in the comments section of various platforms. The project has worked in collaboration with Wikipedia, The New York Times, The Guardian, and The Economist. | [https://www.perspectiveapi.com/ Perspective API] is a machine learning tool developed by the Google-owned company Jigsaw, which aims to identify toxic messages in the comments section of various platforms. The project has worked in collaboration with Wikipedia, The New York Times, The Guardian, and The Economist. | ||
− | |||
The collaboration between Perspective API and Wikipedia lives under the name [https://meta.wikimedia.org/wiki/Research:Detox Detox]. The project is based on a method that combines crowdsourcing and machine learning to analyse personal attacks at scale. Two intentions seem to be at stake: a research into harassments made in the Talk section of Wikipedia, and the creation of the largest annotated dataset for harassments. | The collaboration between Perspective API and Wikipedia lives under the name [https://meta.wikimedia.org/wiki/Research:Detox Detox]. The project is based on a method that combines crowdsourcing and machine learning to analyse personal attacks at scale. Two intentions seem to be at stake: a research into harassments made in the Talk section of Wikipedia, and the creation of the largest annotated dataset for harassments. | ||
− | |||
The project uses supervised machine learning techniques, a logistic regression algorithm, and two [[WikiHarass|datasets]]: | The project uses supervised machine learning techniques, a logistic regression algorithm, and two [[WikiHarass|datasets]]: | ||
Line 21: | Line 22: | ||
*1M annotations by 4000 crowd workers on 100.000 comments form English Wikipedia talk pages, where each comment is annotated 10 times. | *1M annotations by 4000 crowd workers on 100.000 comments form English Wikipedia talk pages, where each comment is annotated 10 times. | ||
− | '''Findings:''' | + | '''Findings from the [https://arxiv.org/abs/1610.08914 paper] published by Jigsaw & Wikipedia:''' |
− | ''This leads to several interesting findings: while anonymously contributed comments are 6 times more likely to be an attack, they contribute less than half of the attacks. Similarly less than half of attacks come from users with little prior participation; and perhaps surprisingly, approximately 30% of attacks come from registered users with over 100 contributions.'' | + | *''This leads to several interesting findings: while anonymously contributed comments are 6 times more likely to be an attack, they contribute less than half of the attacks. Similarly less than half of attacks come from users with little prior participation; and perhaps surprisingly, approximately 30% of attacks come from registered users with over 100 contributions.'' |
− | ''Moreover, the crowdsourced data may also result in other forms of unintended bias.'' | + | *''Moreover, the crowdsourced data may also result in other forms of unintended bias.'' |
This brings up key questions for our method and more generally for applications of machine learning to analysis of comments: who defines the truth for the property in question? How much do classifiers vary depending on who is asked? What is the subsequent impact of applying a model with unintended bias to help an online community’s discussion? | This brings up key questions for our method and more generally for applications of machine learning to analysis of comments: who defines the truth for the property in question? How much do classifiers vary depending on who is asked? What is the subsequent impact of applying a model with unintended bias to help an online community’s discussion? |
Latest revision as of 13:35, 1 November 2017
Type: | Case Study |
Datasets: | Wikipedia edits |
Technique: | Supervised machine learning, word embeddings |
Developed by: | Jigsaw |
Case Study: Perspective API
Screenshot of the Perspective API website (October, 2017)
Perspective API is a machine learning tool developed by the Google-owned company Jigsaw, which aims to identify toxic messages in the comments section of various platforms. The project has worked in collaboration with Wikipedia, The New York Times, The Guardian, and The Economist.
The collaboration between Perspective API and Wikipedia lives under the name Detox. The project is based on a method that combines crowdsourcing and machine learning to analyse personal attacks at scale. Two intentions seem to be at stake: a research into harassments made in the Talk section of Wikipedia, and the creation of the largest annotated dataset for harassments.
The project uses supervised machine learning techniques, a logistic regression algorithm, and two datasets:
- 95M comments from English Wikipedia talk pages made between 2001–2015
- 1M annotations by 4000 crowd workers on 100.000 comments form English Wikipedia talk pages, where each comment is annotated 10 times.
Findings from the paper published by Jigsaw & Wikipedia:
- This leads to several interesting findings: while anonymously contributed comments are 6 times more likely to be an attack, they contribute less than half of the attacks. Similarly less than half of attacks come from users with little prior participation; and perhaps surprisingly, approximately 30% of attacks come from registered users with over 100 contributions.
- Moreover, the crowdsourced data may also result in other forms of unintended bias.
This brings up key questions for our method and more generally for applications of machine learning to analysis of comments: who defines the truth for the property in question? How much do classifiers vary depending on who is asked? What is the subsequent impact of applying a model with unintended bias to help an online community’s discussion?
The Detox project includes a section on biases, which is published under the name "Fairness".