Data Workers: Difference between revisions
From Algolit
Line 14: | Line 14: | ||
Nowadays the promise has attained mythical dimensions. When looking at the concrete applications, that collection is truely innovative and fascinating, but rather fragmented and incomplete. | Nowadays the promise has attained mythical dimensions. When looking at the concrete applications, that collection is truely innovative and fascinating, but rather fragmented and incomplete. | ||
− | |||
Line 20: | Line 19: | ||
'''Works:''' | '''Works:''' | ||
* [[Algoliterator]] | * [[Algoliterator]] | ||
+ | |||
+ | |||
===Writers=== | ===Writers=== | ||
Line 29: | Line 30: | ||
'''Works:''' | '''Works:''' | ||
* [[Data Workers Podcast]] | * [[Data Workers Podcast]] | ||
+ | |||
+ | |||
===Cleaners=== | ===Cleaners=== | ||
Algolit chooses to work with texts that are free of copyright. This means that they are published under a Creative Commons 4.0 license - which is rare -, or that they are in the public domain because the author has died more than 70 years ago. This condition has the great advantage that we can use texts without asking permission or giving explantions, that we often discover unprecedented pearls and that we help to make datasets available for others online. The disadvantage is that we cannot use epubs or other contemporary text formats, and we are often at the mercy of cleaning up documents. We are not alone in this. | Algolit chooses to work with texts that are free of copyright. This means that they are published under a Creative Commons 4.0 license - which is rare -, or that they are in the public domain because the author has died more than 70 years ago. This condition has the great advantage that we can use texts without asking permission or giving explantions, that we often discover unprecedented pearls and that we help to make datasets available for others online. The disadvantage is that we cannot use epubs or other contemporary text formats, and we are often at the mercy of cleaning up documents. We are not alone in this. | ||
+ | |||
Books are scanned at high resolution, page by page. This is intense human work and often the reason why archives and libraries transfer their collections to a commercial company like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes. Again intense human work to improve the texts. This is work for freelancers via little paid platforms like Mechanical Turk; or for volunteers, such as the community around the Gutenberg Proofreaders, who does fantastic work. Whoever does it or wherever it is done, cleaning up texts is a huge job for which there is no structural automation yet. | Books are scanned at high resolution, page by page. This is intense human work and often the reason why archives and libraries transfer their collections to a commercial company like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes. Again intense human work to improve the texts. This is work for freelancers via little paid platforms like Mechanical Turk; or for volunteers, such as the community around the Gutenberg Proofreaders, who does fantastic work. Whoever does it or wherever it is done, cleaning up texts is a huge job for which there is no structural automation yet. | ||
Line 38: | Line 42: | ||
'''Works:''' | '''Works:''' | ||
* [[Cleaning for Poems]] | * [[Cleaning for Poems]] | ||
+ | |||
+ | |||
===Informants=== | ===Informants=== | ||
+ | All machine learning algorithms need guidance; whether they are supervised or not. In order to separate one thing from another thing, they need material to extricate relations from: here is a collection of texts that is about cooking. What can you tell me about units of measurement? | ||
+ | |||
+ | The study material should be chosen carefully: as we know, a badly written textbook can lead a student to forfeit the whole subject. A good textbook is preferably not a textbook at all. This is where the dataset comes in: neatly arranged, well disciplined rows and columns lining up on the screen waiting to be read by the machine. (What would be the equivalent of close reading for a machine?) | ||
+ | |||
+ | Each dataset collects different information about the world, and like all collections, they are imbued with the collector's bias. You will hear this expression very often: data is the new oil. If only data were more like oil! Leaking, dripping and heavy with fat, bubbling up and jumping unexpectedly when in contact with new matter. Instead, data is clean. With each process, each questionaire, each column title, each css selector scraped, it becomes cleaner and cleaner and finds its residence within the performative logic of the dataset. | ||
+ | |||
+ | Some datasets combine the logic of the machine with the logic of humans. The algorithms which require supervision multiply the subjectivities of both data collectors, and annotators, then propel and propagate what they've been taught. You will listen to extracts of some of the datasets that pass as default in the machine learning field, as well as other stories of human teaching. | ||
+ | |||
+ | |||
+ | |||
+ | '''Works:''' | ||
+ | * [[An Ethnography of Datasets]] | ||
+ | * [[The Annotator]] | ||
+ | |||
+ | |||
+ | |||
===Readers=== | ===Readers=== | ||
+ | We communicate with computers through language. We click on icons that have a description in words, we tap words on keyboards, use our voice to give them instructions. Sometimes we believe that the computer can read our thoughts and forget that they are extensive calculators. A computer understands every word as a combination of zeros and ones. A letter is read as a specific ASCII number: capital "A" is 001. | ||
+ | |||
+ | In all models, rule based, classical machine learning and neural networks, in order to understand the semantic meaning of language, words undergo a different type of translation into numbers. This is done by counting. Some models count the frequency of single words, some might count the frequency of combinations of words, some count the frequency of nouns, adjectives, verbs or noun of verb phrases - this is called part-of-speech analysis. Some just replace words in a text by an index number. It also optimises the speed of the processes. In the following section we present a few technologies to do so. | ||
+ | |||
+ | |||
+ | |||
+ | '''Works:''' | ||
+ | * [[]] | ||
+ | |||
+ | |||
+ | |||
===Learners=== | ===Learners=== | ||
+ | A machine learns from data it reads. This learning is also called the training & test phase. The machine searches for patterns in the data by reducing, for example to the most common or unique words. It does this by making a series of calculations according to existing formulas. The formulas, or 'classifiers', often have a long history, which is embedded in mathematics and statistics. | ||
+ | |||
+ | In software packages you don't get to see the individual personality of the classifiers. They are packed in underlying modules or libraries, which you can call up as a programmer with one line of code. For this exhibition, we have therefore developed three party games that show the learning process of three simple, but frequently used classifiers and their evaluators. | ||
+ | |||
+ | |||
+ | |||
+ | '''Works:''' | ||
+ | * [[Naive Bayes game]] | ||
+ | * [[Perceptron game]] | ||
+ | * [[Linear Regression game]] | ||
+ | |||
+ | |||
+ | |||
+ | |||
===Oracles=== | ===Oracles=== | ||
+ | Machine Learning is mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural language processing', in short, 'nlp'. These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their word to say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc. | ||
+ | |||
+ | There are two main tasks when it comes to language understanding. Information extraction looks at concepts and relations between concepts. This allows for recognizing topics, places and persons in a text, summarization and questions & answering. The other task is text classification. You can train an oracle to detect whether an email is spam or not, written by a man or a woman, rather positive or negative. | ||
+ | |||
+ | Here you can see some of those models at work. During your further journey through the exhibition you will discover the different steps that a human-machine goes through to come to a final model. | ||
+ | |||
+ | |||
+ | |||
+ | '''Works:''' | ||
+ | * [[Reverse Algebra]] | ||
+ | * Pos/Neg Classifier Model | ||
+ | * Topic Analysis Naive Bayes books | ||
+ | |||
+ | |||
==Sources== | ==Sources== | ||
[[Category:Data Workers]] | [[Category:Data Workers]] |
Revision as of 16:44, 6 February 2019
About
This exhibition shows a selection of algoliterary works made by members of Algolit, an artistic research group with a focus on F/LOSS code and texts, based in Brussels. While artificial intelligences are being created to serve, entertain, record, and know us, they are usually hidden behind interfaces. In these works of fiction, the algoritmic storytellers leave the invisible underworld to become interlocutors. The works show the voice of the robots, algorithmic models that read data, turn words into numbers, make calculations that define patterns and are able to endlessly process new texts thereafter. This exhibition is an attempt to grasp and multiply voices which are absent in our representations of the world. It allows the robots to go into dialogue with us, humans. It allows us to understand their reasoning, to demystify their behaviour, to encounter their personalities, without having to study intensively for years. It is also a tribute to the many machines that Paul Otlet and Henri Lafontaine imagined for their Mundaneum, showing their potential but also their limits.
Stations
The origins of the Mundaneum go back to the late nineteenth century. The project was created by two young Belgian jurists, Paul Otlet (1868-1944), the father of documentation, and Henri La Fontaine (1854-1943), Nobel Peace Prize winner. Itaimed at gathering all the world’s knowledge and file it using the Universal Decimal Classification (UDC) system that they had created. At first it was international institutions bureau dedicated to knowledge and fraternity. In the 20th century the Mundaneum became a universal centre of documentation. Its collections are made up of thousands of books, newspapers, journals, documents, posters, glass plates, postcards and other bibliographic cards. These were put together and kept in various buildings in Brussels, including the Palais du Cinquantenaire. The archive only moved to Mons in 1998.
Based on Mundaneum, the two men designed a World City for which Le Corbusier made scale models and plans. The aim of the World City was to gather, at world level, the institutions of intellectual work: libraries, museums and universities. This project was never be realised. The Mundaneum project soon faced the scale of the technical development of its era. It suffered from its own utopia. The Mundaneum is the result of a Visionary dream. It attained mythical dimensions at the time. When looking at the concrete archive that was developed, that collection is very fragmented and incomplete.
The same can be said for artifical intelligences today. When reading about them, the visionary dream has been there since the beginning of their development in the 50s.
Nowadays the promise has attained mythical dimensions. When looking at the concrete applications, that collection is truely innovative and fascinating, but rather fragmented and incomplete.
Works:
Writers
Data workers need data to work with. The data that is used in the context of this exhibition, is written language. Where does it come from? Who is writing? Machine learning relies of many types of writing. We could say that every human being who has access to the internet is an algorithm writer each time they interact with it by adding reviews, writing Wikipedia articles, or writing emails. Machine learning algorithms are not critics: they take whatever they are given, no matter the writing style, no matter the CV of the author, no matter the spelling mistakes. In fact, mistakes make it better: the more variety, the better it can anticipate. Sometimes, the authors are not particularly aware of what happens to their oeuvre: offline material, such as printed literature, is digitized too and turned into prediction fodder. Some writing is in English, some in French, and some in Python. The latter is done by writers with intent, the programmers who scrawled the code we're now discussing. The algorithm can be a writer too, some neural networks write their own rules. And for the rest, the code that is still wrestling with the subtleties of human language, there are human editors who take over. Poets, playwrights, novellists start their exciting new careers as ventriloquists for AI assistants.
Works:
Cleaners
Algolit chooses to work with texts that are free of copyright. This means that they are published under a Creative Commons 4.0 license - which is rare -, or that they are in the public domain because the author has died more than 70 years ago. This condition has the great advantage that we can use texts without asking permission or giving explantions, that we often discover unprecedented pearls and that we help to make datasets available for others online. The disadvantage is that we cannot use epubs or other contemporary text formats, and we are often at the mercy of cleaning up documents. We are not alone in this.
Books are scanned at high resolution, page by page. This is intense human work and often the reason why archives and libraries transfer their collections to a commercial company like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes. Again intense human work to improve the texts. This is work for freelancers via little paid platforms like Mechanical Turk; or for volunteers, such as the community around the Gutenberg Proofreaders, who does fantastic work. Whoever does it or wherever it is done, cleaning up texts is a huge job for which there is no structural automation yet.
Works:
Informants
All machine learning algorithms need guidance; whether they are supervised or not. In order to separate one thing from another thing, they need material to extricate relations from: here is a collection of texts that is about cooking. What can you tell me about units of measurement?
The study material should be chosen carefully: as we know, a badly written textbook can lead a student to forfeit the whole subject. A good textbook is preferably not a textbook at all. This is where the dataset comes in: neatly arranged, well disciplined rows and columns lining up on the screen waiting to be read by the machine. (What would be the equivalent of close reading for a machine?)
Each dataset collects different information about the world, and like all collections, they are imbued with the collector's bias. You will hear this expression very often: data is the new oil. If only data were more like oil! Leaking, dripping and heavy with fat, bubbling up and jumping unexpectedly when in contact with new matter. Instead, data is clean. With each process, each questionaire, each column title, each css selector scraped, it becomes cleaner and cleaner and finds its residence within the performative logic of the dataset.
Some datasets combine the logic of the machine with the logic of humans. The algorithms which require supervision multiply the subjectivities of both data collectors, and annotators, then propel and propagate what they've been taught. You will listen to extracts of some of the datasets that pass as default in the machine learning field, as well as other stories of human teaching.
Works:
Readers
We communicate with computers through language. We click on icons that have a description in words, we tap words on keyboards, use our voice to give them instructions. Sometimes we believe that the computer can read our thoughts and forget that they are extensive calculators. A computer understands every word as a combination of zeros and ones. A letter is read as a specific ASCII number: capital "A" is 001.
In all models, rule based, classical machine learning and neural networks, in order to understand the semantic meaning of language, words undergo a different type of translation into numbers. This is done by counting. Some models count the frequency of single words, some might count the frequency of combinations of words, some count the frequency of nouns, adjectives, verbs or noun of verb phrases - this is called part-of-speech analysis. Some just replace words in a text by an index number. It also optimises the speed of the processes. In the following section we present a few technologies to do so.
Works:
- [[]]
Learners
A machine learns from data it reads. This learning is also called the training & test phase. The machine searches for patterns in the data by reducing, for example to the most common or unique words. It does this by making a series of calculations according to existing formulas. The formulas, or 'classifiers', often have a long history, which is embedded in mathematics and statistics.
In software packages you don't get to see the individual personality of the classifiers. They are packed in underlying modules or libraries, which you can call up as a programmer with one line of code. For this exhibition, we have therefore developed three party games that show the learning process of three simple, but frequently used classifiers and their evaluators.
Works:
Oracles
Machine Learning is mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural language processing', in short, 'nlp'. These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their word to say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.
There are two main tasks when it comes to language understanding. Information extraction looks at concepts and relations between concepts. This allows for recognizing topics, places and persons in a text, summarization and questions & answering. The other task is text classification. You can train an oracle to detect whether an email is spam or not, written by a man or a woman, rather positive or negative.
Here you can see some of those models at work. During your further journey through the exhibition you will discover the different steps that a human-machine goes through to come to a final model.
Works:
- Reverse Algebra
- Pos/Neg Classifier Model
- Topic Analysis Naive Bayes books