Machine learning systems are heavily dependent on data, unlike traditional software systems which rely on logic. Machine learning systems for many use cases are not built due to unavailability of large scale real data for training. This motivated researchers to find novel methods to generate training data for those use cases. The key consideration in generating training data is to ensure that the resulting machine learning systems are not just performant in training and testing environment but in real environments too. These novel methods are mostly simple but enable rapid construction of bulk training datasets which propelled machine learning research in that domain.

Novel technique in generation of bulk training datasets

Information can be stored in structured (like in tables), unstructured (like in emails) and semi-structured (like in invoices) text formats. Majority of the organizational knowledge is contained in an unstructured textual format (emails & documents). There are no simple ways to extract information from the unstructured text corpus. Traditional approaches of extracting information from unstructured text involve treating the unstructured text as structured text and writing rules to extract from them. This has a poor degree of convergence as unstructured text is unstructured in many ways. This demanded systems that can read and comprehend unstructured text corpus and then extract the answer for our query from that context (text corpus).

Machine learning systems for this use case were not available as there was no large scale training datasets of triples (context, query & answer). In Neural Information Processing Systems (NeurIPS) conference in 2015, a team of researchers, Karl Moritz Hermann, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman from Google DeepMind and Tomáš Kočiský, Phil Blunsom from University of Oxford, presented novel techniques in their paper Teaching Machines to Read and Comprehend to build the real natural language training datasets. The technique is to transform the news stories from CNN and Daily Mail websites into datasets of triples (context, query & answer).

The technique takes a news story like RAISE 2020 marked the new epoch of India's AI journey as input. The story is considered as context. The title is processed to form query and its answer.

Machine learning-based entity detection system can detect the name of person, place, organization, quantity.. in text. The entity detection system is applied to ‘Title’ to identify its entities, which is ‘RAISE 2020’ in this case. Then the entity is replaced with a placeholder in ‘Title’ to form ‘query’, for which the answer is ‘RAISE 2020’.

This technique generated million data records (triples) from CNN and Daily Mail websites. The title of a news story in CNN and Daily Mail are not extractive but abstractive and so the generated datasets mimic real natural language datasets. This triggered the development of a new class of machine learning model called Question-Answering (QA) model which is trained to extract the answer for the query from the context. This Question-Answering model-based machine learning system has been deployed in many places to extract information from emails, manuals and text transcript of conversations.

Interesting approaches to generate data

Machine learning based Gun Detection system is trained with recordings from staged ‘active shooter’ events. Translation pairs of 5,00,000 documents (books, dictionaries, newspapers and online news) in English and its Vietnamese translation produced by professional translators are used for training English- Vietnamese translator. Camera trajectories generated from YouTube videos are used for researches in 3D Computer Vision. Machine learning of 3D Structure from Motion uses videos recorded while riding a bicycle from a handheld camera.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE