Normalization of Noisy Text
Abstract:
This project is based on the ACL 2015 Shared Task (see http://noisy-text.github.io/norm-shared-task.html). User generated content (UGC) such as the text in Twitter messages is notoriously varied in content and composition, often containing ungrammatical sentence structures, non-standard words and domain-specific entities. Accuracy declines have been observed in many NLP tasks over UGC data, motivating the need for methods which normalise the content prior to the application of NLP tools to the data.
This task focuses on text normalisation, in aiming to normalise non-standard words in English Twitter messages to their canonical forms. In this, we aim to correct non-standard spellings (e.g., toook for took), expand informal abbreviations (e.g., tmrw for tomorrow), and normalise phonetic substitutions (e.g., 4eva for forever).
Recommended reading:
Project prerequisites:
- Basics of natural language processing (ready to present)
- Basics of machine learning
- Working with text corpora (ready to present)
- Programming language: Python
Associated topics:
natural language processing, machine learning
Planned lectures:
- Basics of NLP
- Working with text corpora
About lecturer:
Mr. Vsevolod Dyomkin,
Grammarly Inc.