Normalization of Noisy Text

Abstract:

This project is based on the ACL 2015 Shared Task (see http://noisy-text.github.io/norm-shared-task.html). User generated content (UGC) such as the text in Twitter messages is notoriously varied in content and composition, often containing ungrammatical sentence structures, non-standard words and domain-specific entities. Accuracy declines have been observed in many NLP tasks over UGC data, motivating the need for methods which normalise the content prior to the application of NLP tools to the data.

This task focuses on text normalisation, in aiming to normalise non-standard words in English Twitter messages to their canonical forms. In this, we aim to correct non-standard spellings (e.g., toook for took), expand informal abbreviations (e.g., tmrw for tomorrow), and normalise phonetic substitutions (e.g., 4eva for forever).

Project prerequisites:

Basics of natural language processing (ready to present)
Basics of machine learning
Working with text corpora (ready to present)
Programming language: Python

Associated topics:

natural language processing, machine learning

Planned lectures:

Basics of NLP
Working with text corpora

About lecturer:

Mr. Vsevolod Dyomkin,
Grammarly Inc.

Normalization of Noisy Text

Abstract:

Recommended reading:

Project prerequisites:

Associated topics:

Planned lectures:

About lecturer:

About us

Participation

X Summer School

Achievements and Applicationsof Contemporary Informatics, Mathematics and PhysicsAugust 4-18, 2015, Kyiv (Ukraine)

Normalization of Noisy Text

Abstract:

Recommended reading:

Project prerequisites:

Associated topics:

Planned lectures:

About lecturer:

About us

Participation

Achievements and Applications
of Contemporary Informatics, Mathematics and Physics
August 4-18, 2015, Kyiv (Ukraine)