Creation of Ukrainian language NER system
Abstract:
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
NER is one of the popular NLP tasks, and the challenge of creating a robust NER system lies in access to a substantial large corpus of annotated data. However, such data is not available for all languages, specifically for the Ukrainian one, but there’s a potential to use unsupervised and semi-supervised approaches.
We will use the unannotated Ukrainian language corpus (https://github.com/mariana-scorp/lt-project) as a starting point and will need to dvelop some of our own data-sets/annotations, as well as try to adapt one of the existing NER algorithms or come up with our own variation.
Recommended reading:
- Coursera NLP course - Week 4, Named entity recognition and Maximum Entropy Sequence Models
- A survey of named entity recognition and classification
- Learning a Part-of-Speech Tagger from Two Hours of Annotation
- Design Challenges and Misconceptions in Named Entity Recognition - advanced
Project prerequisites:
- Basics of natural language processing (ready to present)
- Basics of machine learning
- Linear classification models
- Semi-supervised and unsupervised ML approaches
- Working with text corpora (ready to present)
- Programming language: Python
Associated topics:
natural language processing, semi-supervised and unsupervised machine learning
Planned lectures:
- Basics of NLP
- Working with text corpora
About lecturer:
Mr. Vsevolod Dyomkin,
Grammarly Inc.