Creation of Ukrainian language NER system
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
NER is one of the popular NLP tasks, and the challenge of creating a robust NER system lies in access to a substantial large corpus of annotated data. However, such data is not available for all languages, specifically for the Ukrainian one, but there’s a potential to use unsupervised and semi-supervised approaches.
We will use the unannotated Ukrainian language corpus (https://github.com/mariana-scorp/lt-project) as a starting point and will need to dvelop some of our own data-sets/annotations, as well as try to adapt one of the existing NER algorithms or come up with our own variation.
- Coursera NLP course - Week 4, Named entity recognition and Maximum Entropy Sequence Models
- A survey of named entity recognition and classification
- Learning a Part-of-Speech Tagger from Two Hours of Annotation
- Design Challenges and Misconceptions in Named Entity Recognition - advanced
- Basics of natural language processing (ready to present)
- Basics of machine learning
- Linear classification models
- Semi-supervised and unsupervised ML approaches
- Working with text corpora (ready to present)
- Programming language: Python
natural language processing, semi-supervised and unsupervised machine learning
- Basics of NLP
- Working with text corpora
Mr. Vsevolod Dyomkin,