Task 7 : Identification of professions and occupations (ProfNER) in Spanish tweets

In this task, modified from previous years, systems developed must develop one or more components to classify tweets that contain professions and occupations in Spanish tweets and detect the text span of the reported professions and occupations. This task presents multiple challenges. Firstly, the classification task needs to take into account class imbalance where only around 24% of the tweets contain profession or occupation mentions. Secondly, span detection will require advanced named entity recognition approaches.

Participants will be provided with a labeled training set containing tweet texts as well as professions and occupations annotations. They can choose participating in one or both subtasks.

  • Training data: 8,000 tweets
  • Test data: 2,000 tweets

For more information, please visit https://temu.bsc.es/smm4h-spanish/

Register your team here : https://forms.gle/1qs3rdNLDxAph88n6
After registration approval, you will be invited to join the Google group for the task. Link to the dataset is available in the Google groups banner. If you do not receive the invite please request to join the Google group with team name using the link below.
Google groups : https://groups.google.com/g/smm4h21-task-7
Link to Codalab : Available Feb 1 2021

Evaluation Period for Task 7 :

Test Dataset Release1st Mar 2021 12:00am UTC
Predictions Due3rd Mar 2021 11:59pm UTC (3:59pm PST)
All submissions are automated and time limits are enforced by Codalab. No extensions will be provided.

Subtask 7a : Tweet classification

Given a tweet, participants of this subtask will be required to submit only the binary annotations Profession/noProfession (1/0). A tweet should be assigned the label 1 if and only if it has one or more mentions of an occupation.

example of input and output (prediction) information for subtask 7a

Predictions for each task should be contained in a single .tsv (tab separated values) file as shown below. This file (and only this file) should be compressed into a .zip file. Please upload this zip file as submission.

example of annotation tab-separated file for subtask 7a

Evaluation Metric : Submissions will be ranked by Precision, Recall and F1-score for the Profession class (1)

Subtask 7b : Profession/occupation span detection

Participants of this second subtask will be required to detect the spans of expressed professions and occupations in each tweet of the test set.
Participants must find the beginning and end of occupation mentions and classify them in the corresponding category. The corpus contains 4 mention categories, but participants will only be evaluated in the prediction of 2 of them: PROFESION [profession] and SITUACION_LABORAL [working status].

example of input and output (prediction) information for subtask 7b

Predictions for each task should be contained in a single .tsv (tab separated values) file as shown below. This file (and only this file) should be compressed into a .zip file. Please upload this zip file as submission.

annotation tab-separated file for the previous tweet

Evaluation Metric : Submissions will be ranked by Precision, Recall and F1-score for the PROFESION and SITUACION_LABORAL class where the spans overlap entirely.

Contact information: Antonio Miranda (antonio.miranda@bsc.es) or visit https://temu.bsc.es/smm4h-spanish/

Frequently asked questions (FAQs):

Do I have to participate in all the subtasks?
No. You may choose to participate in one or both subtasks. During evaluation, you will be allowed to make two submissions for each subtask.