As information on the web continually expands across different languages, Cross-lingual information retrieval (CLIR) systems which enable users search in one language and retrieve documents in another are becoming increasingly important. Research in CLIR for African languages is also growing, and these methods often require African CLIR test collections to adequately evaluate systems and expand research. Collections have been curated which either include some African languages or solely focus on African languages, however, these collections are mostly created via translation or synthetically and could be prone to bias and translation issues.
The goal of the CIRAL track is promote the research and evaluation of CLIR for African languages. With the intent of curating a human-annotated test collection through a community shared task, our track entails retrieval between English and four African languages which are Hausa, Somali, Swahili and Yoruba. Given the low-resourced nature of African languages, this track also focuses on fostering CLIR research and evaluation in low-resource settings, and hence the development of retrieval systems that are well suited for such tasks.
Track participants are tasked with developing retrieval systems that return documents in a specified African language when issued a query in English. Retrieval is done at the passage level, with queries formulated as natural language questions and passages relevant to a given query are those with answers to the question. More details on the training and tests sets are provided in the Dataset section.
Each team is required to submit run files obtained from their retrieval systems in the standard TREC format. Submissions are expected to be 2 to 3 per language, but with a cap of 3. Participants with more than 3 submissions in any of the languages would have the top 3 selected based on ranking by the team. Run files can be submitted using this form
Evaluation is done by creating pools for each query and manually judging for the binary relevance of retrieved passages (pooling depth is k = 50). Using the provided judgements, the submitted run files are evaluated with standard retrieval metrics such as MAP to account for precision and recall. We also evaluate for early precision using nDCG@10 and P@10.
Each team is be required to submit a working note detailing their proposed retrieval system and approach to the task. The required format for the working note is the ACM SIG’s template, with a maximum of 5 pages. Submissions would be made to the track’s email at ciralproject23@gmail.com.
For each language, a static collection of passages extracted from news articles is provided. The training set comprises of the static collection, approximately 10 queries per language and some binary relevance judgements for each query. The test set comprises approximately 30 queries per languages. The statistics of the collection is documented in the dataset repo, and can also be found in the table on the right. The table would be updated as the dataset is curated.
The datasets would be made available in this Google drive according to the release date for each set. Participants can send a mail to ciralproject23@gmail.com to obtain permission to the folder of interest.
# Train Queries | # Test Queries | # Passages | |
---|---|---|---|
Hausa (hau) | 10 | 30 | 715,355 |
Somali (som) | 10 | 30 | 1,015,567 |
Swahili (swa) | 10 | 30 | 981,658 |
Yoruba (yor) | 8 | 27 | 82,095 |
22nd May 2023 | Track Website opens |
---|---|
Registration for Track Begins | |
Training Data Released | |
12th Jul 2023 | Test Data Released |
1st Aug 2023 | Run Submission Deadline |
15th Aug 2023 | Declaration of Results |
15th Sep 2023 | Working Note Submission |
5th Oct 2023 | Review Notifications |
15th Oct 2023 | Final Version of Working Note |
Coming soon…
University of Waterloo
University of Waterloo
University of Waterloo
University of Waterloo