CIRAL

TRACK OVERVIEW

As information on the web continually expands across different languages, Cross-lingual information retrieval (CLIR) systems which enable users search in one language and retrieve documents in another are becoming increasingly important. Research in CLIR for African languages is also growing, and these methods often require African CLIR test collections to adequately evaluate systems and expand research. Collections have been curated which either include some African languages or solely focus on African languages, however, these collections are mostly created via translation or synthetically and could be prone to bias and translation issues.

The goal of the CIRAL track is promote the research and evaluation of CLIR for African languages. With the intent of curating a human-annotated test collection through a community shared task, our track entails retrieval between English and four African languages which are Hausa, Somali, Swahili and Yoruba. Given the low-resourced nature of African languages, this track also focuses on fostering CLIR research and evaluation in low-resource settings, and hence the development of retrieval systems that are well suited for such tasks.

PARTICIPATION

Task

Track participants are tasked with developing retrieval systems that return documents in a specified African language when issued a query in English. Retrieval is done at the passage level, with queries formulated as natural language questions and passages relevant to a given query are those with answers to the question. More details on the training and tests sets are provided in the Dataset section.

Submission

Each team is required to submit run files obtained from their retrieval systems in the 6 column standard TREC format. Submissions would be recieved according to the ranking by the team and participating teams are encouraged to make a minimum of 2 submissions for each of the languages. At least the top 3 ranked submissions from each team would be included in the pools. Up to 1000 passages can be retreived per query, results with more than 1000 would be truncated. Run files can be submitted using this form

Evaluation

Evaluation is done by creating pools for each query and manually judging for the binary relevance of retrieved passages (pooling depth is k = 20). Using the provided judgements, the submitted run files are evaluated with the following standard retrieval metrics - Recall@100 and nDCG@20.

DATASET

For each language, a static collection of passages extracted from news articles is provided. The training set comprises of the static collection, approximately 10 queries per language and some binary relevance judgements for each query. The queries and qrels in the train set serve as samples to help analyze relevance, and explore approaches while using the qrels for evaluation. The statistics of the collection is documented in the dataset repo, and can also be found in the table on the right.

The datasets would be made available in respective Hugging Face repos (Corpus, Queries and Jugdements) according to the release date for each set. Participants can request for access to the training and test sets.

	# Train Queries	# Test Queries	# Passages
Hausa (hau)	10	85	715,355
Somali (som)	10	100	1,015,567
Swahili (swa)	10	85	981,658
Yoruba (yor)	10	100	82,095

BULLETINS

Track Timeline

~~22nd May 2023~~	Track Website opens
~~22nd May 2023~~	Registration for Track Begins
~~7th Jun 2023~~	Training Data Released
~~21st Aug 2023~~	Test Data Released
~~10th Sep 2023~~	Run Submission Deadline
~~26th Sep 2023~~	Declaration of Results
~~8th Oct 2023~~	Working Note Submission
~~25th Oct 2023~~	Final Version of Working Note

Announcements

Run results are ready and are available on the leaderboard!
Test queries out now: Dataset Repo
Check out baselines here
Just starting out with IR/CIRAL Quick Start, Documentation
Train queries and qrels released for Swahili and Somali here
Updated version of corpora available in Hugging Face repo
Training data released for Hausa and Yoruba here
Corpora available here

AUTHOR GUIDELINES

All FIRE 2023 track working notes are submitted centrally at https://cmt3.research.microsoft.com/FIRE2023. The following guidelines have been provided to help with submission.

Each team can submit only ONE working note which will be published with CEUR. Working notes should follow the single column CEUR format:
- Overleaf template
- Word and Latex templates
Along with working notes, each group also has to submit a copyright agreement form. There are two types of forms, depending on the use of third- party data or resources:
- Copyright-ntp: No use of third-party data/resource.
- Copyright-tp: Use of third-party data/resource.
We encourage a minimum of 5 pages for working notes. Working notes with less than 5 pages are classified as extended abstracts at CEUR.

Sanity checks on results, plagiarism and grammatical errors are also important. Other common issues to avoid are listed below.

Inclusion of names of teams and tracks in titles of working notes is highly discouraged. For instance titles like "Avengers@CIRAL: Dense Passage Retrieval" should rather be written as "Dense Passage Retrieval"
Author names should not have any prefixes like Dr., Prof., etc
Copyright information within the footer of the first page should be "Forum for Information Retrieval Evaluation, December 15-18, 2023, India" and should not be changed to include track names or any other details/modifications.