Wikimedia Quality Images

Accessing the Dataset

The code for grabbing the metadata of Wikimedia Quality Images, downloading the images, filtering, and processing is available on Paul’s GitHub repo. Some of the preprocessed datasets are also available for download on the Arbutus cloud:

  • WQI-metadata.tar.zst (63MB) The metadata files
  • WQI-people-4k-llama-sim.tar (5.1 GB) Images with people, resized to (up to) 4K, with images smaller than 1080p removed. We use keywords to filer out images with people, and use some heuristics to avoid images with significantly similar contents (including using QWen 3.0, through the LLaMA-cpp-python library to judge the similarity of the content descriptions).
  • WQI-people-4k-llama-sim.tar Same as above, but use LLM to judge the similarity of the content descriptions.
  • WSI-archive_0000000.tar WQI-archive_0000623.tar The whole archive of Wikimedia Quality Images when we downloaded them. Each tar file contains the (up to) 500 images in each metadata JSON file. The dataset was archived at the beginning of 2024.

The code for processing the images is released in a MIT-like licence. Individual images are released under different licences, they are either in the public domain (or CC0) or under a CC-BY or CC-BY-SA licence. Proper attribution is given in this project through the metadata files.

We are making the dataset processing tools and preprocessed datasets available to the research community free of charge. If you find this dataset useful in your research, we kindly ask that you give proper attribution to the original sources of the images and reference our project:

1
2
3
4
5
6
7
@inproceedings{yang2025towards,
title={Towards a Universal Image Degradation Model via Content-Degradation Disentanglement},
author={Yang, Wenbo and Wang, Zhongling and Wang, Zhou},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={12966--12975},
year={2025}
}

Disclaimer

This project is unrelated to Wikimedia Foundation and Wikipedia.