Published June 5, 2020 | Version 1.0
Presentation Restricted

MNIST Large Scale data set

  • 1. KTH Royal Institute of Technology

Description

Motivation 

The MNIST Large Scale data set is based on the classic MNIST data set but contains large scale variations up to a factor of 16. The motivation behind creating this data set was to enable testing the ability of different algorithms to learn in the presence of large scale variability and specifically the ability to generalise to new scales not present in the training set over large scale ranges.

The MNIST Large Scale data set was originally introduced in the paper:

[1] Y. Jansson and T. Lindeberg (2021) “Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges”, International Conference on Pattern Recognition (ICPR 2020), pp. 1181–1188., Extended version preprint (which includes additional information about dataset creation) arXiv:2004.01536.

A more extensive experimental description of this data set is given in the paper (including a published description of the details of data set creation, as well as compact performance measures that can serve as benchmarks regarding scale generalisation performance):

[2] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

Access and rights
The data set is freely available under the condition that you reference both the original MNIST data set: 

[3] LeCun, Y., Bottou, L., & Haffner, P. (1998). “Gradient-based learning applied to document recognition”. Proceedings of the IEEE, 86(11): 2278–2324

and this derived version, either of the references [1] or [2] (preferably [2]).

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset
The MNIST Large Scale data set is based on the classic MNIST data set [2] but contains large scale variations up to a factor of 16. The data set is created by scaling the original MNIST images with varying scale factors and embedding the resulting image in a 112x112 image with a uniform background followed by smoothing and soft thresholding to reduce discretization artifacts. The details of data set creation are described in [1] and [2]. 

All training data sets are created from the first 50,000 examples in the original MNIST training set, while the validation data sets are created from to the last 10,000 images of the original MNIST training set. The test data sets are created from the 10,000 images in the original MNIST test set. 

There are three data sets (7.0 GB each) for single scale training for three different scales (scale 1, 2 and 4), which also include test and validation data for the same scales:

    mnist_large_scale_tr50000_vl10000_te10000_outsize112-112_sctr1p000_scte1p000.h5
    mnist_large_scale_tr50000_vl10000_te10000_outsize112-112_sctr2p000_scte2p000.h5
    mnist_large_scale_tr50000_vl10000_te10000_outsize112-112_sctr4p000_scte4p000.h5

In addition, there are 17 data sets (1.0 GB each) for testing generalisation ability to scales not present in the training set. These data sets include scale factors 2k/4 with k in the range [-4, 12], i.e. spanning the scale range [1/2, 8]:

    mnist_large_scale_te10000_outsize112-112_scte0p500.h5
    mnist_large_scale_te10000_outsize112-112_scte0p595.h5
    mnist_large_scale_te10000_outsize112-112_scte0p707.h5
    mnist_large_scale_te10000_outsize112-112_scte0p841.h5

    mnist_large_scale_te10000_outsize112-112_scte1p000.h5
    mnist_large_scale_te10000_outsize112-112_scte1p189.h5
    mnist_large_scale_te10000_outsize112-112_scte1p414.h5
    mnist_large_scale_te10000_outsize112-112_scte1p682.h5

    mnist_large_scale_te10000_outsize112-112_scte2p000.h5
    mnist_large_scale_te10000_outsize112-112_scte2p378.h5
    mnist_large_scale_te10000_outsize112-112_scte2p828.h5
    mnist_large_scale_te10000_outsize112-112_scte3p364.h5

    mnist_large_scale_te10000_outsize112-112_scte4p000.h5
    mnist_large_scale_te10000_outsize112-112_scte4p757.h5
    mnist_large_scale_te10000_outsize112-112_scte5p657.h5
    mnist_large_scale_te10000_outsize112-112_scte6p727.h5
    mnist_large_scale_te10000_outsize112-112_scte8p000.h5

The above data sets were used for the experiments presented in Figure 2 and Figure 4 in [1].  The numerical performance scores for a vanilla CNN and the different scale channel architectures evaluated in the paper are given in Table I in [1].

To evaluate the ability of different algorithms to learn from data with large scale variations when only a limited number of training samples are availablethere is also a data set where both the training, test and validation data span the scale range [1,4]:

    mnist_large_scale_tr50000_vl10000_te10000_outsize112-112_sctr1-4_scte1-4.h5

This data set was used for the experiment presented in Figure 5 in [1]. The numerical performance scores for a vanilla CNN and the different scale channel architectures evaluated in [1] are given in Table III in [1]. When evaluating how the performance varies with the number of training samples for this data set, the first n samples from the training set should be used for training, while the full test set should be used for testing.


Instructions for loading the data set
The dataset is saved in HDF5 format. The four training data sets are stored as 6 partitions in the respective HDF5 files
(“/x_train, /x_val, /x_test, /y_train, /y_test, /y_val”) and can be loaded in Python as follows: 

import h5py

with h5py.File(<filename>, 'r') as f:    
        x_train = np.array( f["/x_train"], dtype=np.float32)
        x_val= np.array( f["/x_val"], dtype=np.float32)
        x_test = np.array( f["/x_test"], dtype=np.float32)
    
        y_train = np.array( f["/y_train"], dtype=np.int32)
        y_val= np.array( f["/y_val"], dtype=np.int32)
        y_test = np.array( f["/y_test"], dtype=np.int32)

or in Matlab as:

    x_train = h5read(<filename>),’/x_train’);
    x_val = h5read(<filename>,’/x_val’);
    x_test = h5read(<filename>,’/x_test’);

    y_train = h5read(<filename>,’/y_train’);
    y_val = h5read(<filename>,’/y_val’);
    y_test = h5read(<filename>,’/y_test’);

The 17 test data sets can be loaded in Python as:

with h5py.File(<filename>, 'r') as f:    
        x_test = np.array( f["/x_test"], dtype=np.float32)
        y_test = np.array( f["/y_test"], dtype=np.int32)

or in Matlab as:

    x_test = h5read(<filename>,’/x_test’);
    y_test = h5read(<filename>,’/y_test’);

(The test data sets do additionally contain a single train and validation sample in the “x_train/, x_val/, y_train/ and y_val/ partitions to enable compatibility with code that always loads all the three data sets. This sample is not intended to be used.)

Note that the greyscale images are stored in the HDF5 files using row-major (C-style) order i.e. as 
 [n_samples, xdim, ydim, n_channels] where the size of the channel dimension is 1. 

For convenience we also provide a Jupyter notebook and a Matlab script for loading and inspecting the data sets at https://github.com/spacemir/MNISTLargeScaleDataset .

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

The data set is freely available under the condition that you reference both this derived version (either [1] or [2]) and the original MNIST data set [3] as further specified in the data set description.   

 

 

You are currently not logged in. Do you have an account? Log in here

Additional details

References

  • Y. Jansson and T. Lindeberg (2020), "Exploring the ability of CNNs to generalise to previously unseen scales", arXiv preprint arXiv:2004.01536.
  • LeCun, Y., Bottou, L., & Haffner, P. (1998), "Gradient-based learning applied to document recognition", Proceedings of the IEEE, 86(11), 2278–2324