Edited by: Thomas Serre, Brown University, USA
Reviewed by: Sébastien M. Crouzet, Centre de Recherche Cerveau et Cognition, France; Michelle R. Greene, Stanford University, USA
*Correspondence: Johannes Mohr
This article was submitted to Perception Science, a section of the journal Frontiers in Psychology
†These authors have contributed equally to this work.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Photographic stimuli are often used for studying human perception. To faithfully represent our natural viewing environment, these stimuli should be free of potential artifacts. If stimulus material for scientific experiments is generated from photographs that were created for a different purpose, such as advertisement or art, the scene layout and focal depth might not be typical for our visual world. For instance in advertising photos, particular objects are often centered and focused. In visual search experiments, this can lead to the so-called central viewing bias and an unwanted pre-segmentation of focused objects (Wichmann et al.,
In order to meet the needs for publicly available stimulus material in which these artifacts are avoided, we introduce in this paper the BOiS—Berlin Object in Scene database, which provides controlled photographic stimulus material for the assessment of human visual search behavior under natural conditions. The BOiS database comprises high-resolution photographs of 130 cluttered scenes. In each scene, one particular object was chosen as search target. The scene was then photographed three times: with the target object at an expected location, at an unexpected location, or absent.
Moreover, the database contains 240 different views of each target object in front of a black background. These images provide different visual cues of the target before the search is initiated. All photos were taken under controlled conditions with respect to photographic parameters and layout and were corrected for optical distortions.
The BOiS database allows investigating the top-down influence of scene context, by providing contextual prior maps of each scene that quantify people's expectations to find the target object at a particular location. These maps were obtained by averaging the individual expectations of 10 subjects and can be used to model context effects on the search process.
Last not least, the database includes segmentation masks of each target object in the two corresponding scene images, as well as a list of semantic information on the target object, the scene, and the two chosen locations. Moreover, we provide bottom-up saliency measures and contextual prior values at the two target object locations. While originally aimed at visual search, our database can also provide stimuli for experiments on scene viewing and object recognition, or serve as test environment for computer vision algorithms.
Our goal was to obtain scenes that were as naturalistic as possible, using visual environments encountered in everyday life. Therefore, for each of the 130 scenes a particular target object was chosen from among the objects contained within the scene, such as a pen lying on a desk, a sofa cushion in a living room or a watering can in a garden (see the example in Figures
Each scene in our database is labeled as indoor- or outdoor-scene and assigned to a further sub-category. The 107 indoor scenes that are subdivided into 32 kitchen scenes, 12 bathroom scenes, 9 bedroom scenes, 16 living room scenes, 15 office scenes, 8 hall scenes, 7 hospital scenes, 5 garage scenes, 2 bakery scenes, and 1 shop scene. The 23 outdoor scenes are subdivided into 13 garden scenes, 4 park scenes, 4 terrace scenes, 1 garage scene, and 1 street scene. For each scene, we also name the associated target object, and describe the two locations where the target object was photographed within the scene by their relation to nearby objects. For the “expected” location there usually were some semantically associated objects next or close to the target object. For instance, the bottle cap was screwed onto a bottle, the rubber duck was sitting on the rim of a bathtub, or the garden watering can was standing next to a garden shed. For the “unexpected” location, the surrounding objects did not have a semantic relation to the target object. For example, the bottle cap was lying on top of washed dishes, the rubber duck was sitting on the bathroom floor, and the garden watering can was standing on top of the garden shed.
In order to avoid potential artifacts, no graphics editing was applied and target objects were physically placed at different locations within the scene. No humans or animals appeared in the pictures in order to avoid any attentional biases caused by their presence (Suzuki and Cavanagh,
All scenes were taken from a natural human perspective. The layout of all scenes was similar: The central area and the outer margin of the image were kept clear of the target object. This allows for generating different cuts of the original scene by zooming and shifting, providing more control over size and relative position of the target object in the final stimulus. Furthermore, it was ensured that the two locations for one particular target object had a similar distance to the image center (minimum distance at expected location: 605.6 ± 88.1 pixels, at unexpected location: 618.1 ± 93.5 pixels; no significant difference,
Target objects had a varying absolute size, ranging from a few millimeters up to about 60 cm, such as a coin, a pair of scissors, a cooking pot, or a car tire. We acquired multiple views of each target object in front of a black background (see Figure
For all images we used the same hardware and pre-defined setting ranges. All photographic images of scenes and target objects were taken with a Canon EOS 450D camera with a Canon EF-S 18-55/3.5-5.6 IS lens. Shutter priority was fixed at F 9.5, exposure time ranged from 5 to 1/1500 s, and ISO speed rating was either ISO 400 or ISO 800, depending on lighting conditions. Photoflash was only used when indoor lighting conditions were inadequate. Focal length was 18 mm; in some images zooming was used with a focal length of up to 33 mm. All pictures were recorded at a high resolution of 4272 × 2848 and saved in Canon RAW image file format (.CR2). Taking the particular lens into account, all photos were automatically corrected for optical errors (optical and geometric distortion, vignetting, chromatic aberrations, lens softness, noise reduction, white balancing) using DxO Optics Pro 9.
One important top-down influence arises from scene context (Castelhano and Heaven,
For this purpose, 10 naïve subjects (5 male and 5 female students living in Berlin, between 20 and 30 years old) segmented the scenes with missing target objects (Figures
We provide segmentations of the target objects in the scenes and use these to include values for several bottom-up features that are potentially related to the detectability of the target object within the scene.
Each target object is associated with a particular scene, and appears in two images of this scene. Our database contains binary masks for these scene images in which the pixels belonging to the target object are marked. These masks were obtained by exact manual segmentation in the high resolution image and are therefore very accurate (for an example see Figures
We applied these target object segmentation to the contextual prior maps in order to validate our subjective assessment of expected and unexpected target object location. For each of the 130 pairs, we averaged the contextual prior for both the expected and the unexpected target object location over the pixels taken up by the object. Thus, each version of the scene was represented as a single number reflecting how expected the target object was for that particular location. The difference in this value across the different version of the scene will tell us about the differences in expectation. A paired
In 129 of the scenes the expected target object locations had a higher maximum contextual prior value than the corresponding unexpected target object location. For these scenes, our subjective assessment was thus confirmed. In one scene our categorization into “expected” and “unexpected” location did not correspond to the results of contextual prior maps (scene 118). When using our database to study prior information on visual search it may thus be prudent to exclude scene ID 118.
Our database includes a datasheet listing a set of values for bottom-up features that are potentially related to the detectability of target object within a scene: target object size; minimum and mean distance to image center; conspicuity map values for luminance, color and orientation; saliency of the target object.
Larger size of an object increases its visibility within a scene and its likelihood of coincidental detection. Therefore, the object size could modulate attention. We defined the size of the target object as the number of pixels covered by the target object mask.
When looking at images humans tend to first fixate the area close to the center of the image (central fixation bias). Therefore, the distance to the image center might be inversely correlated to the detectability of the target (Tatler and Melcher,
Saliency maps form the basis of standard models of bottom-up visual attention. We calculated conspicuity maps and saliency map based on the model of Itti and Koch (Itti et al.,
The database is publicly available under the following address:
The first folder “original_scenes_DxO” contains the distortion corrected scene images as PNG files with a resolution of 4272 × 2848 pixels. The scenes without target object are named O_ < id>L_DxO.png, the scenes with the target object at the expected location O_ < id>E_DxO.png, and the scene with the target object at the unexpected location O_ < id>U_DxO.png. The second folder “target_objects” contains a subfolder DxO, in which for each target object < id> one can find a folder TO_ < id>_DxO with the corrected PNG images of the different views of the target objects, TO_ < id>_ < vertical angle>_ < horizontal viewpoint>_DxO.png. These were down-sampled to 1152 × 786 pixel resolution in order to keep the total size of the database practicable. The third folder “masks” contains the segmentation masks of the scenes with the target object in expected (O_ < id>E_m.png) and unexpected (O_ < id>U_m.png) location at 4272 × 2848 pixel resolution. The last folder “cpms” contains the contextual prior maps as gray level images (CPM_ < id>.png) at 4272 × 2848 pixels resolution. Note that these maps are scaled to use the whole gray value range for visualization purposes and need to be normalized such that the sum over all pixels is 1 to ensure comparability across scenes. We have also included the 130 individual contextual prior segmentations of our 10 subjects in our database at 600 × 400 pixels resolution (Individual_priors.zip). These files are named < subject_id>_O_ < id>L_s.png, where < subject_id> ranges from 2 to 11, and the gray values (0, 85, 170, 255) in the images encode the four prior probability levels from lowest to highest.
All meta-information for the database is provided in BOiSmeta.xls. This Excel spreadsheet contains a table with the semantic information (see Section Target Objects in Realistic Scenes), as well as a table with the contextual prior values (see Section Contextual Prior Maps) and bottom-up feature values (see Section Bottom-Up Features) for both target object locations within the scene.
All authors jointly developed the concept of the database. JS generated all the photos. JS and JM compiled and annotated the database, and wrote the manuscript. AL, JW, FW, and KO carefully revised the manuscript for important intellectual content.
This work was funded by the German Federal Ministry of Education and Research (BMBF) through the Bernstein Focus: Neurotechnology (FKZ: 01GQ0850 – A4) as well as the Bernstein Computational Neuroscience Program Tübingen (FKZ: 01GQ1002).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.