Speed Biases With Real-Life Video Clips

Rossi, Federica; Montanaro, Elisa; de’Sperati, Claudio

doi:10.3389/fnint.2018.00011

ORIGINAL RESEARCH article

Front. Integr. Neurosci., 16 March 2018
Volume 12 - 2018 | https://doi.org/10.3389/fnint.2018.00011

Speed Biases With Real-Life Video Clips

Federica Rossi¹

Elisa Montanaro²

Claudio de’Sperati^1,3*

¹Laboratory of Action, Perception and Cognition, Faculty of Psychology, Vita-Salute San Raffaele University, Milan, Italy
²Department of Neuroscience Rita Levi Montalcini, University of Turin, Turin, Italy
³Experimental Psychology Unit, Division of Neuroscience, San Raffaele Scientific Institute, Milan, Italy

We live almost literally immersed in an artificial visual world, especially motion pictures. In this exploratory study, we asked whether the best speed for reproducing a video is its original, shooting speed. By using adjustment and double staircase methods, we examined speed biases in viewing real-life video clips in three experiments, and assessed their robustness by manipulating visual and auditory factors. With the tested stimuli (short clips of human motion, mixed human-physical motion, physical motion and ego-motion), speed underestimation was the rule rather than the exception, although it depended largely on clip content, ranging on average from 2% (ego-motion) to 32% (physical motion). Manipulating display size or adding arbitrary soundtracks did not modify these speed biases. Estimated speed was not correlated with estimated duration of these same video clips. These results indicate that the sense of speed for real-life video clips can be systematically biased, independently of the impression of elapsed time. Measuring subjective visual tempo may integrate traditional methods that assess time perception: speed biases may be exploited to develop a simple, objective test of reality flow, to be used for example in clinical and developmental contexts. From the perspective of video media, measuring speed biases may help to optimize video reproduction speed and validate “natural” video compression techniques based on sub-threshold temporal squeezing.

Introduction

Motion pictures—videos—are becoming pervasive in our everyday life. Yet, videos might easily fool us. For example, we recently showed that observers fail to notice large speed manipulations when viewing a soccer match video clip (de’Sperati and Baud Bovy, 2017). In that study, we also found a small but reliable tendency toward speed underestimation, which suggests that the best speed for reproducing a video may not be its original shooting speed. This may sound rather counterintuitive, as we tend to implicitly assume that shooting speed and reproduction speed should coincide, for otherwise motion rendering would be sub-optimal or even artifactual¹. Yet, this may not always be true.

There are several reasons why real-life motion pictures could generate a “wrong” speed impression, or speed bias. In general, this happens whenever the scene does not match expectations, either implicit or explicit, about how the world should appear (Shi et al., 2013; Shi and Burr, 2016). Scene content and context may induce the viewer to expect that events should unfold at a different pace, for example when an action is performed at a particularly slow rhythm or when motion cues are poor. In this respect, there are virtually no limits as to the potential mismatches between expectations and particular visual scenes. Sometimes the wrong expectations apply even to basic physical facts (intuitive physics, McCloskey and Kohl, 1983; McCloskey et al., 1983; Pittenger, 1985; Kaiser et al., 1992; Kubricht et al., 2017). Conversely, we may be particularly well tuned to certain patterns of biological motion, either based on purely visual mechanisms or through visuo-motor coupling (de’Sperati and Stucchi, 1995, 1997, 2000; Viviani et al., 1997; de’Sperati and Viviani, 1997; Thornton, 1998; Runeson et al., 2000; Cattaneo and Rizzolatti, 2009; Gallese et al., 2009; Lacquaniti et al., 2014). Low-level factors are also important in interpreting visual motion. For example, depending on spatial and temporal frequency content, image contrast can introduce distortions in perceived speed (Anstis, 2003; Burr and Thompson, 2011). The frequently reported speed underestimation at low contrast can be an effect of different weights of low-pass and band-pass cortical filtering (Thompson et al., 2006) or as a consequence of a low-speed prior (Weiss et al., 2002). Thus, video clips depicting different real-life scenes might produce different speed biases for a variety of reasons. The first aim of this study is to verify whether this is indeed the case.

Viewing conditions could also modify the impression of speed. For example, if scaling mechanisms for speed constancy (McKee and Smallman, 1998; Distler et al., 2000; Thornton et al., 2014) do not fully compensate for display size or viewing distance, watching a video on a mobile phone may be different from watching it on a computer monitor or home TV. Likewise, watching a muted video may be different from watching it with an accompanying soundtrack, and in turn the type of soundtrack can convey a feeling of urgency or relaxation, possibly impacting on the impression of visual tempo (Recanzone, 2003; Soto-Faraco and Väljamäe, 2012). Indeed, there is ample evidence that music can influence the sense of time, and specific neural correlates have been proposed (Schäfer et al., 2013). Moreover, music, rhythm and movements are tightly intertwined, and this relationship extends to visual metrical perception, including a specific effect of visual motion on auditory tempo (Su and Jonikaitis, 2011; Su and Salazar-López, 2016). Thus, speed biases may depend on these “accessory” factors as well. Playing video clips at a slightly different speed might compensate for these effects, if present—think for example of an automatic equalization system for compensating different perceived speeds on small mobile screens vs. TV screens. The second aim of this study is to verify whether screen size or soundtrack can modify the dynamic appearance of videos.

The impression that a given visual scene is too slow—speed underestimation—may depend on the fact that its duration is perceived to be too long (and vice-versa). One reason could be that, subjectively, the scene is not “filling time” sufficiently. Indeed, according to an influential model based on the idea that temporal cognition depends on the accumulation of event ticks, perceived time is dilated when a visual stimulus “fills” it more densely, for example because of a higher speed or higher temporal frequency (Brown, 1995; Block and Zakay, 1997; Lacquaniti et al., 2014). Thus, it is possible that speed underestimation is associated to duration overestimation, although with complex motion scenes additional factors may influence time perception, especially when human actions are involved (Grivel et al., 2011; Carrozzo and Lacquaniti, 2013; Sgouramani and Vatakis, 2014). Alternatively, the lack of correlation between speed bias and duration estimation would suggest that measuring the sense of speed is not just another way to measure perceived elapsed time, but characterizes a distinct function at the interface between perception and cognition. The third aim of this study is to verify whether perceived speed and perceived duration of real-life video clips are correlated.

To address these three issues, we assessed the subjectively estimated “natural” speed of short video clips depicting various real-life scenes. This was achieved by measuring the point of subjective equality (PSE; Ehrenstein and Ehrenstein, 1999), which provided an estimate of speed bias. We then searched for a correlation between speed bias and errors in estimating temporal intervals, evaluated through a duration reproduction task (Grondin, 2010), and tested the robustness of speed bias by manipulating display size and soundtrack.

Experiment 1

This experiment assessed the ability to estimate the “natural” reproduction speed of real-life video clips, i.e., the original shooting speed, as a function of visual content. Observers adjusted the video speed in real time until a point at which speed was not perceived as too high or too low (PSE). To mimic real-life viewing conditions, no stimulus standard was provided for comparison, so that observers had to rely entirely on their internal expectations. In Bayesian terms, this amounts to emphasizing the role of priors in perceptual decisions (“reference memory”, Shi et al., 2013). We investigated the effects of clip type, display size and repeated presentations on PSE. The same subjects were also tested in a duration reproduction task based on the same video material.

Methods

Participants

Fifteen participants (mean age = 30.00 years, nine females) volunteered for the experiments. They had normal or corrected-to-normal vision, and were naïve as to the purpose of the experiment. This study was carried out in accordance with the recommendations of San Raffaele Ethical Committee. The protocol was approved by the San Raffaele Ethical Committee. All subjects gave written informed consent in accordance with the Declaration of Helsinki.

Stimuli and Task

We used four short video clips (duration, 30 s), displaying physical motion, human motion—first-person and third-person perspectives—and mixed human-physical motion. Three videos were shot in-house with a smartphone in HD format (30 fps, 1280 × 720 pixels, with a f/2.6 lens and 60° FOV), and one video was obtained, with permission, from a web collection of naturalistic landscapes, again in HD format². Video clips were shot in fixed-camera mode¹, except for the video clip in the first-person perspective (ego-motion). Temporal calibration was performed by recording a visual stimulus flashing at 1 Hz for 60 s. The recorded flash frequency turned out to be 0.999 Hz, which corresponds to an error of 0.1%.

Video clips represented rather uniform scenes and were displayed on a black background. C1 (jumping man—only human motion) is a frontal shot of a young man jumping in front of a building wall. C2 (foot dribbling—mixed human-physical motion) shows the same man dribbling a soccer ball in front of the same wall. C3 (water waves—physical motion) is a wide shot of an undertow of the sea, with the seashore in the front, surrounded by a few rocks, and the sea in the background. C4 (ego-motion) is a first-person perspective of a walk in a crowded street. The original video clips are available on request from the corresponding author.

Video clips were displayed on a 21″ LCD monitor at a 60 Hz refresh rate and a viewing distance of about 60 cm. We distinguish between refresh rate, i.e., the frequency at which the visual display visual buffer is updated, and frame rate, i.e., the frequency at which different frames are displayed. The frame rate cannot be higher than the refresh rate, but the refresh rate can be higher than the frame rate. In order to reproduce the video clips at a variable speed, participants changed the actual frame rate by means of two keyboard keys (speed increase and decrease). Given the 60 Hz refresh rate and the 30 Hz video original frame rate, reproducing a video at its original speed means displaying the same frame twice in two subsequent refresh cycles (2:1 ratio). Doubling the original video speed means displaying a video frame every refresh cycle (1:1 ratio), while halving it means displaying one video frame for four refresh cycles (4:1 ratio). Intermediate speeds are achieved by implementing appropriate ratios between the frame rate and the refresh rate. This solution for a variable speed video reproduction avoids two visual artifacts, namely, video tearing, which would result from the crude disabling of the V-synch signal, and unnatural motion, which could result from frame resampling/interpolation. In the debriefing, observers did not report motion irregularities, and videos appeared smooth. Programs were written in Matlab using the Psychophysics Toolbox extensions, and were run under Windows 7 on an Intel-based PC with on-board graphics. In this experiment videos were muted.

In this and our previous study (de’Sperati and Baud Bovy, 2017), we refer to video clip speed although technically we did not implement a gradual speed change but only discrete removal or insertion of single video frames at proper time positions. That this choice was meant to prevent video quality deterioration, as a gradual frame rate change obtained by disabling synchronization with vertical retrace signal could introduce tearing, and interpolation could generate the impression of unnatural motion. What in fact legitimates the use of the term “speed” rather than “time jumps” or the like, is the observers’ subjective impression of smooth scenes unfolding at all tested speeds, at least in our experimental conditions. Arguably, this smoothness sensation depends on the temporal integration of global motion (Burr and Santoro, 2001; Vaina et al., 2003).

Speed Estimation (Adjustment Task)

Observers were presented the video clips with a randomized initial speed (frame rate range: 15–60 fps). Their task was to adjust the speed by means of two keyboard keys (in 0.1% steps) in order to reach the speed that they reputed to be the original shooting speed, at which time they could skip to the next trial. Each video clip lasted 30 s, and this was also the maximum time available for speed adjustment, after which the program passed automatically to the next trial. Observers were never shown the videos at the original speed as the standard for comparison, and were instructed to be as natural as possible in trying to re-establish the original video speed, as if they were trying to fix the reproduction speed of their old, non-calibrated videotape player. A few familiarization trials with similar video clips preceded the beginning of the experimental session.

Duration Estimation (Interval Reproduction Task)

The same participants were also tested for their ability to reproduce the duration of video clip pieces by means of a prospective interval reproduction task (Grondin, 2010), which was administered after the adjustment task. Observers were shown short pieces of the same four video clips used in the adjustment task for a variable duration randomly extracted from a uniform distribution in the 0.5–5.5 s range, randomly intermingled, starting at a random clip position (five different durations, for a total of 60 trials for each subject). At the end, observers had to reproduce the clip piece duration by holding a keyboard key for the same amount of time.

Experimental Design and Data Analysis

In this experiment we tested four video clips (C1, human motion; C2, human-physical motion; C3, physical motion; C4, ego-motion), three clip sizes (21″, 10.5″ and 5.25″), and two repetitions (blocked), for a total of 24 trials for each participant. All factors were within-subject, and their presentation order was randomized within each block.

For the statistical analyses, we used both means and medians as estimators of central tendency, together with 95% bootstrap confidence intervals (method: bias corrected and accelerated percentile with 1000 runs). Normality assumption was checked with the Shapiro-Wilk test, and outliers were detected with the Grubbs test. Repeated-measures ANOVAs were used, applying the Greenhouse correction whenever necessary, in which case the degrees of freedom were non-integer. Null-hypothesis was rejected at α < 0.05, while in pairwise comparisons we used α < 0.01 to reduce multiple comparison effects. Effect size was assessed through partial eta square (η²), while associations were tested with Pearson’s r or Spearman’s rho according to their measurement scales. For the duration reproduction task, linear fitting of perceived vs. objective durations was performed through robust linear least-squares method.

Results

For technical problems, three participants were excluded from the analyses, as their recordings were corrupted. Thus, in the following we report data from 12 participants. Four outlier values (1% of the trials) were replaced by mean values.

As an initial step, we considered the adjustment history for all participants in all trials (Figure 1). The plots illustrate the instantaneous frame rate (blue traces), expressed as the delay between two video frames (Inter-Frame Interval, IFI). IFI changed upon observer’s key strokes, starting with an initial random value. In general, participants reported feeling comfortable with the task, and they converged rather regularly towards the final estimated speed. The overall adjustment behavior is plotted as the mean IFI change (mean adjustment rate) over successive 5-s time intervals (red curves). We preferred this quantity instead of the number of key strokes because we noted that, to adjust speed, observers tended to keep holding the keyboard keys instead of pressing them frequently and briefly. Thus, the number of key strokes was not a faithful index of effective adjustment. As is evident from the figure, adjustments decreased rather regularly over time, to reach values close to zero towards the end of the trial. This reassures that trials did not end while observers were still adjusting. The mean adjustment rates over the entire trial duration were quite similar across display sizes and repetitions, the only statistically significant effect being the video clip main factor (F_(3,33) = 5.680, p = 0.003, η² = 0.341).

FIGURE 1

Figure 1. Adjustment behavior in Experiment 1. Video speed adjustments are represented through instantaneous Inter-Frame Interval (IFI, gray traces) over time in each individual trial, for each clip (columns) and display size (rows). Blue arrows indicate the mean final IFI value, which is the inverse of point of subjective equality (PSE). Red curves represent the mean adjustment rate (IFI change) over time with 95% confidence intervals. The horizontal blue dotted line indicates the original IFI of the videos (33 ms).

The final value of the adjusted speed reached at the end of the trial is the PSE, expressed as a frame rate value, which we took as a measure of speed judgment (PSE = final IFI⁻¹). Figure 2A shows the PSE values measured in individual trials, separately for clip type (horizontal axis) and display size (color), while Figure 2B shows the average PSE values for each subject. Because no statistically significant interactions were found, in Figures 2C–E we then plotted the mean values and confidence intervals separately for each factor (i.e., video clip, display size and repetition), superimposed to the mean values of individual observers (gray curves). The only significant factor found to affect estimated speed was the video clip type (main effect of video clip, F_{(1.337,14.707)} = 7.009, p = 0.013, η² = 0.389). Pairwise contrasts showed only one non-significant comparison, namely, C2 vs. C4. Despite the non-significant interactions, to further examine whether speed judgments are indeed insensitive to display size, we ran four separate ANOVAs, one for each video clip data. Again, no statistically significant effects of display size were found, except for an interaction display size × repetition with C4 (F_(2,22) = 5.305, p = 0.018, η² = 0.325).

FIGURE 2

Figure 2. Point of subjective equality (PSE) in Experiment 1. (A) Single-trial data. (B) Single-subject data. (C–E) Effects of the three tested factors. In (B–E), means and 95% confidence intervals are reported. Gray lines in (C–E) are individual subjects’ data. PSE is expressed in frame per second (fps). The horizontal dotted line indicates the original video speed (30 fps). L, M and S stand for Large, Medium and Small display size (21″, 10.5″ and 5.25″ respectively).

These speed judgments indicated a tendency toward speed underestimation. Indeed, the mean values of speed increase were higher than the objective video clip frame rate (30 fps) by 9%, 3%, 25% and 4% for C1, C2, C3 and C4, respectively (Figure 2C), and the lower limits of confidence intervals did not cross the 30-fps level, except in one case (C2). Similarly, by considering single subjects, in nine of them (75%) the PSEs were significantly beyond the objective video clip frame rate, again as shown by means and confidence intervals, and only in one subject was PSE significantly smaller than 30 fps (Figure 2B). The pattern of results was practically identical when computing median values (7%, 3%, 23% and 4% for C1, C2, C3 and C4, respectively, where only C2 was not significantly different from the 30-fps reference value), and also after logarithmic data transformation (geometric means: 9%, 3%, 21%, and 5% for C1, C2, C3 and C4, respectively, where only C2 was not significantly different from the 30-fps reference value). These results indicate that observers tended to judge the original video clip speed to be too low. Note that the single distributions were far from being uniform (Figure 2A), and rather tended to be slightly leptokurtic (mean kurtosis index = 0.823), that is, observers did show a speed preference.

We checked for a possible correlation between the randomly assigned initial video clip speed and the final observer’s estimation, but the two variables were not correlated at the trial-wise level (r = −0.010, p = 0.866). By contrast, we found a significant trial-wise correlation between total adjustment behavior, computed as the sum of instantaneous IFI changes produced in each trial, and speed estimation (r = 0.211, p < 0.001), which suggests that the more observers adjust, the more they underestimate video speed, regardless of the initial speed.

Figure 3 reports the results of the duration reproduction task, with the estimated video clip durations plotted against their objective durations. Reproduced durations tended to be shorter than original durations, although this effect seems to be more pronounced at longer stimulus durations. For each clip type and display size we computed a slope, which is an index of temporal overestimation (>1) or underestimation (<1). Across subjects, the mean slope was 0.89 (range = 0.81–0.93 across subjects) and was significantly less than the unitary slope (t₍₁₁₎ = −7.869, p < 0.001). At variance with PSE in the speed estimation task, it did not depend on the video clip type (F₍₃₃₃₎ = 2.410, p = 0.085, η² = 0.180). Indeed, slope was not correlated with PSE, either subject-wise (accounted variance = 4%, p = 0.537, N = 12), subject-and-clip-wise (accounted variance = 3%, p = 0.222, N = 48), or subject-and-clip-and-size-wise (accounted variance = 2%, p = 0.085, N = 144). The intercept, which could reflect very short-range temporal processing, and was on average negative across subjects although not significantly different from zero (−57 ms, t₍₁₁₎ = −10.950, p < 0.001), was also uncorrelated with PSE (subject-wise: accounted variance < 1%, p = 0.969, N = 12; subject-and-clip-wise: accounted variance < 1%, p = 0.658, N = 48; subject-and-clip-and-size-wise: accounted variance < 1%, p = 0.422, N = 144).

FIGURE 3

Figure 3. (A) Correlation between objective and perceived video clip duration. Each dot represents a trial. The oblique dotted line represents unitary slope with no offset. (B) Lack of correlation between speed estimation and duration reproduction, as measured through slope. Each dot (N = 144) represents, for each subject, clip and display size, the slope computed over five trials with different stimulus durations, and the PSE computed as the mean of the two repetitions in the speed estimation task. The horizontal dotted line divides the speed underestimation region (upward) from the speed overestimation region (downward). The vertical dotted line divides the time underestimation region (leftward) from the time overestimation region (rightward). (C) Same as (B) but for intercept. Colors code video clip type (C1–C4).

Finally, we found a positive significant correlation between PSE and subject age (accounted variance = 40%, p = 0.024, N = 12). However, the sample size was small and most subjects’ age was comprised between 20 and 30, thus we cautiously suggest that this result remains to be confirmed with a participants’ sample better tailored to studying age effects.

Discussion

Experiment 1 showed a tendency toward speed underestimation, that is, a tendency toward perceiving the flow of visual events as too slow. As a consequence, observers adjusted video speed at a rate higher than the original shooting speed. The lack of a significant effect of repetition suggests that speed judgments are rather stable over time. Speed underestimation depended on video clip type, but not on display size and repetition. The largest systematic speed error was found with physical motion (C3, water waves, PSE = 25%), and smaller errors were found with human motion (C1, jumping man, PSE = 9%), ego-motion (C4, walking in the crowd, PSE = 4%), and mixed human-and-physical motion (C2, foot dribbling, PSE = 3%).

As for display size, we failed to find significant effects on PSE, either globally or with single ANOVAs. This suggests that, as far as speed is concerned, it makes no difference watching a movie on a computer monitor (the largest display size used in this experiment) or on a mobile phone (approximately the smallest display size). Clearly, it is possible that stronger manipulations turn out to be effective in modifying video speed judgments, such as immersive display viewing.

As for the duration reproduction task, we did not find evidence that duration estimation is related to speed estimation. First, at variance with speed estimation, temporal estimation did not depend on clip type. Second, no correlation was found between the performances in the two tasks, regardless of whether we considered the level of subject, subject-and-clip, or subject-and-clip-and-size level. Note that the lack of significant effects of repetition in the speed estimation task makes the presence of carry-over effects due to the fixed task sequence unlikely.

Experiment 2

It is possible that the tendency toward speed underestimation found in Experiment 1 depended on the fact that the video clips were muted. Indeed, as anticipated, acoustic stimuli can modulate temporal processing and have a tight relation to movement (Recanzone, 2003). Therefore, Experiment 2 tested whether soundtracks can influence video speed judgments. To this aim, we asked observers to perform the same adjustment task already used in Experiment 1, with the same video clips, but this time accompanied by various soundtracks. We asked whether three acoustic manipulations (tempo variations of a metronome beating; tempo variations of a musical piece; musical pieces and white noise allegedly capable of evoking different arousal states), and in one case, volume manipulation, could modify the PSE for video speed.