Anticipation in turn-taking: mechanisms and information sources

Riest, Carina; Jorschick, Annett B.; de Ruiter, Jan P.

doi:10.3389/fpsyg.2015.00089

ORIGINAL RESEARCH article

Front. Psychol., 02 February 2015

Sec. Psychology of Language

Volume 6 - 2015 | https://doi.org/10.3389/fpsyg.2015.00089

This article is part of the Research Topic Turn-Taking in Human Communicative Interaction View all 22 articles

Anticipation in turn-taking: mechanisms and information sources

$\r\nCarina Riest*$ Carina Riest^*

Annett B. Jorschick $Jan P. de Ruiter\r\n$ Jan P. de Ruiter

Faculty for Linguistics and Literary Studies, Bielefeld University, Bielefeld, Germany

During conversations participants alternate smoothly between speaker and hearer roles with only brief pauses and overlaps. There are two competing types of accounts about how conversationalists accomplish this: (a) the signaling approach and (b) the anticipatory (‘projection’) approach. We wanted to investigate, first, the relative merits of these two accounts, and second, the relative contribution of semantic and syntactic information to the timing of next turn initiation. We performed three button-press experiments using turn fragments taken from natural conversations to address the following questions: (a) Is turn-taking predominantly based on anticipation or on reaction, and (b) what is the relative contribution of semantic and syntactic information to accurate turn-taking. In our first experiment we gradually manipulated the information available for anticipation of the turn end (providing information about the turn end in advance to completely removing linguistic information). The results of our first experiment show that the distribution of the participants’ estimation of turn-endings for natural turns is very similar to the distribution for pure anticipation. We conclude that listeners are indeed able to anticipate a turn-end and that this strategy is predominantly used in turn-taking. In Experiment 2 we collected purely reacted responses. We used the distributions from Experiments 1 and 2 together to estimate a new dependent variable called Reaction Anticipation Proportion. We used this variable in our third experiment where we manipulated the presence vs. absence of semantic and syntactic information by low-pass filtering open-class and closed class words in the turn. The results suggest that for turn-end anticipation, both semantic and syntactic information are needed, but that the semantic information is a more important anticipation cue than syntactic information.

Introduction

Participants in a conversation have a number of tasks that they have to perform simultaneously. They have to comprehend the speaker’s utterance while at the same time they need to prepare their response to that utterance, preferably before the current speaker ends their turn. Despite the complexity of these processes the alternation between the speaker and the hearer roles is generally timed with only short pauses and overlaps (Sacks et al., 1974). This conversational phenomenon is an important part of the turn-taking organization.

There are two competing main approaches providing an explanation for the turn-taking organization: the anticipatory approach, in which it is assumed that participants are able to predict the end of a turn in advance, and the signaling approach, which assumes that listeners perceive specific signals to detect the end of a turn.

The aim of this study was first to determine the relative contribution of these two proposed mechanisms to turn-taking and second, to investigate which linguistic information sources listeners predominantly use for end-of-turn anticipation. To this end, we conducted a series of button-press experiments with turns from natural conversations while manipulating both the respective critical information sources and the task.

The anticipatory approach argues that the precise timing in conversations can only be explained by the listeners’ ability to make accurate predictions about the end of the speaker’s utterances. Depending on the assumed anticipatory model listeners use various kinds of information to anticipate. The first to claim that listeners are able to anticipate a turn ending were Sacks et al. (1974). In their famous and often-cited turn-taking model they provide an explanation for the characteristic smooth speaker transitions in natural conversation. According to their model, turns consist of syntactic building blocks called turn-constructional units. Listeners are able to predict the end of a turn-constructional unit. At this point a speaker change becomes relevant. This point in time is called a transition-relevance place. When a turn arrives at a transition-relevance place it is possible (a) for the current speaker to select another speaker, or (b) for another speaker to self-select and start talking. If neither option (a) nor (b) is used the current speaker can produce another turn.

In contrast, the signaling approach assumes that turn transitions are regulated by an exchange of conventional vocal or gestural signals (e.g., Yngve, 1970). So in this approach, participants in a conversation do not anticipate these signals but react to them after having perceived them. Influential proponents of the signaling approach who did numerous studies on finding explicit turn taking signals are Duncan (1972, 1973), Duncan and Niederehe (1974), and Duncan and Fiske (1977). They assume that there exist definite signals that are displayed and responded to according to specific rules. According to Duncan (1972) such signals are composed of one or more of six behavioral cues: (1) any phrase-final intonation other than sustained, intermediate pitch level, (2) drawl on the final syllable or on the stressed syllable of a terminal clause, (3) the termination of any hand gesticulation, (4) sociocentric sequences (stereotyped expressions like “you know,” “isn’t it,” etc.), (5) drop in pitch and/or loudness in conjunction with one of the sociocentric expressions, or (6) termination of a grammatical clause. According to Duncan and Fiske (1977) speakers always produce at least one of these turn transition cues at the end of their turn, to which listeners react by initiating their next turn. The more cues a speaker produces the more likely a change of speaker role is at that point.

The standard argument against the signaling approach is that the relevant cues occur too late in the speaker’s turn to enable timely speaker changes. As a counter-argument, Heldner and Edlund (2010) note that the timing of floor changes is not as precise as it is often claimed. In their analysis of three different conversational corpora 41–45% of between-speaker intervals were longer than 200 ms. They claim that these intervals are potentially long enough for people to react to end-of-turn signals. Their argumentation is based on the distribution of observed delays and pauses in conversational turn-transfers. In their view, pauses longer than 200 ms could also plausibly be explained by assuming they were reactions to signals (p. 566), while pauses shorter than 200 ms could correspond to anticipation (55–59% of the turn transitions in the investigated corpora). Their reaction threshold explanation is based on minimal response times, which were investigated under maximally favorable conditions. Their argument for this strict threshold is that interlocutors are highly trained to recognize gaps, when they can start their turn. But even if one assumes higher thresholds reaching up to 600 ms (Jescheniak et al., 2003; Indefrey and Levelt, 2004; Schnur et al., 2006) Heldner and Edlund (2010) argue that the proportion of responses which can be explained by reaction would be lower, but would not be eliminated.

We want to suggest that the presence of gaps longer than 200 ms does not necessarily mean that the turn before the gap was reacted to. Speakers often intentionally delay the production of so-called ‘dispreferred’ responses, which leads to longer pauses (see, e.g., Levinson, 1983; Kendrick and Torreira, 2014). So pauses longer than 200 ms are not necessarily caused by reaction, but can also be caused by an anticipated response that was nevertheless intentionally delayed. Conversely, response times of shorter than 200 ms need not always be caused by anticipation, but can be early reactions to perceived signals (false alarms). Hence, using a fixed cut-off value does not give us an accurate estimate of the relative number of anticipated and reacted turn transitions.

One possible criticism regarding the anticipatory approach is that Sacks et al. (1974) do not explain the mechanism responsible for anticipation, and more specifically, which information listeners use to ‘project’ when a turn is going to end (Sacks et al., 1974; Power and Dal Martello, 1986; O’Connell et al., 1990). Sacks et al. (1974) present only observational evidence suggesting that syntax and intonation play an important role in this process. But in the last decade possible mechanisms of turn-end anticipation have been investigated in more depth.

To investigate the role of intonational contour and lexico-syntactic cues in end-of-turn anticipation De Ruiter et al. (2006) performed a button press experiment presenting turns taken from natural Dutch conversations to participants. The instruction was to press a button when they thought the turn was going to end. They presented unaltered turns as well as manipulated turns where the lexico-syntactic information was absent but the intonational contour remained intact and vice versa. The intonational contour was manipulated by completely flattening the pitch leaving duration, rhythm and intensity intact. The lexico-syntactic information was manipulated by low-pass filtering the original turn fragment. In this way, words could no longer be identified, but the pitch contour remained intact. The results show that for unaltered turns, the average response time was about 200 ms before the turn was finished. This indicates that rather than waiting for the end of the turn and then react, the participants tried to anticipate the turn ending. With intonation contour absent but intact lexico-syntactic information, the participants were still able to accurately anticipate the turn ending. But the anticipation accuracy deteriorated significantly in absence of the lexico-syntactic information. The authors concluded that the lexico-syntactic structure is necessary (and perhaps even sufficient) for accurate end-of-turn projection. They suggested that the syntactic structure provides constraining information about the upcoming words and serves as a temporal resource for the listeners to monitor the unfolding turn. An important difference between the task used by De Ruiter et al. (2006) and turn-taking in natural communication is that listeners do not need to prepare and produce an utterance. This actually led to more accurate responses in the experiment compared to the responses in the natural conversations from which the experimental stimuli were culled. Hence, we believe that the results from this methodology are at least qualitatively generalizable to the natural situation.

Keitel et al. (2013) used eye-tracking methodology to investigate the influence of semantic content and intonation on anticipation ability during development. They presented recordings of actors performing conversations to three different age groups (prelinguistic 6–12 months, linguistic 24–36 months, adults) while measuring their gaze. The conversations were presented either with normal or flattened intonation. If a gaze was shifted from the current to the next speaker at least 500 ms before the end of the current turn, it was considered anticipatory. But if the gaze shifted after the listener began to speak the gaze shift was coded as reactive. The results showed that in contrast to younger infants, children at the age of three are already able to reliably anticipate the end of turns. Furthermore, intonation influenced anticipation only in this specific age group, suggesting that at that age they rely more strongly on intonational information for anticipation than adults. The authors explained this finding by noting that the syntactic and semantic competence of the 3-year-olds is not yet adult-like. This is in line with the finding that adults tend to rely on prosody for the detection of turn-ends only when neither semantic nor syntactic information is available (Grosjean and Hirt, 1996).

A comparable study was done by Casillas and Frank (2013) who also investigated which linguistic cues children use to anticipate a turn ending. In contrast to Keitel et al. (2013) they tested 1–7 year-olds and instead of using conversations done by actors, they measured the children’s gaze shifts while watching videos of conversations between puppets. Casillas and Frank (2013) found that even 1 and 2-year-olds anticipated turn endings, and that their anticipation correlated positively with the duration of the gap between two successive turns. They also manipulated the prosodic or lexical information (or both) of the conversations, and compared question with non-question turns. In their general discussion, they write that “Question effects are strongest when both prosodic and lexical cues are present, contrary to prior findings with adult listeners that found lexical information sufficient to predict upcoming turn-end boundaries (De Ruiter et al., 2006)” (emphasis in original). We are not convinced that there is a clear contradiction between their study and the result of De Ruiter et al. (2006) for the following reasons. First, the study by Casillas and Frank (2013) does not provide enough information to assess whether there is a statistically significant effect corresponding to this specific claim. Second, in the study by De Ruiter et al. (2006), the factor Question vs. No-Question was not investigated. (In Stivers et al. (2009) the data from De Ruiter et al. (2006) was reanalyzed and indeed showed no difference between responses to questions and non-questions, but that was only for the natural data.) Finally, it is possible, perhaps even plausible, that asking actors to record a conversation speaking “as if they were on a children’s television show” (p. 2) will result in prosodic patterns that are more exaggerated than in natural speech, due to the explicit child-directedness of the actors’ speech. For these reasons, we do not (yet) see a clear contradiction between the results of Casillas and Frank (2013) and those of De Ruiter et al. (2006).

To investigate how listeners use lexico-syntactic information to anticipate turn-ends Magyari and De Ruiter (2012) conducted a gating study. They used the experimental stimuli of De Ruiter et al.’s (2006) study and selected turns of which the ends were either predicted with a high or with a low accuracy in the button-press experiment. The results showed that the proportion of the correct guesses of upcoming words was higher when the accuracy of button-press in the original experiment was higher. Furthermore, in the gating study the participants expected more words to come with those turns that resulted in button presses that occurred too late in De Ruiter et al.’s (2006) study. They concluded that listeners make predictions in advance about which, and therefore how many, words will follow in a turn. These predictions help to estimate the remaining duration of the turn.

The idea that lexico-syntactic information serves as source for listeners’ anticipation performance is also supported by conversation-analytic studies (e.g., Ford and Thompson, 1996; Selting, 1996; Caspers, 2003). Caspers (2003) showed in her quantitative investigation that turn transitions are always located at syntactic completion points. She concluded that syntax constitutes the main information source for end-of-turn projection. Similar findings, based on a quantitative analysis of standard German, have been presented in Selting (1996), who concluded that listeners primarily exploit syntactic structure to project turn endings. Ford and Thompson (1996) found in their analysis of an American English face-to-face corpus that speaker change most frequently occurred when syntactic completion was combined with intonational as well as pragmatic completion. They concluded that syntax operates together with intonation and pragmatics to project the end of turns (see also Gravano and Hirschberg, 2011). As not all these studies found a perfect correspondence of syntactic completion points to turn-transitions, it remains an intriguing question how the distinction between those syntactic completions that are, and those that aren’t treated as turn-ends by the listeners is made. Unfortunately, this question cannot be satisfactorily answered by studying correlations in dialog corpora, but would require explicit experimentation to be able to distinguish correlation from causation.

To summarize, there is evidence from multiple sources that listeners are able to anticipate the end of the speaker’s turn (De Ruiter et al., 2006; Casillas and Frank, 2013; Keitel et al., 2013). But the mere existence of an anticipation ability does not imply that it is actually used to predict when a turn is finished in natural communication. Furthermore, Heldner and Edlund (2010) argued that turn-taking could at least partially be explained by assuming that conversationalists simply react to signals. Thus, the first question we want to investigate in this study is: is turn-taking based on anticipation or on reaction?

Experiment 1

To determine the relative role of anticipation and reaction in turn-taking we conducted a button-press experiment using the same experimental methodology as in De Ruiter et al. (2006). We took turns from natural conversations and asked the participants to indicate the end of the turn by pressing a button. In the turns, we manipulated the information available for anticipation of the turn end and studied the effect of this manipulation on the projection accuracy. Our manipulations ranged from providing complete advance information about the turn-end to completely removing all linguistic information from the turn. (These manipulations are described in detail below.) The logic is that if the projection accuracy in responding to the original (unchanged) turns is comparable to responses to turns with advance information, then this is evidence for anticipation. On the other hand, if the projection performance to the natural turns is similar to the responses to the turns without or with substantially reduced linguistic information, this is evidence for people reacting to the perceived end of the turn.

Materials and Methods

Compliance with ethics guidelines

The experimental methods used in this project have been approved by the Ethics Board of Bielefeld University. Informed consent was obtained from all subjects.

Participants

Eighty native speakers of German participated in Experiment 1 (56 females, 24 males).

Stimulus collection

The stimulus collection procedure is the same as the one described in De Ruiter et al. (2006). For maximum ecological validity we took our stimuli from a natural German ‘telephone’ corpus (audio-only conversation), which we recorded in our lab. We recorded 16 native speakers of German in eight dyadic conversations (four female–male, three female–female, one male–male). The participants in each dyad were friends. For the stimulus collection we told the participants to just talk about anything they liked and gave them no further instruction. Each dyad’s conversation lasted 20 min, resulting in a total of 160 min of recorded conversation.

For the audio recordings we put the participants in two separate rooms and required them to wear closed headphones. Directional microphones were placed on a table in front of them. We established a telephone-like connection between them, such that both participants could hear both themselves and their interlocutor. The speech of each of the two participants was recorded separately on the two channels of a stereo recording device. This way, we avoided cross talk between the participants in our recordings. The participants rapidly got used to the recording situation and the resulting conversations appeared natural and lively.

After recording the corpus, the conversations were transcribed, registering overlaps, pauses, laughter, turn beginnings and endings, assessments (Goodwin, 1986), and continuers (Schegloff, 1982). In addition we measured the Floor Transfer Offset (FTO) of 1597 turn transitions. The FTO value is defined “as the difference (in seconds) between the time that turn starts and the moment the previous turn ends” (De Ruiter et al., 2006, p. 516). Hence, a gap between two turns is characterized by a positive FTO value and an overlap by a negative one. Figure 1 shows the distribution of the FTO values.

FIGURE 1

FIGURE 1. Floor Transfer Offset (FTO) distribution of the German telephone corpus.

Although the general shape of the FTO distribution resulting from the German telephone corpus looks similar to the Dutch FTO distribution from De Ruiter et al. (2006), the distributions differ in a number of aspects. There are small differences in the means, variances, skewness, and kurtosis (see Table 1)¹.

TABLE 1

TABLE 1. Comparison of Dutch and German telephone corpora.

From this corpus we randomly selected 100 target turns and an additional 16 turns for practice purposes. We took care that the turns contained at least five words so that the participants in the planned button-press experiments obtained enough information content to potentially base their reaction on. Furthermore, we made sure that the random selection reflected the distribution of pauses and overlaps of the natural conversations. Furthermore we balanced the sex of the speaker in the target turns (50 % female, 50% male). The total number of different speakers in our target stimuli was 16. Table 2 presents some descriptive statistics of the target turns.

TABLE 2

TABLE 2. Descriptive statistics of target turns.

After selecting the target turns, we extracted them into individual sound files using Praat (Boersma and Weenink, 2012) and created four different versions of each stimulus. These versions were as follows.

Natural-Turn. The target turn was presented as it occurred in the natural conversations. In this condition the participants had access to all potentially relevant information to base their anticipation or reaction on.

Advance-Knowledge. The participants could first read the content (a literal transcription) of the turn before they heard the target stimulus. Because the participants knew in advance how the turn was going to end, they were, in principle, maximally capable to anticipate the turn end. In this condition the response distribution of anticipated responses was measured.

Scrambled-Word-Order. We randomly changed the order of the words within the target turn using Praat. The pauses between the words in the original were assigned to the subsequent word. The resulting stimuli therefore had the same duration as the Natural-Turn stimuli. In this condition there was no sequential word-order information to base the anticipation on, but there were still words present. Thus, the predictability of a word on the basis of its preceding words is switched off, i.e., the cloze probability (Taylor, 1953) of the words in the resulting turns was very low. In contrast to the Natural-Turn condition the anticipation of the turn end on the basis of sequential lexical information was made impossible.

Noise. The Noise condition was created using a Praat script that convolved the speech stimulus of the natural turn with white noise. The resulting sample of constant noise had the same duration and frequency spectrum as the original fragment. This condition served as a comparative baseline from which all linguistic information that could be used for anticipation was removed. The only way to be certain that the turn has ended in this condition is to react to the turn end. This condition measured the response distributions when the participants had no choice but to react to the end of the turn.

In order to control for subjective loudness between conditions and stimuli we adjusted the loudness of all stimuli to a reference sone value.

Design

Each participant was presented with four trial blocks (Natural turn, Advance-Knowledge, Scrambled-Word-Order, Noise) each containing 25 target turns. Within each block there were four practice trials followed by the 25 target turns. We created eight different experimental lists. In the first four lists we permutated the block order according to a Latin-square design. The remaining four lists were the same as the first four lists with the block order as well as the presentation order of the stimuli reversed. Each of the target turns appeared in all four conditions across the lists but none appeared twice within the same experimental list.

Procedure

The participants received a written instruction that they had to listen to short audio fragments, taken from real conversations, and to press a button as soon as they thought the speaker in the fragments would finish speaking. They were informed that they would be presented with four different blocks, and that in one of these blocks they had to first read the content of the fragment before they heard the corresponding audio fragment. Furthermore, they were informed that in two blocks the stimuli were manipulated acoustically. The stimuli were presented to them via closed headphones. We randomly assigned the participants to one of the eight experimental lists (10 per list).

The participants were presented first with the four practice trials and after that with the corresponding trial block. After each practice block the participants got the chance to ask the experimenter questions. Each experimental block contained a visual countdown from 3 to 1 followed by the auditory presentation of the stimuli. As soon as the participants pressed the button the sound was immediately cut off. In this way we made sure that the participants got no feedback about their performance. The trial block Advance-Knowledge differed from the other trial blocks because after the visual countdown the participants were presented with a written sentence, representing the content of the turn. After pressing the button the sentence disappeared and the acoustic presentation of the stimulus started.

For the presentation of the stimuli we used the E-Prime software package (Schneider et al., 2012a,b), which also allowed us to record the time from stimulus onset to button press.

Results and discussion

We first calculated the BIAS, which is defined as response time minus the duration of the target turn. Figure 2 shows the BIAS distributions for the four different conditions. Figure 3 shows an overview of the average BIAS per condition. The average BIAS is negative in all conditions, which gives a first hint that participants tried to anticipate the turn ending, rather than wait until the turn fragment was over.

FIGURE 2

FIGURE 2. Response distributions per condition from Experiment 1.

FIGURE 3

FIGURE 3. Average BIAS of responses per condition as measured in Experiment 1. Asterisk indicates statistical significance at the 0.05 level.

An ANOVA for the dependent variable BIAS showed a significant effect for presentation condition (by subjects: F1(3,315) = 23.259, p < 0.001; by items: F2(3,297) = 18.82, p < 0.001). Bonferroni-corrected paired t-tests, pairing over identical turn fragments from the two conditions under comparison, revealed that the Natural turn condition led to significantly more negative BIAS than the Noise and the Scrambled-Word-Order condition. The latter condition led to significantly more negative BIAS than the Noise condition. Whereas the BIAS in the Advance-Knowledge and the Natural turn condition did not differ significantly from each other.

Conventional significance tests are designed to reject the null hypothesis without fault in the limit of infinite sample size. This is characterized by vanishing p-values and unbounded t-values. In contrast, if the null hypothesis is true and infinite sample sizes are considered the p-values are not converging to any limit value. Correspondingly, under the null hypothesis, all p-values are all equally likely (Rouder et al., 2009). Hence, it is not possible to claim evidence favoring a null hypothesis using conventional significance tests. We therefore also performed a Bayesian analysis (Jeffreys, 1961; Kass and Raftery, 1995) for the Advance-Knowledge and the Natural-Turn condition by comparing them using a Bayesian paired t-test (Rouder et al., 2009). To be consistent with Morey and Rouder (2011) and Rouder et al. (2012) we used a Cauchy prior with scale parameter for the standardized effect size in combination with a Jeffreys prior on the variance. The analysis was performed using the BayesFactor package (Morey et al., 2014) for R (R Development Core Team, 2009). An overview of a common textual interpretation of Bayes Factor values is presented in Table 3.

TABLE 3

TABLE 3. Evidence Categories for Bayes Factor, adapted from Jeffreys (1961), cited in Wetzels et al. (2011).

The Bayesian paired t-test using item means for the variable BIAS revealed that the null hypothesis, stating that Advance-Knowledge and Natural-Turn condition are equal in anticipation accuracy, is twelve times more likely than the alternative hypothesis that these two conditions differ in button press accuracy (BF = 0.08). This provides “strong” evidence for the null hypothesis.

Comparing the subject means of the BIAS variable with the Bayesian paired t-test resulted in “substantial” evidence (BF = 0.1) for the null hypothesis. This analysis allows us to conclude that there is no statistically reliable difference between the BIAS in the Advanced-Knowledge and the Natural Turn condition. So the participants’ button press accuracy with the natural turns was just as good as when they had advance information about the content of the turn. This finding suggests that participants are indeed able to anticipate a turn ending, and that they are using this strategy to predict when a turn is going to end.

The significant difference between Scrambled-Word-Order and Noise condition indicates that having access to words (even though they were in the wrong order) still allowed them to anticipate better than chance.

Although there was no significant difference in the button press accuracy between the Advance-Knowledge and the Natural-Turn condition, the participants could still have reacted to signals to a certain extent. If the participants used both anticipation and reaction as a strategy this should result in a lower response consistency. To investigate the response consistency over conditions we computed the Entropy for every stimulus/condition pair (Shannon, 1948). The Shannon Entropy is a measure of uncertainty: the more the responses are distributed over different intervals the higher the Entropy. If the participants used only one strategy to estimate when the turn is over, the Entropy should be lower. However, if the participants used both reaction and anticipation, their responses should be more highly distributed, resulting in a higher Entropy.

In Figure 4 the average Shannon Entropy (using a bin-width of 250 ms; see De Ruiter et al., 2006 for details) is shown for every condition. We can only show a by-item analysis as these Entropy values can only be meaningfully computed for individual stimuli over entire response distributions.

FIGURE 4

FIGURE 4. Average Shannon Entropy per stimulus/condition as measured in Experiment 1. Asterisk indicates statistical significance at the 0.05 level.

As in the BIAS analysis, an ANOVA of the Entropy showed a main effect for condition F2(3,297) = 62.5, p < 0.001). Bonferroni-corrected paired t-tests revealed that all differences between individual conditions were significant (p < 0.001), the exception being the difference between Advance-Knowledge and Natural-Turn.

Again we compared the Entropy values in the Advance-Knowledge and Natural-Turn condition using a Bayesian paired t-test. The analysis (BF = 0.2) provided “substantial” evidence for the null hypothesis (no difference between Advance-Knowledge and Natural-Turn in button press consistency).

The analysis of the participants’ button press consistency supports the interpretation of the BIAS results. The results showed that the Entropy in the Natural-Turn condition and the Advance-Knowledge condition was comparable. Thus, in the Natural-Turn condition the participants were able to consistently and accurately anticipate the turn-end and consequently used anticipation as a strategy to tell when a turn was over.

In contrast, in the Scrambled-Word and the Noise condition the Entropy values were significantly higher than in the other two conditions. This suggests that the participants tried to anticipate the turn-end rather than just waited for the end of the fragment, which lead to significantly broader distributed responses. In addition, the average Entropy in the Scrambled-Word order condition was significantly lower than in the Noise condition. This corresponds to the BIAS analysis above where participants in the Scrambled-Words condition were significantly more accurate in detecting the turn end. Hence, the participants are more consistent and accurate in the end-of-turn projection when they have access to words compared to when they only hear noise. One explanation of this finding could be that even with the scrambled word order listeners are able to recognize the basic meaning of the turn, enabling them to roughly guess when the turn finishes. Additionally, it is possible that once the participants “gambled” that a certain word was the last word, they could anticipate the end of that word, as suggested by research on auditory word recognitions (Marslen-Wilson and Welsh, 1978; McClelland and Elman, 1986; Marslen-Wilson, 1987).

We showed in Experiment 1 that listeners in dialog are indeed able to anticipate the end of the speaker’s turn and that they consistently use this ability to predict when a turn is going to end. When listening to the natural turns the participants showed the same response accuracy and consistency as when they knew the end of the turn in advance. Our results are in line with earlier findings that listeners anticipate turn endings and that natural language is predictable to a certain degree (De Ruiter et al., 2006; Magyari and De Ruiter, 2012; Casillas and Frank, 2013; Keitel et al., 2013; Magyari et al., 2014). Hence, in the first experiment we were able to show that anticipation is the primary mechanism underlying smooth turn-taking, and that participants consistently use this strategy to detect a turn ending. Thus, our results support the turn-taking model proposed by Sacks et al. (1974). Nevertheless, reaction to the turn end might well serve as some kind of a “backup” mechanism in the case when the anticipation of the turn ending is, for whatever reason, not possible.

We now have an empirically derived distribution of anticipation times, from a task in which the participants were asked to anticipate turn-ends, and had the information to do so. To find out about the distributional properties of the reaction process, which we assume also plays a role, we need to study the reaction time distribution of participants that had no information to anticipate (as in the Noise condition of Experiment 1) but in addition, were not instructed to anticipate, but rather to respond to the end of the stimulus. To this end, we conducted Experiment 2.

Experiment 2

Heldner and Edlund (2010) suggested that turn transitions with a gap longer than 200 ms are potentially explainable by assuming that participants respond to signals at the end of the turn. As we discussed in the introduction, this assumption does not capture the stochastic nature of the time course of the two processes involved. Instead, we assume that distributions of natural floor transfer are actually a stochastic mixture of an anticipation and a reaction time distribution. We wanted to empirically estimate the distribution of reacted responses in order to be able to estimate the proportion of turn-transitions that we were reasonably sure were reactions (and not to anticipations) to turn transitions.

An empirically estimated anticipation distribution is provided by the Advanced-Knowledge condition of Experiment 1. In Experiment 2 we want to find the other distribution based on pure reaction time. To this end, we used the Noise stimuli from Experiment 1, but now explicitly instructed the participants to respond only after they perceived the end of the fragment.