Word frequency cues word order in adults: cross-linguistic evidence

Gervain, Judit; Sebastian-Galles, Nuria; Diaz, Begona; Laka, Itziar; Mazuka, Reiko; Yamane, Naoto; Nespor, Marina; Mehler, Jacques

doi:10.3389/fpsyg.2013.00689

ORIGINAL RESEARCH article

Front. Psychol., 02 October 2013

Sec. Psychology of Language

Volume 4 - 2013 | https://doi.org/10.3389/fpsyg.2013.00689

Word frequency cues word order in adults: cross-linguistic evidence

Judit Gervain^1,2*

Núria Sebastián-Gallés³

Begoña Díaz³

Itziar Laka⁴

Reiko Mazuka^5,6

Naoto Yamane⁵

Marina Nespor⁷

Jacques Mehler⁷

¹Laboratoire Psychologie de la Perception, Université Paris Descartes, Sorbonne Paris Cité, Paris, France
²Laboratoire Psychologie de la Perception, CNRS, Paris, France
³Department of Technology, Center for Brain and Cognition, Universitat Pompeu Fabra, Barcelona, Spain
⁴Lingusitics and Basque Studies, Psycholingusitics Laboratory, University of the Basque Country (UPV/EHU), Vitoria-Gasteiz, Spain
⁵Laboratory for Language Development, RIKEN Brain Science Institute, Wako-shi, Japan
⁶Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
⁷Language, Cognition and Development Lab, Cognitive Neuroscience Sector, SISSA - International School for Advanced Studies, Trieste, Italy

One universal feature of human languages is the division between grammatical functors and content words. From a learnability point of view, functors might provide entry points or anchors into the syntactic structure of utterances due to their high frequency. Despite its potentially universal scope, this hypothesis has not yet been tested on typologically different languages and on populations of different ages. Here we report a corpus study and an artificial grammar learning experiment testing the anchoring hypothesis in Basque, Japanese, French, and Italian adults. We show that adults are sensitive to the distribution of functors in their native language and use them when learning new linguistic material. However, compared to infants' performance on a similar task, adults exhibit a slightly different behavior, matching the frequency distributions of their native language more closely than infants do. This finding bears on the issue of the continuity of language learning mechanisms.

Introduction

Speakers of English readily recognize “Twas brillig, and the slithy toves \ Did gyre and gimble in the wabe …,” the first lines of Lewis Carroll's Jabberwocky poem as having an English-like grammatical structure despite the absence of any meaning. This is a striking illustration of a universal feature of human languages: grammatical functors (set in bold in the quote) define and signal sentence structure, while content words carry meaning. Languages differ with respect to which universally available content or function word categories they instantiate and how they implement them, but the major divide between function and content words is present in all the world's languages (Fukui, 1986; Abney, 1987). Using a cross-linguistic perspective, the present paper investigates whether this feature of human languages contributes to their parsing and learning in adult speakers. The main goal of the paper is to show that the anchoring property of function words applies in typologically different languages and can be used even by adults, i.e., speakers with a full-blown linguistic competence.

Function words have been hypothesized to contribute to language learning in at least two ways. First, they often help categorize content words. In English, for instance, nouns are typically preceded by determiners such as a, the, some, etc., whereas verbs are often preceded by auxiliaries, such as have, is, etc. and they take suffixes like -ing, -ed, etc. Formally described first by structuralist linguists (Bloomfield, 1933; Saussure et al., 1983), this role of functors has been central to formalist as well as statistical theories of language, and numerous behavioral and computational studies have established its psychological relevance (e.g., Thorne, 1968; Morgan et al., 1996; Redington et al., 1998; Shi et al., 1998, 1999; Mintz, 2002; Shi and Werker, 2003; Shi et al., 2006).

Second, functors have been assumed to cue rules and increase the learnability of structural generalizations in language. Intuitively, functors, due to their high frequency, act as anchor points with respect to which the structural roles and sequential positions of other constituents can be encoded and remembered. This hypothesis has been explored in a number of artificial grammar learning studies, asking whether (artificial) languages with and without the functor/content word distinction show different degrees of learnability (Braine, 1966; Green, 1979; Morgan and Newport, 1981; Mori and Moeser, 1983; Morgan et al., 1987; Valian and Coulson, 1988; Valian and Levitt, 1996; Wang et al., 2011). These studies, discussed in detail below, confirm that functors contribute to the learnability of linguistic structure by serving as structural flags or anchor points for English-speaking participants.

However, functors are realized differently across languages, one important difference being their relative order with respect to content words. Whether functors serve as anchor points in languages with different configurations remains largely unexplored. A few recent studies with infants exposed to languages other than English (Höhle et al., 2001; Gervain et al., 2008; Hochmann et al., 2010) suggest that young learners of typologically diverse languages do use functors as entry points into language structure. However, the number of languages investigated remains limited (German, Italian, and Japanese). Further, it is also unknown whether the anchoring role of functors in these typologically different languages is present only at the beginning of the language acquisition process, or whether it is a strategy that even adults rely on when parsing novel linguistic material. Arguments have been put forth in the literature for both the continuity and the discontinuity of language learning and competence across the life span (Guasti, 2002; Santelmann et al., 2002).

The aim of the current study is, therefore, twofold. First, it seeks to investigate parsing strategies based on the distribution of frequent items, used by adult speakers of two pairs of typologically different languages: Basque and Japanese as well as Italian and French. French and Basque have never been investigated from this perspective in the literature before. Second, by testing adult speakers in a task adapted from one of the previous infant studies (Gervain et al., 2008), we also seek to explore the life span continuity of this learning strategy. The inclusion of Italian and Japanese participants will allow us to compare their performance with the Italian and Japanese infants tested in the previous study.

Below, we first provide a brief description of the relevant typological differences between Basque, Japanese, Italian, and French, highlighting the fact that Basque and Japanese are typically characterized as functor-final, whereas French and Italian are functor-initial languages. Afterwards, we discuss the anchoring hypothesis in detail, reviewing the relevant behavioral studies, and whether such a language learning mechanism might be employed across the life span. We then report a corpus study showing that functors do indeed act as utterance-final anchors in Basque and Japanese and utterance-initial in Italian and French in actual language use. This is followed by an artificial grammar study, which shows that adult native speakers use functors as entry points to the structure of an unknown language and they do so using the word orders and frequency distributions characteristic of their native language. When comparing the performance of the adults in our study and that of the infants in the previous study (Gervain et al., 2008), we find both important similarities and differences. The discussion of their theoretical implications closes the paper.

Functors and Content Words in the World's Languages

The distinction between grammatical functors and semantically loaded content words is universal (Fukui, 1986; Abney, 1987; Morgan et al., 1996). However, the world's languages show systematic differences in the way they use functors and position them with respect to content words. English, French, and Italian, for instance, use prepositions in front of nouns (e.g., English: on the table). By contrast, Japanese, Basque, and Turkish have postpositions [e.g., Japanese: Kobe ni Kobe to “to Kobe (a city in Japan)”]. The relative order of functors and content words correlates with other word order phenomena (Greenberg, 1963; Dryer, 1992). Languages with prepositions typically use a Verb-Object (VO) basic word order (e.g., English: eat an apple), whereas postpositional languages are usually Object-Verb (OV) languages (e.g., Japanese: ringo-wo taberu apple.acc eat). The order of constituents in other phrase types (e.g., embedded clauses, possessives, etc.) correlates with basic word order (Greenberg, 1963; Dryer, 1992). Determining the relative order of functors and content words might thus be a powerful cue to a large number of syntactic structures in a language.

Another important typological difference between languages is whether they realize functors as bound or free morphemes. Bound morphemes are morphologically dependent on, i.e., attached to, other words (e.g., read-ing), whereas free morphemes are independent words (e.g., to read). Functors that appear as free morphemes in VO languages often surface as bound morphemes in OV languages (e.g., prepositions vs. suffixes, respectively; English: on the table vs. Basque: mahai-a-n table-the-on). In fact, a systematic relationship exists between word order, on the one hand, and the bound vs. free nature and the position of functors, on the other (Greenberg, 1963; Dryer, 1992). When functors are realized as bound morphemes, there is a general tendency in the world's languages to realize them as suffixes, not prefixes. However, this general preference for suffixes is modulated by word order: suffixation is predominant in OV languages, but it is also common in VO languages, whereas prefixes are rare in OV languages and exclusive prefixation only exists in VO languages, never in OV languages.

Since word order varies across languages, young infants face the task of having to learn this morphosyntactic property from the speech input when acquiring their native language. Babies seem to accomplish this at an early age. In fact, their first multiword utterances (toward the end of the second year of life) follow the basic word order of the target language (Brown, 1973). Does the anchoring function of grammatical markers play a role?

Recently, Gervain et al. (2008) reported that infants as young as 7 months of age are able to form a rudimentary representation of word order on the basis of word frequencies. Given the correlation of the relative position of functors and content words with other word order phenomena within a language (Dryer, 1992), keeping track of the relative order of frequent and infrequent words, i.e., functors and content words, might provide infants with a heuristic cue to a rudimentary representation of basic word order. This is exactly what Gervain et al. (2008) found. Using an artificial grammar learning task, they showed that 8-month-old Italian and Japanese infants had opposite expectations about the relative order of frequent and infrequent words, mirroring the opposite word orders of their native languages, well before they start to talk¹. Italian infants preferred the test items with frequent-infrequent (FI) order, while Japanese looked longer at the infrequent-frequent (IF) order. This suggests that the distribution of functors might indeed contribute to language learning.

Functors Contribute to the Learnability of Language

Braine (1966) was one of the first to study how frequent or constant marker elements influence grammar learning. Linear order is a fundamental aspect of natural languages. However, typically what is important in grammar is not the exact ordinal position of a word in a sequence, but the position of constituents with respect to each other (Chomsky, 1957). For instance in English wh-questions, the auxiliary follows the wh-phrase irrespectively of its length. Consequently, it may occupy different ordinal positions within different sentences, but always the same position with respect to the wh-phrase (e.g., [Where]are they? [How many] are they?, but not *How are many they?). Therefore, it is important to know what mechanisms enable humans to learn languages on the basis of information about underlying (non-adjacent) dependencies rather than ordinal position. Braine (1966) tested this in 9–10-year-old children, giving them artificial grammar learning tasks in which success depended on learning the positions of non-frequent variable tokens (“content words”) with respect to constant marker elements (“function words”). The positions to be learnt could be immediately adjacent to or one position removed from the marker element (as in the structure fPQ, where f is a marker, P and Q are content words—a natural language example would be: the blue car). The results suggested that participants readily learned both relative positions. This, as the author pointed out, was a necessary prerequisite for natural language acquisition. Indeed, Braine (1963a) observed that functors often play the role of “pivot” during young infants' two-word stage production. These utterances often contain a closed-class word, which is productively combined with an open-class word (my daddy, my mommy, my milk …). The early appearance of some functors in language production points to their role in the acquisition of grammatical structure².

Green (1979) investigated the importance of the reliability of functors as markers. In a first experiment, he presented three different grammars to three groups of adult participants. The first group saw well-formed strings from a grammar containing functional markers and content words, which co-occurred in a systematic way (“effective markers”). The second group was familiarized with a grammar having markers and content words, which co-occurred randomly (“useless markers”). The third group was presented with a grammar having no markers (“no markers”). The author found that there was some learning in all three conditions, but learners of “effectively marked” grammars performed significantly better than participants in the other two conditions. Green (1979) summarized these findings in the “marker hypothesis,” which has the following three tenets. First, in all learnable languages, there is a small set of words or morphemes (i.e., function words), the “markers,” each of which is associated with one or, at most, a few syntactic constructions/categories. Second, sentences are easier to parse when they contain markers. Third, a language without markers would be very difficult or impossible to parse and, hence, to learn³.

Morgan et al. (1987) conducted similar experiments, comparing learning in artificial grammars that had (i) no markers, (ii) inconsistent markers, or (iii) consistent markers. They focused mainly on how, if at all, markers help learners discover the hierarchical phrase structure of the input. Importantly, they tested free and bound functors, i.e., function words and grammatical suffixes, separately. In the experiment that tested free function words, three grammars were used. One (“no markers”) contained only content words, and no function words. A second (“inconsistent markers”) used both function words and content words, but function words appeared randomly between content words. A third (“consistent markers”) had both function words and content words, in such a way that function words indicated phrase boundaries. Apart from the functors, all three grammars were generated by the same phrase structure rules. Participants learned the linear order and sequential co-occurrence patterns of words in all conditions. However, those in the consistent markers condition performed better than the others. Moreover, only they succeeded in a subsequent constituency test. The second experiment tested the same three grammars but using bound morphemes instead of free markers. The results were similar to the ones obtained before. Morgan et al. (1987) concluded that markers, both free and bound, provided efficient cues to hierarchical phrase structure.

Some of the above studies tested children; others tested adults. Can we consider language learning in a lifespan perspective?

The Continuity of Grammar Learning Mechanisms Across Life

The question whether linguistic abilities are best characterized by continuity or discontinuity across life has been of considerable theoretical importance for language acquisition research and linguistic theory (e.g., Weissenborn et al., 1992; Guasti, 2002). The related issue of a critical or sensitive period for language learning and its implications for neural plasticity have also attracted much attention (for a recent review, see Fava et al., 2011).

Younger learners clearly outperform older ones in several domains of language learning, especially those related to the sound patterns and morphosyntactic regularities of language, while vocabulary is less challenging to learn even at an older age. It is well-established that native phonological perception and production is very difficult to achieve if the first exposure to a language occurs after early childhood (Dupoux et al., 1997; Pallier et al., 1997; Sebastian-Galles and Soto-Faraco, 1999; Best and McRoberts, 2003). In morphosyntax, adults' disadvantage is somewhat less marked, although fully native-like proficiency is still hard to achieve (Johnson and Newport, 1989; Long, 1990). However, even in the case of comparable performance, infant and adult populations might not rely on the same learning mechanisms or linguistic competence to achieve a similar performance. Specifically, in the domain of morphosyntactic acquisition, the aspect of language relevant for our study, a series of studies by Newport and colleagues (Johnson and Newport, 1989; Newport, 1990; Goldowsky and Newport, 1993; Hudson Kam and Newport, 2005; Wonnacott et al., 2008; Hudson Kam and Newport, 2009) suggests that younger and older learners might use two basic learning mechanisms, memory-based/statistical learning and rule extraction, differently at least under some conditions. Captured by the “less is more” hypothesis, Newport (Newport, 1990; Goldowsky and Newport, 1993) argues that infants tend to rely more heavily on rule extraction and regularization, because their memory capacity, more limited than that of adults, prevents them from memorizing large data sets item by item. Given limited memory, the most efficient way to encode and learn a dataset is to extract regularities. Adults, not limited by memory constraints in the same way, might rely more on memorizing, item-based or statistical learning instead. Indeed, late learners of a language have been observed to memorize entire unanalyzed chunks or sequences, which young learners tend to decompose instead (e.g., for American Sign Language; Newport, 1984). Similarly, experiments testing how faithfully younger and older learners encode morphosyntactic properties that appear probabilistically or inconsistently in the input found that children were more likely to regularize inconsistencies, while adults only did so for the most inconsistent features (Hudson Kam and Newport, 2009). Taken together, these results suggest that both statistical learning and rule extraction are available to young and adult learners alike. However, the two age groups might employ these mechanisms somewhat differently, predicting both similarities and differences in how morphosyntax is learned at different ages.

The Current Study

In the present study, we seek to extend the existing research on the role of frequent functors in parsing and leaning new linguistic material. Gervain et al.'s (2008) study tested one language per word order type. It remains to be determined whether the specificities of the two languages suffice to account for the results or whether they are generalizable to typologically similar languages. The first hypothesis we test in this study is that the frequency-based strategy will generalize to other languages. In addition, it remains unexplored whether adult speakers rely on the position of frequent words in their native language when segmenting novel linguistic input. The second hypothesis examined here is that adults might also use frequent words as anchor points when parsing new material, although they might rely on somewhat different mechanisms than infants.

The present paper thus examines these hypotheses in adult speakers of four languages representing the above discussed typological variants. Japanese, Basque, Italian, and French were selected. French is a functor-initial V(erb)–O(bject), Preposition–Noun language (e.g., manger une pomme eat an apple “eat an apple,” sur la table on the table “on the table”), similar to Italian in most morphosyntactic properties. By contrast, Basque, like Japanese, is a functor-final OV, Noun-Postposition language (e.g., sagarra jan apple eat “eat an apple,” mahai-a-n table-the-on “on the table”). However, unlike Japanese, Basque has postnominal determiners (e.g., gizon-a man-the “the man,” lehendakari-a president-the “the president”). This makes Basque even more consistently functor-final than Japanese, as most bound morpheme functors occupy final positions in syntactic phrases (Hualde and Ortiz de Urbina, 2003; de Rijk and de Coene, 2008). For free functors, no marked difference exists between the two languages.

While these linguistic descriptions hold at the grammatical level, from a learnability point of view it is important to show that they manifest themselves in a statistically reliable fashion in actual language use, serving as input for language learning. Therefore, our first aim is to show that the most frequent elements are indeed functors in all four languages and that they occupy phrase-initial positions in French and Italian, and phrase-final positions in Basque and Japanese.

Study 1: A Corpus Study

In this analysis, we investigate whether the sequential position of functors in actual French, Italian, Basque, and Japanese corpora follows the distributions predicted by the anchoring hypothesis. Since functors typically constitute at least the 20–30 most frequent morphemes in a language, we operationalized this question by calculating the proportion of frequent item final and frequent item initial “phrases” in two privileged positions within utterances, i.e., their beginnings and ends. Utterance boundaries were chosen because they are identifiable via perceptual, i.e., non-grammatical, non-structural, cues (Aslin et al., 1996). Examining phrases within utterances would have been circular, as the anchoring hypothesis provides a bootstrapping strategy precisely to break into and bracket the internal syntactic structure of utterances. Utterance boundaries, by contrast, can be identified without any grammatical knowledge through prosodic and phonological cues.

Materials and Methods

Corpora

We used corpora from four languages. In Japanese, Italian, and French, transcriptions of actual speech directed to infants and young children were available. For Basque, currently no such database exists. We, therefore, used a variety of written sources, mainly texts extracted from newspapers as well as children's books.

The four languages we examined have different orthographic traditions. In Japanese, for example, postpositions are written as separate words, while in Basque, they are attached to the noun. To eliminate such differences, agglutinative affixes (i.e., elements of inflectional morphology attaching to the word stem, e.g., Basque mendi-tik mountain from) were encoded as separate morphemes⁴. Thus, the corpora were tagged and segmented into morphemes. This allowed us to take both free and bound functors into account, better reflecting adult speakers' full-fledged (but, of course, implicit) knowledge of their native grammar. The corpora were phonologically transcribed to provide a more realistic signal using phonotypical pronunciations in a semi-automated manner (Roach et al., 1996).

For each language, we collected 10 subcorpora from independent sources, i.e., different speakers or different texts, in order to have independent data points for each language, allowing statistical analysis. Each subcorpus in each language comprised 500 utterances, for a total of 5000 utterances per language. We relied on the original corpora for the definition of utterances. We counted as a single utterance whatever was transcribed as such in the original corpora. Equating the number of utterances per subcorpus and per language allowed us to better compare languages and to control for potential biases resulting from sample size and data sparsity, which might affect linguistic variability and frequency counts, at least for medium or low frequency linguistic features (Biber and Finegan, 1991; De Haan, 1992).

For Basque, we used 10 randomly chosen extracts from written sources (two newspapers and eight books) courtesy of The Basque Language Institute (http://www.ei.ehu.es/). Each subcorpus was 500 sentences long, thus this corpus comprised 5000 utterances. For Japanese, we made use of the corpus of infant-directed Japanese collected at the Laboratory of Language Development, Brain Science Institute, RIKEN (Mazuka et al., 2006). For our purposes, we extracted 500-utterance-long samples from 10 mothers' utterances addressed to their infants during free play or directed story-telling (using specific story books), but we excluded their conversations with adults, e.g., the experimenter. Our full corpus thus comprises 5000 utterances. For Italian, we used 500-utterance-long samples from the utterances of 10 adults from four Italian language subcorpora (Antelmi, 2004; Antinucci and Parisi, 1973; Cipriani et al., 1989; Tonelli et al., 1995; Salerni et al., 2007) of the CHILDES database (MacWhinney, 2000). The full corpus comprises 5000 utterances. For French, we used 500-utterance-long samples of the speech of 10 adults from the Paris subcorpus (Morgenstern and Parisse, 2007; Leroy et al., 2009; Morgenstern and Sekali, 2009) of the CHILDES database. The resulting full corpus consists of 5000 utterances.

Measures

We used the multiword utterances of the corpora to calculate how often frequent and infrequent items appear at phrase-initial and phrase-final positions at utterance boundaries. Single word utterances were discarded, as they are uninformative with respect to word order⁵. Frequent and infrequent items were defined as having a relative frequency of occurrence higher and lower, respectively, than a predefined threshold T = 0.01⁶. (Relative frequency of occurrence is the absolute frequency of occurrence of a given item normalized by the size of the corpus, allowing comparisons across corpora of different sizes.) In this frequency range, most items are closed-class morphemes in all four languages (e.g., Basque: -ko locative suffix, du “has.3sg.transitive”; Japanese: chan diminutive honorific to address children, -wa topic marker; Italian: il masculine definite article, che “what”; French: est “is,” pas negative particle). All other morphemes in the corpora were categorized as infrequent.

Using the frequent and infrequent categories as defined above, we calculated the percentages of the different possible orders at the boundaries of multiword utterances. We obtained these measures in the following way. We identified the first and the last two items of utterances, i.e., bimorphemic “phrases” at the left and right utterance boundaries. If the “phrase” had a [frequent item—infrequent item] order, it was counted as FI. Examples of FI phrases include è rosso is red “(it) is red,” all' asilo in-the daycare “in the daycare” [Italian, presented orthographically for ease of exposition]. If it had an [infrequent item—frequent item] structure, it was counted as IF. Examples of IF phrases include zoritxar haren misfortune his “his bad luck” [Basque]. “Phrases” where both morphemes were of the same category, i.e., [frequent item—frequent item] or [infrequent item—infrequent item] did not enter into the counts⁷.

For statistical purposes, we calculated the proportion of FI and IF utterances in each subcorpus in the above defined way, and conducted analyses of variance over the obtained datasets. We expect to find opposite word orders in Japanese and Basque, on the one hand, and Italian and French, on the other. We further predict a possible difference between two OV languages, as Basque has more functor-final phrases than Japanese. No such difference is expected between Italian and French.

Results

Figure 1 presents the percentages of FI and IF utterances in the four languages. As expected, the OV languages (Japanese: 38.4% IF, 20.36% FI and Basque: 66.86% IF, 20.36% FI) and the VO languages (Italian: 39.64% IF, 60.16% FI, and French: 39.06% IF, 57.54% FI) show opposite patterns, the former having more IF utterances, the latter more FI ones. All the statistical analyses reported below were also carried out with the additional factor Position (sentence-initial/sentence-final), but as the factor had no significant main effect, nor did it enter into significant interactions in any of the analyses, data from the initial and final positions were pooled.

FIGURE 1

Figure 1. The results of the corpus study. The x-axis shows the four languages. The y-axis represents the percentage of FI (light gray bars) and IF (dark gray bars) phrases at the boundaries of multiword utterances in the four corpora. Note that the maximum value is 200%, as each utterance contributes two data points (one for the beginning, one for the end). Error bars represent the standard error of the mean.

We carried out a first ANOVA with factors Language Type (OV/VO) and Order (FI/IF) using the proportion of FI and IF “phrases” in the 10 subcorpora per language as the dependent measure. We obtained a significant main effect of Order [F_{(1, 38)} = 6.21, p = 0.017], as more IF than FI phrases were identified in the four languages. We also observed a significant main effect of Language Type [F_{(1, 38)} = 4.52, p = 0.040], because more phrases were identified in the VO than in the OV languages overall. Importantly, the interaction of Language Type and Order was very highly significant [F_{(1, 38)} = 60.56, p < 0.0001], because IF phrases abounded in OV, FI phrases in VO languages. In Scheffe post-hoc tests, the difference between the number of IF and FI phrases was significant in both OV (p < 0.0001) and VO languages (p < 0.001).

In a second ANOVA with factors Language (Basque/Japanese/French/Italian) and Order (FI/IF), we found a main effect of Order [F_{(1, 36)} = 6.06, p = 0.018] due to the greater number of IF phrases in the overall dataset, just like in the first ANOVA. Again, the interaction between Language and Order was significant [F_{(1, 36)} = 20.07, p < 0.0001]. In Scheffe post-hoc tests, we found significant differences between the number of IF and FI items in Basque (p < 0.0001) and in Japanese (p < 0.001), marginally so in Italian (p = 0.07) and not in French. Further, although numerically, the number of IF phrases was greater in Basque than in Japanese, this difference did not reach significance.

Discussion

The analyses confirmed the general prediction that frequent items occupy sequence-initial positions in VO languages and sequence-final positions in OV languages. This distribution makes frequent items reliable cues to word order across typologically different languages. Through their distribution and sequential position, they can provide potential break-in points or anchors for the syntactic bracketing of sentential structure. This general pattern is modulated by language-specific properties. In the two OV languages, this pattern is stronger than in the VO languages. The reason for this difference between the OV and VO languages we tested is that while Japanese and Basque are functor-final both at the level of syntax (OV) and morphology (heavily suffixing), French and Italian are strongly functor-initial syntactically (VO), but have both suffixes and prefixes in their morphology.

In addition, we also observed an overall advantage for IF over FI items in all the languages. This effect was carried partly by the outstandingly high percentage of IF items in Basque, partly by the fact that the two VO languages had more IF items than the OV languages had FI items. The high percentage of IF sequences in Basque derives from the overt realization of the definite article phrase-finally and the generally strong functor-final nature of the language. (Japanese has no overt definite article.) Further, in the light of the discussion above, French and Italian are syntactically functor-initial, but somewhat more mixed morphologically, hence the non-negligible presence of IF phrases in these languages. More generally, this might reflect a universal preference for suffixation (Greenberg, 1957; Hawkins and Gilligan, 1988; Dryer, 1992). Languages with OV order almost exclusively use suffixation, i.e., word- and thus phrase-final morphological functors. However, many VO languages make use of suffixes in addition to prefixes. French and Italian are two cases in point.

As a caveat, we need to point out that the nature of the corpora differed somewhat between the languages. While most of the material in all four languages was derived from infant- and child-directed sources, the Basque corpus also contained some adult-directed material and was derived from written rather than spoken sources—due to the unavailability of infant- or child-directed spoken corpora in Basque. These differences might alter the results somewhat, although we do not expect the overall pattern of results to change considerably.

The corpus analysis has shown that the distribution and sequential position of the most frequent elements correlate with basic word order in the linguistic signal. As a second step in examining the cross-linguistic role of frequent items as predicted by the anchoring hypothesis, we now ask whether adult speakers of the above tested four languages are sensitive to the language-specific distributions of frequent items and whether they use this knowledge when parsing novel linguistic material.

Study 2: An Artificial Grammar Learning Experiment

We tested the word order preferences of adult native speakers of Basque, Japanese, Italian, and French using an artificial grammar learning paradigm. If Basque and Japanese speakers have different preferences for the relative order of frequent and infrequent items than Italian and French speakers that provides cross-linguistic evidence for the anchoring or frequency-based learning mechanism, and suggests that it is operational throughout life.