Corpus linguistics meets historical linguistics and construction grammar: how far have we come, and where do we go from here?
This paper aims to give an overview of corpus-based research that investigates processes of language change from the theoretical perspective of Construction Grammar. Starting in the early 2000s, a dynamic community of researchers has come together in order to contribute to this effort. Among the different lines of work that have characterized this enterprise, this paper discusses the respective roles of qualitative approaches, diachronic collostructional analysis, multivariate techniques, distributional semantic models, and analyses of network structure. The paper tries to contextualize these approaches and to offer pointers for future research.
Large-scale patterns of number use in spoken and written English
This paper describes patterns of number use in spoken and written English and the main factors that contribute to these patterns. We analysed more than 1.7 million occurrences of numbers between 0 and a billion in the British National Corpus, including conversational speech, presentational speech (e.g., lectures, interviews), imaginative writing (e.g., fiction), and informative writing (e.g., academic books). We find that four main factors affect number frequency: (1) Magnitude - smaller numbers are more frequent than larger numbers; (2) Roundness - round numbers are more frequent than unround numbers of a comparable magnitude, and some round numbers are more frequent than others; (3) Cultural salience - culturally salient numbers (e.g., recent years) are more frequent than non-salient numbers; and (4) Register - more informational texts contain more numbers (in writing), types of numbers, decimals, and larger numbers than less informational texts. In writing, we find that the numbers 1-9 are mostly represented by number words (e.g., 'three'), 10-999,999 are mostly represented by numerals (e.g., '14'), and 1 million-1 billion are mostly represented by a mix of numerals and number words (e.g., '8 million'). Altogether, this study builds a detailed profile of number use in spoken and written English.
Investigating genre distinctions through discourse distance and discourse network
The notion of genre has been widely explored using quantitative methods from both lexical and syntactical perspectives. However, discourse structure has rarely been used to examine genre. Mostly concerned with the interrelation of discourse units, discourse structure can play a crucial role in genre analysis. Nevertheless, few quantitative studies have explored genre distinctions from a discourse structure perspective. Here, we use two English discourse corpora (RST-DT and GUM) to investigate discourse structure from a novel viewpoint. The RST-DT is divided into four small subcorpora distinguished according to genre, and another corpus (GUM) containing seven genres are used for cross-verification. An RST (rhetorical structure theory) tree is converted into dependency representations by taking information from RST annotations to calculate the through a process similar to that used to calculate syntactic dependency distance. Moreover, the data on dependency representations deriving from the two corpora are readily convertible into network data. Afterwards, we examine different genres in the two corpora by combining discourse distance and discourse network. The two methods are mutually complementary in comprehensively revealing the distinctiveness of various genres. Accordingly, we propose an effective quantitative method for assessing genre differences using discourse distance and discourse network. This quantitative study can help us better understand the nature of genre.