
The idea that language can be learned and processed based on probabilistic information has been the subject of much criticism. Principal among these criticisms is the issue of data-sparsity: it is claimed that the amount of data required to reliably estimate the parameters of any useful probabilistic model far outstrip the amount of language exposure any person could reasonably receive, not just in childhood, but in an entire lifetime. The most acute form of this objection involves unseen events: in any sample of language to which a learner is exposed, many legitimate linguistic constructions will not occur, because language is such a complex, productive system. Simple probabilistic models based on the principle of maximum likelihood will assign these events a probability of 0, incorrectly implying that they are (probabilistically speaking) impossible. In this paper we review the standard arguments regarding data-sparsity and the various methods of probability smoothing that have been proposed in the field of natural language processing in order to address them. We then propose a similarity-based model for estimating bigram probabilities in language, based on psychological principles of similarity-based generalization. In a series of 3 computational simulations we show that this method is capable of learning bigram probabilities more efficiently than other methods, and thus results in considerably improved language-modeling performance. We argue that the method we propose, as well as being extremely competitive in engineering terms, also provides a powerful way of ameliorating the problems of data-sparsity, and goes some way to showing how the influential criticisms that have been levied at probabilistic models of human language learning might be solved in a manner consistent with basic psychological principles. We argue that these results, in conjunction with the ubiquity of similarity-based mechanisms in cognition, support the hypothesis that human language learners could in fact exploit similarity-based information when acquiring linguistic structure, and that there is no in principle reason to believe that language is not learned and processed through the exploitation of probabilistic information.