Plans for The Corpus of Spoken Israeli Hebrew (CoSIH) started to take shape in 1998. CoSIH aimed at compiling a large database of recordings of spoken Israeli Hebrew in order to facilitate research in a range of disciplines. A corpus is a preliminary desideratum for larger projects that cannot otherwise be accomplished. The research potential of such a corpus is extremely large, including, inter alia, applications in the following areas: general and theoretical linguistics, Hebrew language and linguistics, applied linguistics, language engineering, education, and cultural and sociological studies.
CoSIH was designed with the intention to include a representative sample of both demographically and contextually defined varieties. The model according to which CoSIH would be compiled was to consist of a thousand sets of recordings (“cells”) with 5000 words each, i.e., a corpus of five million words. We have taken a culture-dependent approach for the compilation of CoSIH. CoSIH aspires to bridge between the infinite number of varieties used by the Israeli Hebrew speech community and their representation in the corpus, by characterizing their diversity in both demographic and contextual terms. CoSIH seems to be a first and singular attempt to establish a representative corpus using the axes of both demographic and contextual variables, based on statistical and analytic criteria.
The selection of informants for the recordings of CoSIH would be made by a random sample of the Israeli population, in order to reflect the social structure of the Israeli Hebrew speech community. The segmentation of the corpus for analytic purposes would be done using well-defined criteria, notwithstanding the fact that all sociolinguistic data of the recorded informants will be made available for CoSIH’s endusers. The working hypothesis of CoSIH is based on demographic criteria that seem to be most significant for the representation of the linguistic diversity in Israel: (1) place of birth, familial land of origin, ethnic group or religion; (2) age; (3) education; and (4) sex.1
For the analysis of the contextual variables for each discourse, CoSIH’s working hypothesis is based on five variables. There are three primary variables: interpersonal relationships, discourse structure and discourse topic; and two secondary variables: number of participants and medium (i.e. face-to-face conversation and telephone conversation).
A comprehensive study of the demographic and circumstantial variables in Hebrew discourse in Israel remains a desideratum. Therefore, in order to design a proper model for CoSIH, the setting of the corpus would be done in phases, during which a research program would be taken in order to verifty the wortking hypothesis suggested above.
This model was first published online, in both Hebrew and English. The English version eventually found its place in Hary & Izre’el 2003. A more sophisticated model has been published in English in Izre’el, Hary & Rahav 2001.
CoSIH was initiated, designed and operated by a team of Israeli and international scholars:
Core team: Shlomo Izre’el, Tel-Aviv University (director); Benjamin Hary, Emory University (principal investigator); John Du Bois, University of California at Santa arbara (corpus analyst); Mira Ariel, Tel-Aviv University (discourse analysis and pragmatics); Giora Rahav, Tel-Aviv University (statistics and sociology). Esther Borochovsky-Bar Aba, Tel Aviv University (syntax) joined the team at a later stage.
Advisory board: Eliezer Ben-Rafael, Tel Aviv University (sociolinguistics – sociological aspects); Yaakov Bentolila, Ben Gurion University (sociolinguistics – linguistic aspects); Otto Jastrow, Universität Erlangen-Nürnberg (transcription, phonology, dialectology); Shmuel Bolozky, University of Massachusetts at Amherst (phonology, morphology); Geoffrey Khan, Cambridge University (syntax); Elana Shohamy, Tel Aviv University (language education).
The Present State of CoSIH
As of 2012, this ambitious project still awaits its realization. The limited financial support that was at our disposal enabled us to compile two sets of recordings, the first of which was made during the initial preparatory phase, while the second was done as a pilot study. The initial preparatory phase produced 11 recordings spanning at least 6 hours each, with some being much longer. Although we initially designed a pilot of 20 sets of 3-hour recordings, we have eventually ended up with 42 sets, each including between 8 to 16 hours of uninterrupted recording of everyday speech. Taken together, we now possess 6 to 18 hour recordings by 53 volunteers, which we believe to be a reasonable source of data for the study of Spoken Hebrew. The recordings, which were all made between August 2000 and October 2002, are all real life conversations of CoSIH’s informants. As such, they naturally include both the speech of the volunteers who recorded them and their interlocutors.