[bnc] Using the corpus

Can I analyse and search the BNC with corpus and text analysis software?

Yes. The corpus text files are made available in an open format called XML which can be processed by many different software tools. You can also use scripts, or write your own software to analyse the BNC. Please note that some desktop tools might struggle to cope with a corpus of this size.

What are the funny characters I see when I look at my BNC text?

The XML markup in the BNC text files provides information about the texts, the speakers, and much more. Some tools will allow you to see the texts without the mark-up. (more about mark-up). This is an XML example:
<s n="12"><w type="TO0" lemma="to">To </w>< w type="VVI" lemma="illustrate">illustrate </w><w type="AT0" lemma="the">the </w><w type="NN1" lemma="paradigm"> paradigm</w><c type="PUN"> , </c><w type="NN1" lemma="reference"> reference </w><w type="VBZ" lemma="be"> is </w><w type="VVN" lemma="make"> made </w><w type="PRP" lemma="to"> to </w>...

How can I use the BNC?

The BNC can be used in many ways:

look at frequency lists
use an online service, such as BNCWeb or the Brigham Young corpus interface
write your own software
use an XML-aware concordancer
use a concordancer that can handle text files

What is the difference between the BNC XML Edition, BNC World, BNC Sampler and BNC Baby corpora?

BNC XML Edition and BNC World are both versions of the whole British National Corpus, containing 100 million words. BNC XML Edition is the current version, BNC World the former one. The BNC Sampler is a 2 million word subset of the BNC World with equal proportions written and spoken material. BNC Baby consists of four subsets of BNC World: one million words each of spoken conversation, written fiction, newspapers and academic prose [more about the corpora].

The following versions are now available for download from the Oxford Text Archive:

BNC XML

BNC Baby

BNC Sampler

I would like to access the sound recordings of the spoken part of the corpus. Where are they stored, and can I get access to them?

Some of the recordings from which the spoken parts of the BNC were transcribed are now stored at the The British Library Sound Archive. The AudioBNC project started in 2010 to work on aligning this sound data with the BNC transcripts. The corpus aligned with the digital audio can now be queried via an experimental BNCweb service at Lancaster University.

I am writing an article using data from the BNC. How should I cite it?

Please use this formula "Data cited herein has been extracted from the British National Corpus Online service, managed by the University of Oxford on behalf of the BNC Consortium. All rights in the texts cited are reserved." More information available on the Copyright page