When reading Table 1, it is important to keep in mind that many corpora were meant to be representative for the language produced in Chinese speaking regions and not necessarily for the language daily heard and read by Chinese speaking people. In addition, some of these sources are copyright protected. One main problem with Chinese word frequencies is that Chinese words are not written separately, making the segmentation of the corpus into words labor-intensive if one wants to have information beyond single character frequencies (Chinese words can consist of one to four or even more characters). This situation is currently changing, due to the availability of automatic parsers and part-of-speech taggers, as we will see below.