For each subtitle file, the time zone information and other information not related to the film contents were removed (e.g., the name of the subtitle group, translator, proofreader, director, actors, etc.). The files were then segmented and PoS tagged with the ICTCLAS software ( , Institute of Computing Technology, Chinese Lexical Analysis System [19]). We used the ICTCLAS version 2009Share via Java Native Interface (JNI). Regarding the PoS specifications, we used the Peiking University (PKU) PoS tagging set [24], [25] among the sets available in ICTCLAS. According to a previous study [26], this combination (PKU-ICTCLAS) has an excellent performance in word segmentation. The outcome of the analysis was a corpus of 33.5 million words (46.8 million characters).