Pedagogically Useful Extractive Summaries for Science Education

8 pages
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Pedagogically Useful Extractive Summaries for Science Education
  Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008),  pages 177–184Manchester, August 2008 Pedagogically Useful Extractive Summaries for Science Education Sebastian de la Chica, Faisal Ahmad, James H. Martin, Tamara Sumner Institute of Cognitive Science Department of Computer Science University of Colorado at Boulder sebastian.delachica, faisal.ahmad, james.martin, Abstract This paper describes the design and evaluation of an extractive summarizer for educational science content called COGENT. COGENT extends MEAD based on strategies elicited from an em-pirical study with science domain and in-structional design experts. COGENT identifies sentences containing pedagogi-cally relevant concepts for a specific sci-ence domain. The algorithms pursue a hybrid approach integrating both domain independent bottom-up sentence scoring features and domain-aware top-down fea-tures. Evaluation results indicate that COGENT outperforms existing summar-izers and generates summaries that closely resemble those generated by hu-man experts. COGENT concept invento-ries appear to also support the computa-tional identification of student miscon-ceptions about earthquakes and plate tec-tonics. 1   Introduction Multidocument summarization (MDS) research efforts have resulted in significant advancements in algorithm and system design (Mani, 2001). Many of these efforts have focused on summariz-ing news articles, but not significantly explored the research issues arising from processing edu-cational content to support pedagogical applica-tions. This paper describes our research into the application of MDS techniques to educational © 2008. Licensed under the Creative Commons At-tribution-Noncommercial-Share Alike 3.0 Unported   license ( ). Some rights reserved. science content to generate pedagogically useful summaries. Knowledge maps are graphical representations of domain information laid out as networks of nodes containing rich concept descriptions inter-connected using a fixed set of relationship types (Holley and Dansereau, 1984). Knowledge maps are a variant of the concept maps used to capture, assess, and track student knowledge in education research (Novak and Gowin, 1984). Learning research indicates that knowledge maps may be useful cognitive scaffolds, helping users lacking domain expertise to understand the macro-level structure of an information space (O'Donnell et al., 2002). Knowledge maps have emerged as an effective representation to generate conceptual browsers that help students navigate educational digital libraries, such as the Digital Library for Earth System Education ( (Butcher et al., 2006). In addition, knowledge maps have proven useful for domain and instructional ex-perts to capture domain knowledge from digital library resources and to analyze student under-standing for the purposes of providing formative assessments (Ahmad et al., 2007). Knowledge maps have proven useful both as representations of knowledge for assessment purposes and as learning resources for presenta-tion to students. However, domain knowledge map construction by experts is an expensive knowledge engineering activity. In this paper, we describe our progress towards the automated generation of pedagogically useful extractive summaries from educational texts about a sci-ence domain. In the context of automated knowl-edge map generation, summary sentences corre-spond to concepts. While the detection of rela-tionships between concepts is also part of our overall research agenda, this paper focuses solely on concept identification using MDS techniques. The remainder of this paper is organized as fol- 177  lows. First, we review related work in the areas of automated concept extraction from texts and extractive summarization. We then describe the empirical study we have conducted to understand how domain and instructional design experts identify pedagogically important science con-cepts in educational digital library resources. Next, we provide a detailed description of the algorithms we have designed based on expert strategies elicited from our empirical study. We then present and discuss our evaluation results using automated summarization metrics and hu-man judgments. Finally, we present our conclu-sions and future work in this area. 2   Related Work Our work is informed by efforts to automate the acquisition of ontology concepts from text. On-toLearn (Navigli and Velardi, 2004) extracts do-main terminology from a collection of texts using a syntactic parse to identify candidate terms that are filtered based on domain relevance and con-nected using a semantic interpretation based on word sense disambiguation. The newly identified concepts and relationships are used to update an existing ontology. Knowledge Puzzle focuses on n-grams to produce candidate terms filtered based on term frequency in the input documents and on the number of relationships associated with a given term (Zouaq et al., 2007). This ap-proach leverages pattern extraction techniques to identify concepts and relationships. While these approaches produce ontologies useful for compu-tational purposes, the identified concepts are of a very fine granularity and therefore may yield graphs not suitable for identifying student mis-conceptions or for presentation back to the stu-dent. Clustering by committee has also been used to discover concepts from a text by grouping terms into conceptually related clusters (Lin and Pantel, 2002). The resulting clusters appear to be tightly related, but operate at a very fine level of granularity. Our approach focuses on sentences as units of knowledge to produce concise repre-sentations that may be useful both as computa-tional objects and as learning resources to present back to the student. Therefore, extractive sum-marization research also informs our work.   Topic representation and topic themes have been used to explore promising MDS techniques (Harabagiu and Lacatusu, 2005). Recent efforts in graph-based MDS have integrated sentence affinity, information richness and diversity pen-alties to produce very promising results (Wan and Yang, 2006). Finally, MEAD is a widely used multi-document summarization and evalua-tion platform (Radev et al., 2000). MEAD re-search efforts have resulted in significant contri-butions to support the development of summari-zation applications (Radev et al., 2000). While all these systems have produced promising re-sults in automated evaluations, none have di-rectly targeted educational content as input or the generation of pedagogically useful summaries. We are directly building upon MEAD due its focus on sentence extraction and its high degree of modularization. 3   Empirical Study We have conducted a study to capture how hu-man experts construct and use knowledge maps. In this 10-month study, we examined how ex-perts created knowledge maps from educational digital libraries and how they used the maps to assess student work and provide personalized feedback. In this paper, we are focusing on the knowl-edge map construction aspects of the study. Four geology and instructional design experts collabo-ratively selected 20 resources from DLESE to construct a domain knowledge map on earth-quakes and plates tectonics for high school age learners. The experts independently created knowledge maps of individual resources which they collaboratively merged into the final domain knowledge map in a one-day workshop. The re-sulting domain knowledge map consisted of 564 nodes containing domain concepts and 578 rela-tionships. The concepts consist of 7,846 words, or 5% of the total number of words in the srci-nal resources. Figure 1 shows a fragment of the domain knowledge map created by our experts. Figure 1. Fragment of domain knowledge map created by domain and instructional experts Experts created nodes containing concepts of varying granularity, including nouns, noun phrases, partial sentences, single sentences, and 178  multiple sentences. Our analysis of this domain knowledge map indicates that experts relied on copying-and-pasting   (58%) and paraphrasing (37%) to create most domain concepts. Only 5% of the nodes could not be traced directly to the srcinal resources.   Experts used relationship types in a Zipf-like distribution with the top 10 relationship types accounting for 64% of all relationships. The top 2 relationship types each accounted for more than 10% of all relationships: elaborations (19% or 110 links) and examples (14% or 78 links). We have established the completeness of this domain knowledge map by asking a domain ex-pert to assess its content coverage of nationally-recognized educational goals on earthquakes and plate tectonics for high school age learners using the American Association for the Advancement of Science (AAAS) Benchmarks (Project 2061, 1993). The results indicate adequate content cov-erage of the relevant  AAAS Benchmarks  achieved through 82 of the concepts (15%) with the re-maining 482 concepts (85%) providing very de-tailed elaborations of the associated learning goals. Qualitative analysis of the verbal protocols captured during the study indicates that all ex-perts used external sources to construct the do-main knowledge map. Experts made references to their own knowledge (e.g., “I know that…”), to content learned or taught in geology courses, to other resources used in the study, and to the National Science Education Standards (NSES), a comprehensive collection of nationally-recognized science learning goals for K-12 stu-dents (National Research Council, 1996). We have examined sentence extraction agree-ment between experts using a kappa measure that accounts for prevalence of judgments and con-flicting biases amongst experts, called PABA-kappa (Byrt et al., 1993). The average PABA-kappa value of 0.62 indicates that our experts substantially agree on sentence extraction from digital library resources. While this study was not designed as an annotation project to support summarization evaluation, this level of agree-ment indicates that the concepts selected by the experts may serve as the reference summary to evaluate the performance of our summarizer. 4   Summarizer for Science Education Creating a knowledge map from a collection of input texts involves identifying sentences con-taining important domain concepts, linking con-cepts, and labeling those links. This paper fo-cuses solely on identifying and extracting peda-gogically relevant sentences as domain concepts. We have designed and implemented an extrac-tive summarizer for educational science content, called COGENT, based on MEAD version 3.11 (Radev et al., 2000). COGENT processes a col-lection of educational digital library resources by first preprocessing each resource using Tidy ( to fix improperly format-ted HTML code. COGENT then merges multiple web pages into a single HTML document and extracts the contents of each resource into a plain text file. We have extended MEAD with sen-tence scoring features based on domain content, document structure, and sentence length. 4.1   Domain Content We have designed two sentence-scoring features that aim to capture the domain content relevance of each sentence: the educational standards feature and the gazetteer feature. We have developed a feature that models how human experts used external sources to identify and extract concepts. The educational standards  feature uses the textual description of the relevant AAAS Benchmarks on earthquakes and plate tectonics for high-school age learners and the associated NSES. Each sentence receives a score based on its similarity to the text contents of the learning goals and educational standards computed using a TFIDF (Term Frequency-Inverse Document Frequency) approach (Salton and Buckley, 1988). We have used KinoSearch, a Perl implementation of the Lucene search engine (, to create an index that includes the AAAS Benchmarks learning goal description (boosted by 2), subject (boosted by 8), and keywords (boosted by 2), plus the text of the associated national standards (not boosted). Sentence scores are based on the similarity scores generated by KinoSearch in response to a query consisting of the sentence text. To account for the large number of examples used by the experts in the domain knowledge map (14% of all links), we have developed a feature that reflects the number and relevance of the geographical names in each sentence. Earth science examples often refer to names of geographical places, including geological formations on the planet. The gazetteer   feature leverages the Alexandria Digital Library (ADL) Gazetteer service (Hill, 2000) to check whether named entities identified in each sentence match 179  entries in the ADL Gazetteer. A gazetteer is a georeferencing resource containing information about locations and place-names, including latitude and longitude as well as type information about the corresponding geographical feature. Each sentence receives a score based on a TFIDF approach where the TF is the number of times a particular location name appears in the sentence and the IDF is the inverse of the count of gazetteer entries matching the location name. If the ADL Gazetteer returns a large number of results for a given place-name, it means there are many geographical locations identified by that name. Our assumption is that unique names may be more pedagogically relevant. For example, Ohio receives an IDF score of 0.0625 because the ADL Gazetteer contains 16 entries so named, while the Mid-Atlantic Ridge, the distinctive underwater mountain range dividing the Atlantic Ocean, receives a score of 1.0 as it appears only once. 4.2   Document Structure Based on the intuition that the HTML structure of a web site reflects content relevancy, we have developed the hypertext feature. The hypertext feature assigns a higher score to sentences con-tained under higher level HTML headings. Heading Bonus H1 1/1 = 1.00 H2 1/2 = 0.50 H3 1/3 = 0.33 H4 1/4 = 0.25 H5 1/5 = 0.20 H6 1/6 = 0.17 Table 1. Hypertext feature heading bonus Within a given heading level, the hypertext feature assigns a higher score to sentences that appear earlier within that level based on both relative paragraph order within the heading and relative sentence position within each paragraph. The equation used to compute the hypertext score for a sentence is 44 _1*_1*__ nosent no par bonusheadingscorehypertext  =  where heading_bonus  is obtained from Table 1,  par_no  is the paragraph number within the head-ing, and sent_no  is the sentence number within the paragraph. We use the 4 1  x  function to at-tenuate the contributions to the feature score of later paragraphs and sentences. Initially, we used the same function MEAD uses to modulate its position feature ( 2 1  x ), but initial experimenta-tion indicated this function decayed too rapidly, resulting in later sentences being over-penalized. 4.3   Sentence Length To promote the extraction of sentences contain-ing scientific concepts, we have developed the content word density feature. This feature makes a cut-off decision based on the ratio of content words to function words in a sentence. The con-tent word density feature uses a pre-populated list of function words (a stopword list) to calcu-late the ratio of content to function words within each sentence, keeping sentences that meet or exceed the ratio of 50%. This cut-off value im-plies that the extracted sentences contain rela-tively more content words than function words. 4.4   Sentence Scoring and Selection We compute the final score of each sentence by adding the scores obtained for the MEAD default configuration features (centroid and position) to the scores for the COGENT features (educational standards, gazetteer, and hypertext). After the sentences have been sorted according to their cumulative scores, we keep sentences that pass the cut-off constraints, including the MEAD length feature equal or greater than 9 and CO-GENT content word density equal or greater than 50%. We use the MEAD cosine re-ranker to eliminate redundant sentences based on a cutoff similarity value of 0.7. Since human experts used only 5% of the total word count in the resources, we have configured MEAD to use a 5% word compression rate.  5   Evaluation We have evaluated COGENT by processing the 20 digital library resources used in the empirical study and comparing its output against the con-cepts identified by the experts. 5.1   Quality To assess the quality of the generated summaries, we have examined three configurations:  Random ,  Default  , and COGENT  . The  Random  configura-tion extracts a random collection of sentences from the input texts. The  Default   configuration uses the MEAD default centroid, position and length (cut-off value of 9) sentence scoring fea-tures. Finally, the COGENT   configuration in-cludes the MEAD default features and the CO-GENT features. The Default and COGENT con-figurations use the MEAD cosine function with a threshold of 0.7 to eliminate redundant sen- 180  tences. All three configurations use a word com-pression factor of 5% resulting in summaries of very similar length. For this evaluation, we leverage ROUGE (Lin and Hovy, 2003) to address the relative quality of the generated summaries based on common n-gram counts and longest common subsequence (LCS). We report on ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE W-1.2 (weighted LCS), and ROUGE-S* (skip bigrams) as they appear to correlate well with human judgments for longer multi-document summaries, particu-larly ROUGE-1 (Lin, 2004). Table 2 shows the results of this ROUGE-based evaluation includ-ing recall (R), precision (P), and balanced f-measure (F). Random Default COGENT R 0.4855 0.4976 0.6073 P 0.5026 0.5688 0.6034 R-1 F 0.4939 0.5308 0.6054 R 0.0972 0.1321 0.1907 P 0.1006 0.1510 0.1895 R-2 F 0.0989 0.1409 0.1901 R 0.0929 0.0951 0.1185 P 0.1533 0.1733 0.1877 R-W-1.2 F 0.1157 0.1228 0.1453 R 0.2481 0.2620 0.3820 P 0.2657 0.3424 0.3772 R-S* F 0.2566 0.2969 0.3796 Table 2. Quality evaluation results (5% word compression) COGENT consistently outperforms the Ran-dom and Default baselines based on all four re-ported ROUGE measures. Given that much of the srcinal research efforts on MEAD have cen-tered on news articles, this result is not surpris-ing. Pedagogical content, such as the educational digital library resources used in our work, differs in rhetorical intent, structure and terminology from the news articles leveraged by the MEAD researchers. However, the COGENT features described here are complementary to the default MEAD configuration. COGENT can best be characterized as a hybrid MDS, integrating bot-tom-up (centroid, position, length, hypertext, and content word density) and top-down (educational standards and gazetteer) sentence scoring fea-tures. This hybrid approach reflects our findings from observing expert behaviors for identifying concepts from educational digital library re-sources. We believe the overall improvement in quality scores may be due to the COGENT fea-tures targeting different dimensions of what con-stitutes a pedagogically effective summary than the default MEAD features. To characterize the COGENT summary con-tents, one of our research team members manu-ally generated a summary corresponding to the best case for an extractive summarizer. This  Best Case summary comprises the sentences from the digital library resources that align to the concepts selected by the human experts in our empirical study. Since the experts created concepts of vary-ing granularity, this alignment produces the list of sentences that the experts would have pro-duced if they had only selected single sentences to create concepts for their domain knowledge map. This summary comprises 621 sentences consisting of 13,116 words, or about a 9% word compression. For this aspect of the evaluation, we have used ROUGE-L, an LCS metric computed using ROUGE. The ROUGE-L computation examines the union LCS between each reference sentence and all the sentences in the candidate summary. We believe this metric may be well-suited to re-flect the degree of linguistic surface structure similarity between summaries. We postulate that ROUGE-L may be able to account for the explic-itly copy-pasted concepts and to detect the more subtle similarities with paraphrased concepts in the expert-generated domain knowledge map. We have also used the content-based evaluation capabilities of MEAD to report on a cosine measure to capture similarity between the candi-date summaries and the reference. Table 3 shows the results of this aspect of the evaluation includ-ing recall (R), precision (P), and balanced f-measure (F). Random (5%) Default (5%) COGENT (5%) Best Case (9%) R 0.4814 0.4919 0.6021 0.9669 P 0.4982 0.5623 0.5982 0.6256 R-L F 0.4897 0.5248 0.6001 0.7597 Cosine 0.5382 0.6748 0.8325 0.9323 Table 3. Content-based evaluation results (word compression in parentheses) COGENT consistently outperforms the Ran-dom and Default baselines on both the ROUGE-L and cosine measures. Given the cosine value of 0.8325, it appears COGENT extracts sentences containing similar terms in very similar fre-quency distribution as the experts. The ROUGE-L scores also consistently indi-cate that the COGENT summary may be closer to the reference summary in relative word order- 181
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks