How to automate the extraction and analysis of information for educational purposes

  Reading Time: 7 minutes

Main aim of the study

In this case study we intended to reflect on how the online data about CS is shared and communicated in the websites, and how could this data be extracted massively and stored in a central database to, later, be analysed for different purposes. One of them, studied in this article, is the usage of all the information in educational contexts.

Period addressed by the study

For this study, we analysed both websites and CS projects information. From the 72 websites selected, we extracted 4949 CS projects information. The list of websites and projects is consistently updated for the duration of the project (2019-2022). The data has been extracted from the CS Track database.

Research questions 

This research was focused on answering the following questions:

  • How is CS communicated and promoted on online websites?
  • How automatic methods such as web scraping methods and anonymization techniques can be designed, developed and used to extract data from online sites? And How could these methods be applied to comply with the GDPR?
  • Is it possible, and how could this data be used for educational purposes?

Research Context

CS has a wide online presence; from online platforms dedicated to local, regional or global CS practice (such as The Citizen Science Association (CSA-North America), the European Citizen Science Association (ECSA), the Australian Citizen Science Association (ACSA), Observatorio de la ciencia ciudadana (Spain) or Bürger schaffen Wissen (Germany)), the ones dedicated to a single CS project (such as Mosquito Alert or Cities-Health) to the ones that contain information about CS project but are not oriented to CS practice (such as the sites of a research institute, a museum or a university). These websites’ objectives (especially those dedicated to CS), among others, are to make CS known and promote the participation and dissemination of CS projects (Vohland et al., 2021; Veeckman et al., 2019).

The communication of science through online media might contribute to promoting informal scientific knowledge (Stocklmayer et al., 2010). Furthermore, some previous studies identify citizens’ participation in CS projects might promote knowledge, development of skills, awareness of real problems addressed by projects or motivation through STEM careers  (Hiller & Kitsantas, 2014; Bonney et al., 2016; Kobori et al., 2016; Vohland et al., 2021). Considering all these assumptions, this study aims to explore how online websites communicate about CS projects and how all this information available can be used in formal education contexts, for instance, to promote scientific literacy or support teachers’ practice. 

In combination with automatic techniques, which have been previously used to collect and better understand the data (Diouf et al., 2019; Ponti et al., 2018), from this study we could create a database with more than 4000 projects. By centralising the data from various sites (following the European General Data Protection Regulation (GDPR)) we expected to allow us to analyse the data structures of the websites to report the data and give a response to the research questions defined.

Research Methods applied

For this study, we applied both computational methods (web scraping) and explorative study (manual analysis of the data extracted and website information). From the manual analysis, we wanted to identify how websites share information about CS online and analyse the technical architecture to better understand to what extent they apply the metadata standard.  Especially, to know if they follow the Public Participation in Scientific Research (Citizen Science) metadata standard (PPSR_core metadata standard).

Procedures applied

Figure 1 shows the process followed during the analysis (Calvera-Isabal et al., 2023):

  1. Selection of websites following the criteria of (1) Contains CS projects information, (2) are from Europe or allow participation of European citizens and (3) allow automatic data extraction.
  2. Analysis of the website’s content and characteristics. Also, understand how to share information online.
  3. Developments and execution of the crawler to extract and store the data.
  4. Analysis of the potential usage of the data in formal education contexts.

Figure 1. Process followed to extract and store the data used for the study.

Summary of results/findings

After the analysis, for the CS Track database, we included 4 new additional categories to the PPSR_Core metadata standard. We observe that although the mandatory categories information is included in 91.56% of the cases, there is still work to do from websites to take into account the PPSR_Code metadata standards.

Having access to CS massive data, online educational resources or tools developed or used by CS projects could also help teachers to create learning activities. For instance, to inspire them to create learning activities, to know more about how science is addressing real problems or allowing participation in CS projects following an educational perspective (by using the materials developed). Nevertheless, only 48.61% of websites analysed have educational material or information about learning. Although, the ones that allow online participation (such as Zooniverse) have specific educational sections. Likewise, they include information about tools used in CS projects that teachers could use in the classroom to support the student’s learning process or enhance it. Finally, we expect that by exploring the data extracted and resources available teachers could improve their pedagogical skills and scientific knowledge. In the end, it might have an effect on students’ knowledge and attitude toward science (Chan & Yung, 2018).


In order to improve the communication of CS projects or the accessibility and the analysis of the data, CS platforms might apply the PPSR more strictly. This could potentially help citizens find the key information about the CS projects and might motivate them to participate or could generate interest to know more about projects. The application of the standard would also facilitate the search and automatization of data extraction allowing algorithms such as NER to extract and classify data, so it might improve the scientific knowledge of CS (e.g. SDGs or research areas). 

A correct application of the standard would also help to support educational uses of CS data. Having the information structured and classified into the categories defined by the metadata standard and sharing the required information needed for teachers might help them to use this data in a formal educational context, inspire them to create learning activities or motivate them to participate in a project. This possibility was explored in other studies presented in this deliverable in which teachers explained that they use open resources, tools developed by others, their personal experiences and other teachers’ practices to inspire them to create learning activities and adapt their practice. 

Link to complete report

A scientific article was published in the number 74 of the Comunicar journal: Calvera-Isabal, M., Santos, P., Hoppe, H., & Schulten, C. (2023). How to automate the extraction and analysis of information for educational purposes. [Cómo automatizar la extracción y análisis de información sobre ciencia ciudadana con propósitos educativos]. Comunicar, 74. https://doi.org/10.3916/C74-2023-02

Link to dataset

Database: https://zenodo.org/record/7356627#.Y39bEnaZNPY

List of descriptors: https://zenodo.org/record/7310445#.Y2zph3aZNPY


Bonney, R., Phillips, T. B., Ballard, H.L., & Enck, J.W. (2016). Can citizen science enhance public understanding of science? Public Understanding of Science, 25(1), 2-16. https://doi.org/10.1177/0963662515607406

Chan, K.K.H., & Yung, B.H.W. (2018). Developing pedagogical content knowledge for teaching a new topic: More than teaching experience and subject matter knowledge. Research in Science Education, 48(2), 233-265. https://doi.org/10.1007/s11165-016-9567-1

Diouf, R., Sarr, E.N., Sall, O., Birregah, B., Bousso M., & Mbaye, S.N. (2019). Web Scraping: State-of-the-Art and Areas of Application. In 2019 IEEE International Conference on Big Data (Big Data) (pp. 6040-6042). https://doi.org/10.1109/BigData47090.2019.9005594

Hiller, S.E., & Kitsantas, A. (2014). The effect of a horseshoe crab citizen science program on middle school student science performance and STEM career motivation. School Science and Mathematics, 114(6), 302-311. https://doi.org/10.1111/ssm.12081

Kobori, H., Dickinson, J.L., Washitani, I., Sakurai, R., Amano, T., Komatsu, N., Kitamura, W., Takagawa, S., Koyama, K., Ogawara, T., & Miller-Rushing, A.J. (2016). Citizen science: a new approach to advance ecology, education, and conservation. Ecological Research, 31(1), 1-19. https://doi.org/10.1007/s11284-015-1314-y

Ponti, M., Hillman, T., Kullenberg, C., & Kasperowski, D. (2018). Getting it right or being top rank: Games in citizen science. Citizen Science: Theory and Practice, 3(1). https://doi.org/10.5334/cstp.101

Veeckman, C.M., Talboom, S., Gijsel, L., Devoghel, H., & Duerinckx, A. (2019). Communicatie bij burgerwetenschap: Een praktische handleiding voor communicatie en betrokkenheid bij citizen science. SCIVIL. https://bit.ly/3PKQz50

Vohland, K., Land-Zandstra, A., Ceccaroni, L., Lemmens, R., Perelló, J., Ponti, M., Samson, R., & Wagenknecht, K. (Eds.) (2021). The science of citizen science. Springer Nature. https://doi.org/10.1007/978-3-030-58278-4

Photo by Diego PH on Unsplash

Audience: Citizen Scientist | CS Platforms | CS project initiator/founder
0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments

Read more like this article

Help us disseminate our research results

Be the first to receive updates about CS Track project results, opinion pieces and News&Events related to Citizen Science.

Email Address
Twitter Account

Subscribe to our newsletter

Subscribe to the newsletter