The Chimp & See project
Zooniverse is one of the larger citizen science web portals. It was launched in 2009 by the Citizen Science Alliance (CSA), which has board members from various institutions such as the Adler Planetarium and the Johns Hopkins University. As of now, the platform has more than 1.6 million registered volunteers participating in citizen science activities. In June 2020, the platform listed 99 active, 122 paused and 46 finished projects. All of these projects involve crowdsourced CS activities with active participation, where the volunteers annotate and classify entities among other types of activity. A comprehensive description of the typology of the CS activities in Zooniverse has been published by Michalak (2015).
Chimp & See, was started in 2015 by the Max Planck Institute for Evolutionary Anthropology as one of the projects on the Zooniverse platform. The goal of the project is to gain a better understanding of chimp culture, population size and demographics in specific regions of Africa. The type of activity is an annotation and classification task, where the volunteers identify species in videos. The web-based platform features a video player equipped with interactive tools that allow users to annotate parts of the video, for example to highlight certain behavioural patterns of chimpanzees. The volunteers are not required to have any prior knowledge in the field. They receive instructions for the annotation in an interactive web tutorial.
Goals and indicators
The Zooniverse projects are per se crowdsourced activities with active participation (Haklay, 2013). For each of the projects, a website with a communication forum exists. The actors of such an activity can be categorised by their role as a volunteer, a moderator or a scientist. From an external perspective, it is neither obvious nor trivial how the given roles correspond to the actual roles the volunteers take on in the discourse or communication in the forum (talk pages). The behavioural patterns and communication structures that can be observed within their forum communication might provide some clues about weaknesses and strengths in the facilitation of such online activities and might also help us understand how certain discourse patterns can be identified as characteristic of a particular citizen science project. The goal of this work in CS Track is to create an initial and exploratory study to characterise a CS project to infer more general mechanisms that can be applied to a variety of projects. The “Chimp & See” project had been selected as a prototypical project for our analysis. On the one hand, the original version of this project has been completed with a closed discussion forum. On the other hand, an active successor project exists leaving space for further exploration and comparison.
To characterise the communication structure and particularly the role of certain users in the discourse, we use centrality measures such as (weighted) in-degree, (weighted) out-degree, and eigenvector centrality to measure different types of importance. Additionally, descriptive statistics about the distribution of communication across the different roles (“who talks to whom?”) give important insights into a project’s communication structure, particularly by telling us who initiates communication and who replies to enquiries.
How did we get the results?
The forum data of the Chimp & See talk pages were processed to create a dataset for the analysis. For this purpose, 3218 forum threads with 24531 individual posts were analysed. The forum involved a total of 575 unique user accounts, which represents 10.1% of all the active volunteers of the Chimp & See project. The number of accounts splits up into the following (system) roles: 8 moderators, 25 scientists, 542 volunteers. The time window of the forum discussion that we processed was from 2015-04-03 until 2019-07-05. Three subforums were analysed: help, science, and chat (“community building”). The average length of a discussion thread is 6.5 posts, with a variation depending on the specific forum (help: 5,7; chat: 5; science: 8,7). The overview page that contains all subforums has been used as the starting point for storing all forum posts. A tool to systematically store the content of websites is called a “crawler”.
For the technical implementation of the data collection and processing, a pipeline with the techniques and tools was created using the Python programming language. The crawler used Selenium with a headless browser to access page content from the forum and BeautifulSoup to extract relevant data (paging for multi-page threads included).
Afterwards, the NetworkX library was used to extract the social network from the retrieved forum data. For an ex-post analysis (centrality measures) and visualisation of the extracted network with dynamic graph layouting, Gephi was used. The network is created as a directed graph, where nodes represent the users, and the colouring of nodes indicates the prescribed role (moderator, scientist, volunteer). Edges between users are established either when a user replies to a post, or when a user mentions someone using the designated @ character. A weight is assigned to each edge representing the number of replies and mentions. The weight of a node is the outdegree. To characterise the dynamics, time slices are selected for each year.
To address the dynamic character of forum discussions, we investigated the number of forum posts per time slice. The extracted posts were aggregated per year to gain an overview about the posts over time. Chart 1 shows that the number of posts decreases over time, in total and for every sub forum. This might be for several reasons. For example, a decreased post activity in the help forum might be justified by a growing knowledge base of already answered questions that new users search before posting. This might also affect the science forum, as this encompasses announcements for competitions and contests, which generate a lot of traffic. Potentially, more recruitment and retainment efforts from the project team were made at the start of the project. However, methods of content analysis can be useful in future work to explore this issue.
The distribution of roles between the top users is equal (8 moderators, 9 scientists and 11 volunteers). It is remarkable that there are also volunteers who are highly motivated to contribute to the discussions. For future analyses, it may be interesting to investigate the incentives for volunteers to participate to such an extent. Following the analogy of PageRank or eigenvector centrality, it might induce a feeling of importance to communicate with people of high reputation. Therefore, we investigate the direction of communication, in particular with respect to who references whom in terms of the affiliated forum role. The following figure shows the filtered network (k-core filtering with k=10). The node size corresponds to the eigenvector-centrality, whereas the colour corresponds to the role (violet = volunteer; orange = scientist; blue = moderator).
It can be observed that all of the moderators are still in the filtered network, where a few have a high predominance and a high reputation. Although there are some volunteers with a relatively high centrality, a large portion of the volunteers has been filtered out. The following figure shows the average centrality per role. Compared to moderators, volunteers have on average a very low centrality.
The following three charts show the relative amount of references, normalised by the total number of references over all user roles in the specific forum. A reference is either a direct mentioning of a user (with the ‘@’ symbol) or a post reply in the thread structure of the discussion forum. In the help forum, most references are made by moderators.
When volunteers ask for help, they typically do not know whom to address, whereas moderators might point to scientists and / or mention the user who asked a question. Further analysis of communication patterns is needed to ground this. Volunteers typically reference moderators to say “thanks” for prior replies to their request for help. In the chat board, the references are similar, except that there is less need for moderators to refer other users to scientists, which explains the lower bar for this reference.
Volunteers are mentioned quite often in this forum, usually because the moderators and scientists welcome them. The chat forum is sometimes used by users to introduce themselves. This can serve for further analyses to deepen the understanding of the incentives and backgrounds of volunteers, and particularly their motivation to participate in CS activities.
The previously outlined example illustrates a first case study tackling the question how to assess communication structures in online communities of citizen science projects. Using the methods of SNA shows the importance of certain roles in the mediation of citizen science activities. The network-based measures such as centrality have been facilitated to quantify the importance of certain actors within the communication in the Chimp & See discussion forum. The findings confirm that moderators play an important role in the mediation, specifically in the Chat board. In addition, the analysis of the centralities over time reveals that certain volunteers try to establish a network of higher reputation, as they connect to more and more actors with a higher centrality. From the perspective of project facilitators, this is an important finding, because such volunteers are probably willing to spend more of their personal resources and engage more in such online citizen science activities. Spotting such actors might be useful to cast moderators or to foster a more active CS community. Although this investigation does not outline the full picture, it provides an important message to facilitators: community management matters.
An extensive analysis of the Chimp & See case study has been conducted by the CS Track consortium and published by Amarasinghe, Manske, Hoppe, Santos and Hernández-Leo (2021).
Amarasinghe I, Manske S, Hoppe HU, Santos P, Hernández-Leo D. Using network analysis to characterize participation and interaction in a citizen science online community. In: Hernández-Leo D, Hishiyama R, Zurita G, Weyers B, Nolte A, Ogata H, editors. Collaboration Technologies and Social Computing. Proceedings of the 27th International Conference, CollabTech 2021; 2021 Aug 31-Sep 3; Cham, Switzerland. Cham: Springer; 2021. p. 67-82. (LNCS; no. 12856). DOI: 10.1007/978-3-030-85071-5_5
Haklay, M. (2013). Citizen science and volunteered geographic information: Overview and typology of participation. In Crowdsourcing geographic knowledge (pp. 105-122). Springer, Dordrecht
Herodotou, C., Aristeidou, M., Miller, G., Ballard, H., & Robinson, L. (2020). What Do We Know about Young Volunteers? An Exploratory Study of Participation in Zooniverse. Citizen Science: Theory and Practice, 5(1).
Michalak, K. (2015). Online localization of Zooniverse citizen science projects–on the use of translation platforms as tools for translator education. Teaching English with Technology, 15(3), 61-70.
Muthukadan, B. (2018). Selenium with Python. Retrieved: https://bit.ly/2F2VDmz
Richardson, L. (2020). BeautifulSoup Retrieved: https://bit.ly/2ZkItYG
Wasserman, S., Faust, K., 1997. Social network analysis methods and applications. Structural Analysis in the social Sciences, Vol. 8. Cambridge University Press, Cambridge (Corrected Reprint).