(Volume: 1, Issue: 1)
In the current century of industrial and urban development, the control of water pollution and the monitoring of water quality are a major concern. The World Bank’s annual report (2022) alerts “About 2 billion people lack safely managed drinking water and 3.6 billion lack safely managed sanitation. If nothing changes, the world will not have enough water to meet demand by 2030”. Specifically, water quality monitoring ensures the water resources to be non-toxic and sustainable, ensuring safety of food and irrigation. The researchers who intend to do research on water quality monitoring, in an effect to restore the water ecosystem, can then use the “Water Quality Prediction Data Set” from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Water+Quality+Prediction#). This dataset is a real-world dataset that is partly derived from the United States Geological Survey. Further, it is publicly accessible and available for downloading, since 2017. The areas of study in which this dataset finds application are health and social science, agriculture, energy generation and sustainable resource management. In the past few years of research, forecasting water quality using both the spatial and the temporal dependencies is much sought to accurately capture the dynamic behavior. This dataset is multivariate and enables the researcher to carry out regression analysis on the spatio-temporal water quality, in terms of “the power of hydrogen (pH)” value. The input here is the daily samples of water quality measurements, which are related to pH values, at 37 sites in Georgia (USA). The spatial dependency that is used to predict the next day water quality is acquired from the water systems of Atlanta and the eastern coast of Georgia. The spatial dependency is considered because the water systems are highly-complex and tree-structured, making the forecast to be erroneous. There are 11 indices for the input features and they include temperature, volume of dissolved oxygen and specific conductance. The output to be forecasted is the values of ‘pH, water, unfiltered, field, standard units (Median)’. There are a total of 705 instances. About 423 instances can be used as the training set data and 282 instances can be used as the testing set data. To use the dataset, one should cite the paper: “Liang Zhao, Olga Gkountouna, and Dieter Pfoser. 2019. Spatial Auto-regressive Dependency Interpretable Learning Based on Spatial Topological Constraints. ACM Trans. Spatial Algorithms Syst. 5, 3, Article 19 (August 2019), 28 pages. DOI: https://doi.org/10.1145/3339823”.