Modelling and Data Analysis
2022. Vol. 12, no. 1, 27–48
doi:10.17759/mda.2022120103
ISSN: 2219-3758 / 2311-9454 (online)
Identification and Classification of Toxic Statements by Machine Learning Methods
Abstract
General Information
Keywords: Natural Language Processing (NLP), Classification, Gradient boosting, XGBoost, CatBoost, Recurrent Neural Network, LSTM, Convolutional Neural Network
Journal rubric: Optimization Methods
Article type: scientific article
DOI: https://doi.org/10.17759/mda.2022120103
Received: 18.01.2022
Accepted:
For citation: Platonov E.N., Rudenko V.Y. Identification and Classification of Toxic Statements by Machine Learning Methods. Modelirovanie i analiz dannikh = Modelling and Data Analysis, 2022. Vol. 12, no. 1, pp. 27–48. DOI: 10.17759/mda.2022120103. (In Russ., аbstr. in Engl.)
References
- Riz R. Natural language processing in Java. DMK-Press. 2016.264 p.
- Perspective API. URL: https://www.perspectiveapi.com
- van Aken B., Risch J., Krestel R., Löser A. Challenges for toxic comment classification: An in-depth error analysis. 2018, arXiv:1809.07572.
- Risch J., Krestel R. Toxic Comment Detection in Online Discussions. Deep Learning-Based Approaches for Sentiment Analysis. Springer, Singapore, 2020. P. 85–109.
- Weiss K., Khoshgoftaar T.M., Wang D. A survey of transfer learning // Big Data, 3: 9. 2016. https://doi.org/10.1186/s40537-016-0043-6
- Andrusyak B., Rimel M., Kern R. Detection of Abusive Speech for Mixed Sociolects of Russian and Ukrainian Languages //RASLAN. – 2018. – P. 77-84.
- Li Y., Yang T. Word Embedding for Understanding Natural Language: A Survey. In: Srinivasan S. (eds) Guide to Big Data Applications. Studies in Big Data, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-53817-4_4
- Liu C. et al. Research of text classification based on improved TF-IDF algorithm // IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE). 2018 P. 218–222.
- word2vec // URL: https://code.google.com/archive/p/word2vec/
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space // Proceedings of Workshop at ICLR, 2013
- Bojanowski P, et al. Enriching word vectors with subword information // Transactions of the Association for Computational Linguistics. 2017. V. 5. P. 135–146.
- Pennington J., Socher R., Manning C. D. Glove: Global vectors for word representation // Proceedings of the conference on empirical methods in natural language processing (EMNLP). 2014. P. 1532–1543.
- Wieting J. et al. From paraphrase database to compositional paraphrase model and back // Transactions of the Association for Computational Linguistics. 2015. V. 3. P. 345–358.
- Chen T., Guestrin C. Xgboost: A scalable tree boosting system // Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. P. 785–794.
- Dorogush A. V., Ershov V., Gulin A. CatBoost: gradient boosting with categorical features support // arXiv preprint arXiv:1810.11363. 2018.
- Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term memory // Neural computation, V. 9(8). P. 1735–1780, 1997.
- Staudemeyer R. C., Morris E. R. Understanding LSTM — a tutorial into Long Short-Term Memory Recurrent Neural Networks // arXiv preprint arXiv:1909.09586. 2019. URL:https://arxiv.org/pdf/1909.09586.pdf
- Understanding LSTM Networks URL:https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Convolutional Neural Network. An Introduction to Convolutional Neural Networks. URL: https://towardsdatascience.com/convolutional-neural-network-17fb77e76c05
- Bai S., Kolter J. Z., Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling // CoRR, abs/1803.01271. 2018. http://arxiv.org/abs/1803.01271
- Quora Insincere Questions Classification. URL: https://www.kaggle.com/c/quora-insincere-questions-classification/data
- T. Fawcett. An introduction to ROC analysis // Pattern Recognition Letters, V. 27. 2006. P. 861–874.
- Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning. Springer-Verlag, New York. 2017.
- Eli5 Documentation. URL: https://eli5.readthedocs.io/en/latest/
- Tulio Ribeiro M., Singh S., Guestrin C. Why Should I Trust You? Explaining the Predictions of Any Classifier // KDD'16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. P. 1135-1144.
- glove.840B.300d — pre-trained word vectors GloVe. URL: https://nlp.stanford.edu/projects/glove/
- wiki-news-300d-1M – pre-trained word vectors trained using fastText. URL: https://fasttext.cc/docs/en/english-vectors.html
- paragram_300_sl999 – New Paragram-SL999 300 dimensional embeddings tuned on SimLex999 dataset. URL: https://www.kaggle.com/ranik40/paragram-300-sl999
- GoogleNews-vectors-negative300 — pre-trained word vectors trained using Word2Vec. URL: https://code.google.com/archive/p/word2vec/
Information About the Authors
Metrics
Views
Total: 999
Previous month: 25
Current month: 33
Downloads
Total: 404
Previous month: 8
Current month: 5