Introduction to the "Selenops" Corpus Built from Large-scale Proprietary Crawl Data

Sessions
Data Platform

DAY 1 16:55-17:10 JST Seminar Room B

Onsite

Introduction to the "Selenops" Corpus Built from Large-scale Proprietary Crawl Data

We created a Japanese web corpus based on the content obtained from approximately 1.5 billion URLs that we crawled ourselves. In this presentation, we will introduce how we created the corpus and how we use it within our company.

The archive is available only internally.

Speaker

Hirata Kodai / LY Corporation

engineer at search company

Joined the company as a new graduate in 2023. Involved in the development of web crawlers and projects using crawl data.

Back to Sessions