In order to extract the websites' text, simply run the retrieve_website_text script.
You can download the scripts here
The script websites_analysis analyzes the text that was retrieved from the websites. The result will be the list of the most used Japanese Kanji characters and Japanese words from the websites.
You can download the script here
Read the following information in order to understand the whole process:
Step 1 - Download all the files and place them in a folder.
Step 2 - Run the "websites_analysis" script. Once you're done, an "analysis" folder will be created containing two different folders: "kanji" and "words." Inside the "words" folder, there will be another folder and three text files.
The "jp_words" text file will only store all the data from your "jp_websites" folder, the "split_data" folder will split that same data into several text files, and the program will read each file, one by one.
The "official_jp_words" text file will contain the final list of the Japanese words along with their number of repetitions.
Step 3 - On the other hand, inside the "kanji" folder, you will have three text files. The "kanji_chars" text file will only store all the Japanese Kanji characters that were found in the websites.
The "official_kanji" text file will contain the final list of the Kanji characters along with their number of repetitions.