Rectangle 27 0

apache spark PySpark LDA Model Dense Vector from RDD?


In the following example, we load word count vectors representing a corpus of documents.

So, you need to get these word count vectors first from your own corpus, before proceeding as you try.

Thanks @desertnaut ! I need to read more about the LDA model. My understanding was I would feed in 'documents' and based on the counts of text in relation to each other it would derive topic probabilities.

You are indeed misinterpreting the example: the file sample_lda_data.txt does not contain text (check it), but word count vectors that have already been extracted from a corpus. This is indicated in the text preceding the example:

Note