Nicolo Cosimo Albanese
2 min readJan 14, 2023

--

Thank you for your comment. The article does not say that BERTopic / Top2Vec "perform worse" than LDA or other models on longer documents as a general statement, but it says that "they [BERTopic and Top2Vec] work better on shorter text". The main reason for which BERTopic / Top2Vec surely shine with shorter documents is that they assume that one document belongs to one topic only. In reality, we know that the longer a document is, the more likely it is that it discusses multiple subjects. Therefore, longer documents may "violate" the main assumption of BERTopic / Top2Vec.

Other models, such as LDA, do not make this assumption and therefore might be more suitable for longer documents.

The post does not say that "BERTopic performs worse" as as general statement because, in the end, the performance of a topic modeling strategy also depends on the data at hand, their context and domain, and the final business goal. BERTopic may work just fine on longer documents: as described in the same bullet point you mentioned, one may simply pre-process text and split documents into paragraphs/sections before the topic modeling task. As a consequence, one may extract multiple topics for each document (although this comes with additional attention points).

The takeaway of the comparison (table included) is to suggest that, in case of very long texts with multiple subjects, one may start experimenting with LDA rather than BERTopic / Top2Vec, but not necessarily discard BERTopic / Top2Vec entirely as "worse".

You may find additional remarks from the author of BERTopic himself, Maarten Grootendorst, in this YouTube interview (https://www.youtube.com/watch?v=uZxQz87lb84&t=2026s). From minute 33:37 to 35:14, he provides his view on short text vs. long text. The interview was released after this post. Another information you may find useful is that BERTopic released a new feature (version 0.13.0) aiming at creating a topic probability distribution for each document (https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html). I did not mention it in the article as it was released after it (article's date: 19-09-2022, version 0.13.0 release date: 04-01-2023).

Hope this helps.

--

--

Nicolo Cosimo Albanese
Nicolo Cosimo Albanese

Responses (1)