Gensim is implemented in Python and Cython for performance. Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.

Clients: Amazon Retail, Cisco Security, Channel 4, Juju, Issuu, 12K Research, Stillwater Supercomputing, SiteGround, Capital One.


Digital platforms: Linux, Windows, macOS

Versions: Cloud/On-Premise 

Use cases

  • Amazon Retail

Document similarity. National Institutes of Health nih Health Processing grants and publications with word2vec.

  • Cisco Security Large-scale fraud detection.

Mindseye Legal Similarities in legal documents.

  • Channel 4 Media Recommendation engine.

Talentpair HR Candidate matching in high-touch recruiting.

  • Juju HR

Provide non-obvious related job suggestions.

  • Tailwind Media

Post interesting and relevant content to Pinterest.

  • Issuu Media

Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about.

  • Search Metrics

Content Marketing Gensim word2vec used for entity disambiguation in Search Engine Optimisation. 1

  • 2K Research

Document similarity analysis on media articles.

  • Stillwater Supercomputing

Hardware Document comprehension and association with word2vec.

  • SiteGround Web hosting

An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA.

  • Capital One

Finance Topic modeling for customer complaints exploration.


All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core), Intuitive interfaces easy to plug in your own input corpus/datastream (trivial streaming API) easy to extend with other Vector Space algorithms (trivial transformation API) Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.

Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.