Tech & Society

Unveiling the Essence of Open Source Language Models as the True “Open AI”

In 2015, a nonprofit called OpenAI was formed to create “broadly and evenly distributed” AI. Fast forward to 2024, and OpenAI has transitioned into full-on for-profit mode, hoarding access to LLMs behind a transactional API service. Most recently, they’re seeking a $100 billion valuation.

The past decade of AI progress has been dominated by large tech companies like Google, Meta, and OpenAI releasing ever-larger proprietary language models. From Bard and Claude to GPT-4, much of the state-of-the-art in natural language processing (NLP) has remained concentrated in the hands of a few research labs.

However, the long-term future of AI lies not in more private bigger models served exclusively through APIs but rather with open-source language models built in the open alongside communities.

Open-Source Language Models

In recent years, a handful of startups, universities, and dedicated individuals have helped pioneer this open model of language model development.

The latest model continuing this open-source lineage is H2O-Danube-1.8B. Weighing in at 1.8 billion parameters, Danube demonstrates surprising capability even compared to other publicly available models many times its size. The H2O.ai team meticulously designed, trained, and validated Danube completely transparently, with the full report available on arXiv.

Rather than hoarding access, H2O.ai released Danube’s full parameters and training code openly on HuggingFace. Within days of the initial announcement, curious developers began freely experimenting with the model, showcasing rapid innovation generation simply not feasible with proprietary models. As of writing, the entire h2o-danube-1.8b-chat model has been download over 500 times on HuggingFace.

Anyone can use the model with the transformers library, following the below code, courtesy of h2o’s HuggingFace repo:

import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="h2oai/h2o-danube-1.8b-chat",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# We use the HF Tokenizer chat template to format each message
# https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {"role": "user", "content": "Why is drinking water so healthy?"},
]
prompt = pipe.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
res = pipe(
    prompt,
    max_new_tokens=256,
)
print(res[0]["generated_text"])
# <|prompt|>Why is drinking water so healthy?</s><|answer|> Drinking water is healthy for several reasons: [...]

H2O believes collaborating openly remains the ultimate key towards democratizing access to AI and unlocking benefits for the many rather than wealth for the few.

Other Open-Source Language Models

The open-source AI ecosystem continues expanding with developers globally collaborating on shared models. Beyond H2O-Danube-1.8B, numerous noteworthy initiatives aim to prevent concentration of knowledge within walled gardens.

MPT

Developed by startup MosaicML, the Machine Programming Transformer (MPT)incorporates techniques like mixture-of-experts parallelization and context length extrapolation to improve efficiency.

Falcon

Falcon’s biggest open-source LLM is a whopping 180-billion parameter beast, outperforming the likes of LLaMA-2, StableLM, RedPajama, and MPT.

At that size, it’s recommended to have 400 gigabytes of available memory to run the model.

Mistral

Founded by ex-Googlers and Meta researchers, Mistral released the 7 billion parameter Mistral 7B model in September 2022. Mistral 7B achieves competitive performance among open models nearly matching the closed GPT-3 in sample quality.

Legacy Models

Beyond newly launched models, earlier open-source models continue to empower developers. GPT2 from OpenAI and GPT-J from EleutherAI both hold historical significance despite lagging behind modern architectures. And Transformers like BERT gave rise to an entire subclass of NLP breakthroughs powering products globally.

The democratization narrative only strengthens thanks to passionate communities generously contributing their creations back to common pools of knowledge.

A More Equitable Future

In many ways, proprietary language models risk recreating many inequities the tech industry continues wrestling with. Concentrating knowledge within wealthy organizations excludes smaller teams from shaping progress early on. And later makes integration prohibitively expensive once available purely through transactional APIs.

Open-source models are vital to seeding a more equitable way forward. One where agency lies closer to diverse communities actually building concrete AI applications. The long arc of progress only Bends towards justice when people come together united behind the technology itself rather than any one organization seeking to control it.

Danube and the open paradigm it represents offers but one glimpse into an alternate vision. One driven not by short-term profits or prestige but by empowering developers everywhere to freely build upon each other’s shoulders. There will always remain space for proprietary work, but the true future of AI lies open.

Community-Driven Innovation

Releasing open-source models draws contributions from a motivated community of developers and researchers. This collaborative style of working in the open unlocks unique opportunities. Experts across organizations can peer review each other’s work to validate techniques.

Researchers can readily replicate and extend new ideas instead of reinventing the wheel. And software engineers can rapidly integrate and deploy innovations into customer offerings.

Perhaps most promisingly, the open paradigm allows niche communities to gather around customizing models for specific use cases. Teams can sculpt versions tailored to particular topics like medicine, law, or finance which outperform generic models. These specialized models then get shared back to benefit the rest of the community. Together, groups make collective progress not possible within any single closed lab.

This article was originally published by Frederik Bussler on Hackernoon.

HackerNoon