theregister.com
Taiwan's new president wants to upgrade from 'silicon island' to 'AI island'
Simon Sharwood
4–5 minutes
Taiwan's recently elected president, Lai Ching-te, has used his inaugural address to call for the island state to upgrade to an AI nation.
"As we meet the global challenges of adopting more and more smart technologies, we in Taiwan, a 'silicon island,' must do all we can to expedite Taiwan's transformation into an 'AI island'," Lai – who uses the Western name William – told citizens on Monday.
"We must adapt AI for industry and step up the pace of AI innovation and applications," he added, and "must also adapt industry for AI and use AI's computational power to make our nation, our military, our workforce, and our economy stronger."
He also called on Taiwan to make "bold investments" in quantum computing, robotics, the metaverse, precision medicine, and other advanced technologies.
"Our sights are set on making Taiwan the Asian hub of unmanned aerial vehicle supply chains for global democracies, and developing the next generation of medium- and low-orbit communications satellites, bringing Taiwan's space and aerospace industries squarely into the international sphere," he added.
That's part of an ambitious plan to make more military tech, and security and surveillance kit – two of five industries in which Taiwan excels and is trusted globally. Semiconductors is another, with AI and next-generation communications rounding out the list.
The mentions of military strength and serving like-minded democracies are notable, because elsewhere in his speech Lai called on China to "cease their political and military intimidation against Taiwan, share with Taiwan the global responsibility of maintaining peace and stability in the Taiwan Strait as well as the greater region, and ensure the world is free from the fear of war."
And if that doesn't work, he's prepared to fight.
"As we pursue the ideal of peace, we must not harbor any delusions," the incoming president warned. "So long as China refuses to renounce the use of force against Taiwan, all of us in Taiwan ought to understand that even if we accept the entirety of China's position and give up our sovereignty, China's ambition to annex Taiwan will not simply disappear."
"In face of the many threats and attempts of infiltration from China, we must demonstrate our resolution to defend our nation, and we must also raise our defense awareness and strengthen our legal framework for national security."
Lai suggested that defensive effort would be conducted on behalf of Taiwan's citizens, and those of the wider world.
"As we look toward our future, we know that semiconductors will be indispensable. And the AI wave has already swept in. Taiwan has already mastered advanced semiconductor manufacturing, and we stand at the center of the AI revolution," he boasted. "We are a key player in supply chains for global democracies. For these reasons, Taiwan has an influence on global economic development, as well as humanity's well-being and prosperity.”
To safeguard that prosperity, he called for solidarity at home and ongoing demonstrations of thanks for the support of foreign friends.
This is recognition that US support for Taiwan is a major deterrent to any Chinese effort to reclaim the island. One reason for that support is that the US can't do without silicon produced in Taiwan by TSMC if it is to continue enjoying superiority in that arena compared to China.
China's foreign minister, Wang Yi, dismissed Lai's speech and reiterated the CCP position that reunification is inevitable and the only way to guarantee peace across the Taiwan Strait.
"The Taiwan issue is China's internal affair, and the realization of complete national reunification is the unanimous demand of all Chinese people. It is also a historical trend that no force can stop," he declared. ®
https://www.theregister.com/2024/05/21/ ... ai_island/
Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
- pigra senlaborulo
- pyllypuhelinmyyjä
- Posts: 125595
- Joined: 12 Jan 2013, 02:48
- Location: pastreja fabgulaŝo de vikario de bopatrejo
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
Ei ole mitään rikkuria alhaisempaa.
Marx propagoi fiksuuttaan lukemalla kirjoja ja kirjoittamalla niitä. Bakunin taas tuhosi aivosolujaan alkoholilla. Jäljellejääneet aivosolut saivat tilaa kasvaa ja kehittyä, ja lopulta Bakuninin pääkopassa oli vain yksi helvetin iso ja fiksu aivosolu. Bakunin oli siis fiksumpi kuin Marx.
- pigra senlaborulo
- pyllypuhelinmyyjä
- Posts: 125595
- Joined: 12 Jan 2013, 02:48
- Location: pastreja fabgulaŝo de vikario de bopatrejo
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
Analysis of ChatGPT answers to 517 programming questions finds 52% of ChatGPT answers contain incorrect information. Users were unaware there was an error in 39% of cases of incorrect answers.
https://dl.acm.org/doi/pdf/10.1145/3613904.3642596
Ei ole mitään rikkuria alhaisempaa.
Marx propagoi fiksuuttaan lukemalla kirjoja ja kirjoittamalla niitä. Bakunin taas tuhosi aivosolujaan alkoholilla. Jäljellejääneet aivosolut saivat tilaa kasvaa ja kehittyä, ja lopulta Bakuninin pääkopassa oli vain yksi helvetin iso ja fiksu aivosolu. Bakunin oli siis fiksumpi kuin Marx.
- pigra senlaborulo
- pyllypuhelinmyyjä
- Posts: 125595
- Joined: 12 Jan 2013, 02:48
- Location: pastreja fabgulaŝo de vikario de bopatrejo
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
arstechnica.com
New Windows AI feature records everything you’ve done on your PC
Benj Edwards - 5/20/2024, 9:43 PM
4–5 minutes
The illusion of privacy —
At a Build conference event on Monday, Microsoft revealed a new AI-powered feature called "Recall" for Copilot+ PCs that will allow Windows 11 users to search and retrieve their past activities on their PC. To make it work, Recall records everything users do on their PC, including activities in apps, communications in live meetings, and websites visited for research. Despite encryption and local storage, the new feature raises privacy concerns for certain Windows users.
"Recall uses Copilot+ PC advanced processing capabilities to take images of your active screen every few seconds," Microsoft says on its website. "The snapshots are encrypted and saved on your PC’s hard drive. You can use Recall to locate the content you have viewed on your PC using search or on a timeline bar that allows you to scroll through your snapshots."
By performing a Recall action, users can access a snapshot from a specific time period, providing context for the event or moment they are searching for. It also allows users to search through teleconference meetings they've participated in and videos watched using an AI-powered feature that transcribes and translates speech.
At first glance, the Recall feature seems like it may set the stage for potential gross violations of user privacy. Despite reassurances from Microsoft, that impression persists for second and third glances as well. For example, someone with access to your Windows account could potentially use Recall to see everything you've been doing recently on your PC, which might extend beyond the embarrassing implications of pornography viewing and actually threaten the lives of journalists or perceived enemies of the state.
Despite the privacy concerns, Microsoft says that the Recall index remains local and private on-device, encrypted in a way that is linked to a particular user's account. "Recall screenshots are only linked to a specific user profile and Recall does not share them with other users, make them available for Microsoft to view, or use them for targeting advertisements. Screenshots are only available to the person whose profile was used to sign in to the device," Microsoft says.
Users can pause, stop, or delete captured content and can exclude specific apps or websites. Recall won't take snapshots of InPrivate web browsing sessions in Microsoft Edge or DRM-protected content. However, Recall won't actively hide sensitive information like passwords and financial account numbers that appear on-screen.
Microsoft previously explored a somewhat similar functionality with the Timeline feature in Windows 10, which the company discontinued in 2021, but it didn't take continuous snapshots. Recall also shares some obvious similarities to Rewind, a third-party app for Mac we covered in 2022 that logs user activities for later playback.
As you might imagine, all this snapshot recording comes at a hardware penalty. To use Recall, users will need to purchase one of the new "Copilot Plus PCs" powered by Qualcomm's Snapdragon X Elite chips, which include the necessary neural processing unit (NPU). There are also minimum storage requirements for running Recall, with a minimum of 256GB of hard drive space and 50GB of available space. The default allocation for Recall on a 256GB device is 25GB, which can store approximately three months of snapshots. Users can adjust the allocation in their PC settings, with old snapshots being deleted once the allocated storage is full.
As far as availability goes, Microsoft says that Recall is still undergoing testing. "Recall is currently in preview status," Microsoft says on its website. "During this phase, we will collect customer feedback, develop more controls for enterprise customers to manage and govern Recall data, and improve the overall experience for users."
https://arstechnica.com/gadgets/2024/05 ... n-your-pc/
Ei ole mitään rikkuria alhaisempaa.
Marx propagoi fiksuuttaan lukemalla kirjoja ja kirjoittamalla niitä. Bakunin taas tuhosi aivosolujaan alkoholilla. Jäljellejääneet aivosolut saivat tilaa kasvaa ja kehittyä, ja lopulta Bakuninin pääkopassa oli vain yksi helvetin iso ja fiksu aivosolu. Bakunin oli siis fiksumpi kuin Marx.
- pigra senlaborulo
- pyllypuhelinmyyjä
- Posts: 125595
- Joined: 12 Jan 2013, 02:48
- Location: pastreja fabgulaŝo de vikario de bopatrejo
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
mikrobitti.fi
Suomalainen puhelinvalmistaja teki paluun Venäjä-kiemuran jälkeen – esitteli puhelimen ja yllätystuotteen
Janne Heleskoski
2–3 minutes
Suomalainen Jolla on tehnyt paluun ja esitellyt kaksi uutta laitetta. Jolla Mind2 on henkilökohtainen tekoälytietokone ja -avustaja. Sen on tarkoitus yhdistyä käyttäjän puhelimeen tai tietokoneeseen ja helpottaa tiedon hallintaa ja käsittelyä tekoälyn avulla.
Samalla laitteen on määrä tarjota turvaa ja yksityisyyttä, kun data pysyy laitteella ja käyttäjän hallitsemana, erilaisten pilvipalveluiden sijaan.
Kuvauksen mukaan laitteella voi hoitaa kaikki päivittäiset askareet yhdellä tekoälyä hyödyntävällä käyttöliittymällä. Laite huolehtii sähköposteista, viesteistä, dokumenteista ja muusta sisällöstä. Sen kanssa voi myös kommunikoida puhumalla. Lisäksi laitteelle on tarkoitus luoda avoin ekosysteemi, joka kutsuu sovelluskehittäjät luomaan laitteelle uusia sovelluksia.
Jolla Mind2 tulee aluksi tarjolle Community Edition -versiona, joka on suunniteltu kehittäjille ja uteliaille harrastajille. Ennakkotilausten toimitukset alkavat syyskuussa. Hintaa laitteella on 699 euroa.
Toinen uutuus on Jolla C2 -älypuhelin. ”Community Phone” on niin ikään suunniteltu lähinnä uteliaille harrastajille, jotka haluavat kehittää ja innovoida Sailfish OS 5.0 -käyttöjärjestelmän parissa. Laite tulee rajatusti tarjolle 299 euron hintaan ja tilausten toimitukset alkavat elokuussa.
Jolla C2. Uusi Community Phone tarjoaa innokkaille mahdollisuuden päästä puuhaamaan Sailfish-käyttöjärjestelmän parissa.
Lisäksi Jolla kertoi uudesta strategisesta yhteistyöstä turkkilaisen matkapuhelinvalmistaja Reeder Technologyn kanssa. Reeder lisensoi Sailfish-käyttöjärjestelmän käyttääkseen sitä tulevissa laitteissaan. Lisäksi Reeder vastaa Mind2-tekoälytietokoneen sekä Jolla C2 -puhelimen kokoonpanosta.
Jolla syntyi aikoinaan Nokian Meego-ohjelman pohjalta valmistamaan älypuhelimia, mutta päätyi lopulta ohjelmistoyhtiöksi, jonka tuotteita ovat olleet Sailfish-käyttöjärjestelmä sekä autovalmistajille suunnattu Appsupport. Sen avulla autovalmistajat voivat tuoda Android-sovelluksia helposti omille Linux-pohjaisille käyttöjärjestelmilleen.
”Vanhan Jollan” suuri venäläisomistus muodostui kuitenkin ongelmaksi Ukrainan sodan alettua vuonna 2022. Lopulta ratkaisuksi muodostui yrityssaneeraus, jonka päätteeksi yhtiön entinen johto osti Jolla Oy:n liiketoiminnan. Sailfishiä, puhelinta ja tekoälytietokonetta kehittää nyt Jollyboys Oy ja autoalan liiketoiminta on eriytetty niin ikään uudelle yhtiölle, Seafarix Oy:lle.
https://www.mikrobitti.fi/uutiset/mb/6a ... df7f4c0ffb
Ei ole mitään rikkuria alhaisempaa.
Marx propagoi fiksuuttaan lukemalla kirjoja ja kirjoittamalla niitä. Bakunin taas tuhosi aivosolujaan alkoholilla. Jäljellejääneet aivosolut saivat tilaa kasvaa ja kehittyä, ja lopulta Bakuninin pääkopassa oli vain yksi helvetin iso ja fiksu aivosolu. Bakunin oli siis fiksumpi kuin Marx.
- Spandau Mullet
- Matti Partanen

- Posts: 99709
- Joined: 28 Jul 2014, 20:37
- Location: Raakaa paskaa akselilta Reetunlehto-Ruksimäki
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta

Onnea macook-käyttäjät
Tämä nimimerkki kirjoittaa suurimmaksi osaksi Roskakori-osioon lyhyitä viestejä, joissa ei ole juurikaan sisältöä.
- Spandau Mullet
- Matti Partanen

- Posts: 99709
- Joined: 28 Jul 2014, 20:37
- Location: Raakaa paskaa akselilta Reetunlehto-Ruksimäki
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta

Tämä nimimerkki kirjoittaa suurimmaksi osaksi Roskakori-osioon lyhyitä viestejä, joissa ei ole juurikaan sisältöä.
- Spandau Mullet
- Matti Partanen

- Posts: 99709
- Joined: 28 Jul 2014, 20:37
- Location: Raakaa paskaa akselilta Reetunlehto-Ruksimäki
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
https://www.windowscentral.com/microsof ... and-guides

Ei helvetti, ei saatanaDuring the event, Microsoft showcased one of these new integrations. Microsoft Copilot will be embedded directly in video games, starting with Minecraft. Players will be able to use natural language to ask questions like "How do I craft a sword?" and the Copilot will search your chests and inventories for the necessary materials, or guide you to them if you don't have them. It will also explain how to craft the item, and so on, eliminating the need to alt tab and read a website for Minecraft guides like ours (RIP Windows Central).
Tämä nimimerkki kirjoittaa suurimmaksi osaksi Roskakori-osioon lyhyitä viestejä, joissa ei ole juurikaan sisältöä.
- pigra senlaborulo
- pyllypuhelinmyyjä
- Posts: 125595
- Joined: 12 Jan 2013, 02:48
- Location: pastreja fabgulaŝo de vikario de bopatrejo
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
Ei ole mitään rikkuria alhaisempaa.
Marx propagoi fiksuuttaan lukemalla kirjoja ja kirjoittamalla niitä. Bakunin taas tuhosi aivosolujaan alkoholilla. Jäljellejääneet aivosolut saivat tilaa kasvaa ja kehittyä, ja lopulta Bakuninin pääkopassa oli vain yksi helvetin iso ja fiksu aivosolu. Bakunin oli siis fiksumpi kuin Marx.
- pigra senlaborulo
- pyllypuhelinmyyjä
- Posts: 125595
- Joined: 12 Jan 2013, 02:48
- Location: pastreja fabgulaŝo de vikario de bopatrejo
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
Yle mittaa entistä tarkemmin uutistensa moniäänisyyttä.
Uutistoimituksen käytössä on tästä keväästä lähtien ollut tekoälyä hyödyntävä moniäänisyystyökalu, joka kertoo muun muassa sen, keitä Ylen verkkouutisissa siteerataan.
Työkalu analysoi tekstimuotoisista uutisista myös esimerkiksi sukupuolten tasapainoa, haastateltavien työnimikkeitä, jutuissa esiintyviä organisaatioita, aiheita sekä sitä, miten eri puolueiden poliitikot pääsevät ääneen Ylen sisällöissä.
– Työkalun avulla voimme tehdä parempia valintoja ja palvella yleisöä paremmin. Erityisen arvokasta on se, että näemme nyt aiheittain, kenelle annamme äänen, sanoo Kati Puustinen, joka työskentelee Ylellä journalistisena kehityspäällikkönä.
Automatisoitu työkalu helpottaa ja nopeuttaa jo pitkään jatkunutta työtä, jonka tarkoituksena on lisätä uutisten moniäänisyyttä. Yle on muun muassa laskenut käsin julkaisemiensa juttujen sukupuolijakaumaa vuodesta 2016 lähtien.
– Työkalun avulla tarjolla on aiempaa enemmän dataa moniäänisyydestä. Saamme aikaiseksi syvällisempää keskustelua toimitusten sisällä, Puustinen toteaa.
Katse korkeakoulukuplan ulkopuolelle
Yle pyrkii tänä vuonna saamaan uutisjuttujen äärelle aiempaa enemmän peruskoulun ja toisen asteen koulutustaustalla olevia suomalaisia.
Moniäänisyystyökalu auttaa uuden kohderyhmän tavoittamisessa analysoimalla esimerkiksi verkkojuttujen alueellista tasapainoa. Työkalu kertoo, kuinka usein eri paikkakunnat tai vaikkapa kaupunginosat esiintyvät Ylen uutisissa. Tuloksia voi verrata tilastotietoon eri kunnissa ja kaupunginosissa asuvien ihmisten koulutustaustasta.
– Tunnistamme, että toimittajat muodostavat korkeakoulutetun kuplan. Jutuissa usein korkeakoulutetut asiantuntijat selittävät maailmaa. Haluamme katsoa katveisiimme ja olla kaikkien luottamuksen arvoisia. Sen takia menemme tällaista tavoitetta päin, Puustinen toteaa.
Yle rakensi työkalun Tampereen yliopiston koordinoimassa Moniäänisyysmittari-pilottihankkeessa syntyneiden kokemusten pohjalta.
Puustisen mukaan meneillään on kokeiluvaihe, ja kehitteillä on muitakin työkaluja.
– Yritämme samaan aikaan miettiä, millä tavoin tekoäly voisi konkreettisesti mutta vastuullisesti auttaa meitä uutistyössä. Voisiko se esimerkiksi sparrata meitä moniäänisyydessä, Puustinen pohtii.
https://yle.fi/a/74-20089714?origin=rss
Ei ole mitään rikkuria alhaisempaa.
Marx propagoi fiksuuttaan lukemalla kirjoja ja kirjoittamalla niitä. Bakunin taas tuhosi aivosolujaan alkoholilla. Jäljellejääneet aivosolut saivat tilaa kasvaa ja kehittyä, ja lopulta Bakuninin pääkopassa oli vain yksi helvetin iso ja fiksu aivosolu. Bakunin oli siis fiksumpi kuin Marx.
- Iggy DOG
- Luu se on jäinenkin luu
- Posts: 53756
- Joined: 17 Mar 2013, 01:05
- Location: Keskiluokan kusipäät
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta

One day this outfit will fade out and our bones will crumble to earth
Spoiler:
- pigra senlaborulo
- pyllypuhelinmyyjä
- Posts: 125595
- Joined: 12 Jan 2013, 02:48
- Location: pastreja fabgulaŝo de vikario de bopatrejo
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
technologyreview.com
GPT-4o’s Chinese token-training data is polluted by spam and porn websites
Zeyi Yang
11–14 minutes
Soon after OpenAI released GPT-4o on Monday, May 13, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases.
On May 14, Tianle Cai, a PhD student at Princeton University studying inference efficiency in large language models like those that power such chatbots, accessed GPT-4o’s public token library and pulled a list of the 100 longest Chinese tokens the model uses to parse and compress Chinese prompts.
Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. Besides dictionary words, they also include suffixes, common expressions, names, and more. The more tokens a model encodes, the faster the model can “read” a sentence and the less computing power it consumes, thus making the response cheaper.
Of the 100 results, only three of them are common enough to be used in everyday conversations; everything else consisted of words and expressions used specifically in the contexts of either gambling or pornography. The longest token, lasting 10.5 Chinese characters, literally means “_free Japanese porn video to watch.” Oops.
“This is sort of ridiculous,” Cai wrote, and he posted the list of tokens on GitHub.
OpenAI did not respond to questions sent by MIT Technology Review prior to publication.
GPT-4o is supposed to be better than its predecessors at handling multi-language tasks. In particular, the advances are achieved through a new tokenization tool that does a better job compressing texts in non-English languages.
But at least when it comes to the Chinese language, the new tokenizer used by GPT-4o has introduced a disproportionate number of meaningless phrases. Experts say that’s likely due to insufficient data cleaning and filtering before the tokenizer was trained.
Because these tokens are not actual commonly spoken words or phrases, the chatbot can fail to grasp their meanings. Researchers have been able to leverage that and trick GPT-4o into hallucinating answers or even circumventing the safety guardrails OpenAI had put in place.
Why non-English tokens matter
The easiest way for a model to process text is character by character, but that’s obviously more time consuming and laborious than recognizing that a certain string of characters—like “c-r-y-p-t-o-c-u-r-r-e-n-c-y”—always means the same thing. These series of characters are encoded as “tokens” the model can use to process prompts. Including more and longer tokens usually means the LLMs are more efficient and affordable for users—who are often billed per token.
When OpenAI released GPT-4o on May 13, it also released a new tokenizer to replace the one it used in previous versions, GPT-3.5 and GPT-4. The new tokenizer especially adds support for non-English languages, according to OpenAI’s website.
The new tokenizer has 200,000 tokens in total, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to count the number of tokens in different languages, and the top languages, besides English, are Russian, Arabic, and Vietnamese.
“So the tokenizer’s main impact, in my opinion, is you get the cost down in these languages, not that the quality in these languages goes dramatically up,” Das says. When an LLM has better and longer tokens in non-English languages, it can analyze the prompts faster and charge users less for the same answer. With the new tokenizer, “you’re looking at almost four times cost reduction,” he says.
Das, who also speaks Hindi and Bengali, took a look at the longest tokens in those languages. The tokens reflect discussions happening in those languages, so they include words like “Narendra” or “Pakistan,” but common English terms like “Prime Minister,” “university,” and “international” also come up frequently. They also don’t exhibit the issues surrounding the Chinese tokens.
That likely reflects the training data in those languages, Das says: “My working theory is the websites in Hindi and Bengali are very rudimentary. It’s like [mostly] news articles. So I would expect this to be the case. There are not many spam bots and porn websites trying to happen in these languages. It’s mostly going to be in English.”
Polluted data and a lack of cleaning
However, things are drastically different in Chinese. According to multiple researchers who have looked into the new library of tokens used for GPT-4o, the longest tokens in Chinese are almost exclusively spam words used in pornography, gambling, and scamming contexts. Even shorter tokens, like three-character-long Chinese words, reflect those topics to a significant degree.
“The problem is clear: the corpus used to train [the tokenizer] is not clean. The English tokens seem fine, but the Chinese ones are not,” says Cai from Princeton University. It is not rare for a language model to crawl spam when collecting training data, but usually there will be significant effort taken to clean up the data before it’s used. “It’s possible that they didn’t do proper data clearing when it comes to Chinese,” he says.
The content of these Chinese tokens could suggest that they have been polluted by a specific phenomenon: websites hijacking unrelated content in Chinese or other languages to boost spam messages.
These messages are often advertisements for pornography videos and gambling websites. They could be real businesses or merely scams. And the language is inserted into content farm websites or sometimes legitimate websites so they can be indexed by search engines, circumvent the spam filters, and come up in random searches. For example, Google indexed one search result page on a US National Institutes of Health website, which lists a porn site in Chinese. The same site name also appeared in at least five Chinese tokens in GPT-4o.
Chinese users have reported that these spam sites appeared frequently in unrelated Google search results this year, including in comments made to Google Search’s support community. It’s likely that these websites also found their way into OpenAI’s training database for GPT-4o’s new tokenizer.
The same issue didn’t exist with the previous-generation tokenizer and Chinese tokens used for GPT-3.5 and GPT-4, says Zhengyang Geng, a PhD student in computer science at Carnegie Mellon University. There, the longest Chinese tokens are common terms like “life cycles” or “auto-generation.”
Das, who worked on the Google Search team for three years, says the prevalence of spam content is a known problem and isn’t that hard to fix. “Every spam problem has a solution. And you don’t need to cover everything in one technique,” he says. Even simple solutions like requesting an automatic translation of the content when detecting certain keywords could “get you 60% of the way there,” he adds.
But OpenAI likely didn’t clean the Chinese data set or the tokens before the release of GPT-4o, Das says: “At the end of the day, I just don’t think they did the work in this case.”
It’s unclear whether any other languages are affected. One X user reported that a similar prevalence of porn and gambling content in Korean tokens.
The tokens can be used to jailbreak
Users have also found that these tokens can be used to break the LLM, either getting it to spew out completely unrelated answers or, in rare cases, to generate answers that are not allowed under OpenAI’s safety standards.
Geng of Carnegie Mellon University asked GPT-4o to translate some of the long Chinese tokens into English. The model then proceeded to translate words that were never included in the prompts, a typical result of LLM hallucinations.
He also succeeded in using the same tokens to “jailbreak” GPT-4o—that is, to get the model to generate things it shouldn’t. “It’s pretty easy to use these [rarely used] tokens to induce undefined behaviors from the models,” Geng says. “I did some personal red-teaming experiments … The simplest example is asking it to make a bomb. In a normal condition, it would decline it, but if you first use these rare words to jailbreak it, then it will start following your orders. Once it starts to follow your orders, you can ask it all kinds of questions.”
In his tests, which Geng chooses not to share with the public, he says he can see GPT-4o generating the answers line by line. But when it almost reaches the end, another safety mechanism kicks in, detects unsafe content, and blocks it from being shown to the user.
The phenomenon is not unusual in LLMs, says Sander Land, a machine-learning engineer at Cohere, a Canadian AI company. Land and his colleague Max Bartolo recently drafted a paper on how to detect the unusual tokens that can be used to cause models to glitch. One of the most famous examples was “_SolidGoldMagikarp,” a Reddit username that was found to get ChatGPT to generate unrelated, weird, and unsafe answers.
The problem lies in the fact that sometimes the tokenizer and the actual LLM are trained on different data sets, and what was prevalent in the tokenizer data set is not in the LLM data set for whatever reason. The result is that while the tokenizer picks up certain words that it sees frequently, the model is not sufficiently trained on them and never fully understands what these “under-trained” tokens mean. In the _SolidGoldMagikarp case, the username was likely included in the tokenizer training data but not in the actual GPT training data, leaving GPT at a loss about what to do with the token. “And if it has to say something … it gets kind of a random signal and can do really strange things,” Land says.
And different models could glitch differently in this situation. “Like, Llama 3 always gives back empty space but sometimes then talks about the empty space as if there was something there. With other models, I think Gemini, when you give it one of these tokens, it provides a beautiful essay about El Niño, and [the question] didn’t have anything to do with El Niño,” says Land.
To solve this problem, the data set used for training the tokenizer should well represent the data set for the LLM, he says, so there won’t be mismatches between them. If the actual model has gone through safety filters to clean out porn or spam content, the same filters should be applied to the tokenizer data. In reality, this is sometimes hard to do because training LLMs takes months and involves constant improvement, with spam content being filtered out, while token training is usually done at an early stage and may not involve the same level of filtering.
While experts agree it’s not too difficult to solve the issue, it could get complicated as the result gets looped into multi-step intra-model processes, or when the polluted tokens and models get inherited in future iterations. For example, it’s not possible to publicly test GPT-4o’s video and audio functions yet, and it’s unclear whether they suffer from the same glitches that can be caused by these Chinese tokens.
“The robustness of visual input is worse than text input in multimodal models,” says Geng, whose research focus is on visual models. Filtering a text data set is relatively easy, but filtering visual elements will be even harder. “The same issue with these Chinese spam tokens could become bigger with visual tokens,” he says.
Update: The story has been updated to clarify a quote from Sander Land.
https://www.technologyreview.com/2024/0 ... n-polluted
Ei ole mitään rikkuria alhaisempaa.
Marx propagoi fiksuuttaan lukemalla kirjoja ja kirjoittamalla niitä. Bakunin taas tuhosi aivosolujaan alkoholilla. Jäljellejääneet aivosolut saivat tilaa kasvaa ja kehittyä, ja lopulta Bakuninin pääkopassa oli vain yksi helvetin iso ja fiksu aivosolu. Bakunin oli siis fiksumpi kuin Marx.
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
https://x.com/BobbyAllyn/status/1792679435701014908
https://news.ycombinator.com/item?id=40421225
https://news.ycombinator.com/item?id=40421225
tää Scarletti-juttukin on kyllä taas sitä itseään ja helvetin hienolta mieheltä vaikuttaa tämä Sam "WorldCoin" AltmanWell, that statement lays out a damning timeline:
- OpenAI approached Scarlett last fall, and she refused.
- Two days before the GPT-4o launch, they contacted her agent and asked that she reconsider. (Two days! This means they already had everything they needed to ship the product with Scarlett’s cloned voice.)
- Not receiving a response, OpenAI demos the product anyway, with Sam tweeting “her” in reference to Scarlett’s film.
- When Scarlett’s counsel asked for an explanation of how the “Sky” voice was created, OpenAI yanked the voice from their product line.
Perhaps Sam’s next tweet should read “red-handed”.



- Spandau Mullet
- Matti Partanen

- Posts: 99709
- Joined: 28 Jul 2014, 20:37
- Location: Raakaa paskaa akselilta Reetunlehto-Ruksimäki
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
Ihan perus piilaakso-startupin "better to ask for forgiveness than permission" -meininki. Mutta joo perseestä on tämä jorkki.kantasolu wrote: ↑21 May 2024, 22:26https://x.com/BobbyAllyn/status/1792679435701014908
https://news.ycombinator.com/item?id=40421225tää Scarletti-juttukin on kyllä taas sitä itseään ja helvetin hienolta mieheltä vaikuttaa tämä Sam "WorldCoin" AltmanWell, that statement lays out a damning timeline:
- OpenAI approached Scarlett last fall, and she refused.
- Two days before the GPT-4o launch, they contacted her agent and asked that she reconsider. (Two days! This means they already had everything they needed to ship the product with Scarlett’s cloned voice.)
- Not receiving a response, OpenAI demos the product anyway, with Sam tweeting “her” in reference to Scarlett’s film.
- When Scarlett’s counsel asked for an explanation of how the “Sky” voice was created, OpenAI yanked the voice from their product line.
Perhaps Sam’s next tweet should read “red-handed”.
Tämä nimimerkki kirjoittaa suurimmaksi osaksi Roskakori-osioon lyhyitä viestejä, joissa ei ole juurikaan sisältöä.
- Marxin Ryyppy
- -=Lord Of PIF=-

- Posts: 14162
- Joined: 02 Mar 2020, 18:55
- Location: Ylen sankia pride
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
Viimeisin Ed Zitronin kätinänyysletteri on lyhyesti ja ytimekkäästi nimetty: "Sam Altman Is Full Of Shit", ja kyllä mää Ediin tässä asiassa voin luottaa.
https://www.wheresyoured.at/sam-altman-is-full-of-shit/
Aika lailla samat asiat siellä käydään lävitte mitä ny täälläkin on käyty mutta tämä oli hyvä pointti:
https://www.wheresyoured.at/sam-altman-is-full-of-shit/
Aika lailla samat asiat siellä käydään lävitte mitä ny täälläkin on käyty mutta tämä oli hyvä pointti:
It shouldn’t come as much of a surprise that Johansson didn’t jump at the chance to work with OpenAI. As a member of the SAG-AFTRA actor’s guild, Johansson was a participant in the 2023 strike that effectively deadlocked all TV and film production for much of that year. A major concern of the guild was the potential use of AI to effectively create a facsimile of an actor, using their likeness but giving them none of the proceeds. The idea that, less than one year after the strike’s conclusion, Johansson would lend her likeness to the biggest AI company in the world is, frankly, bizarre.
Who am I? Who else is there? Who am I? Let's put it this way: who has the best tunes?- pigra senlaborulo
- pyllypuhelinmyyjä
- Posts: 125595
- Joined: 12 Jan 2013, 02:48
- Location: pastreja fabgulaŝo de vikario de bopatrejo
Re: Täällä seurataan AI-ripuligeneraattoreiden maailmanvalloitusta
Ei ole mitään rikkuria alhaisempaa.
Marx propagoi fiksuuttaan lukemalla kirjoja ja kirjoittamalla niitä. Bakunin taas tuhosi aivosolujaan alkoholilla. Jäljellejääneet aivosolut saivat tilaa kasvaa ja kehittyä, ja lopulta Bakuninin pääkopassa oli vain yksi helvetin iso ja fiksu aivosolu. Bakunin oli siis fiksumpi kuin Marx.




