Three vast corpora, to start

As I am writing and looking at the things I may have to leave undone, it strikes me that we have some amazing sources of oral history, which can and should be systematically converted to text for close study by humans and pattern extraction by machines. I have in mind the following sources:

Jake Feinberg, The Jake Feinberg Show, various episodes.

David Gans and Gary Lambert, Tales From the Golden Road,
Sirius XM Satellite Radio, Grateful Dead Channel, various episodes.

Steve Parrish, The Big Steve Show, Sirius XM Satellite Radio, Grateful Dead Channel, various episodes.

For both the Feinberg and Big Steve shows, I have played around with grabbing audio, converting to text using otter.ai, and editing in the otter.ai interface (which is quite good). These transcripts have much to teach us.

Feinberg has interviewed literally several dozen Garcia collaborators, many on more than one occasion totally a few hours. He has a great jazz head and draws great stuff out of his guests.

Big Steve’s is mostly his recollections, with interviews not uncommon. He has a very good memory, and some of these stories are pure gold. I have found many factual errors in them, but can you blame the guy for not getting 54 years of memories from roadying for the Dead and Garcia Just Exactly Perfect? There are absolute golden nuggets throughout the show.

But you have to pan for them.

In both cases, the transcriptions emerge from the machine *very rough*. A 45 minute Feinberg interview can take me 90 minutes or two hours to clean, maybe even more. And he’s doing these for broadcast, so care is taken to speak somewhat clearly. Big Steve is talking around giant Grizzly Peak doobies and, for some reason, juggling a mouthful of rocks. Polishing these transcriptions is more like 3:1 time ratio. The machine isn’t the first not to know what to make of Big Steve, and we humans just process too fucking slowly.

So, it hurts me to know how much I am probably missing from these and countless other sources. I am glad the Good Ol Grateful Deadcastis making transcriptions – awesome stuff.

But these three shows alone have been running for *years*, and have generated probably millions of words altogether. These currently exist as sound data, and we need to get them transcribed.

Bottom line: the voice-to-text process is where the time/energy bottleneck still is. Once we have text, the machines can do their things, and we can do ours, and there are many opportunities to apply digital humanities kinds of methods to our corpora. I think it will yield great results. But someone other than me is going to have to get that ball rolling!

Comments

2 responses to “Three vast corpora, to start”

Light Into Ashes

July 14, 2023

That's not to mention all the Garcia interviews that haven't been transcribed yet!
Fate Music

July 15, 2023

I tried to do the Garcia Interviews Project, but it hasn't hit yet.

I wonder if ChatGPT could do that stuff?